Archive

Archive for the ‘Barry’ Category

How many threads ?

April 30th, 2010

An aside that can up during a discussion on Thread Pools was how many threads should you have in your application ? This really has no straightforward answer other than “depends”.

Many would argue that on most systems you really should be working at a level of abstraction which deals with this sort of detail under to hood of the library or programming language however that is not much help to those of use stuck with PThreads or similar low level APIs. So lets look at what you should consider.

  • Look at the number of cores. Now going for num_threads = num_cores is not a good idea since any blocked threads will result in an idle core and your code could run into scalability issues as the number of cores is varied. However for applications targeting a well specified architecture this can allow tuning for very high performance.
  • Hardware Threads. Many cores support hardware threading where the core will switch between thread contexts very quickly to mask cache stalls. Tailoring to the number of threads to take advantage of this can yield great performance gains.
  • How many ways can you actually partition your algorithm. This is the biggest question. If you go for a pipeline partitioning then you are going to quickly hit scaling problems. If you go for data partitioning you may create far more threads than you intended and run into problems as they content for resources.
  • Memory hierarchy. How is the cache going to react to frequent thread swapping ? If each thread has a large working set you may spend all your time thrashing the cache. Not to mention the false sharing.

There are many more issues around this and impact is very much dependent on the target system. This just illustrates again how the move to multicore is forcing SW engineers to consider architecture and performance far earlier in the design cycle than before.

The best advice for developers of software that will be ported to many platforms can only be to ensure that they verify that their parallel implementation is as scalable as possible with clean interfaces between tasks for easy mapping onto whatever concurrency mechanism is in place on the target systems. That is: Make you code MultiCore ready before going to multicore.

Barry

Performance issues

November 27th, 2009

We spend a lot of time working with customers to ensure that there code is free from the bugs which plague multithreaded programs such as data races and deadlock - and rightly so. However worrying over the details make it easy to lose sight of the whole reason for going multicore in the first place - performance.

Although performance optimization, and the associated architectural awareness required, is familiar to many developers actual optimization is often an after thought. Functionality is king and there will always be faster computers.

Now that applications have stopped getting faster by themselves the software developer must be aware of performance optimization from much earlier in the design process. Multicore means that we all need to dust off the computer architecture text books and start thinking about the hardware again.

A good example of this is False Sharing. This occurs when 2 or more threads running across multiple cores do not actually share data but access data at nearby addresses to each other.

This then works against the way the cache operates. Each core as a small, fast cache memory attached to it which is loaded with data from the slower main memory in an effort to avoid stalls in the processor as it waits for data. The cache works on the principle of data locality: Most programs which access a location in memory will probably access it, or an nearby location, multiple times in a short period of time. The cache therefore is designed to load multiple addresses (a line or block) of data at a time to take advantage of this. In False Sharing this means that 2 cores must share a line of cache data even though no actual data is being shared between them.

The results can be bad apparently - I was not sure how bad so I built a little test. A program which calculated a Sum of Products of 2 large arrays in 2 ways. First INTERLEAVED where each thread stepped through the data accessing a fixed offset into blocks the size of the number of threads, basically interleaving the accesses to adjacent addresses. For higher numbers of threads this is going to be bad for locality too - so a double whammy.


for (i=params->threadNo; i < dataSize; i += params->threads)
params->a[i] += params->a[i] * params->b[i];

Next BLOCKY which allocated a contiguous chunk of data to each thread


for (i = aBlockSize * params->threadNo; i < aBlockSize * (params->threadNo + 1); ++i)
params->a[i] += params->a[i] * params->b[i];

Running the code for 1, 2, 4, 8, 16 and 32 threads on a Quad-Core Xeon produces the following numbers from time. As you might expect INTERLEAVE does much worse than blocky as the number of threads increases

Graph showing false sharing penalty

What is interesting here is the Real and User numbers for INTERLEAVE. While the actual Real or wall clock time at first improves slightly and then rapidly becomes worse than that of a single thread it is the User time increases massively. This highlights a major issue when things go wrong like this on multicore. The User time remains at 4x of the real time which is unsurprising since this is a 4 core machine but the lesson is that all of those cores are continually running which means we are burning 4x the power to achieve ever worsening performance resulting in a huge overall energy drain. Maybe not a problem for a fan cooled, mains powered server but very bad news for your mobile phone.

Barry

The Horizon is Closer Than You Think

October 14th, 2009

I’ve started seeing promotion for the EE Times ManyCore Virtual conference on various websites which follows on from their MultiCore conference in the summer. The first thing I noticed is that they have drawn the line between multi and many core between 16 and 32 cores.

This seems fairly reasonable server class platforms but for embedded systems, where interconnect speeds tend to be a lot slower, the latency at the lowest levels of memory make it hard to see how practical traditional shared memory multicore programming is possible above 4 or 8 cores.

So this begs the question; is many-core something mainstream programmers should be thinking about? After all we have only jut started looking at multicore and that is hairy enough but now we are told we to throw away the idea of a global address space too !

One of the main features of many-core is the huge array of new (and not so new) languages and architectures that are vying for attention. Which way should you go to avoid investing in the next Betamax?

Fortunately there is no need to jump at random. It is likely to take several years for the market to sort it self out and for a few dominant approaches to emerge. In the meantime there is plenty that can be done now to make to move to new concurrent frameworks a lot less risky.

The key is to make sure current and legacy code is multi/many/whatever-core ready. Even when targeting single core take the time to understand how the code is structured, where the performance hotspots are and that the iterations of key loops and modules share as little state as possible (data independence) and that all necessary communication is organized clearly as possible and is preferably wrapped in a function that is called only as necessary.

If all this sounds a bit like hard work don’t worry, as the guys at Apple would say: There’s an app for that ;-)

Barry

Tim Toady

September 30th, 2009

There’s more than one way to do it may be the Perl motto but it is can be fairly said about just about anything in general and computer science in particular.

When dealing with typical multi-threaded software (written in C or C++, with lots of pointers) tools have to rely on dynamic analysis in order to extract useful information about data dependencies and the inherent hazards these pose to the correct behavior of the application.

There are broadly 2 ways (but with uncounted variations) to collect this dynamic information:

  • Instrumentation of the application as it runs, typically on hardware.
  • Recording the state changes in a Virtual Machine which is running the application.

But how do these 2 approaches stack up against each other?

Instrumentation benefits from speed and convenience, in many cases user will be tracing applications directly on their workstation hardware or a development board which they alway run their apps on. This gives a reassuring sense of knowing that the captured behavior is from the actual target hardware that will be used by customers.

The problem is that this is a false sense of security. Instrumentation will alway disrupt the behavior of the program it is tracing. Typically the scheduling, core allocation and completion ordering of threads will be disrupted so anyone hoping to see exactly what will happen on the real hardware is going to be disappointed. It is this sort of interference that makes it notoriously difficult to debug multi-threaded applications.

Turning to Virtual Machines, you get the possibility of completely hands off tracing and the opportunity to target many different system configurations without cluttering up your desk with expensive dev boards.

However things are not that straightforward. The virtual platform is going to be slower than the actual hardware in most cases, especially when tracing in great detail, and it is a lot harder to extract the higher level scheduling information from the OS making it difficult to establish which thread is running on which core at what time. Few Virtual platform vendors have this capability as yet but many are adding it as demand for this sort of tracing grows.

So which approach is best ? As always it depends on what you are looking for.

Data dependency analysis does not depend much on the order of thread execution. If 2 threads ever access the same memory without synchronization then that is probably bad, so instrumentation is still very useful.

Since many programs will be running under an OS with any number of other active programs it is unlikely that you can ever expect to see a typical schedule anyway so if your application expects a certain schedule for correct execution without enforcing it then you deserve all the bugs you get.

Virtual Machines come into there own mostly in the deeply embedded space where many-core systems are often carved up into several virtualized multicore domains where the OS only runs a single application (telecoms is a typical example of this) and it pays to analyze the detailed runtime behavior to squeeze out every gram of performance.

So, in conclusion, I’m off to find a comfy fence to sit on…

Barry

Multicore Apples

July 16th, 2009

There are no shortage of multicore technologies at the moment. Not far beyond the SMP offerings of Intel and AMD there is a wealth of alternative manycore architectures - most widely available in the form of General Purpose graphic processors. In embedded systems the choice is even wider and this is before we even start looking at the various software libraries and runtimes that support a vast array of parallel programing paradigms.

Into this Apple has recently announced the features of its next OS update and there is a lot in it for multicore developers. Apple controls both the hardware and OS aspects of its platform and so can dictate what it provides to its developers so the choices it has made are very interesting. The central technology is Grand Central Dispatch which is a runtime thread management layer and API developed by Apple which is attempting to remove some of the housekeeping burden from its developers. In addition is fully integrated support for OpenCL which is an open standard initially aimed at opening up graphic processors for general purpose processing.

The combination of these approaches with the range of server/desktop hardware available (which ranges from 2 Core SMP + 16 core GPU up to 2×4 Core SMP + 32 core GPU) will open up heterogeneous multicore programing to a substantial and, importantly, mainstream group of developers.

Looking 1 or 2 years further on and assuming that rumours of a mutlcore iphone are true then it is very possible that the same GCD + OpenCL programing model could find its way onto one of the most widely developed for embedded systems.

Barry

Of Apples and Oranges

June 5th, 2009

EEMBC have released CoreMark. This is a simple benchmark for comparing embedded processors which focuses on evaluating the performance of the core itself without being too heavily effected by the rest of the system configuration.

Normally I don’t, as a rule, get hugely excited about benchmarks but CoreMark has 3 properties which particularly appeal to me:

  1. It is very easily portable across platforms. Saving on effort.
  2. It distills the performance of the core into a single number for easy comparison. Saving yet more effort.
  3. It’s free.

After some mucking about generating coremark scores for the various processor and compiler combinations laying around the office, I thought some actual work might be in order and started to wonder how this benchmark, being representative of typical embedded workloads, could be modified to take advantage of multicore platforms.

So, casually ignoring the fact that modifying the source code invalidates the test scores, I recorded some traces from an ARM9 core and started on an analysis with Prism. You can read my initial findings in this PDF. The next step is to put these findings into action with a PThreads implmentation.

Barry

Thanks for the Memory

January 23rd, 2009

When retro-fitting a sequential application with threads, there are many potential problems however its always the side effects that seem to get you with unexpected bugs. This situation becomes worse when you are attempting to crowbar your application into the resource constrained environment of an embedded system.

Threads, here I’m talking about PThreads but it is true for most libraries, have overhead. There is design overhead, code , maintenance and runtime overheads and of course memory overhead which is the one you really have to lookout for.

PThreads are basically lightweight processes and so have to carry around a good deal of state with them and chief among these is the stack. In a typical implementation, the stack for each thread is allocated on the heap when the thread is created and is a fixed size. This sounds straightforward enough but can immediately cause several failure modes.

  • The per thread stack size X the number of threads is greater than the available heap space resulting in thread creation failure.
  • There is enough space on the heap for all of thread overhead but the next malloc call fails because there is no more room.
  • The per thread stack is too small for the amount of data that needs to be pushed onto it, leading to any number of obscure errors.

So what can you do about this? Fortunately PThreads provides a solution via its thread attributes API. If you are not familiar with attributes it is simply a structure for configuring a thread which is passed to pthread_create as the second parameter which is otherwise set to NULL, as below.


pthread_attr_t myAttr;
pthread_attr_init(&myAttr);
pthread_create(&aThread, &myAttr, (void *) myFunc, (void *) myParams);

Once the pthread_attr_t variable is initialized you can use it to specify the size of the stack you require (among other things). However it is not always that straightforward (is anything) and you will need to test if your PThreads library actually supports this variable stack size and then you need to test what the minimum allowable stack size is before you set it. This is handled via declarations for your system.


pthread_attr_t myAttr;
pthread_attr_init(&myAttr);
size_t myStackSize;
#ifdef _POSIX_THREAD_ATTR_STACKSIZE
pthread_attr_getstacksize(&myAttr, &myStackSize);
printf(“Default Stack is %d Bytes.\n”, myStackSize);
myStackSize = 16*1024; “// Try to set stack to 16K”
if (myStackSize >= PTHREAD_STACK_MIN)
pthread_attr_setstacksize(&myAttr, myStackSize);
else
printf(“PANIC! New Stack size too small!.\n”);
#else
printf(“PANIC! Cannot set stacksize.\n”);
#endif

So there you go. Each thread can now have a custom stack size tailored to its own needs, minimizing the total thread overhead of the system. Of course the problems don’t end there. Many modifications when parallelizing serial code, particularly when attempting to improve performance by adding thread local buffers, will further increase the memory footprint of your application and unexpected combinations of threads may result in memory usage spikes. So on behalf of future code maintainers everywhere please, please check the return value of malloc() and friends for failures!

Barry

Making the Most of the Multicore Crisis

November 18th, 2008

The multicore revolution is looking pretty bloody at the moment. Holy wars are erupting over whether message passing or shared memory is the best programming model; factions are pushing their favored language as the new standard or are resurrecting technologies from the 70s and, most of all, everyone is fretting about where these developers with parallel programming skills are going to come from to rewrite the entire software canon.

This makes it easy to lose sight of the huge opportunities now available to those who keep their heads and are ready to exploit the best technologies to come out of the melee.

The big win for multicore, particularly for embedded systems, is power savings. These come in a number of guises, and things are still moving quickly, but right now there are a number of approaches that can deliver based on the familer shared memory model which currently holds sway in the desktop market.

Consolidate

This is the simplest way to cut power and is already familiar from SoCs. By integrating a number of identical cores onto a single die rather than several packages on a PCB substantial power savings are being realized. This approach is proving popular with DSP vendors, such as TI and Qualcomm, many of whom are delivering dual to quad core versions of their most popular processors as will as DSP with general processor combinations. From a software perspective this maybe isn’t true multicore since there is little if any support or encouragement for synchronization between the cores and software silo based design continues to be the norm.

Reduce Clock Speed

High end embedded systems with operating systems, like Smart Phones and Mobile Internet Devices, often run on a monolithic processor clocked as fast as possible. The clock rate for the processor is typically much higher than is actually required for any one application in order to provide a responsive multitasking experience for the user. The processors for these systems are typically designed to run up to 1GHz which carries a substantial power penalty. By simply opting for the multicore version of these processors, such as ARM MPCore, and using and SMP OS the amount number of tasks sharing each core is reduced allowing the designer to reduce the clock frequency for these cores.

The trick here is that by lowering the required clock frequency for each core can allow the processor to be implemented with a low power IC process library. This allows power savings while retaining or even increasing the number of MIPS available to tasks in the OS. For example, on the ARM11MPCore web page, high performance and low power figures are quoted with the low power library timing to only 320 MHz. So by switching from a single core at 500 MHz @ 0.43 mW/MHz to a 300 MHz dual core with 0.23 mW/MHz you can achieve a 35% power reduction while gaining 100% L1 cache and 20% MIPS. If there is area budget available, it is an easy win.

Processor Downgrade

Following on from reducing the clock frequency, there is the further option of downgrading the actual core used. This can be a good option for more deeply embedded applications where the load is concentrated in a predictable, fixed set of applications which run on at a time and are amenable to multithreading. If evenly balanced each core will have a lower performance requirement so it can make since to switch to an architecture attains lower MIPS per MHz which typically also means lower mW per MHz.

Unfortunately, the mainstream IP vendors do yet not appear to be offering this downgrade path preferring to focus on multicore versions of their high end processors. However some system integrators are starting to offer multicore platforms based around microcontroller architectures which should start to fill this niche. If you are feeling brave it is also possible to role your own multicore systems based on the likes of the Cortex M3 as TI presented at ARM DevCon.

Opportunity

So are things all that bad ? There may be a multicore crisis but processor vendors are already rolling out practical, easy to use, shared memory systems based around existing architectures allow system designers to access the benefits of parallelism with surprisingly little effort. Since these platforms support the thread model of parallel programming, software engineers who invest the time in reading up on the subject can only benefit.

Barry