How many threads ?

April 30th, 2010

An aside that can up during a discussion on Thread Pools was how many threads should you have in your application ? This really has no straightforward answer other than “depends”.

Many would argue that on most systems you really should be working at a level of abstraction which deals with this sort of detail under to hood of the library or programming language however that is not much help to those of use stuck with PThreads or similar low level APIs. So lets look at what you should consider.

  • Look at the number of cores. Now going for num_threads = num_cores is not a good idea since any blocked threads will result in an idle core and your code could run into scalability issues as the number of cores is varied. However for applications targeting a well specified architecture this can allow tuning for very high performance.
  • Hardware Threads. Many cores support hardware threading where the core will switch between thread contexts very quickly to mask cache stalls. Tailoring to the number of threads to take advantage of this can yield great performance gains.
  • How many ways can you actually partition your algorithm. This is the biggest question. If you go for a pipeline partitioning then you are going to quickly hit scaling problems. If you go for data partitioning you may create far more threads than you intended and run into problems as they content for resources.
  • Memory hierarchy. How is the cache going to react to frequent thread swapping ? If each thread has a large working set you may spend all your time thrashing the cache. Not to mention the false sharing.

There are many more issues around this and impact is very much dependent on the target system. This just illustrates again how the move to multicore is forcing SW engineers to consider architecture and performance far earlier in the design cycle than before.

The best advice for developers of software that will be ported to many platforms can only be to ensure that they verify that their parallel implementation is as scalable as possible with clean interfaces between tasks for easy mapping onto whatever concurrency mechanism is in place on the target systems. That is: Make you code MultiCore ready before going to multicore.

Barry

Performance issues

November 27th, 2009

We spend a lot of time working with customers to ensure that there code is free from the bugs which plague multithreaded programs such as data races and deadlock - and rightly so. However worrying over the details make it easy to lose sight of the whole reason for going multicore in the first place - performance.

Although performance optimization, and the associated architectural awareness required, is familiar to many developers actual optimization is often an after thought. Functionality is king and there will always be faster computers.

Now that applications have stopped getting faster by themselves the software developer must be aware of performance optimization from much earlier in the design process. Multicore means that we all need to dust off the computer architecture text books and start thinking about the hardware again.

A good example of this is False Sharing. This occurs when 2 or more threads running across multiple cores do not actually share data but access data at nearby addresses to each other.

This then works against the way the cache operates. Each core as a small, fast cache memory attached to it which is loaded with data from the slower main memory in an effort to avoid stalls in the processor as it waits for data. The cache works on the principle of data locality: Most programs which access a location in memory will probably access it, or an nearby location, multiple times in a short period of time. The cache therefore is designed to load multiple addresses (a line or block) of data at a time to take advantage of this. In False Sharing this means that 2 cores must share a line of cache data even though no actual data is being shared between them.

The results can be bad apparently - I was not sure how bad so I built a little test. A program which calculated a Sum of Products of 2 large arrays in 2 ways. First INTERLEAVED where each thread stepped through the data accessing a fixed offset into blocks the size of the number of threads, basically interleaving the accesses to adjacent addresses. For higher numbers of threads this is going to be bad for locality too - so a double whammy.


for (i=params->threadNo; i < dataSize; i += params->threads)
params->a[i] += params->a[i] * params->b[i];

Next BLOCKY which allocated a contiguous chunk of data to each thread


for (i = aBlockSize * params->threadNo; i < aBlockSize * (params->threadNo + 1); ++i)
params->a[i] += params->a[i] * params->b[i];

Running the code for 1, 2, 4, 8, 16 and 32 threads on a Quad-Core Xeon produces the following numbers from time. As you might expect INTERLEAVE does much worse than blocky as the number of threads increases

Graph showing false sharing penalty

What is interesting here is the Real and User numbers for INTERLEAVE. While the actual Real or wall clock time at first improves slightly and then rapidly becomes worse than that of a single thread it is the User time increases massively. This highlights a major issue when things go wrong like this on multicore. The User time remains at 4x of the real time which is unsurprising since this is a 4 core machine but the lesson is that all of those cores are continually running which means we are burning 4x the power to achieve ever worsening performance resulting in a huge overall energy drain. Maybe not a problem for a fan cooled, mains powered server but very bad news for your mobile phone.

Barry

The Horizon is Closer Than You Think

October 14th, 2009

I’ve started seeing promotion for the EE Times ManyCore Virtual conference on various websites which follows on from their MultiCore conference in the summer. The first thing I noticed is that they have drawn the line between multi and many core between 16 and 32 cores.

This seems fairly reasonable server class platforms but for embedded systems, where interconnect speeds tend to be a lot slower, the latency at the lowest levels of memory make it hard to see how practical traditional shared memory multicore programming is possible above 4 or 8 cores.

So this begs the question; is many-core something mainstream programmers should be thinking about? After all we have only jut started looking at multicore and that is hairy enough but now we are told we to throw away the idea of a global address space too !

One of the main features of many-core is the huge array of new (and not so new) languages and architectures that are vying for attention. Which way should you go to avoid investing in the next Betamax?

Fortunately there is no need to jump at random. It is likely to take several years for the market to sort it self out and for a few dominant approaches to emerge. In the meantime there is plenty that can be done now to make to move to new concurrent frameworks a lot less risky.

The key is to make sure current and legacy code is multi/many/whatever-core ready. Even when targeting single core take the time to understand how the code is structured, where the performance hotspots are and that the iterations of key loops and modules share as little state as possible (data independence) and that all necessary communication is organized clearly as possible and is preferably wrapped in a function that is called only as necessary.

If all this sounds a bit like hard work don’t worry, as the guys at Apple would say: There’s an app for that ;-)

Barry

Tim Toady

September 30th, 2009

There’s more than one way to do it may be the Perl motto but it is can be fairly said about just about anything in general and computer science in particular.

When dealing with typical multi-threaded software (written in C or C++, with lots of pointers) tools have to rely on dynamic analysis in order to extract useful information about data dependencies and the inherent hazards these pose to the correct behavior of the application.

There are broadly 2 ways (but with uncounted variations) to collect this dynamic information:

  • Instrumentation of the application as it runs, typically on hardware.
  • Recording the state changes in a Virtual Machine which is running the application.

But how do these 2 approaches stack up against each other?

Instrumentation benefits from speed and convenience, in many cases user will be tracing applications directly on their workstation hardware or a development board which they alway run their apps on. This gives a reassuring sense of knowing that the captured behavior is from the actual target hardware that will be used by customers.

The problem is that this is a false sense of security. Instrumentation will alway disrupt the behavior of the program it is tracing. Typically the scheduling, core allocation and completion ordering of threads will be disrupted so anyone hoping to see exactly what will happen on the real hardware is going to be disappointed. It is this sort of interference that makes it notoriously difficult to debug multi-threaded applications.

Turning to Virtual Machines, you get the possibility of completely hands off tracing and the opportunity to target many different system configurations without cluttering up your desk with expensive dev boards.

However things are not that straightforward. The virtual platform is going to be slower than the actual hardware in most cases, especially when tracing in great detail, and it is a lot harder to extract the higher level scheduling information from the OS making it difficult to establish which thread is running on which core at what time. Few Virtual platform vendors have this capability as yet but many are adding it as demand for this sort of tracing grows.

So which approach is best ? As always it depends on what you are looking for.

Data dependency analysis does not depend much on the order of thread execution. If 2 threads ever access the same memory without synchronization then that is probably bad, so instrumentation is still very useful.

Since many programs will be running under an OS with any number of other active programs it is unlikely that you can ever expect to see a typical schedule anyway so if your application expects a certain schedule for correct execution without enforcing it then you deserve all the bugs you get.

Virtual Machines come into there own mostly in the deeply embedded space where many-core systems are often carved up into several virtualized multicore domains where the OS only runs a single application (telecoms is a typical example of this) and it pays to analyze the detailed runtime behavior to squeeze out every gram of performance.

So, in conclusion, I’m off to find a comfy fence to sit on…

Barry

It’s Just Not Fair

September 17th, 2009

Whilst debugging some pthread code recently I came across a somewhat unexpected and seemingly non-intuitive result regarding locks. Here is a simple pthread example program. The main creates a single child thread. This thead that works away incrementing the sGlobal counter until sContinue is cleared by main. The main simply checks the count and stops the child when the count exceeds 10.  Of course, the accesses to sGlobal are protected by a mutex to ensure that a partially updated result isn’t read and there are no nasty data races (okay, so an int can be read atomically on pretty much any machine but sGlobal could be a more complex object).

static int sGlobal;
static int sContinue;
static pthread_mutex_t sMutex;

static void* childthread(void* p)
{
  do
  {
    pthread_mutex_lock(&sMutex);
    sGlobal++;
    printf("Count = %d\n", sGlobal);
    pthread_mutex_unlock(&sMutex);
  }
  while (sContinue);
}

int main(int argc, char *argv[])
{
  pthread_t aThreadID;
  int aCount;

  pthread_mutex_init(&sMutex, NULL);
  sContinue = 1;
  sGlobal = 0;
  pthread_create(&aThreadID, NULL, childthread, NULL);
  do
  {
    pthread_mutex_lock(&sMutex);
    aCount = sGlobal;
    pthread_mutex_unlock(&sMutex);
  }
  while (aCount < 10);
  sContinue = 0;
}

So what would you expect the output to be? Maybe a count 0 to 10 and then the program exits?  I was expecting that main thread and the child thread to contend for the lock but ultimately for both of them to get pretty fair access to it. If the child thread has the lock but the main thread wants it I was expecting it to be next in line for acquiring it. Unfortunately there isn’t an orderly queue waiting for the lock, its a total free for all. As it turns out pthread mutexes provide absolutely no guarantee of fairness for lock accesses whatsoever. It doesn’t matter how long a thread has been waiting for a lock, another thread can push into the queue and get the lock again even after its just released it.

So what do you get when you run this. Well, it all depends. I ran this on a Linux Redhat 4 system with NPTL threads and the final count at termination were for successive runs was 641, 573, 229, 29, 25, 14791, 1708. Spot any pattern? No, neither can I. I also tried it on an ARM Debian Linux version with the older LinuxThreads implementation. Its still going. So this is rather random and very non-portable.

The child thread is hogging the lock. Even though it releases it, it grabs it again very quickly afterwards. Unless the process happens to get pre-empted just at the that point it will get the lock back again, even if the main thread has been waiting for ages. If most of the processing is done while the lock is held then there is very little opportunity for another thread to acquire the lock.

So why this rather rude free-for-all with locks? Its all in the name of efficiency. If orderely queues were maintained for each lock then every time a thread wants to acquire a lock it would need to be polite by checking if any other thread were waiting rather than just grabbing it. The queue itself would probably also need to be protected with locks since it could be accessed by any thread, and so it goes on. Its just simple and efficient to grab the lock if its available, no questions asked. Its up to the programmer to ensure fairness and synchronisation using other methods such as condition variables.

And what was the answer for ARM LinuxThreads? Well, its at 389000 and still counting. I suspect its never going to halt, but you never know (as Turing famously proved…).

Richard

Stay focused

September 1st, 2009

I’m teaching two classes at ESC Boston in September:

[ESC-442] Practical Migration of Sequential C/C++ Code to Multi-core Systems

[ESC-322] Coprocessing and Multiprocessing Techniques to Accelerate Software

and I have been busy working up examples and polishing slides.

Lots of people suggest that parallel programming is hard, so I thought I would share a few of my lessons learned during the run up to ESC Boston. Of course, some parallel programming lessons are subtle and hard won, but then there are others which remind you that no matter how complex things get, some mistakes are just plain dumb. Might as well start there…

I was reworking an image processing example to demonstrate a loop fusion technique which makes it more efficient to parallelize a processing pipeline. Part of the algorithm uses a Sobel filter, part of which includes:

x_sum  = 1 * in_pixels[rn * ncols + cw];
x_sum += 2 * in_pixels[r  * ncols + cw];
x_sum += 1 * in_pixels[rs * ncols + cw];
x_sum -= 1 * in_pixels[rn * ncols + ce];
x_sum -= 2 * in_pixels[r  * ncols + ce];
x_sum -= 1 * in_pixels[rs * ncols + ce];

This works fine (I’ve used it before), but as I was refactoring the algorithm and adding threading, I decided that the computation might look a wee bit clearer if I stopped repeating the x_sum assignment. The quick code modification looks like this:

x_sum  = 1 * in_pixels[rn * ncols + cw];
       + 2 * in_pixels[r  * ncols + cw];
       + 1 * in_pixels[rs * ncols + cw];
       - 1 * in_pixels[rn * ncols + ce];
       - 2 * in_pixels[r  * ncols + ce];
       - 1 * in_pixels[rs * ncols + ce];

I just couldn’t leave well enough alone. I finished up all the refactoring, and pretty quickly I got a clean compile. Great!

But of course, when I ran it, I got nothing close to the test image I was expecting. Well that’s not too unexpected- I figured I’d need to do a little debugging- I was fusing loops and I had to lag some computations in buffers to make everything pipe out correctly- it’s pretty easy to make an error in there somewhere.

You probably know the rest. After adding a fair amount of debug scaffolding to examine what was happening, I concluded that I had done all the refactoring and fusing correctly. I finally narrowed the problem down to the Sobel, and it took a lot of head scratching to realize that for the x_sum computation, I’d removed ‘=‘ assignments, but I had failed to remove the corresponding ‘;‘ at all but the last line. No complaints from my compiler; it’s perfectly legal though not terribly useful C code.

So today’s important parallel programming lesson- stay focused. As you migrate from sequential to parallel code, change the minimum necessary to move from one stable program to another. Resist the urge to clean up a few incidentals along the way- some people think parallel programming is hard enough without making it more so.

- Skip

BTW: for GCC, anybody know which compiler switches to add (or remove) to warn about a statement with no lvalue and no side effects?

Skip

Multicore Apples

July 16th, 2009

There are no shortage of multicore technologies at the moment. Not far beyond the SMP offerings of Intel and AMD there is a wealth of alternative manycore architectures - most widely available in the form of General Purpose graphic processors. In embedded systems the choice is even wider and this is before we even start looking at the various software libraries and runtimes that support a vast array of parallel programing paradigms.

Into this Apple has recently announced the features of its next OS update and there is a lot in it for multicore developers. Apple controls both the hardware and OS aspects of its platform and so can dictate what it provides to its developers so the choices it has made are very interesting. The central technology is Grand Central Dispatch which is a runtime thread management layer and API developed by Apple which is attempting to remove some of the housekeeping burden from its developers. In addition is fully integrated support for OpenCL which is an open standard initially aimed at opening up graphic processors for general purpose processing.

The combination of these approaches with the range of server/desktop hardware available (which ranges from 2 Core SMP + 16 core GPU up to 2×4 Core SMP + 32 core GPU) will open up heterogeneous multicore programing to a substantial and, importantly, mainstream group of developers.

Looking 1 or 2 years further on and assuming that rumours of a mutlcore iphone are true then it is very possible that the same GCD + OpenCL programing model could find its way onto one of the most widely developed for embedded systems.

Barry

Of Apples and Oranges

June 5th, 2009

EEMBC have released CoreMark. This is a simple benchmark for comparing embedded processors which focuses on evaluating the performance of the core itself without being too heavily effected by the rest of the system configuration.

Normally I don’t, as a rule, get hugely excited about benchmarks but CoreMark has 3 properties which particularly appeal to me:

  1. It is very easily portable across platforms. Saving on effort.
  2. It distills the performance of the core into a single number for easy comparison. Saving yet more effort.
  3. It’s free.

After some mucking about generating coremark scores for the various processor and compiler combinations laying around the office, I thought some actual work might be in order and started to wonder how this benchmark, being representative of typical embedded workloads, could be modified to take advantage of multicore platforms.

So, casually ignoring the fact that modifying the source code invalidates the test scores, I recorded some traces from an ARM9 core and started on an analysis with Prism. You can read my initial findings in this PDF. The next step is to put these findings into action with a PThreads implmentation.

Barry

If you can’t get rid of them, at least isolate them

May 27th, 2009

After reading the title above, did you think I would be talking about your co-workers?

Dependency analysis is a big part of the Prism parallel migration flow. Data dependencies are characterized as true data (read-after-write), anti (write-after-read) and output (write-after-write) dependencies.

Data Dependency Types

Dependency relationships between threads limit the amount of parallelism that can be achieved. They must either be removed or properly synchronized. Improperly handled dependencies show up as data races during execution.

Anti-Dependencies

Anti-dependencies are a common artifact of optimized sequential code when data storage is reused to save space. Loops are a common source of these anti-dependencies. The good news is that, once identified, these dependencies can be removed by duplicating or localizing variables, often making the loops embarrassingly parallel. Consider this made up sequential code block:

int i;
x_t x;
y_t y[N];

for (i = 0; i < N; ++i) {
  x = f(i);
  y[i] = g(i, x);
}

The x variable is shared by all iterations. In sequential code, this works fine because the iterations occur one after another and do not overlap. Unfortunately, the shared x variable causes problems when trying to run iterations in parallel. There is an anti-dependency between iterations where one iteration cannot write a new x value until after another iteration has read the existing x value.

This is easy to fix. If the x declaration is moved inside each iteration, each iteration maintains its own local x variable:

int i;
y_t y[N];

for (i = 0; i < N; ++i) {
  x_t x;
  x = f(i);
  y[i] = g(i, x);
}

In this case, there are no more dependencies between iterations, and the iterations can be run in parallel with each other. This code is now easy to parallelize using any typical parallel API, such as Pthreads or OpenMP.

True Data Dependencies

True data dependencies are usually fundamental to the algorithm being implemented. A value cannot be used (read) until after it has been computed (written). These dependencies are usually impossible to remove without changing algorithms.

Despite this, their impact can be minimized. Consider this made up sequential code block:

int i;
x_t x[N];
y_t y;

y = y_init();
for (i = 0; i < N; ++i) {
  x[i] = f(i);
  y = g(y, x[i]);
}

To run properly, the y values must computed in order. (Finicky readers will note the implicit assumption that the g(a, b) function is not associative.) The next iteration’s initial y value cannot be read until the previous iteration has written it. This is a true data dependency.

Though it cannot be removed, the dependency can be isolated. Consider this code refactoring:

int i;
x_t x[N];
y_t y;

for (i = 0; i < N; ++i) {
  x[i] = f(i);
}

y = y_init();
for (i = 0; i < N; ++i) {
  y = g(y, x[i]);
}

Here the x and y computations have been separated. Though the y values must still be computed serially, the x iterations can all be run in parallel using common parallel APIs.

If the f(i) function is expensive, this separation pays dividends. For example, assume the upper loop takes 98% of the run time and the second iteration only 2%. Using Amdahl’s law, the speed up for 7 cores is 100/(98/7 +2) = 6.25x.

So, when migrating optimized sequential code to parallel, I usually start with the low hanging fruit first- refactor the code to remove anti-dependencies. If lucky, this will make critical loops easy to parallelize. For remaining loops with true data dependencies, consider splitting the loops to isolate the code which must remain serial.

And do try to be respectful to your co-workers; you never know…

Skip

Algorithms in the real world

May 14th, 2009

With the release of Prism, we provide a recommended flow for migrating sequential code to parallel. One of the first things we suggest is to understand the intent and code structure of your application and make sure it’s well optimized.

No matter how well optimized though, when you go to parallelize, there are typically some fundamental data dependencies which will serialize parts of your application. The standard advice is to consider switching algorithms to one that is more parallel friendly. It sounds good, but many applications I’ve worked on, for example, h.264 video decode, have algorithms prescribed by standard.

I recently took a look at parallelizing the Dijkstra algorithm as part of a network routing application.

Dijkstra Algorithm - source: Wikipedia

The algorithm works to identify shortest paths in a network by following edges and keeping track of the minimum cost to get to each node. When starting from a source node, the algorithm maintains a priority queue of active nodes, choosing the minimum distance node as the next node to start form as the algorithm traverses the edges. This gives a worst case performance of:

O(|E| + |V|log|V|)

where |E| is the number of edges, and |V| is the number of vertices (nodes). The |E| part is easy to understand; all the reachable edges should be considered. The |V|log|V| part is trickier- that one depends on the priority queue implementation. Each time a node is added, the queue must be reprioritized. A binary heap implementation requires O(log|V|) swaps, and there are |V| nodes. That explains it.

But, I’m free to choose other implementations- a simple list would work but require O(|V|) searches to find the minimum distance node each time. A Fibonacci heap priority queue might even have better performance, statistically speaking, than a binary heap. That statistical part finally made me wake up. I realized that when thinking about changing algorithms, I’d always been thinking about going to something that was considered universally better, like the difference between a quicksort and a bubble sort. In practice, for Dijkstra, it appears that the Fibonacci queue often performs worse than the binary heap.

The best choice algorithm depends on the properties of the graph being analyzed. The real world can be so messy, but that makes room for good engineering and job security!

So imagine my surprise when I picked up the serial Dijkstra benchmark code inside MiBench and discovered it didn’t even bother to reprioritize the nodes in the queue, it just used the oldest nodes first in a simple queue. With a little thinking, you realize that if you reprioritize to always select the minimum distance node first, you are guaranteed not to have to revisit previous nodes. You lose that benefit if you just take the oldest node first, so the worst case performance appears to me to be really bad:

O(|E||V|)

because at each node you might have to backtrack over previously traversed edges. That sounds wasteful, but the real world may rescue us. If edges all have similar costs, then the algorithm will add nodes with the fewest node hops first, and that will greatly reduce the number of backtracked edges needed. With no reprioritization and no backtracking, the best case performance for this simpler implementation would become:

O(|E| + |V|)

Real world results will vary, but for typical routing graphs, I bet they are closer to O(|E| + |V|).

So I have a new found appreciation for ‘changing algorithms’. Will it help my parallelization? I’ll let you know.

- Skip

Skip