Of Apples and Oranges

June 5th, 2009

EEMBC have released CoreMark. This is a simple benchmark for comparing embedded processors which focuses on evaluating the performance of the core itself without being too heavily effected by the rest of the system configuration.

Normally I don’t, as a rule, get hugely excited about benchmarks but CoreMark has 3 properties which particularly appeal to me:

  1. It is very easily portable across platforms. Saving on effort.
  2. It distills the performance of the core into a single number for easy comparison. Saving yet more effort.
  3. It’s free.

After some mucking about generating coremark scores for the various processor and compiler combinations laying around the office, I thought some actual work might be in order and started to wonder how this benchmark, being representative of typical embedded workloads, could be modified to take advantage of multicore platforms.

So, casually ignoring the fact that modifying the source code invalidates the test scores, I recorded some traces from an ARM9 core and started on an analysis with Prism. You can read my initial findings in this PDF. The next step is to put these findings into action with a PThreads implmentation.

Barry

If you can’t get rid of them, at least isolate them

May 27th, 2009

After reading the title above, did you think I would be talking about your co-workers?

Dependency analysis is a big part of the Prism parallel migration flow. Data dependencies are characterized as true data (read-after-write), anti (write-after-read) and output (write-after-write) dependencies.

Data Dependency Types

Dependency relationships between threads limit the amount of parallelism that can be achieved. They must either be removed or properly synchronized. Improperly handled dependencies show up as data races during execution.

Anti-Dependencies

Anti-dependencies are a common artifact of optimized sequential code when data storage is reused to save space. Loops are a common source of these anti-dependencies. The good news is that, once identified, these dependencies can be removed by duplicating or localizing variables, often making the loops embarrassingly parallel. Consider this made up sequential code block:

int i;
x_t x;
y_t y[N];

for (i = 0; i < N; ++i) {
  x = f(i);
  y[i] = g(i, x);
}

The x variable is shared by all iterations. In sequential code, this works fine because the iterations occur one after another and do not overlap. Unfortunately, the shared x variable causes problems when trying to run iterations in parallel. There is an anti-dependency between iterations where one iteration cannot write a new x value until after another iteration has read the existing x value.

This is easy to fix. If the x declaration is moved inside each iteration, each iteration maintains its own local x variable:

int i;
y_t y[N];

for (i = 0; i < N; ++i) {
  x_t x;
  x = f(i);
  y[i] = g(i, x);
}

In this case, there are no more dependencies between iterations, and the iterations can be run in parallel with each other. This code is now easy to parallelize using any typical parallel API, such as Pthreads or OpenMP.

True Data Dependencies

True data dependencies are usually fundamental to the algorithm being implemented. A value cannot be used (read) until after it has been computed (written). These dependencies are usually impossible to remove without changing algorithms.

Despite this, their impact can be minimized. Consider this made up sequential code block:

int i;
x_t x[N];
y_t y;

y = y_init();
for (i = 0; i < N; ++i) {
  x[i] = f(i);
  y = g(y, x[i]);
}

To run properly, the y values must computed in order. (Finicky readers will note the implicit assumption that the g(a, b) function is not associative.) The next iteration’s initial y value cannot be read until the previous iteration has written it. This is a true data dependency.

Though it cannot be removed, the dependency can be isolated. Consider this code refactoring:

int i;
x_t x[N];
y_t y;

for (i = 0; i < N; ++i) {
  x[i] = f(i);
}

y = y_init();
for (i = 0; i < N; ++i) {
  y = g(y, x[i]);
}

Here the x and y computations have been separated. Though the y values must still be computed serially, the x iterations can all be run in parallel using common parallel APIs.

If the f(i) function is expensive, this separation pays dividends. For example, assume the upper loop takes 98% of the run time and the second iteration only 2%. Using Amdahl’s law, the speed up for 7 cores is 100/(98/7 +2) = 6.25x.

So, when migrating optimized sequential code to parallel, I usually start with the low hanging fruit first- refactor the code to remove anti-dependencies. If lucky, this will make critical loops easy to parallelize. For remaining loops with true data dependencies, consider splitting the loops to isolate the code which must remain serial.

And do try to be respectful to your co-workers; you never know…

Skip

Algorithms in the real world

May 14th, 2009

With the release of Prism, we provide a recommended flow for migrating sequential code to parallel. One of the first things we suggest is to understand the intent and code structure of your application and make sure it’s well optimized.

No matter how well optimized though, when you go to parallelize, there are typically some fundamental data dependencies which will serialize parts of your application. The standard advice is to consider switching algorithms to one that is more parallel friendly. It sounds good, but many applications I’ve worked on, for example, h.264 video decode, have algorithms prescribed by standard.

I recently took a look at parallelizing the Dijkstra algorithm as part of a network routing application.

Dijkstra Algorithm - source: Wikipedia

The algorithm works to identify shortest paths in a network by following edges and keeping track of the minimum cost to get to each node. When starting from a source node, the algorithm maintains a priority queue of active nodes, choosing the minimum distance node as the next node to start form as the algorithm traverses the edges. This gives a worst case performance of:

O(|E| + |V|log|V|)

where |E| is the number of edges, and |V| is the number of vertices (nodes). The |E| part is easy to understand; all the reachable edges should be considered. The |V|log|V| part is trickier- that one depends on the priority queue implementation. Each time a node is added, the queue must be reprioritized. A binary heap implementation requires O(log|V|) swaps, and there are |V| nodes. That explains it.

But, I’m free to choose other implementations- a simple list would work but require O(|V|) searches to find the minimum distance node each time. A Fibonacci heap priority queue might even have better performance, statistically speaking, than a binary heap. That statistical part finally made me wake up. I realized that when thinking about changing algorithms, I’d always been thinking about going to something that was considered universally better, like the difference between a quicksort and a bubble sort. In practice, for Dijkstra, it appears that the Fibonacci queue often performs worse than the binary heap.

The best choice algorithm depends on the properties of the graph being analyzed. The real world can be so messy, but that makes room for good engineering and job security!

So imagine my surprise when I picked up the serial Dijkstra benchmark code inside MiBench and discovered it didn’t even bother to reprioritize the nodes in the queue, it just used the oldest nodes first in a simple queue. With a little thinking, you realize that if you reprioritize to always select the minimum distance node first, you are guaranteed not to have to revisit previous nodes. You lose that benefit if you just take the oldest node first, so the worst case performance appears to me to be really bad:

O(|E||V|)

because at each node you might have to backtrack over previously traversed edges. That sounds wasteful, but the real world may rescue us. If edges all have similar costs, then the algorithm will add nodes with the fewest node hops first, and that will greatly reduce the number of backtracked edges needed. With no reprioritization and no backtracking, the best case performance for this simpler implementation would become:

O(|E| + |V|)

Real world results will vary, but for typical routing graphs, I bet they are closer to O(|E| + |V|).

So I have a new found appreciation for ‘changing algorithms’. Will it help my parallelization? I’ll let you know.

- Skip

Skip

Under the Spousal Radar!

March 31st, 2009

I attended the first day of the Embedded Systems Conference in San Jose today. Previous conferences have included hands on tracks, and this year continues that theme. This conference’s track is called Build Your Own Embedded System (BYOSE) and uses the Beagle Board from Texas Instruments.

I picked up my Beagle Board at registration and headed off to the Beagle Board 101 class taught by Jason Kridner and Gerald Coley from TI. At least 120 engineers connected their boards into screens, keyboards, mice, and USB hubs, and shortly we were all running a basic Linux distribution.

TI describes the $149 price point and 3” by 3” size as “under the spousal radar”. True enough, although unfortunately it seems to have been easily detected by all my kids’ radars. They haven’t ratted me out yet, but it’s costing me. TI’s recommendations- if you’ve got less money but more time- this board is for you. If you have more money but less time, go with one of TI’s evaluation boards and a commercial Linux distribution.

The centerpiece of the board is TI’s OMAP 3430. It includes an ARM Cortex A8 application processor, an Imagination Technologies’ PowerVR SGX graphics accelerator, and a TI C64x DSP with a number of specialized accelerators. This is a great example of a highly integrated asymmetric multiprocessing (AMP) system. Developing applications for the ARM A8 is fairly straightforward using the GNU toolchain. With a little more work, the developer can gain access to the graphics engine and the DSP subsystem.

TI OMAP 3430

TI OMAP 3430

What most interests me is how this platform lets a programmer evaluate processing tradeoffs between the different processing engines and accelerators, including running and coordinating their parallel activities- something we all will be doing much more off in the future. One of the big things we do at CriticalBlue, with products like Prism and Cascade, is help engineers efficiently and easily evaluate these tradeoffs. The Beagle Board appears to be a very practical proving ground for different application design and programming approaches.

The best place to get started- beagleboard.org. It contains a wealth of information, an active community, and quite a few interesting projects.

TI recently announced the OMAP4 architecture which, among other enhancements, ups the application processor to a dual core Cortex A9MPCore- adding true symmetric multiprocessing (SMP) to the mix. When the time comes, I certainly hope TI offers a Beagleboard equivalent, but maybe by then they can figure out how to keep these multiprocessing boards under even my kids’ radars.

Skip

Thanks for the Memory

January 23rd, 2009

When retro-fitting a sequential application with threads, there are many potential problems however its always the side effects that seem to get you with unexpected bugs. This situation becomes worse when you are attempting to crowbar your application into the resource constrained environment of an embedded system.

Threads, here I’m talking about PThreads but it is true for most libraries, have overhead. There is design overhead, code , maintenance and runtime overheads and of course memory overhead which is the one you really have to lookout for.

PThreads are basically lightweight processes and so have to carry around a good deal of state with them and chief among these is the stack. In a typical implementation, the stack for each thread is allocated on the heap when the thread is created and is a fixed size. This sounds straightforward enough but can immediately cause several failure modes.

  • The per thread stack size X the number of threads is greater than the available heap space resulting in thread creation failure.
  • There is enough space on the heap for all of thread overhead but the next malloc call fails because there is no more room.
  • The per thread stack is too small for the amount of data that needs to be pushed onto it, leading to any number of obscure errors.

So what can you do about this? Fortunately PThreads provides a solution via its thread attributes API. If you are not familiar with attributes it is simply a structure for configuring a thread which is passed to pthread_create as the second parameter which is otherwise set to NULL, as below.


pthread_attr_t myAttr;
pthread_attr_init(&myAttr);
pthread_create(&aThread, &myAttr, (void *) myFunc, (void *) myParams);

Once the pthread_attr_t variable is initialized you can use it to specify the size of the stack you require (among other things). However it is not always that straightforward (is anything) and you will need to test if your PThreads library actually supports this variable stack size and then you need to test what the minimum allowable stack size is before you set it. This is handled via declarations for your system.


pthread_attr_t myAttr;
pthread_attr_init(&myAttr);
size_t myStackSize;
#ifdef _POSIX_THREAD_ATTR_STACKSIZE
pthread_attr_getstacksize(&myAttr, &myStackSize);
printf(“Default Stack is %d Bytes.\n”, myStackSize);
myStackSize = 16*1024; “// Try to set stack to 16K”
if (myStackSize >= PTHREAD_STACK_MIN)
pthread_attr_setstacksize(&myAttr, myStackSize);
else
printf(“PANIC! New Stack size too small!.\n”);
#else
printf(“PANIC! Cannot set stacksize.\n”);
#endif

So there you go. Each thread can now have a custom stack size tailored to its own needs, minimizing the total thread overhead of the system. Of course the problems don’t end there. Many modifications when parallelizing serial code, particularly when attempting to improve performance by adding thread local buffers, will further increase the memory footprint of your application and unexpected combinations of threads may result in memory usage spikes. So on behalf of future code maintainers everywhere please, please check the return value of malloc() and friends for failures!

Barry

Multicore Mobile Mainstream

January 12th, 2009

Believe it or not, there were a few rumors at this month’s Macworld Expo that didn’t speculate about Steve Jobs. One of the most interesting was posted by Jason O’Grady at ZDNet.  Jason reported rumors about a mysterious quad core iPhone and corresponding firmware 3.0 which would include support for the quad core. If true, this would be a significant milestone for multicore use on mainstream consumer platforms.

Of course, there are already embedded multicore devices in production today. Even though they’re in the market, a significant percentage of these devices are still running single core software. Though this allows independent applications to run in parallel, it doesn’t lower each application’s power profile or, conversely, allow each application greater throughput. To achieve these benefits, application software must be multicore ready.

In these challenging economic times, new platforms must disrupt markets. There is little profit in launching a quad core phone without stunning benefits to the end consumer, and that means applications must take full advantage of the multicore architecture. Likewise, getting the most out of existing multicore platforms can extend their product lifetime revenues, and software enhancements are the way to do it. Either way, the time to make applications multicore friendly is now.

For multithreaded platforms programmed in C, Pthreads is the natural choice. The Pthreads’ API is available on many operating systems, and the Pthreads’ library is especially well supported on Linux variants. Applications can be parallelized and debugged using pthreads, and they will still run correctly in a single core environment. An unexpected benefit of threading serial code is that it’s likely that some previously undiscovered bugs will be flushed out during the refactoring.

In some cases, OpenMP may be an attractive initial approach. OpenMP uses a set of compiler directives to describe parallelism available in the serial code. Maximum parallel performance may not be achieved in all cases. Offsetting that, the code can always be run serially by simply disabling the directives during compilation.

Even if not immediately releasing multithreaded applications, ensuring that libraries and drivers are multithread safe provides significant leverage going forward. When you do discover an application requiring the extra performance or lower power of multicore, having multicore-ready libraries and drivers will enable you to respond quickly with less learning curve and better debug isloation.

Signs are that 2009 is the year for multicore to impact the mainstream embedded market. Making existing software multicore ready will pay dividends to developers willing to make the investment now, even when times are tough.

Skip

The Push and Pull of Multicore

December 18th, 2008

Have you ever heard a software developer say ‘What I really need on my next project is a multicore platform to program..’? No, I didn’t think so. As we all probably know, the shift to multicore is driven by silicon platform vendors recognizing that such architectures are the only way they can meet end product performance and power consumption requirements. Enough has been written on that topic, and I don’t need to re-state the rationale behind this trend here. The marketing books tell you that for rapid adoption of a new product there needs to be a ‘pull’ from the end users into the market that is satisfied by a product which has a ‘push’ behind from what will become the dominant vendors in that new market. Clearly, in the transition to multicore, silicon vendors are already providing an almighty ‘push’ into the market. What interests me is seeking out and categorizing the market ‘pull’…

These days I spend a lot of time thinking about the software side of the evolution to multicore- mainly because of my work with the Multicore Association’s working group on Multicore Programming Practices (MPP), but also because of customer interactions concerning CriticalBlue products. So what can we say about market ‘pull’ in this domain? As far as end users are concerned, the consumers of multicore silicon platforms are systems companies and, more specifically, the software developers within those companies. I tend to categorize the multicore markets into three sectors, namely Server, Desktop and Embedded.

Coming from the chip design end of the market, I am most familiar with the embedded space; although I have recently been looking at the other two markets to see what can be learned. In particular, I am interested in how software development is changing with the widespread adoption of multicore platforms. In the server market, the most significant impact has been the massive growth of virtualization techniques and technologies, to the extent that some analysts even predict that the number of x86 based servers will drop in years to come. Why? The appeal is simple I think. Virtualization allows users to make more efficient and more flexible use of the platforms they have with – wait for it – no change to the software applications they are running. The server market also has the characterization of being mainly transaction based, ie. the application run time is not the most important factor, so any application performance overhead added by virtualization is not a big deal. In the desktop world, there is a lot of noise around tools and methodologies, but in equal measure there is also a lot of noise about how little of the actual multicore performance is being utilized by software running on desktops. Game applications and Adobe Photoshop are examples of a handful of consumer volume, compute intensive applications that could benefit from multiple processors; the majority of things we do on desktop platforms do not. Consequently, the widespread adoption of multicore platforms has not yet had a significant impact on the vast majority of software developers.

But what of the embedded market? This market is the newest to be touched by the multicore trend. Yes I know multicore isn’t new in embedded, but shared memory multicore is, and it is in this configuration that multicore architectures can, if they so desire, be exploited directly by software developers. Another observation is that moving from server, through desktop and into embedded means that software applications become more and more compute intensive and therefore need to run closer and closer to the actual hardware platform. In other words the performance implications of using different methods of exploiting multicore come into sharp focus.

Right now we are just beginning to see the reaction of the deeply embedded software developer to the multicore revolution. What we can learn by looking at the server and desktop markets is that software developers will not make a mad dash to multithreaded programming if there is an alternative that allows them to stick with what they have while meeting their performance/cost/power goals. This shouldn’t be a surprise. In the server market they found virtualization; in desktop they found that most applications run just fine and wouldn’t necessarily benefit from being multithreaded.

In the embedded market it may be too early to be certain what will happen. What we can say is that there are a lot of shared memory multicore platforms being put together, and there is a lot of interest in multithreading software techniques. It is also likely that application performance/power considerations will force the adoption of low overhead multicore software approaches.

However, as I have analyzed the differences and similarities between the way software developers have approached multicore in the three sectors, one observation about the server/desktop market jumps out at me and may serve as an important indicator how what will happen in embedded. And it is the software developer saying ‘I want the system benefits of multicore with the absolute minimum amount of change to my software and my development environment’. In the new world of multicore software development, that’s the real market ‘pull’. Ignore it at your peril.

David

Making the Most of the Multicore Crisis

November 18th, 2008

The multicore revolution is looking pretty bloody at the moment. Holy wars are erupting over whether message passing or shared memory is the best programming model; factions are pushing their favored language as the new standard or are resurrecting technologies from the 70s and, most of all, everyone is fretting about where these developers with parallel programming skills are going to come from to rewrite the entire software canon.

This makes it easy to lose sight of the huge opportunities now available to those who keep their heads and are ready to exploit the best technologies to come out of the melee.

The big win for multicore, particularly for embedded systems, is power savings. These come in a number of guises, and things are still moving quickly, but right now there are a number of approaches that can deliver based on the familer shared memory model which currently holds sway in the desktop market.

Consolidate

This is the simplest way to cut power and is already familiar from SoCs. By integrating a number of identical cores onto a single die rather than several packages on a PCB substantial power savings are being realized. This approach is proving popular with DSP vendors, such as TI and Qualcomm, many of whom are delivering dual to quad core versions of their most popular processors as will as DSP with general processor combinations. From a software perspective this maybe isn’t true multicore since there is little if any support or encouragement for synchronization between the cores and software silo based design continues to be the norm.

Reduce Clock Speed

High end embedded systems with operating systems, like Smart Phones and Mobile Internet Devices, often run on a monolithic processor clocked as fast as possible. The clock rate for the processor is typically much higher than is actually required for any one application in order to provide a responsive multitasking experience for the user. The processors for these systems are typically designed to run up to 1GHz which carries a substantial power penalty. By simply opting for the multicore version of these processors, such as ARM MPCore, and using and SMP OS the amount number of tasks sharing each core is reduced allowing the designer to reduce the clock frequency for these cores.

The trick here is that by lowering the required clock frequency for each core can allow the processor to be implemented with a low power IC process library. This allows power savings while retaining or even increasing the number of MIPS available to tasks in the OS. For example, on the ARM11MPCore web page, high performance and low power figures are quoted with the low power library timing to only 320 MHz. So by switching from a single core at 500 MHz @ 0.43 mW/MHz to a 300 MHz dual core with 0.23 mW/MHz you can achieve a 35% power reduction while gaining 100% L1 cache and 20% MIPS. If there is area budget available, it is an easy win.

Processor Downgrade

Following on from reducing the clock frequency, there is the further option of downgrading the actual core used. This can be a good option for more deeply embedded applications where the load is concentrated in a predictable, fixed set of applications which run on at a time and are amenable to multithreading. If evenly balanced each core will have a lower performance requirement so it can make since to switch to an architecture attains lower MIPS per MHz which typically also means lower mW per MHz.

Unfortunately, the mainstream IP vendors do yet not appear to be offering this downgrade path preferring to focus on multicore versions of their high end processors. However some system integrators are starting to offer multicore platforms based around microcontroller architectures which should start to fill this niche. If you are feeling brave it is also possible to role your own multicore systems based on the likes of the Cortex M3 as TI presented at ARM DevCon.

Opportunity

So are things all that bad ? There may be a multicore crisis but processor vendors are already rolling out practical, easy to use, shared memory systems based around existing architectures allow system designers to access the benefits of parallelism with surprisingly little effort. Since these platforms support the thread model of parallel programming, software engineers who invest the time in reading up on the subject can only benefit.

Barry

Keeping it Consistent

November 7th, 2008

As we dive deeper into this brave new world of the multicore programmer, it surprises me how many new and interesting ways there are to introduce subtle bugs into multithreaded applications. No wonder such programming is widely considered to be difficult stuff, not yet fully emerged from the shadows of a black art. One such set of subtleties are memory consistency models and, more precisely, how they can screw you up.

Something that has always impressed me about the last 30 years of microarchitecture innovation has been the insulation between the software binary and hardware worlds. Processors have undergone enormous change, from simple creatures where every transistor was precious to hugely complex speculative, out-of-order, multi-issue beasts where hundreds of instructions are thrown into the mix and the results land again miraculously in the right order. All this complexity has been insulated from the programmer; they’ve been happy enough ingesting that free lunch courtesy of Moore, safe in the knowledge that their unmodified software will just get faster in the future. Unfortunately in the multicore world, free lunch is over, and the hardware abstraction can now cause a little indigestion. An obscure hardware topic like memory consistency models suddenly has relevance to the application programmer.

So what is a memory consistency model? Well it’s a complicated subject, but essentially it describes the order in which data gets read and written to memory. Let’s consider a simple example on a single core:


int *x, y, z;

x = &y;
y = 1;
z = *x;

You would quite reasonably expect, despite all the optimization and trickery in the compiler and processor, that z would end up with the value 1. In reality, all sorts of things might be going on in the internals of processor and cache, but it will always ensure that it looks like things happen in a sequential order, even if the system has to work very hard underneath to give that impression.

When it comes to multicore, however, retaining that pretense becomes a whole lot harder. Consider a similar example split over two threads:


int ready = 0, data;
void thread1() {
  data = 42;
  ready = 1;
}
void thread2() {
  int answer;
  while (!ready)
    <do something useful>;
  answer = data;
}

Sure, this isn’t the greatest example of multithreaded coding ever, thread2 hangs about waiting for the data to be ready in a busy wait loop. Thread1 makes the data available in memory and then afterwards marks its presence with the ready flag. When thread2 detects ready is non-zero, it copies data into answer. So it will get the answer 42 right? Wrong! Well to be precise, it almost always will, but there is a very small chance that it won’t.  To me that sounds worse than it not working at all.

So what can possibly be wrong with this? If in thread1 you read back data immediately after you had set the ready flag, you would always get 42. Indeed, if thread2 happened to run on the same CPU as thread1, it would always get 42 as well. The problem comes when thread2 is on a different CPU. Even if a CPU sees that its own stores happen in sequential order, that does not guarantee that another CPU will see them in the same order. It depends on the type of consistency model used, but many multicore systems use a relaxed consistency model where this type of mind bending relativism is all too possible. They have to, or else their cache coherence would take too long. The problem is that accessing data might cause a second level cache miss or go to a different bank of DRAM than ready. In modern pipelined bus systems different bus transactions can overtake each other and get themselves out of order.

The real issue is that the chances of this actually going wrong are very small. It’s much more subtle than many other, already pretty subtle, threading issues. The two threads must be on different cores, and the memory layout, cache contents and transaction ordering must be lined up in just the wrong way. You can be sure though that when things do go wrong this is a very nasty issue to debug and find.

So what’s the solution? Well you may have spotted that the access to ready and data between the two threads might be considered to be a data race. Tools can be used to find such data races and point them out to the programmer. The way you would fix the data race would be to introduce a mutex-lock around the accesses rather than using a DIY spin-lock like ready. The reason this works is that mutexes do slightly more than just lock and unlock. Normally the implementation of a mutex in the threads library also executes a memory barrier instruction. In short, this is special instruction that tells the CPU to make its memory ordering sequential with respect to other CPUs. So this side effect of making a mutex call actually causes a serialization of all memory state and ensures that everything previously written by any core is fully committed so that the executing CPU can properly see it.

Presuming you already use them, the problem with data race detection tools is that they can generate false positives. You might look at this code and be sure you know what you are doing, and there might be good reasons for writing the code in this way. Just beware though, sequential ordering isn’t quite as ordered as it used to be…

Richard Taylor, CTO

Richard

Threads, it’s easy to overlook the dependencies

October 27th, 2008

At CriticalBlue, we often hear that thread based programming is hard and that it requires us all to ‘think parallel’. Conceptually though, I don’t think the fundamentals of thread programming are all that hard to grasp. In pthreads, we create threads so functions can execute in parallel; we use mutexes to coordinate mutual exclusion through critical regions, and we use condition variables to wait until conditions are right before proceeding. Seems simple enough.

Rather, what really is hard is to constrain a parallel program such that all possible interleavings of program execution are correct. While it’s easy to create threads and run them in parallel, it’s hard to catch all the data dependencies between the threads which must be properly ordered to ensure always race free execution.

I recently proved once again just how easy it is to overlook a critical dependency. I was preparing some examples for a talk I was giving at the ARM Developers’ Conference this October. I was extending a simple edge detection example to show different ways the image pipeline could be parallelized. I’d previously parallelized just the Sobel function using a row-based data decomposition strategy. No problem here I thought, all I need to do is replace the Sobel slice function with one processing a slice through the correction, noise filter and sobel functions. I ran the program, and I ended up with a pretty good looking result.

I say pretty good, because there was one line of pixels which looked a little off, but that was probably just an artifact, right? Wrong. That artifact saved me a lot of embarrassment, because, I’d blindly introduced data races into my code.

When you think it through, it’s easy to see that a filter slice can’t start processing until the correcting slice has completed and the edges of it’s neighbors have completed processing. Within a slice, the processing is sequential, but at the edges, there’s no guarantee that the edges of one slice will be corrected before another slice needs them. A simple but slightly overconstrained solution is to ensure that all correction processing completes before any filtering processing occurs, and, likewise, that all filtering completes before the Sobel slices begin.

The lesson learned is that it is very easy to overlook a critical dependency. I was lucky that my visual result gave me a tiny hint that something was wrong. It could have very easily passed all my tests, and I could have had an observant conference attendee put me in my place.

Simple though it was, this incident gave me a real appreciation for the importance of good tools, in this case, not so much for doing my parallel thinking for me, but for checking that what I did wasn’t doing anything wrong, and when there are mistakes, helping me understand what they are and ensuring that I’ve corrected them. I’ve added data race detection tools to my list of must haves for parallel programming.

Alas, there are still plenty of other ways I routinely embarrass myself…

- Skip Hovsmith

Skip