Archive

Archive for the ‘Skip’ Category

Stay focused

September 1st, 2009

I’m teaching two classes at ESC Boston in September:

[ESC-442] Practical Migration of Sequential C/C++ Code to Multi-core Systems

[ESC-322] Coprocessing and Multiprocessing Techniques to Accelerate Software

and I have been busy working up examples and polishing slides.

Lots of people suggest that parallel programming is hard, so I thought I would share a few of my lessons learned during the run up to ESC Boston. Of course, some parallel programming lessons are subtle and hard won, but then there are others which remind you that no matter how complex things get, some mistakes are just plain dumb. Might as well start there…

I was reworking an image processing example to demonstrate a loop fusion technique which makes it more efficient to parallelize a processing pipeline. Part of the algorithm uses a Sobel filter, part of which includes:

x_sum  = 1 * in_pixels[rn * ncols + cw];
x_sum += 2 * in_pixels[r  * ncols + cw];
x_sum += 1 * in_pixels[rs * ncols + cw];
x_sum -= 1 * in_pixels[rn * ncols + ce];
x_sum -= 2 * in_pixels[r  * ncols + ce];
x_sum -= 1 * in_pixels[rs * ncols + ce];

This works fine (I’ve used it before), but as I was refactoring the algorithm and adding threading, I decided that the computation might look a wee bit clearer if I stopped repeating the x_sum assignment. The quick code modification looks like this:

x_sum  = 1 * in_pixels[rn * ncols + cw];
       + 2 * in_pixels[r  * ncols + cw];
       + 1 * in_pixels[rs * ncols + cw];
       - 1 * in_pixels[rn * ncols + ce];
       - 2 * in_pixels[r  * ncols + ce];
       - 1 * in_pixels[rs * ncols + ce];

I just couldn’t leave well enough alone. I finished up all the refactoring, and pretty quickly I got a clean compile. Great!

But of course, when I ran it, I got nothing close to the test image I was expecting. Well that’s not too unexpected- I figured I’d need to do a little debugging- I was fusing loops and I had to lag some computations in buffers to make everything pipe out correctly- it’s pretty easy to make an error in there somewhere.

You probably know the rest. After adding a fair amount of debug scaffolding to examine what was happening, I concluded that I had done all the refactoring and fusing correctly. I finally narrowed the problem down to the Sobel, and it took a lot of head scratching to realize that for the x_sum computation, I’d removed ‘=‘ assignments, but I had failed to remove the corresponding ‘;‘ at all but the last line. No complaints from my compiler; it’s perfectly legal though not terribly useful C code.

So today’s important parallel programming lesson- stay focused. As you migrate from sequential to parallel code, change the minimum necessary to move from one stable program to another. Resist the urge to clean up a few incidentals along the way- some people think parallel programming is hard enough without making it more so.

- Skip

BTW: for GCC, anybody know which compiler switches to add (or remove) to warn about a statement with no lvalue and no side effects?

Skip

If you can’t get rid of them, at least isolate them

May 27th, 2009

After reading the title above, did you think I would be talking about your co-workers?

Dependency analysis is a big part of the Prism parallel migration flow. Data dependencies are characterized as true data (read-after-write), anti (write-after-read) and output (write-after-write) dependencies.

Data Dependency Types

Dependency relationships between threads limit the amount of parallelism that can be achieved. They must either be removed or properly synchronized. Improperly handled dependencies show up as data races during execution.

Anti-Dependencies

Anti-dependencies are a common artifact of optimized sequential code when data storage is reused to save space. Loops are a common source of these anti-dependencies. The good news is that, once identified, these dependencies can be removed by duplicating or localizing variables, often making the loops embarrassingly parallel. Consider this made up sequential code block:

int i;
x_t x;
y_t y[N];

for (i = 0; i < N; ++i) {
  x = f(i);
  y[i] = g(i, x);
}

The x variable is shared by all iterations. In sequential code, this works fine because the iterations occur one after another and do not overlap. Unfortunately, the shared x variable causes problems when trying to run iterations in parallel. There is an anti-dependency between iterations where one iteration cannot write a new x value until after another iteration has read the existing x value.

This is easy to fix. If the x declaration is moved inside each iteration, each iteration maintains its own local x variable:

int i;
y_t y[N];

for (i = 0; i < N; ++i) {
  x_t x;
  x = f(i);
  y[i] = g(i, x);
}

In this case, there are no more dependencies between iterations, and the iterations can be run in parallel with each other. This code is now easy to parallelize using any typical parallel API, such as Pthreads or OpenMP.

True Data Dependencies

True data dependencies are usually fundamental to the algorithm being implemented. A value cannot be used (read) until after it has been computed (written). These dependencies are usually impossible to remove without changing algorithms.

Despite this, their impact can be minimized. Consider this made up sequential code block:

int i;
x_t x[N];
y_t y;

y = y_init();
for (i = 0; i < N; ++i) {
  x[i] = f(i);
  y = g(y, x[i]);
}

To run properly, the y values must computed in order. (Finicky readers will note the implicit assumption that the g(a, b) function is not associative.) The next iteration’s initial y value cannot be read until the previous iteration has written it. This is a true data dependency.

Though it cannot be removed, the dependency can be isolated. Consider this code refactoring:

int i;
x_t x[N];
y_t y;

for (i = 0; i < N; ++i) {
  x[i] = f(i);
}

y = y_init();
for (i = 0; i < N; ++i) {
  y = g(y, x[i]);
}

Here the x and y computations have been separated. Though the y values must still be computed serially, the x iterations can all be run in parallel using common parallel APIs.

If the f(i) function is expensive, this separation pays dividends. For example, assume the upper loop takes 98% of the run time and the second iteration only 2%. Using Amdahl’s law, the speed up for 7 cores is 100/(98/7 +2) = 6.25x.

So, when migrating optimized sequential code to parallel, I usually start with the low hanging fruit first- refactor the code to remove anti-dependencies. If lucky, this will make critical loops easy to parallelize. For remaining loops with true data dependencies, consider splitting the loops to isolate the code which must remain serial.

And do try to be respectful to your co-workers; you never know…

Skip

Algorithms in the real world

May 14th, 2009

With the release of Prism, we provide a recommended flow for migrating sequential code to parallel. One of the first things we suggest is to understand the intent and code structure of your application and make sure it’s well optimized.

No matter how well optimized though, when you go to parallelize, there are typically some fundamental data dependencies which will serialize parts of your application. The standard advice is to consider switching algorithms to one that is more parallel friendly. It sounds good, but many applications I’ve worked on, for example, h.264 video decode, have algorithms prescribed by standard.

I recently took a look at parallelizing the Dijkstra algorithm as part of a network routing application.

Dijkstra Algorithm - source: Wikipedia

The algorithm works to identify shortest paths in a network by following edges and keeping track of the minimum cost to get to each node. When starting from a source node, the algorithm maintains a priority queue of active nodes, choosing the minimum distance node as the next node to start form as the algorithm traverses the edges. This gives a worst case performance of:

O(|E| + |V|log|V|)

where |E| is the number of edges, and |V| is the number of vertices (nodes). The |E| part is easy to understand; all the reachable edges should be considered. The |V|log|V| part is trickier- that one depends on the priority queue implementation. Each time a node is added, the queue must be reprioritized. A binary heap implementation requires O(log|V|) swaps, and there are |V| nodes. That explains it.

But, I’m free to choose other implementations- a simple list would work but require O(|V|) searches to find the minimum distance node each time. A Fibonacci heap priority queue might even have better performance, statistically speaking, than a binary heap. That statistical part finally made me wake up. I realized that when thinking about changing algorithms, I’d always been thinking about going to something that was considered universally better, like the difference between a quicksort and a bubble sort. In practice, for Dijkstra, it appears that the Fibonacci queue often performs worse than the binary heap.

The best choice algorithm depends on the properties of the graph being analyzed. The real world can be so messy, but that makes room for good engineering and job security!

So imagine my surprise when I picked up the serial Dijkstra benchmark code inside MiBench and discovered it didn’t even bother to reprioritize the nodes in the queue, it just used the oldest nodes first in a simple queue. With a little thinking, you realize that if you reprioritize to always select the minimum distance node first, you are guaranteed not to have to revisit previous nodes. You lose that benefit if you just take the oldest node first, so the worst case performance appears to me to be really bad:

O(|E||V|)

because at each node you might have to backtrack over previously traversed edges. That sounds wasteful, but the real world may rescue us. If edges all have similar costs, then the algorithm will add nodes with the fewest node hops first, and that will greatly reduce the number of backtracked edges needed. With no reprioritization and no backtracking, the best case performance for this simpler implementation would become:

O(|E| + |V|)

Real world results will vary, but for typical routing graphs, I bet they are closer to O(|E| + |V|).

So I have a new found appreciation for ‘changing algorithms’. Will it help my parallelization? I’ll let you know.

- Skip

Skip

Under the Spousal Radar!

March 31st, 2009

I attended the first day of the Embedded Systems Conference in San Jose today. Previous conferences have included hands on tracks, and this year continues that theme. This conference’s track is called Build Your Own Embedded System (BYOSE) and uses the Beagle Board from Texas Instruments.

I picked up my Beagle Board at registration and headed off to the Beagle Board 101 class taught by Jason Kridner and Gerald Coley from TI. At least 120 engineers connected their boards into screens, keyboards, mice, and USB hubs, and shortly we were all running a basic Linux distribution.

TI describes the $149 price point and 3” by 3” size as “under the spousal radar”. True enough, although unfortunately it seems to have been easily detected by all my kids’ radars. They haven’t ratted me out yet, but it’s costing me. TI’s recommendations- if you’ve got less money but more time- this board is for you. If you have more money but less time, go with one of TI’s evaluation boards and a commercial Linux distribution.

The centerpiece of the board is TI’s OMAP 3430. It includes an ARM Cortex A8 application processor, an Imagination Technologies’ PowerVR SGX graphics accelerator, and a TI C64x DSP with a number of specialized accelerators. This is a great example of a highly integrated asymmetric multiprocessing (AMP) system. Developing applications for the ARM A8 is fairly straightforward using the GNU toolchain. With a little more work, the developer can gain access to the graphics engine and the DSP subsystem.

TI OMAP 3430

TI OMAP 3430

What most interests me is how this platform lets a programmer evaluate processing tradeoffs between the different processing engines and accelerators, including running and coordinating their parallel activities- something we all will be doing much more off in the future. One of the big things we do at CriticalBlue, with products like Prism and Cascade, is help engineers efficiently and easily evaluate these tradeoffs. The Beagle Board appears to be a very practical proving ground for different application design and programming approaches.

The best place to get started- beagleboard.org. It contains a wealth of information, an active community, and quite a few interesting projects.

TI recently announced the OMAP4 architecture which, among other enhancements, ups the application processor to a dual core Cortex A9MPCore- adding true symmetric multiprocessing (SMP) to the mix. When the time comes, I certainly hope TI offers a Beagleboard equivalent, but maybe by then they can figure out how to keep these multiprocessing boards under even my kids’ radars.

Skip

Multicore Mobile Mainstream

January 12th, 2009

Believe it or not, there were a few rumors at this month’s Macworld Expo that didn’t speculate about Steve Jobs. One of the most interesting was posted by Jason O’Grady at ZDNet.  Jason reported rumors about a mysterious quad core iPhone and corresponding firmware 3.0 which would include support for the quad core. If true, this would be a significant milestone for multicore use on mainstream consumer platforms.

Of course, there are already embedded multicore devices in production today. Even though they’re in the market, a significant percentage of these devices are still running single core software. Though this allows independent applications to run in parallel, it doesn’t lower each application’s power profile or, conversely, allow each application greater throughput. To achieve these benefits, application software must be multicore ready.

In these challenging economic times, new platforms must disrupt markets. There is little profit in launching a quad core phone without stunning benefits to the end consumer, and that means applications must take full advantage of the multicore architecture. Likewise, getting the most out of existing multicore platforms can extend their product lifetime revenues, and software enhancements are the way to do it. Either way, the time to make applications multicore friendly is now.

For multithreaded platforms programmed in C, Pthreads is the natural choice. The Pthreads’ API is available on many operating systems, and the Pthreads’ library is especially well supported on Linux variants. Applications can be parallelized and debugged using pthreads, and they will still run correctly in a single core environment. An unexpected benefit of threading serial code is that it’s likely that some previously undiscovered bugs will be flushed out during the refactoring.

In some cases, OpenMP may be an attractive initial approach. OpenMP uses a set of compiler directives to describe parallelism available in the serial code. Maximum parallel performance may not be achieved in all cases. Offsetting that, the code can always be run serially by simply disabling the directives during compilation.

Even if not immediately releasing multithreaded applications, ensuring that libraries and drivers are multithread safe provides significant leverage going forward. When you do discover an application requiring the extra performance or lower power of multicore, having multicore-ready libraries and drivers will enable you to respond quickly with less learning curve and better debug isloation.

Signs are that 2009 is the year for multicore to impact the mainstream embedded market. Making existing software multicore ready will pay dividends to developers willing to make the investment now, even when times are tough.

Skip

Threads, it’s easy to overlook the dependencies

October 27th, 2008

At CriticalBlue, we often hear that thread based programming is hard and that it requires us all to ‘think parallel’. Conceptually though, I don’t think the fundamentals of thread programming are all that hard to grasp. In pthreads, we create threads so functions can execute in parallel; we use mutexes to coordinate mutual exclusion through critical regions, and we use condition variables to wait until conditions are right before proceeding. Seems simple enough.

Rather, what really is hard is to constrain a parallel program such that all possible interleavings of program execution are correct. While it’s easy to create threads and run them in parallel, it’s hard to catch all the data dependencies between the threads which must be properly ordered to ensure always race free execution.

I recently proved once again just how easy it is to overlook a critical dependency. I was preparing some examples for a talk I was giving at the ARM Developers’ Conference this October. I was extending a simple edge detection example to show different ways the image pipeline could be parallelized. I’d previously parallelized just the Sobel function using a row-based data decomposition strategy. No problem here I thought, all I need to do is replace the Sobel slice function with one processing a slice through the correction, noise filter and sobel functions. I ran the program, and I ended up with a pretty good looking result.

I say pretty good, because there was one line of pixels which looked a little off, but that was probably just an artifact, right? Wrong. That artifact saved me a lot of embarrassment, because, I’d blindly introduced data races into my code.

When you think it through, it’s easy to see that a filter slice can’t start processing until the correcting slice has completed and the edges of it’s neighbors have completed processing. Within a slice, the processing is sequential, but at the edges, there’s no guarantee that the edges of one slice will be corrected before another slice needs them. A simple but slightly overconstrained solution is to ensure that all correction processing completes before any filtering processing occurs, and, likewise, that all filtering completes before the Sobel slices begin.

The lesson learned is that it is very easy to overlook a critical dependency. I was lucky that my visual result gave me a tiny hint that something was wrong. It could have very easily passed all my tests, and I could have had an observant conference attendee put me in my place.

Simple though it was, this incident gave me a real appreciation for the importance of good tools, in this case, not so much for doing my parallel thinking for me, but for checking that what I did wasn’t doing anything wrong, and when there are mistakes, helping me understand what they are and ensuring that I’ve corrected them. I’ve added data race detection tools to my list of must haves for parallel programming.

Alas, there are still plenty of other ways I routinely embarrass myself…

- Skip Hovsmith

Skip

Hello world!

September 30th, 2008

This blogs intent is to follow trends and share experiences regarding multicore programming. Stay tuned, we’re just getting started. The CriticalBlue team!

Skip