Archive

Archive for May, 2009

If you can’t get rid of them, at least isolate them

May 27th, 2009

After reading the title above, did you think I would be talking about your co-workers?

Dependency analysis is a big part of the Prism parallel migration flow. Data dependencies are characterized as true data (read-after-write), anti (write-after-read) and output (write-after-write) dependencies.

Data Dependency Types

Dependency relationships between threads limit the amount of parallelism that can be achieved. They must either be removed or properly synchronized. Improperly handled dependencies show up as data races during execution.

Anti-Dependencies

Anti-dependencies are a common artifact of optimized sequential code when data storage is reused to save space. Loops are a common source of these anti-dependencies. The good news is that, once identified, these dependencies can be removed by duplicating or localizing variables, often making the loops embarrassingly parallel. Consider this made up sequential code block:

int i;
x_t x;
y_t y[N];

for (i = 0; i < N; ++i) {
  x = f(i);
  y[i] = g(i, x);
}

The x variable is shared by all iterations. In sequential code, this works fine because the iterations occur one after another and do not overlap. Unfortunately, the shared x variable causes problems when trying to run iterations in parallel. There is an anti-dependency between iterations where one iteration cannot write a new x value until after another iteration has read the existing x value.

This is easy to fix. If the x declaration is moved inside each iteration, each iteration maintains its own local x variable:

int i;
y_t y[N];

for (i = 0; i < N; ++i) {
  x_t x;
  x = f(i);
  y[i] = g(i, x);
}

In this case, there are no more dependencies between iterations, and the iterations can be run in parallel with each other. This code is now easy to parallelize using any typical parallel API, such as Pthreads or OpenMP.

True Data Dependencies

True data dependencies are usually fundamental to the algorithm being implemented. A value cannot be used (read) until after it has been computed (written). These dependencies are usually impossible to remove without changing algorithms.

Despite this, their impact can be minimized. Consider this made up sequential code block:

int i;
x_t x[N];
y_t y;

y = y_init();
for (i = 0; i < N; ++i) {
  x[i] = f(i);
  y = g(y, x[i]);
}

To run properly, the y values must computed in order. (Finicky readers will note the implicit assumption that the g(a, b) function is not associative.) The next iteration’s initial y value cannot be read until the previous iteration has written it. This is a true data dependency.

Though it cannot be removed, the dependency can be isolated. Consider this code refactoring:

int i;
x_t x[N];
y_t y;

for (i = 0; i < N; ++i) {
  x[i] = f(i);
}

y = y_init();
for (i = 0; i < N; ++i) {
  y = g(y, x[i]);
}

Here the x and y computations have been separated. Though the y values must still be computed serially, the x iterations can all be run in parallel using common parallel APIs.

If the f(i) function is expensive, this separation pays dividends. For example, assume the upper loop takes 98% of the run time and the second iteration only 2%. Using Amdahl’s law, the speed up for 7 cores is 100/(98/7 +2) = 6.25x.

So, when migrating optimized sequential code to parallel, I usually start with the low hanging fruit first- refactor the code to remove anti-dependencies. If lucky, this will make critical loops easy to parallelize. For remaining loops with true data dependencies, consider splitting the loops to isolate the code which must remain serial.

And do try to be respectful to your co-workers; you never know…

Skip

Algorithms in the real world

May 14th, 2009

With the release of Prism, we provide a recommended flow for migrating sequential code to parallel. One of the first things we suggest is to understand the intent and code structure of your application and make sure it’s well optimized.

No matter how well optimized though, when you go to parallelize, there are typically some fundamental data dependencies which will serialize parts of your application. The standard advice is to consider switching algorithms to one that is more parallel friendly. It sounds good, but many applications I’ve worked on, for example, h.264 video decode, have algorithms prescribed by standard.

I recently took a look at parallelizing the Dijkstra algorithm as part of a network routing application.

Dijkstra Algorithm - source: Wikipedia

The algorithm works to identify shortest paths in a network by following edges and keeping track of the minimum cost to get to each node. When starting from a source node, the algorithm maintains a priority queue of active nodes, choosing the minimum distance node as the next node to start form as the algorithm traverses the edges. This gives a worst case performance of:

O(|E| + |V|log|V|)

where |E| is the number of edges, and |V| is the number of vertices (nodes). The |E| part is easy to understand; all the reachable edges should be considered. The |V|log|V| part is trickier- that one depends on the priority queue implementation. Each time a node is added, the queue must be reprioritized. A binary heap implementation requires O(log|V|) swaps, and there are |V| nodes. That explains it.

But, I’m free to choose other implementations- a simple list would work but require O(|V|) searches to find the minimum distance node each time. A Fibonacci heap priority queue might even have better performance, statistically speaking, than a binary heap. That statistical part finally made me wake up. I realized that when thinking about changing algorithms, I’d always been thinking about going to something that was considered universally better, like the difference between a quicksort and a bubble sort. In practice, for Dijkstra, it appears that the Fibonacci queue often performs worse than the binary heap.

The best choice algorithm depends on the properties of the graph being analyzed. The real world can be so messy, but that makes room for good engineering and job security!

So imagine my surprise when I picked up the serial Dijkstra benchmark code inside MiBench and discovered it didn’t even bother to reprioritize the nodes in the queue, it just used the oldest nodes first in a simple queue. With a little thinking, you realize that if you reprioritize to always select the minimum distance node first, you are guaranteed not to have to revisit previous nodes. You lose that benefit if you just take the oldest node first, so the worst case performance appears to me to be really bad:

O(|E||V|)

because at each node you might have to backtrack over previously traversed edges. That sounds wasteful, but the real world may rescue us. If edges all have similar costs, then the algorithm will add nodes with the fewest node hops first, and that will greatly reduce the number of backtracked edges needed. With no reprioritization and no backtracking, the best case performance for this simpler implementation would become:

O(|E| + |V|)

Real world results will vary, but for typical routing graphs, I bet they are closer to O(|E| + |V|).

So I have a new found appreciation for ‘changing algorithms’. Will it help my parallelization? I’ll let you know.

- Skip

Skip