Archive

Archive for September, 2009

Tim Toady

September 30th, 2009

There’s more than one way to do it may be the Perl motto but it is can be fairly said about just about anything in general and computer science in particular.

When dealing with typical multi-threaded software (written in C or C++, with lots of pointers) tools have to rely on dynamic analysis in order to extract useful information about data dependencies and the inherent hazards these pose to the correct behavior of the application.

There are broadly 2 ways (but with uncounted variations) to collect this dynamic information:

  • Instrumentation of the application as it runs, typically on hardware.
  • Recording the state changes in a Virtual Machine which is running the application.

But how do these 2 approaches stack up against each other?

Instrumentation benefits from speed and convenience, in many cases user will be tracing applications directly on their workstation hardware or a development board which they alway run their apps on. This gives a reassuring sense of knowing that the captured behavior is from the actual target hardware that will be used by customers.

The problem is that this is a false sense of security. Instrumentation will alway disrupt the behavior of the program it is tracing. Typically the scheduling, core allocation and completion ordering of threads will be disrupted so anyone hoping to see exactly what will happen on the real hardware is going to be disappointed. It is this sort of interference that makes it notoriously difficult to debug multi-threaded applications.

Turning to Virtual Machines, you get the possibility of completely hands off tracing and the opportunity to target many different system configurations without cluttering up your desk with expensive dev boards.

However things are not that straightforward. The virtual platform is going to be slower than the actual hardware in most cases, especially when tracing in great detail, and it is a lot harder to extract the higher level scheduling information from the OS making it difficult to establish which thread is running on which core at what time. Few Virtual platform vendors have this capability as yet but many are adding it as demand for this sort of tracing grows.

So which approach is best ? As always it depends on what you are looking for.

Data dependency analysis does not depend much on the order of thread execution. If 2 threads ever access the same memory without synchronization then that is probably bad, so instrumentation is still very useful.

Since many programs will be running under an OS with any number of other active programs it is unlikely that you can ever expect to see a typical schedule anyway so if your application expects a certain schedule for correct execution without enforcing it then you deserve all the bugs you get.

Virtual Machines come into there own mostly in the deeply embedded space where many-core systems are often carved up into several virtualized multicore domains where the OS only runs a single application (telecoms is a typical example of this) and it pays to analyze the detailed runtime behavior to squeeze out every gram of performance.

So, in conclusion, I’m off to find a comfy fence to sit on…

Barry

It’s Just Not Fair

September 17th, 2009

Whilst debugging some pthread code recently I came across a somewhat unexpected and seemingly non-intuitive result regarding locks. Here is a simple pthread example program. The main creates a single child thread. This thead that works away incrementing the sGlobal counter until sContinue is cleared by main. The main simply checks the count and stops the child when the count exceeds 10.  Of course, the accesses to sGlobal are protected by a mutex to ensure that a partially updated result isn’t read and there are no nasty data races (okay, so an int can be read atomically on pretty much any machine but sGlobal could be a more complex object).

static int sGlobal;
static int sContinue;
static pthread_mutex_t sMutex;

static void* childthread(void* p)
{
  do
  {
    pthread_mutex_lock(&sMutex);
    sGlobal++;
    printf("Count = %d\n", sGlobal);
    pthread_mutex_unlock(&sMutex);
  }
  while (sContinue);
}

int main(int argc, char *argv[])
{
  pthread_t aThreadID;
  int aCount;

  pthread_mutex_init(&sMutex, NULL);
  sContinue = 1;
  sGlobal = 0;
  pthread_create(&aThreadID, NULL, childthread, NULL);
  do
  {
    pthread_mutex_lock(&sMutex);
    aCount = sGlobal;
    pthread_mutex_unlock(&sMutex);
  }
  while (aCount < 10);
  sContinue = 0;
}

So what would you expect the output to be? Maybe a count 0 to 10 and then the program exits?  I was expecting that main thread and the child thread to contend for the lock but ultimately for both of them to get pretty fair access to it. If the child thread has the lock but the main thread wants it I was expecting it to be next in line for acquiring it. Unfortunately there isn’t an orderly queue waiting for the lock, its a total free for all. As it turns out pthread mutexes provide absolutely no guarantee of fairness for lock accesses whatsoever. It doesn’t matter how long a thread has been waiting for a lock, another thread can push into the queue and get the lock again even after its just released it.

So what do you get when you run this. Well, it all depends. I ran this on a Linux Redhat 4 system with NPTL threads and the final count at termination were for successive runs was 641, 573, 229, 29, 25, 14791, 1708. Spot any pattern? No, neither can I. I also tried it on an ARM Debian Linux version with the older LinuxThreads implementation. Its still going. So this is rather random and very non-portable.

The child thread is hogging the lock. Even though it releases it, it grabs it again very quickly afterwards. Unless the process happens to get pre-empted just at the that point it will get the lock back again, even if the main thread has been waiting for ages. If most of the processing is done while the lock is held then there is very little opportunity for another thread to acquire the lock.

So why this rather rude free-for-all with locks? Its all in the name of efficiency. If orderely queues were maintained for each lock then every time a thread wants to acquire a lock it would need to be polite by checking if any other thread were waiting rather than just grabbing it. The queue itself would probably also need to be protected with locks since it could be accessed by any thread, and so it goes on. Its just simple and efficient to grab the lock if its available, no questions asked. Its up to the programmer to ensure fairness and synchronisation using other methods such as condition variables.

And what was the answer for ARM LinuxThreads? Well, its at 389000 and still counting. I suspect its never going to halt, but you never know (as Turing famously proved…).

Richard

Stay focused

September 1st, 2009

I’m teaching two classes at ESC Boston in September:

[ESC-442] Practical Migration of Sequential C/C++ Code to Multi-core Systems

[ESC-322] Coprocessing and Multiprocessing Techniques to Accelerate Software

and I have been busy working up examples and polishing slides.

Lots of people suggest that parallel programming is hard, so I thought I would share a few of my lessons learned during the run up to ESC Boston. Of course, some parallel programming lessons are subtle and hard won, but then there are others which remind you that no matter how complex things get, some mistakes are just plain dumb. Might as well start there…

I was reworking an image processing example to demonstrate a loop fusion technique which makes it more efficient to parallelize a processing pipeline. Part of the algorithm uses a Sobel filter, part of which includes:

x_sum  = 1 * in_pixels[rn * ncols + cw];
x_sum += 2 * in_pixels[r  * ncols + cw];
x_sum += 1 * in_pixels[rs * ncols + cw];
x_sum -= 1 * in_pixels[rn * ncols + ce];
x_sum -= 2 * in_pixels[r  * ncols + ce];
x_sum -= 1 * in_pixels[rs * ncols + ce];

This works fine (I’ve used it before), but as I was refactoring the algorithm and adding threading, I decided that the computation might look a wee bit clearer if I stopped repeating the x_sum assignment. The quick code modification looks like this:

x_sum  = 1 * in_pixels[rn * ncols + cw];
       + 2 * in_pixels[r  * ncols + cw];
       + 1 * in_pixels[rs * ncols + cw];
       - 1 * in_pixels[rn * ncols + ce];
       - 2 * in_pixels[r  * ncols + ce];
       - 1 * in_pixels[rs * ncols + ce];

I just couldn’t leave well enough alone. I finished up all the refactoring, and pretty quickly I got a clean compile. Great!

But of course, when I ran it, I got nothing close to the test image I was expecting. Well that’s not too unexpected- I figured I’d need to do a little debugging- I was fusing loops and I had to lag some computations in buffers to make everything pipe out correctly- it’s pretty easy to make an error in there somewhere.

You probably know the rest. After adding a fair amount of debug scaffolding to examine what was happening, I concluded that I had done all the refactoring and fusing correctly. I finally narrowed the problem down to the Sobel, and it took a lot of head scratching to realize that for the x_sum computation, I’d removed ‘=‘ assignments, but I had failed to remove the corresponding ‘;‘ at all but the last line. No complaints from my compiler; it’s perfectly legal though not terribly useful C code.

So today’s important parallel programming lesson- stay focused. As you migrate from sequential to parallel code, change the minimum necessary to move from one stable program to another. Resist the urge to clean up a few incidentals along the way- some people think parallel programming is hard enough without making it more so.

- Skip

BTW: for GCC, anybody know which compiler switches to add (or remove) to warn about a statement with no lvalue and no side effects?

Skip