Archive

Archive for the ‘Richard’ Category

It’s Just Not Fair

September 17th, 2009

Whilst debugging some pthread code recently I came across a somewhat unexpected and seemingly non-intuitive result regarding locks. Here is a simple pthread example program. The main creates a single child thread. This thead that works away incrementing the sGlobal counter until sContinue is cleared by main. The main simply checks the count and stops the child when the count exceeds 10.  Of course, the accesses to sGlobal are protected by a mutex to ensure that a partially updated result isn’t read and there are no nasty data races (okay, so an int can be read atomically on pretty much any machine but sGlobal could be a more complex object).

static int sGlobal;
static int sContinue;
static pthread_mutex_t sMutex;

static void* childthread(void* p)
{
  do
  {
    pthread_mutex_lock(&sMutex);
    sGlobal++;
    printf("Count = %d\n", sGlobal);
    pthread_mutex_unlock(&sMutex);
  }
  while (sContinue);
}

int main(int argc, char *argv[])
{
  pthread_t aThreadID;
  int aCount;

  pthread_mutex_init(&sMutex, NULL);
  sContinue = 1;
  sGlobal = 0;
  pthread_create(&aThreadID, NULL, childthread, NULL);
  do
  {
    pthread_mutex_lock(&sMutex);
    aCount = sGlobal;
    pthread_mutex_unlock(&sMutex);
  }
  while (aCount < 10);
  sContinue = 0;
}

So what would you expect the output to be? Maybe a count 0 to 10 and then the program exits?  I was expecting that main thread and the child thread to contend for the lock but ultimately for both of them to get pretty fair access to it. If the child thread has the lock but the main thread wants it I was expecting it to be next in line for acquiring it. Unfortunately there isn’t an orderly queue waiting for the lock, its a total free for all. As it turns out pthread mutexes provide absolutely no guarantee of fairness for lock accesses whatsoever. It doesn’t matter how long a thread has been waiting for a lock, another thread can push into the queue and get the lock again even after its just released it.

So what do you get when you run this. Well, it all depends. I ran this on a Linux Redhat 4 system with NPTL threads and the final count at termination were for successive runs was 641, 573, 229, 29, 25, 14791, 1708. Spot any pattern? No, neither can I. I also tried it on an ARM Debian Linux version with the older LinuxThreads implementation. Its still going. So this is rather random and very non-portable.

The child thread is hogging the lock. Even though it releases it, it grabs it again very quickly afterwards. Unless the process happens to get pre-empted just at the that point it will get the lock back again, even if the main thread has been waiting for ages. If most of the processing is done while the lock is held then there is very little opportunity for another thread to acquire the lock.

So why this rather rude free-for-all with locks? Its all in the name of efficiency. If orderely queues were maintained for each lock then every time a thread wants to acquire a lock it would need to be polite by checking if any other thread were waiting rather than just grabbing it. The queue itself would probably also need to be protected with locks since it could be accessed by any thread, and so it goes on. Its just simple and efficient to grab the lock if its available, no questions asked. Its up to the programmer to ensure fairness and synchronisation using other methods such as condition variables.

And what was the answer for ARM LinuxThreads? Well, its at 389000 and still counting. I suspect its never going to halt, but you never know (as Turing famously proved…).

Richard

Keeping it Consistent

November 7th, 2008

As we dive deeper into this brave new world of the multicore programmer, it surprises me how many new and interesting ways there are to introduce subtle bugs into multithreaded applications. No wonder such programming is widely considered to be difficult stuff, not yet fully emerged from the shadows of a black art. One such set of subtleties are memory consistency models and, more precisely, how they can screw you up.

Something that has always impressed me about the last 30 years of microarchitecture innovation has been the insulation between the software binary and hardware worlds. Processors have undergone enormous change, from simple creatures where every transistor was precious to hugely complex speculative, out-of-order, multi-issue beasts where hundreds of instructions are thrown into the mix and the results land again miraculously in the right order. All this complexity has been insulated from the programmer; they’ve been happy enough ingesting that free lunch courtesy of Moore, safe in the knowledge that their unmodified software will just get faster in the future. Unfortunately in the multicore world, free lunch is over, and the hardware abstraction can now cause a little indigestion. An obscure hardware topic like memory consistency models suddenly has relevance to the application programmer.

So what is a memory consistency model? Well it’s a complicated subject, but essentially it describes the order in which data gets read and written to memory. Let’s consider a simple example on a single core:


int *x, y, z;

x = &y;
y = 1;
z = *x;

You would quite reasonably expect, despite all the optimization and trickery in the compiler and processor, that z would end up with the value 1. In reality, all sorts of things might be going on in the internals of processor and cache, but it will always ensure that it looks like things happen in a sequential order, even if the system has to work very hard underneath to give that impression.

When it comes to multicore, however, retaining that pretense becomes a whole lot harder. Consider a similar example split over two threads:


int ready = 0, data;
void thread1() {
  data = 42;
  ready = 1;
}
void thread2() {
  int answer;
  while (!ready)
    <do something useful>;
  answer = data;
}

Sure, this isn’t the greatest example of multithreaded coding ever, thread2 hangs about waiting for the data to be ready in a busy wait loop. Thread1 makes the data available in memory and then afterwards marks its presence with the ready flag. When thread2 detects ready is non-zero, it copies data into answer. So it will get the answer 42 right? Wrong! Well to be precise, it almost always will, but there is a very small chance that it won’t.  To me that sounds worse than it not working at all.

So what can possibly be wrong with this? If in thread1 you read back data immediately after you had set the ready flag, you would always get 42. Indeed, if thread2 happened to run on the same CPU as thread1, it would always get 42 as well. The problem comes when thread2 is on a different CPU. Even if a CPU sees that its own stores happen in sequential order, that does not guarantee that another CPU will see them in the same order. It depends on the type of consistency model used, but many multicore systems use a relaxed consistency model where this type of mind bending relativism is all too possible. They have to, or else their cache coherence would take too long. The problem is that accessing data might cause a second level cache miss or go to a different bank of DRAM than ready. In modern pipelined bus systems different bus transactions can overtake each other and get themselves out of order.

The real issue is that the chances of this actually going wrong are very small. It’s much more subtle than many other, already pretty subtle, threading issues. The two threads must be on different cores, and the memory layout, cache contents and transaction ordering must be lined up in just the wrong way. You can be sure though that when things do go wrong this is a very nasty issue to debug and find.

So what’s the solution? Well you may have spotted that the access to ready and data between the two threads might be considered to be a data race. Tools can be used to find such data races and point them out to the programmer. The way you would fix the data race would be to introduce a mutex-lock around the accesses rather than using a DIY spin-lock like ready. The reason this works is that mutexes do slightly more than just lock and unlock. Normally the implementation of a mutex in the threads library also executes a memory barrier instruction. In short, this is special instruction that tells the CPU to make its memory ordering sequential with respect to other CPUs. So this side effect of making a mutex call actually causes a serialization of all memory state and ensures that everything previously written by any core is fully committed so that the executing CPU can properly see it.

Presuming you already use them, the problem with data race detection tools is that they can generate false positives. You might look at this code and be sure you know what you are doing, and there might be good reasons for writing the code in this way. Just beware though, sequential ordering isn’t quite as ordered as it used to be…

Richard Taylor, CTO

Richard