Keeping it Consistent
As we dive deeper into this brave new world of the multicore programmer, it surprises me how many new and interesting ways there are to introduce subtle bugs into multithreaded applications. No wonder such programming is widely considered to be difficult stuff, not yet fully emerged from the shadows of a black art. One such set of subtleties are memory consistency models and, more precisely, how they can screw you up.
Something that has always impressed me about the last 30 years of microarchitecture innovation has been the insulation between the software binary and hardware worlds. Processors have undergone enormous change, from simple creatures where every transistor was precious to hugely complex speculative, out-of-order, multi-issue beasts where hundreds of instructions are thrown into the mix and the results land again miraculously in the right order. All this complexity has been insulated from the programmer; they’ve been happy enough ingesting that free lunch courtesy of Moore, safe in the knowledge that their unmodified software will just get faster in the future. Unfortunately in the multicore world, free lunch is over, and the hardware abstraction can now cause a little indigestion. An obscure hardware topic like memory consistency models suddenly has relevance to the application programmer.
So what is a memory consistency model? Well it’s a complicated subject, but essentially it describes the order in which data gets read and written to memory. Let’s consider a simple example on a single core:
int *x, y, z;
…
x = &y;
y = 1;
z = *x;
You would quite reasonably expect, despite all the optimization and trickery in the compiler and processor, that z would end up with the value 1. In reality, all sorts of things might be going on in the internals of processor and cache, but it will always ensure that it looks like things happen in a sequential order, even if the system has to work very hard underneath to give that impression.
When it comes to multicore, however, retaining that pretense becomes a whole lot harder. Consider a similar example split over two threads:
int ready = 0, data;
void thread1() {
data = 42;
ready = 1;
}
void thread2() {
int answer;
while (!ready)
<do something useful>;
answer = data;
}
Sure, this isn’t the greatest example of multithreaded coding ever, thread2 hangs about waiting for the data to be ready in a busy wait loop. Thread1 makes the data available in memory and then afterwards marks its presence with the ready flag. When thread2 detects ready is non-zero, it copies data into answer. So it will get the answer 42 right? Wrong! Well to be precise, it almost always will, but there is a very small chance that it won’t. To me that sounds worse than it not working at all.
So what can possibly be wrong with this? If in thread1 you read back data immediately after you had set the ready flag, you would always get 42. Indeed, if thread2 happened to run on the same CPU as thread1, it would always get 42 as well. The problem comes when thread2 is on a different CPU. Even if a CPU sees that its own stores happen in sequential order, that does not guarantee that another CPU will see them in the same order. It depends on the type of consistency model used, but many multicore systems use a relaxed consistency model where this type of mind bending relativism is all too possible. They have to, or else their cache coherence would take too long. The problem is that accessing data might cause a second level cache miss or go to a different bank of DRAM than ready. In modern pipelined bus systems different bus transactions can overtake each other and get themselves out of order.
The real issue is that the chances of this actually going wrong are very small. It’s much more subtle than many other, already pretty subtle, threading issues. The two threads must be on different cores, and the memory layout, cache contents and transaction ordering must be lined up in just the wrong way. You can be sure though that when things do go wrong this is a very nasty issue to debug and find.
So what’s the solution? Well you may have spotted that the access to ready and data between the two threads might be considered to be a data race. Tools can be used to find such data races and point them out to the programmer. The way you would fix the data race would be to introduce a mutex-lock around the accesses rather than using a DIY spin-lock like ready. The reason this works is that mutexes do slightly more than just lock and unlock. Normally the implementation of a mutex in the threads library also executes a memory barrier instruction. In short, this is special instruction that tells the CPU to make its memory ordering sequential with respect to other CPUs. So this side effect of making a mutex call actually causes a serialization of all memory state and ensures that everything previously written by any core is fully committed so that the executing CPU can properly see it.
Presuming you already use them, the problem with data race detection tools is that they can generate false positives. You might look at this code and be sure you know what you are doing, and there might be good reasons for writing the code in this way. Just beware though, sequential ordering isn’t quite as ordered as it used to be…
Richard Taylor, CTO