Archive

Archive for November, 2008

Making the Most of the Multicore Crisis

November 18th, 2008

The multicore revolution is looking pretty bloody at the moment. Holy wars are erupting over whether message passing or shared memory is the best programming model; factions are pushing their favored language as the new standard or are resurrecting technologies from the 70s and, most of all, everyone is fretting about where these developers with parallel programming skills are going to come from to rewrite the entire software canon.

This makes it easy to lose sight of the huge opportunities now available to those who keep their heads and are ready to exploit the best technologies to come out of the melee.

The big win for multicore, particularly for embedded systems, is power savings. These come in a number of guises, and things are still moving quickly, but right now there are a number of approaches that can deliver based on the familer shared memory model which currently holds sway in the desktop market.

Consolidate

This is the simplest way to cut power and is already familiar from SoCs. By integrating a number of identical cores onto a single die rather than several packages on a PCB substantial power savings are being realized. This approach is proving popular with DSP vendors, such as TI and Qualcomm, many of whom are delivering dual to quad core versions of their most popular processors as will as DSP with general processor combinations. From a software perspective this maybe isn’t true multicore since there is little if any support or encouragement for synchronization between the cores and software silo based design continues to be the norm.

Reduce Clock Speed

High end embedded systems with operating systems, like Smart Phones and Mobile Internet Devices, often run on a monolithic processor clocked as fast as possible. The clock rate for the processor is typically much higher than is actually required for any one application in order to provide a responsive multitasking experience for the user. The processors for these systems are typically designed to run up to 1GHz which carries a substantial power penalty. By simply opting for the multicore version of these processors, such as ARM MPCore, and using and SMP OS the amount number of tasks sharing each core is reduced allowing the designer to reduce the clock frequency for these cores.

The trick here is that by lowering the required clock frequency for each core can allow the processor to be implemented with a low power IC process library. This allows power savings while retaining or even increasing the number of MIPS available to tasks in the OS. For example, on the ARM11MPCore web page, high performance and low power figures are quoted with the low power library timing to only 320 MHz. So by switching from a single core at 500 MHz @ 0.43 mW/MHz to a 300 MHz dual core with 0.23 mW/MHz you can achieve a 35% power reduction while gaining 100% L1 cache and 20% MIPS. If there is area budget available, it is an easy win.

Processor Downgrade

Following on from reducing the clock frequency, there is the further option of downgrading the actual core used. This can be a good option for more deeply embedded applications where the load is concentrated in a predictable, fixed set of applications which run on at a time and are amenable to multithreading. If evenly balanced each core will have a lower performance requirement so it can make since to switch to an architecture attains lower MIPS per MHz which typically also means lower mW per MHz.

Unfortunately, the mainstream IP vendors do yet not appear to be offering this downgrade path preferring to focus on multicore versions of their high end processors. However some system integrators are starting to offer multicore platforms based around microcontroller architectures which should start to fill this niche. If you are feeling brave it is also possible to role your own multicore systems based on the likes of the Cortex M3 as TI presented at ARM DevCon.

Opportunity

So are things all that bad ? There may be a multicore crisis but processor vendors are already rolling out practical, easy to use, shared memory systems based around existing architectures allow system designers to access the benefits of parallelism with surprisingly little effort. Since these platforms support the thread model of parallel programming, software engineers who invest the time in reading up on the subject can only benefit.

Barry

Keeping it Consistent

November 7th, 2008

As we dive deeper into this brave new world of the multicore programmer, it surprises me how many new and interesting ways there are to introduce subtle bugs into multithreaded applications. No wonder such programming is widely considered to be difficult stuff, not yet fully emerged from the shadows of a black art. One such set of subtleties are memory consistency models and, more precisely, how they can screw you up.

Something that has always impressed me about the last 30 years of microarchitecture innovation has been the insulation between the software binary and hardware worlds. Processors have undergone enormous change, from simple creatures where every transistor was precious to hugely complex speculative, out-of-order, multi-issue beasts where hundreds of instructions are thrown into the mix and the results land again miraculously in the right order. All this complexity has been insulated from the programmer; they’ve been happy enough ingesting that free lunch courtesy of Moore, safe in the knowledge that their unmodified software will just get faster in the future. Unfortunately in the multicore world, free lunch is over, and the hardware abstraction can now cause a little indigestion. An obscure hardware topic like memory consistency models suddenly has relevance to the application programmer.

So what is a memory consistency model? Well it’s a complicated subject, but essentially it describes the order in which data gets read and written to memory. Let’s consider a simple example on a single core:


int *x, y, z;

x = &y;
y = 1;
z = *x;

You would quite reasonably expect, despite all the optimization and trickery in the compiler and processor, that z would end up with the value 1. In reality, all sorts of things might be going on in the internals of processor and cache, but it will always ensure that it looks like things happen in a sequential order, even if the system has to work very hard underneath to give that impression.

When it comes to multicore, however, retaining that pretense becomes a whole lot harder. Consider a similar example split over two threads:


int ready = 0, data;
void thread1() {
  data = 42;
  ready = 1;
}
void thread2() {
  int answer;
  while (!ready)
    <do something useful>;
  answer = data;
}

Sure, this isn’t the greatest example of multithreaded coding ever, thread2 hangs about waiting for the data to be ready in a busy wait loop. Thread1 makes the data available in memory and then afterwards marks its presence with the ready flag. When thread2 detects ready is non-zero, it copies data into answer. So it will get the answer 42 right? Wrong! Well to be precise, it almost always will, but there is a very small chance that it won’t.  To me that sounds worse than it not working at all.

So what can possibly be wrong with this? If in thread1 you read back data immediately after you had set the ready flag, you would always get 42. Indeed, if thread2 happened to run on the same CPU as thread1, it would always get 42 as well. The problem comes when thread2 is on a different CPU. Even if a CPU sees that its own stores happen in sequential order, that does not guarantee that another CPU will see them in the same order. It depends on the type of consistency model used, but many multicore systems use a relaxed consistency model where this type of mind bending relativism is all too possible. They have to, or else their cache coherence would take too long. The problem is that accessing data might cause a second level cache miss or go to a different bank of DRAM than ready. In modern pipelined bus systems different bus transactions can overtake each other and get themselves out of order.

The real issue is that the chances of this actually going wrong are very small. It’s much more subtle than many other, already pretty subtle, threading issues. The two threads must be on different cores, and the memory layout, cache contents and transaction ordering must be lined up in just the wrong way. You can be sure though that when things do go wrong this is a very nasty issue to debug and find.

So what’s the solution? Well you may have spotted that the access to ready and data between the two threads might be considered to be a data race. Tools can be used to find such data races and point them out to the programmer. The way you would fix the data race would be to introduce a mutex-lock around the accesses rather than using a DIY spin-lock like ready. The reason this works is that mutexes do slightly more than just lock and unlock. Normally the implementation of a mutex in the threads library also executes a memory barrier instruction. In short, this is special instruction that tells the CPU to make its memory ordering sequential with respect to other CPUs. So this side effect of making a mutex call actually causes a serialization of all memory state and ensures that everything previously written by any core is fully committed so that the executing CPU can properly see it.

Presuming you already use them, the problem with data race detection tools is that they can generate false positives. You might look at this code and be sure you know what you are doing, and there might be good reasons for writing the code in this way. Just beware though, sequential ordering isn’t quite as ordered as it used to be…

Richard Taylor, CTO

Richard