Performance issues
We spend a lot of time working with customers to ensure that there code is free from the bugs which plague multithreaded programs such as data races and deadlock - and rightly so. However worrying over the details make it easy to lose sight of the whole reason for going multicore in the first place - performance.
Although performance optimization, and the associated architectural awareness required, is familiar to many developers actual optimization is often an after thought. Functionality is king and there will always be faster computers.
Now that applications have stopped getting faster by themselves the software developer must be aware of performance optimization from much earlier in the design process. Multicore means that we all need to dust off the computer architecture text books and start thinking about the hardware again.
A good example of this is False Sharing. This occurs when 2 or more threads running across multiple cores do not actually share data but access data at nearby addresses to each other.
This then works against the way the cache operates. Each core as a small, fast cache memory attached to it which is loaded with data from the slower main memory in an effort to avoid stalls in the processor as it waits for data. The cache works on the principle of data locality: Most programs which access a location in memory will probably access it, or an nearby location, multiple times in a short period of time. The cache therefore is designed to load multiple addresses (a line or block) of data at a time to take advantage of this. In False Sharing this means that 2 cores must share a line of cache data even though no actual data is being shared between them.
The results can be bad apparently - I was not sure how bad so I built a little test. A program which calculated a Sum of Products of 2 large arrays in 2 ways. First INTERLEAVED where each thread stepped through the data accessing a fixed offset into blocks the size of the number of threads, basically interleaving the accesses to adjacent addresses. For higher numbers of threads this is going to be bad for locality too - so a double whammy.
for (i=params->threadNo; i dataSize; i += params->threads)
params->result += params->a[i] * params->b[i];
Next BLOCKY which allocated a contiguous chunk of data to each thread
for (i = aBlockSize * params->threadNo; i < aBlockSize * (params->threadNo + 1); ++i)
params->result += params->a[i] * params->b[i];
Running the code for 1, 2, 4, 8, 16 and 32 threads on a Quad-Core Xeon produces the following numbers from time. As you might expect INTERLEAVE does much worse than blocky as the number of threads increases

What is interesting here is the Real and User numbers for INTERLEAVE. While the actual Real or wall clock time at first improves slightly and then rapidly becomes worse than that of a single thread it is the User time increases massively. This highlights a major issue when things go wrong like this on multicore. The User time remains at 4x of the real time which is unsurprising since this is a 4 core machine but the lesson is that all of those cores are continually running which means we are burning 4x the power to achieve ever worsening performance resulting in a huge overall energy drain. Maybe not a problem for a fan cooled, mains powered server but very bad news for your mobile phone.


