Prism Evaluation

Implementing Prism Scheduling with PThreads

This tutorial looks at the practicalities of implementing the force threads and data dependency resolution types used in Prism as PThread programs.

Setup

This tutorial assumes you are familiar with the basic use of Prism as detailed in the previous tutorials and have imported and built the tutorial examples.

Forced Threads

We looked at the force thread mechanism in an earlier tutorial and saw how this could be used evaluate the impact of running the specified function, and its call tree, in a thread. In this section we will look at how to implement the threading model using PThreads.

When a function is forced as a thread, the thread will run until the function returns. This thread is modeled with no synchronization back to its parent, often this is the cs3_start_c main thread but could be any existing thread in the trace. Therefore it is possible for a forced thread to have a longer lifetime than the main function if there are no data dependencies back to the parent thread which is rarely the case in real applications.


int main (int argc, char* argv[])

{

   pthread_t aThreadA;

   pthread_t aThreadSetB[INSTANCE_COUNT];



   pthread_create(&aThread, NULL, funcA, NULL);



   for (i=0; i < INSTANCE_COUNT; ++i)

      pthread_create(aThreadSetB + i, 

	                 NULL, 

	                 funcB, 

	                 NULL);



   return EXIT_SUCCESS;

}

PThread equivalent of force thread on single call to funcA and multiple calls to funcB.

While the above example shows the literal implementation of the Prism schedule in PThreads this is rarely the best implementation. However, it does give a good starting point for going from sequential to parallel code and can be optimized later with further iterations through Prism.

Dependency Resolutions

As we explored in a previous tutorial, Prism will detect dependencies and data races between threads and model the effects of various solutions to resolve these and therefore prevent errors when running on a multicore platform. This section looks at how these resolutions can be translated into PThread code.

None

The None resolution is probably the best resolution in terms of performance since it completely removes the restrictions of the dependency with inserting any synchronization overhead. The problem is that it often requires substantial restructuring of not only the functions involved but sometimes the entire algorithm.

Generally there are two ways of implementing a None resolution . Firstly, if the dependency is not actually a real communication between the threads then simply change the scope of the shared data to be local to each thread.


// Remove global variable

// int gVar;



void funcA(void)

{

   int aVar; // local version of gVar

   /* etc */

}



void funcB(void)

{

   int aVar; // local version of gVar

   /* etc */

}



int main (int argc, char* argv[])

{

   pthread_t aThreadA, aThreadB;



   pthread_create(&aThreadA, NULL, funcA, NULL);

   pthread_create(&aThreadB, NULL, funcB, NULL);



   return EXIT_SUCCESS;

}

Localizing unnecessarily shared data

Secondly, if the dependency is real and the threads are communicating through the data then try to restructure the algorithm so that local versions of the data are generated by each thread which are then combined by another function which runs once all of the threads have finished.


void funcA(int *pTotal)

{

   /* Some calculation */

   *pTotal += aLocalCalc;

}



int main (int argc, char* argv[])

{

   // This was the shared variable 

   // but will now only be used in this thread.

   int aTotal = 0; 

   // Provide storage for each thread instead

   int aTotalSet[ITERATIONS]; 



   pthread_t aThreadSet[ITERATIONS];



   for (i=0 i < ITERATIONS; ++i)

   {

      aTotalSet[i] = 0;

      pthread_create(

	      aThreadSet + i, NULL, 

		  funcA, (void*)(aTotalSet + i));

   }



   // Wait for all of the threads to complete

   for (i=0 i < ITERATIONS; ++i)

      pthread_join(aThreadSet[i], NULL);



   // Calculate the actual value we want from 

   // the partial solutions from the threads.

   aTotal = calcTotal(aTotalSet);



   /* etc */



   return EXIT_SUCCESS;

}

Accumulating data from multiple threads

While the first situation may seem straightforward it can be a common cause of data races when it isn't clear that data is actually been shared. This is often due to pointers. The second case is obviously much more dependent on the type of code being threaded and requires a lot of application knowledge to implement successfully. If the amount of work required in the accumulation function, calcTotal in the example above, is excessive it could be that None is not the correct resolution type to be using in this situation.

Serialized

Serialized resolution is often the most efficient for dealing with threads which have a genuine dependency and must share data as they execute together. However, it can be quite subtle in its implementation and requires careful handling in order to ensure correct behavior in the resulting code.

The first thing to understand is the meaning of a thread segment. A thread segment is defined as a continuously executing block of code from a thread and, in terms of the PThreads library used with Prism, a thread can only be broken into segments by inserting synchronization events such as creating threads, unlocking mutexes, joining and explicitly calling pthread_yield(). If the thread is created in Prism with a Force Thread action then the entire function and all of its callees are considered to form a single segment which is the same length as the thread. The caller thread segment for the force thread function is split into 2 segments at the point of the function call.

For Serialized this means the scheduler in Prism does not allow another thread segment to do its first access to the shared data until the first thread segment has completed its last access to that same data. This means that while some of the thread segments my run in parallel they will have to be sequential over the range of all of the accesses to shared data so the length of the sequential region could be a lot longer than expected.

The other important issue is ordering. Prism will ensure that any shared memory locations are accesses by threads in the same order as in the original trace when it generates a schedule. It is up to the user to determine if a data dependency must be accessed in a fixed order and enforce that ordering if necessary in his software. If ordering is unimportant, or can be designed out, it may be possible to improve on the schedule estimated by Prism.

So again we have two cases; ordered and unordered serialization. Lets see how this is implemented in an example. The following threaded code is loaded into Prism.


int aSharedVar; 



void funcA(void)

{

   aSharedVar += doWork(aSharedVar);

   doMoreWork();

}



void funcB(void)

{

   aSharedVar += doWork(aSharedVar);

   doMoreWork();

}



int main (int argc, char* argv[])

{
   pthread_t aThreadA, aThreadB;



   aSharedVar = 2;



   pthread_create(aThreadA, NULL, funcA, NULL);

   pthread_create(aThreadB, NULL, funcB, NULL);

   

   printf("%d", aSharedVar);



   return EXIT_SUCCESS;

}

Threaded code with several data races

Obviously there are data races due to accesses made to aSharedVar from the 3 threads: aThreadA, aThreadB and main. Prism duly identifies this as several data races and matching dependencies. In this example, the dependencies between aThreadA and aThreadB are due to a read then a write of aSharedVar must be resolved as a single atomic transaction but the order in which these transactions occur are not important. The dependency due to the printf call from the main thread most be resolved in order, otherwise the result may be printed before it has been calculated. A possible implementation for this is a follows


int aSharedVar;



typedef struct

{

   pthread_mutex_t aMutex;

   pthread_cond_t aCondition;

   int aCounter;

} resolution_t;



resolution_t aResolution = 

   {PTHREAD_MUTEX_INITIALIZER, 

   PTHREAD_COND_INITIALIZER, 

   2};



void funcA(void)

{

   pthread_mutex_lock(&aResolution.aMutex);

   aSharedVar += doWork(aSharedVar);

   aResolution.aCounter--;

   pthread_cond_signal(&aResolution.aCondition);

   pthread_mutex_unlock(&aResolution.aMutex);

   

   doMoreWork();

}



void funcB(void)

{

   pthread_mutex_lock(&aResolution.aMutex);

   aSharedVar += doWork(aSharedVar);

   aResolution.aCounter--;

   pthread_cond_signal(&aResolution.aCondition);

   pthread_mutex_unlock(&aResolution.aMutex);



   doMoreWork();

}

int main (int argc, char* argv[])

{

   pthread_t aThreadA, aThreadB;

   aSharedVar = 2;

   

   pthread_create(

      &aThreadA, NULL, (void *) funcA, NULL);

   pthread_create(

      &aThreadB, NULL, (void *) funcB, NULL);

   

   pthread_mutex_lock(&aResolution.aMutex);



   while (aResolution.aCounter > 0)

   	pthread_cond_wait(

	  &aResolution.aCondition, 

	  &aResolution.aMutex);



   printf("%d", aSharedVar);

   

   pthread_mutex_unlock(&aResolution.aMutex);

   

   return EXIT_SUCCESS;

}

Threaded code with a data races resolved as per Serialized

While Serialized allows for threads containing data dependencies to safely execute in parallel some of the time, the granularity of thread segments can lead to excessively long regions of serialized code. In our example above, an entire call to doWork is serialized because it falls between the first and last accesses to aSharedVar. Obviously it would be far more efficient to take a local copy serially to pass to doWork before serializing again to do the sum. In these situations the programmer can make the optimizations straightaway, however, if the situation is more complex Prism can be used to support this process by liberally adding pthread_yield() calls into thread segments containing many dependencies and then reloading the new trace into Prism. This will result in smaller thread segments allowing more freedom in scheduling when choosing resolutions.


Proceed to the Next Tutorial