Prism Evaluation

Finding Concurrency in Sequential Code

Prism provides a large set of code analysis features to assist developers of multicore software. The various features are typically used in combination to support particular design flows and project life-cycle stages.

This tutorial introduces how Prism can be used when first starting on a multicore project. The ideas presented can be applied in a number of scenarios, typically in the earlier phases of the design process:

The following sections work from purely sequential code examples and walk through the process of evaluating a threaded version without having to make any code changes.

Setup

This tutorial requires the Tutorial_Examples project to setup as described in a previous tutorial and assumes you are familiar with creating PAD files and generating traces.

Create a PAD in the Data_Decomp directory and generate traces for Data_Decomp/example. Once the traces are loaded the Histogram tab in the PAD view it displayed and we can begin the analysis of the code.

Program Analysis

After loading the traces Prism displays the information gathered as a series of graphs under the histogram tab of the PAD view. These show :

Loading a Slice

From this view you can navigate around the trace and select a slice for in depth analysis using the other view in Prism. The reason for this slice approach is to allow Prism to handle very large traces representing 'real' applications. Since the amount of data involved can run to many GB it is impossible to load all of the trace into memory at the same time so instead the user selects a slice of interest and loads that.

Runtime histogram of example showing the slice to be loaded (Click to enlarge)

The slice is the highlighted region over the graphs and can be moved back and forth through the cycles by clicking and dragging. By default the slice is set to its smallest size but this can be increased with the Slice Size slider in the top left of the view. Bare in mind that the larger the slice the more processing is required so performance will suffer as larger slices are loaded.

Once loaded the other views are populated with data for the slice. It is important to remember that all the information displayed in the other views corresponds to the current slice only. However, any changes made will effect the program as a whole.

Understanding Program Structure

The Functions and Graph views aid the understanding of the programs structure at the function level and show how much of the execution time is spent in which parts of the call tree.

By default the graph view displays the call tree for the slice, with each edge annotated with the number of calls made from parent to child function. Note that the call graph contains all of the functions in the trace, not just the user functions, so main is not at the root.

Call graph for a slice of example (Click to enlarge)

The graph can be zoomed in and out, manipulated by dragging with the mouse and be used to further analyze the code via right clicking on the function nodes.

The functions view displays the execution profile of each function and function call tree in the slice. Initially this data is displayed in Call Tree mode showing how the flow of control moves down for any threads which are running in the current slice. In the case of sequential only code, the startup of the C runtime which invokes main is the only thread root present.

Function View showing the full call tree before filtering. (Click to enlarge)

We shall see more when we started forcing threads and loading programs containing pthread_create() calls. Various runtime information is displayed for each function, see the Function_View reference page for more details. The most important feature here is the Force Thread column which allows you to specify functions to become thread roots as part of the 'what if' analysis introduced later.

The view may show all of the LIBC functions in addition to user functions but can be simplified by filtering the kernel API functions

This will also filter PThread API functions when present.

The function view can be used to navigate around the source code of the program.

An annotated source code window will appear for the function. Most of the table views referencing the source code will provide this functionality.

In the flat view every function in the program is listed along with execution details for the current slice.

Viewing the Runtime Schedule

The Schedule View is accessed via the Schedule tab. This view graphically presents the schedule of the current slice of the program. A detailed description of its features is available Schedule_View reference pages but generally the view is based upon the trace captured from the simulator which is presented as a timeline.

This timeline can be viewed in 2 modes. The Core schedule, which shows activity on a core by core basis and the Thread schedule, which shows the activity of each thread. By default the core schedule is selected.

The core schedule. (Click to enlarge)

Each schedule type can be used to view schedules generated in 2 ways. The first is called Trace, which simply shows what happened in the simulator and is fixed unless the program itself is changed and its traces reloaded. The other is FIFO, which generates a schedule based on the simulator trace but is effected by changes made in Prism such as the addition of threads or cores and is the primary tool in evaluating opportunities for parallelizing sequential code.

We will look at most of these capabilities when parallelizing code in the next section but for now lets look at some of the basic navigation functionality.

At this point we can clearly see the call and return structure of the code and that most time is spent in doWork. The grey bar represent memory accesses. Hovering over any of the blocks (schedule segments) will cause it to highlight and eventually a tooltip will appear with details of the schedule segment.

Zoomed in schedule showing memory accesses. (Click to enlarge)

Further operations for the selected schedule segment are accessed via the right click menu and allow you to map from the timeline back to instruction in the source code. Linking from the other views into the schedule view is also enabled.

The schedule view is updated with markers for the function entry (green line) and function exit (red line). This can be repeated for any function and is valid for most of the table views in Prism. Markers will persist between changing between core and thread schedules and zoom levels.

In addition to showing how the schedule changes over time, Prism provides insight into the changing memory requirements of the program as it runs.

The memory view provides a table of all of the memory areas allocated by the program during the execution of the slice. The memory ranges are organized by allocation type; global, static, stack and heap and where possible are linked back to the statement in the user code which caused the allocation.

The source code annotation view will appear highlighting the malloc() call allocating the heap space.

The memory view can also be used to graphically examine the memory extent at any point in the slice.

The memory map will now be displayed for all of the allocations valid from the start of that function at the time at selected colored area.

Allocated memory for the slice (Click to enlarge)

Evaluating Code for Parallelism

Now that you are familiar with the basic tools available in Prism we will start to look at how this analysis can be put to use when considering how to extract parallelism from a sequential program.

The general flow is to load up the traces, select slice of the overall trace which represents the typical execution of the algorithm, for example the start and end slice are rarely where most of the interesting functionality is located in real applications. If the program has several major blocks of different functionality you may wish to analyze them as separate slices, focusing on decompositions which suit the particular needs of the algorithm at that point.

Once a slice is loaded the first stop is usually the Function View where the Call Tree view will quickly guide you to the most likely candidates for thread root functions. It depends heavily on the application but a good choice can typically be found by looking at the functions which have a good ratio between their call count and the percentage of the total cycles. The larger this ratio the more useful work will be done for the overhead cost of initiating the work on a thread.

In Prism, thread creation is modeled by selecting the Force Thread option for a given function in the function view. Prism will then model calls to the function as a pthread_create() call starting from the function. This results of this modeling can be seen in both the function view, where new call trees have been added for each forced thread root, and in the schedule view which will now contain many entries in the Thread Schedule rather than just one. Note that you must switch to the FIFO scheduler to see these effects.

Data Decomposition

Data decomposition is a very common way to extract parallelism where the several instances of the same function are run in parallel operating on separate parts of a data set. It can often lead to very scalable solutions if handled correctly but can be limited if data dependencies are present in the code.

Here we see that doWork is responsible nearly all of the loading of the processor and has a relatively low call count so could be a good candidate for a thread root. Lets see if it is by forcing a thread from it.

The schedule timeline (in Core mode) now shows a sequence of block rather than the single block we started with. Each of these blocks represents a thread executing but since we only have a single core in this system we don't get any benefit from this. Lets have a closer look at what is happening.

Thread interaction in the schedule view. (Click to enlarge)

Here the interaction between the threads can be clearly seen. The green, representing the main function, thread in blocked (Ready) while the forced thread (in blue) runs. Once it finishes the main thread runs again after a short delay. This delay is modeling the overhead of the context switch and thread management. The thread continues until it reaches a call to doWork at which point a thread create occurs and the context is switched to the new thread which begins executing while the parent thread blocks again. The thread creation is indicated by a blue arrow. Hover over it for a descriptive tool tip.

Arrows are used to indicate a number of relationships between different blocks in the schedule. We will see the other arrow types shortly.

So now we need to find out what sort of parallelism and potential performance increase is possible if we go ahead and implement a threaded version of the code. The simplest way to do this in Prism is to add extra core into the FIFO schedule which can model what would happen if the code was actually run on the equivalent multicore system, if it existed.

The schedule view will now update to show the execution on 2 cores. The green dashed line shows the original number of cycles in the trace from the simulator. The red dashed line shows the number of cycles required to execute the same code a multicore platform with the forced threads as estimated by the FIFO scheduler.

Data_Decomp Example running on 2 cores.(Click to enlarge)

Hovering over the red line produces a tool tip detailing the percentage performance gain so the benefits of adding more cores can easily be measured across multiple configurations.

As we can see this code appears to be highly scalable since with the multi threaded part of the code improving performance as we add more cores. So it is likely that doWork is a good function to execute in a thread to get scalable data decomposition.

Once we have more than one thread in Prism then it is likely that Data Dependencies will exist between them and this may lead to Data Races. Fortunately our simple example did not have any such complexities however this is rarely the case in real life applications so Prism provides a number of feature to deal with this which we shall examine in the next tutorial.


Proceed to the Next Tutorial