Prism Evaluation

Evaluating Complex Code for Parallelism

The onset of the Multicore revolution promises dramatic leaps in performance for software which can easily be modified to run concurrently.

However this class of software application is not very common, particularly in the realm of embedded multimedia platforms where the engineer has to look hard to find any potential parallelism in the algorithms themselves let alone in the highly optimized sequential implementation code of standard codecs. This paper looks at how the program analysis and concurrency exploration features of Prism can be employed to help engineers to:

Parallelizing the H.264 Decoder

The H.264 video compression codec is of particular interest for embedded mobile media applications due to its very high encoding efficiency which is well suited for transmission over low bandwidth connections such as mobile phone networks.

The codec takes advantage of both spatial and temporal similarities in the video stream to remove redundant information by essentially expressing the video stream in terms of changes between and within frames when compared to a known reference frame. The following diagram illustrates how such a stream is decoded.

Block diagram of a H.264 Decoder (click to enlarge)

The incoming bit stream from the communication channel is quantized and transformed back from the frequency domain before being used to reconstruct each video frame. The video frame is sent either in inter mode where the frame can be constructed with only the data contained in the bit stream (the Intra Loop in the digram above) or in intra mode, where the frame is reconstructed with reference to the contents of the previous (and potentially future) frames. This loop can be followed through the Motion Estimation block in the diagram.

H.264 is particularly difficult to find concurrency in due to the dependency between frames in inter mode and between individual macro blocks in intra and inter modes however various potential sources of concurrency have been identified in research papers which will be discussed in the next section.

Potential Sources of Concurrency

Typically the first place to look for parallelism is in the algorithm the software is implementing. This makes it a lot easier to spot natural areas of concurrency in the flow of data and the ordering of tasks before they are obscured by the details of a sequential implementation.

Clearly there are two basic forms of parallelism: Task Parallelism, where separate stages of the algorithm are running concurrently, and Data Parallelism, which focuses on partitioning the data set the algorithm works on and processing those partitions simultaneously.

In h.264, Task parallelism only allows for the Motion Compensation to run in parallel with the De-quantizer and IDCT steps. Data Decomposition presents far more opportunities for concurrency at various levels of processing:

In the following sections we will concentrate our investigations at the Macro block and Pixel levels in a bid to find easily exploitable concurrency without having to resort to major restructuring of the codec implementation.

Benchmark

To model a typical optimized implementation we are using the h.264 decoder as implemented in the open source ffmpeg library. The test data stream is the first 10 frames of the common foreman sequence at QCIF resolution.

Discovering Concurrency with Prism

Prism assists the the programmer in uncovering concurrency in 4 ways

These capabilities are particularly useful when dealing with a very complex code base such as a video decoder.

Decoder Overview

Once a simulation has been traced and loaded into Prism the analysis can begin with the histogram which gives a high level overview of the execution of the entire program.

Histogram view of the decoder trace(Click to Enlarge)

Here we can clearly see a repeated cycle which presumably relates to the the decoding of each frame. This can be clarified by calling up the select overview dialog which list all of the functions in the trace.

Overview of traced functions (Click to Enlarge)

Here the functions can be filtered by how much time is spent in each; in terms of total runtime, callee tree and individually. Since the decode_frame function should take up most of the runtime we sort by percentage of all cycles and quickly find the function decode_frame which looks like the function we are after. This can be verified by displaying the activity trace of the function in another histogram group by itself.

Examining decode_frame activity (Click to Enlarge)

The next step is to use the same tools to attempt to map what we already know about the structure of the h.264 decoder onto this particular implementation.

Ideally we would expect the major high level functional blocks to be represented in the top few entries of the overview table when sorted in terms of the call tree. However looking at the top 30 or so functions, those contributing 1 or more % to the overall runtime, it is obvious that the actual loading does not correspond well to our understanding of the algorithm. More investigation is required. Before moving on we note the following:

Frame Level Analysis

Since we have already established that there is little opportunity for concurrency at either Frame or Slice level we will take a closer look at the behavior within the decode of an individual frame.

Sizing the slice selection window to match the duration of a call to decode_frame in the histogram we make sure we are picking up a typical inter frame by adding a trace of the hl_motion function activity which shows we are picking up an interesting frame type with a high ratio of motion compensated macro blocks.

(Click to Enlarge)

Now Prism can present the detailed call tree for an entire frame decode. This allows very close analysis of the structure of the code.

(Click to Enlarge)

Immediately we can see that a lot of the runtime is spent in low level functions forming part of the motion estimation. This maps well back into what we saw in the trace level run times with put_h264_qpel8_v_lowpass and its precedents consuming much of the total runtime.

Exploring Threading Options

When considering how to partition code into threads we need to understand how much work is required to refactor the code and how much performance gain will be realized for the effort expended. Prism allows you to quickly explore the options available with its force thread feature. There are a number ways of selecting where to make the thread partition when thinking purely at the software level, as opposed to thinking algorithmically as discussed previously.

Generally, creating a thread for a function call which is high in the tree will ensure that the thread has a lot of work to do during its life time which will amortize the overhead of creating the thread. It is also common for the amount of shared data to be smaller higher in the call making synchronization between threads easier to implement. On the other hand, a function further down the call tree can be easier to understand and may often work on very simple sets of data which can be more easily partitioned between threads. In this case a more complex thread pool type arrangement may be required to reduce the overhead of the costly creation of many threads for little processing per thread. Using Prism we can quickly evaluate these properties for our h.264 example.

Prism allows the developer to model running a function in a thread simply by checking the relevant property for the function, or functions, of interest and Prism will model a resulting schedule across multiple cores. For this initial investigation lets assume a 4 core target.

Setting the architecture (Click to Enlarge)

Initially we will try threading at a high level in the call tree and force the call to the hl_decode_mb function which is as high a level parallelism which may exist in a frame. This process is very straightforward, simply enable force thread in the functions view and asking Prism to generate a new schedule.

The resulting schedule (below) shows that simply introducing threads at this level does not have any real effect of the runtime. This is due to Data Dependencies between the function now running in a thread and the rest of the code base which is effectively running in the main thread. The schedule generated by Prism serializes interdependent threads by default to maintain the ordering of dependent data accesses. While this ensures that the schedule is guaranteed to be a correct program the result is that execution remains essentially sequential.

Sequential schedule (Click to Enlarge)

Prism provides facilities for the developer to relax some of these data dependency constraints to evaluate the results of potential code refactorings

Resolving Data Dependencies

Prism provides details of all of the dependencies between threads in it's Data Dependency view. From here the location and degree of each dependency can be examined on at the thread, function and instruction level by providing a direct mapping back into the relevant lines of source code accessing the shared memory areas. Ideally the next step would be to use the information to make the necessary changes in the source code to remove the dependencies or limit their extent using synchronization methods such as mutexes. This would then hopefully allow for more concurrency between threads.

Data Dependency View

Data Dependencies between concurrent hl_decode_mb threads. (Click to Enlarge)

A quick look at the number of dependencies, there are hundreds, indicates that this would be extremely time consuming. Worse we would have no guarantee that the investment in time would pay off with additional performance. Fortunately we can use Prism to evaluate the benefits of resolving these dependencies without having to modify any code first, saving substantial amounts of time.

Resolving a dependency (Click to Enlarge)

Prism models resolving dependencies between complete threads which, while very fast, can be too coarse grain to produce conclusive results in some situations. In this case, it would be beneficial to remove the dependencies from hl_decode_mb to the main thread since this would allow the loop which calls hl_decode_mb repeatedly to continue running so generating a large number of hl_decode_mb threads to run in parallel.

Unfortunately this scenario is unrealistic since once the the loop has completed the main thread will continue to run, starting the yuv to RGB conversion before all of the macro block had been decoded. Clearly the range of the dependency resolution needs to be restricted. The way to do this in Prism is to simply force an additional thread for the parent function, in this case decode_slice

Once this containing thread has been created the dependencies between it and the original child thread can be resolved allowing it to run in parallel while remaining sequential to the main thread of the program. This has the added advantage of reducing the overall number of dependencies to be resolved.

The resulting optimal schedule (Click to Enlarge)

The performance improvement can be seen in the schedule view above. Prism reports the new runtime estimate as a percentage improvement over the sequential trace. This benefit however must be balanced against the cost of implementation. Prism provides a clear indication of the work involved in the dependency view by listing the numeber and type of dependendecies to be resolved. 3 types of data dependency are cataloged: Read After Write (RAW), Write After Write (WAW) and Write After read (WAR).

Of these RAW dependendecies are particularly difficult to remove from sequential code since they represent a true dependency where one thread consumes the results of the other and implies substantial refatoring of the code to accomadate. WAW and WAR dependencies can often be protected with mutexes or simply removed entierly by localizing the variable shared variable to each thread.

Evaluating Alternatives

The resulting performance looks promising but there is a lot of work involved due to the number of data dependencies which need to be resolved. Since it takes little time to evaluate a threading strategy we can investigate alternative threading points in a similar manner and compare the results.

Threaded Function Best Improvment Change Complexity
RAW WAW WAR
hl_decode_mb 69% 298 238 695
hl_motion 12% 555 269 51
mc_dir_part 31% 0 60 23
put_h264_qpel16_mc03_c No Change NA NA NA

The above table summarizes a number of threading options at various levels in the call tree for hl_decode_mb. The effort is summarized in terms of the number and type of dependencies to be resolved in order to obtain the Best Improvement. Clearly threading hl_decode_mb has the greatest potential here but the implmentation cost appears huge. A more attractive proposition might be to implement the mc_dir_part threading instead. While the performance gain is only half that of hl_decode_mb the complexity of the job is far less too.

Conclusion

Faced with a complete re-implementation or a pragmatic refactoring, Prism gives developer the tools to quickly uncover available parallelism and evaluate the trade off between effort and reward in order to make the optimal design choices to meet the needs of the current project.

Once a coarse of implementation has been decided upon, the Prism provides the developer markers within the source code to guide refactoring to ensure that no dependency is missed leading to a potential threeaading bug. Further, the developer can continually use the schedule vizualization features of Prism to continually evaluate and verify the threading implementation work as it is in progress.