
The onset of the Multicore revolution promises dramatic leaps in performance for software which can easily be modified to run concurrently.
However this class of software application is not very common, particularly in the realm of embedded multimedia platforms where the engineer has to look hard to find any potential parallelism in the algorithms themselves let alone in the highly optimized sequential implementation code of standard codecs. This paper looks at how the program analysis and concurrency exploration features of Prism can be employed to help engineers to:
The H.264 video compression codec is of particular interest for embedded mobile media applications due to its very high encoding efficiency which is well suited for transmission over low bandwidth connections such as mobile phone networks.
The codec takes advantage of both spatial and temporal similarities in the video stream to remove redundant information by essentially expressing the video stream in terms of changes between and within frames when compared to a known reference frame. The following diagram illustrates how such a stream is decoded.
The incoming bit stream from the communication channel is quantized and transformed back from the frequency domain before being used to reconstruct each video frame. The video frame is sent either in inter mode where the frame can be constructed with only the data contained in the bit stream (the Intra Loop in the digram above) or in intra mode, where the frame is reconstructed with reference to the contents of the previous (and potentially future) frames. This loop can be followed through the Motion Estimation block in the diagram.
H.264 is particularly difficult to find concurrency in due to the dependency between frames in inter mode and between individual macro blocks in intra and inter modes however various potential sources of concurrency have been identified in research papers which will be discussed in the next section.
Typically the first place to look for parallelism is in the algorithm the software is implementing. This makes it a lot easier to spot natural areas of concurrency in the flow of data and the ordering of tasks before they are obscured by the details of a sequential implementation.
Clearly there are two basic forms of parallelism: Task Parallelism, where separate stages of the algorithm are running concurrently, and Data Parallelism, which focuses on partitioning the data set the algorithm works on and processing those partitions simultaneously.
In h.264, Task parallelism only allows for the Motion Compensation to run in parallel with the De-quantizer and IDCT steps. Data Decomposition presents far more opportunities for concurrency at various levels of processing:
In the following sections we will concentrate our investigations at the Macro block and Pixel levels in a bid to find easily exploitable concurrency without having to resort to major restructuring of the codec implementation.
To model a typical optimized implementation we are using the h.264 decoder as implemented in the open source ffmpeg library. The test data stream is the first 10 frames of the common foreman sequence at QCIF resolution.
Prism assists the the programmer in uncovering concurrency in 4 ways
These capabilities are particularly useful when dealing with a very complex code base such as a video decoder.
Once a simulation has been traced and loaded into Prism the analysis can begin with the histogram which gives a high level overview of the execution of the entire program.
Here we can clearly see a repeated cycle which presumably relates to the the decoding of each frame. This can be clarified by calling up the select overview dialog which list all of the functions in the trace.
Here the functions can be filtered by how much time is spent in each; in terms of total
runtime, callee tree and individually. Since the decode_frame function should take up most of the
runtime we sort by percentage of all cycles and quickly find the function decode_frame which
looks like the function we are after. This can be verified by displaying the activity trace of the function in another histogram
group by itself.
The next step is to use the same tools to attempt to map what we already know about the structure of the h.264 decoder onto this particular implementation.
Ideally we would expect the major high level functional blocks to be represented in the top few entries of the overview table when sorted in terms of the call tree. However looking at the top 30 or so functions, those contributing 1 or more % to the overall runtime, it is obvious that the actual loading does not correspond well to our understanding of the algorithm. More investigation is required. Before moving on we note the following:
hl_decode_mb is probably the top of the macro block decode routine since it is
called 1188 times for 12 calls to decode frame and we are expecting 99 macro blocks per QCIF frame.hl_motion suggests that it is the motion compensation block and is called from hl_decode_mb. With a call count of 981 this suggests that a high percentage of blocks must be inter-predicted in
the test stream.decode_slice is called the same number of times as
decode_frame confirming our suspicion from the initial analysis that there are likely to be few
slices per frame.Since we have already established that there is little opportunity for concurrency at either Frame or Slice level we will take a closer look at the behavior within the decode of an individual frame.
Sizing the slice selection window to match the duration of a call to decode_frame in the
histogram we make sure we are picking up a typical inter frame by adding a trace of the hl_motion function activity which shows we are picking up an interesting frame type with a high ratio of motion compensated macro blocks.
Now Prism can present the detailed call tree for an entire frame decode. This allows very close analysis of the structure of the code.
Immediately we can see that a lot of the runtime is spent in low level functions forming part
of the motion estimation. This maps well back into what we saw in the trace level run times with
put_h264_qpel8_v_lowpass and its precedents consuming much of the total runtime.
When considering how to partition code into threads we need to understand how much work is required to refactor the code and how much performance gain will be realized for the effort expended. Prism allows you to quickly explore the options available with its force thread feature. There are a number ways of selecting where to make the thread partition when thinking purely at the software level, as opposed to thinking algorithmically as discussed previously.
Generally, creating a thread for a function call which is high in the tree will ensure that the thread has a lot of work to do during its life time which will amortize the overhead of creating the thread. It is also common for the amount of shared data to be smaller higher in the call making synchronization between threads easier to implement. On the other hand, a function further down the call tree can be easier to understand and may often work on very simple sets of data which can be more easily partitioned between threads. In this case a more complex thread pool type arrangement may be required to reduce the overhead of the costly creation of many threads for little processing per thread. Using Prism we can quickly evaluate these properties for our h.264 example.
Prism allows the developer to model running a function in a thread simply by checking the relevant property for the function, or functions, of interest and Prism will model a resulting schedule across multiple cores. For this initial investigation lets assume a 4 core target.
Initially we will try threading at a high level in the call tree and force the
call to the hl_decode_mb function which is as high a level parallelism which may exist in a frame.
This process is very straightforward, simply enable force thread in the functions view and
asking Prism to generate a new schedule.
The resulting schedule (below) shows that simply introducing threads at this level does not have any real
effect of the runtime. This is due to Data Dependencies between the function now running in a thread
and the rest of the code base which is effectively running in the main thread. The schedule generated by Prism serializes interdependent threads by default to maintain the ordering of dependent data accesses. While this ensures that the schedule is guaranteed to be a correct program the result is that execution remains essentially sequential.
Prism provides facilities for the developer to relax some of these data dependency constraints to evaluate the results of potential code refactorings
Prism provides details of all of the dependencies between threads in it's Data Dependency view. From here the location and degree of each dependency can be examined on at the thread, function and instruction level by providing a direct mapping back into the relevant lines of source code accessing the shared memory areas. Ideally the next step would be to use the information to make the necessary changes in the source code to remove the dependencies or limit their extent using synchronization methods such as mutexes. This would then hopefully allow for more concurrency between threads.
A quick look at the number of dependencies, there are hundreds, indicates that this would be extremely time consuming. Worse we would have no guarantee that the investment in time would pay off with additional performance. Fortunately we can use Prism to evaluate the benefits of resolving these dependencies without having to modify any code first, saving substantial amounts of time.
Prism models resolving dependencies between complete threads which, while very fast, can be too
coarse grain to produce conclusive results in some situations. In this case, it would be beneficial to
remove the dependencies from hl_decode_mb to the main thread since this would allow the loop
which calls hl_decode_mb repeatedly to continue running so generating a large number of
hl_decode_mb threads to run in parallel.
Unfortunately this scenario is unrealistic since once the the loop has completed the main
thread will continue to run, starting the yuv to RGB conversion before all of the macro block had
been decoded. Clearly the range of the dependency resolution needs to be restricted. The way to do
this in Prism is to simply force an additional thread for the parent function, in this case decode_slice
Once this containing thread has been created the dependencies between it and the original child thread
can be resolved allowing it to run in parallel while remaining sequential to the main thread of the
program. This has the added advantage of reducing the overall number of dependencies to be resolved.
The performance improvement can be seen in the schedule view above. Prism reports the new runtime estimate as a percentage improvement over the sequential trace. This benefit however must be balanced against the cost of implementation. Prism provides a clear indication of the work involved in the dependency view by listing the numeber and type of dependendecies to be resolved. 3 types of data dependency are cataloged: Read After Write (RAW), Write After Write (WAW) and Write After read (WAR).
Of these RAW dependendecies are particularly difficult to remove from sequential code since they represent a true dependency where one thread consumes the results of the other and implies substantial refatoring of the code to accomadate. WAW and WAR dependencies can often be protected with mutexes or simply removed entierly by localizing the variable shared variable to each thread.
The resulting performance looks promising but there is a lot of work involved due to the number of data dependencies which need to be resolved. Since it takes little time to evaluate a threading strategy we can investigate alternative threading points in a similar manner and compare the results.
| Threaded Function | Best Improvment | Change Complexity | ||
|---|---|---|---|---|
| RAW | WAW | WAR | ||
| hl_decode_mb | 69% | 298 | 238 | 695 |
| hl_motion | 12% | 555 | 269 | 51 |
| mc_dir_part | 31% | 0 | 60 | 23 |
| put_h264_qpel16_mc03_c | No Change | NA | NA | NA |
The above table summarizes a number of threading options at various levels in the call tree for hl_decode_mb. The effort is summarized in terms of the number and type of dependencies to be resolved in order to obtain the Best Improvement. Clearly threading hl_decode_mb has the greatest potential here but the implmentation cost appears huge. A more attractive proposition might be to implement the mc_dir_part threading instead. While the performance gain is only half that of hl_decode_mb the complexity of the job is far less too.
Faced with a complete re-implementation or a pragmatic refactoring, Prism gives developer the tools to quickly uncover available parallelism and evaluate the trade off between effort and reward in order to make the optimal design choices to meet the needs of the current project.
Once a coarse of implementation has been decided upon, the Prism provides the developer markers within the source code to guide refactoring to ensure that no dependency is missed leading to a potential threeaading bug. Further, the developer can continually use the schedule vizualization features of Prism to continually evaluate and verify the threading implementation work as it is in progress.