Go
Home | Contact Us | Sitemap:  
Products | News | Events | Careers | Company

Visit us at:

Upcoming Events
Accelerating Embedded Software

FAQ


Comparison With Alternative Solutions

1. Why use Cascade coprocessor synthesis instead of simply deploying another general-purpose (GP) processor?

Deployment of an additional GP RISC core with the same instruction set may accelerate embedded software execution. However, a GP RISC core lacks both the instruction-level parallelism and the parallel hardware resources necessary to execute compute-intensive software. Consequently, software acceleration does not scale with the number of processors and power consumption.

Cascade coprocessor synthesis generates a coprocessor with the instruction-level parallelism and the parallel hardware resources necessary to deliver superior execution performance and power consumption.

2. What is the difference between Cascade coprocessor synthesis and custom instruction set processor (CISP) design using EDA tools?

EDA CISP design tools are processor development tools for expert processor designers. CISP tools require manual description of the processor architecture in a (generally) proprietary language prior to automatic generation of the RTL implementation. The whole process consumes several weeks of design time. Use of a CISP with an instruction set that is different from the processor for which the legacy software was originally developed mandates redevelopment of the embedded software.

By contrast, Cascade requires no processor design expertise. Cascade automatically generates both the optimum coprocessor architecture and its RTL implementation in a matter of days. Moreover, legacy software is re-used 'as is'. It is thus an expert coprocessor design tool for both embedded software developers and RTL implementation designers.

3. What is the difference between Cascade coprocessor synthesis and custom instruction set processor (CISP) design using configurable IP?

Configurable IP approaches to CISP design require varying degrees of processor design expertise, generally requiring a processor architectural description in a proprietary language. As with EDA approaches to CISP design, use of an instruction set that is different from the processor for which the legacy software was originally developed mandates redevelopment of the embedded software.

By contrast, Cascade requires no processor design expertise. Cascade automatically generates both the optimum coprocessor architecture and its RTL implementation in a matter of days. Moreover, legacy software is re-used 'as is'. It is thus an expert coprocessor design tool for both embedded software developers and RTL implementation designers.

4. What is the difference between Cascade coprocessor synthesis and behavioural synthesis?

Coprocessor synthesis generates a software-programmable engine that may be re-programmed to deliver modified or different functionality. Behavioural synthesis generates fixed-function hardware that must be re-designed to accommodate function changes. Moreover, current behavioral synthesis tools are often incapable of implementing complex algorithms in their entirety.

A fixed-function block may well deliver the requisite block-level performance, but system-level performance gains might be significantly lower because the block's lack of local cache memory requires system memory access via the system bus, the loading of which may prevent ready data access. By contrast, the programmable coprocessor's optimised local cache minimises latencies to deliver optimum system-level performance.

In addition, behavioural synthesis tools use disparate C dialects, which:

  • Prevents the unmodified use of standard algorithms expressed in standard C.
  • Obliges the designer to develop proprietary algorithms in the C dialect of the selected behavioral synthesis tool. That is, the algorithm developer must know in advance which tool will be used by the hardware implementation designers.
  • Requires re-development of any algorithm - standard or proprietary - to implement it with a different behavioural synthesis tool, hampering the algorithm's re-use across multiple design teams.


Embedded Software Questions

5. Why does Cascade use object code as the input format? Would it not be better to use the intermediate output from the front-end of the compiler, before it is flattened, optimized and constrained by the specifics of the target RISC architecture?

Many compiler optimizations enhance code execution on a Cascade coprocessor in the same way as on a RISC processor. Moreover, use of the object code enables heavily optimized legacy assembler code to be offloaded. It also enables continued use of the designer's existing development environment.

6. How is the original code modified to make use of the coprocessor?

Cascade automatically modifies the original binary code. The first few instructions are replaced with a branch to a coprocessor handler function that automatically initiates coprocessor operation, passes parameters to the coprocessor and collects the results.

7. How is a 'golden' version of the software (C and microcode) maintained and managed to support future end product maintenance?

Cascade generates a special architecture description file that must be maintained under version control to allow new code to be generated for the coprocessor.

8. How does patching object code affect debug?

When the original code is modified, the code itself is not removed. Only the first 3 instructions in each offloaded function are overwritten, so the debugger can still disassemble what remains.

9. Does Cascade support multi-threaded software execution?

Cascade supports two communication mechanisms between the main processor and the coprocessor. In the case of the slave (blocking) model, the main processor waits while the coprocessor performs the offloaded task. In the streaming coprocessor (non-blocking) model, the coprocessor operates in parallel with the main processor and independently of it. The non-blocking model supports the execution of individual threads.

10. What profiler output does Cascade support?

Cascade supports the text format produced by the armprof and GNU gprof profilers. Other tool chain specific profilers will be supported as additional processor architecture support is added to Cascade.

11. Does Cascade accept hand-optimized assembly code? Does it achieve the same performance improvements as with compiled code?

Cascade can accept well-behaved hand optimized assembly code, i.e. code that conforms to standard ABI rules for function parameter passing etc. Such code would achieve the same performance as compiled code with the same instruction level parallelism. Hand optimized assembler performance may actually be superior because its optimized register allocation may reduce the number of memory accesses.

12. How does Cascade handle complex software situations such as real time interrupts, OS effects (especially Linux), self-modifying code, etc.?

The main processor continues to receive real time interrupts even when the coprocessor is active, so system latency is not impacted. However, interrupt routines must not rely on a memory state that is being updated by the coprocessor. Because the interrupt is an asynchronous event, special attention must be paid to state synchronization even if the code is being run on the main processor.

Cascade does not handle self-modifying code. Such code is quite rare in modern software systems, with the possible exception of program loader applications. The Cascade coprocessor performs algorithmic acceleration and has no direct interaction with the operating system. Any function calls performed by the coprocessor that result in operating system interaction are passed to the main processor for execution.


Memory Questions

13. Where is the CriticalBlue microcode stored and how is it downloaded onto the coprocessor? Are there ROMs in the architecture?

A Cascade coprocessor contains an instruction memory unit that holds the microcode. Cascade automatically places the microcode in a data section within the overall main processor executable. Thus, Cascade's microcode download mechanism is the same as that for main processor code download.

There are no ROMs in the architecture because this would destroy the flexibility afforded by microcode changes.

14. How does the Cascade shared memory scheme work?

Cascade automatically generates the hardware and software interfaces necessary to communicate with a pre-existing DMA controller, utilizing data input/output parameters provided by the user. DMA transfers are initiated transparently at run time whenever an offloaded function is invoked. The coprocessor controls the flow of data as required. The design requires no special streaming constructs, and thus maintains compatibility with existing embedded software implementations.

15. How does the Cascade shared memory scheme support virtual memory systems?

The coprocessor uses only virtual addresses. Any coprocessor page misses invoke main processor requests for the requisite page mappings. Where the virtual-to-physical mappings are altered during coprocessor operation, the operating system calls for a map update.

16. How can the cache sizes be modified?

CriticalBlue provides a range of cache sizes and organizations suitable for diverse applications. It also provides an interactive environment to configure cache designs and to obtain rapid performance feedback on the basis of memory access traces captured from the user's actual application.


Hardware Design Questions

17. What is a custom functional unit?

Cascade automatically constructs a programmable coprocessor architecture as an array of simple computational elements that operate at the instruction level - ADD, XOR, SHIFT, etc. - optimized to execute the offloaded software.

Users can increase performance further by specifying some offloaded functionality to be mapped directly onto a custom functional unit, which is a user-designed hardware implementation that is embedded in the coprocessor. Cascade allows users to explore the performance benefits of deploying custom functional units without the necessity of implementing RTL before the architecture is finalized.

To facilitate the easy integration of the custom functional unit into the coprocessor design, Cascade generates RTL Verilog module or VHDL entity declarations that define the block interface to the coprocessor RTL. This approach provides a simple path to integrating new or existing hardware IP into a software-programmable coprocessor.

An example of such a unit can be found in the CriticaBlue Application Note of November 2005.

18. How are the inputs and outputs specified for custom functional units?

Cascade enables the user to specify each of the inputs and outputs of a custom functional unit and their widths. It then generates a template HDL file to enable the user to instantiate the unit's implementation. The unit is then automatically inserted into the coprocessor implementation.


System and Coprocessor Verification Questions

19. How is the coprocessor verified?

Cascade automatically captures and applies main processor stimulus and results to the coprocessor. The coprocessor is verified by (a) a performance simulation using an automatically generated bit- and instruction-accurate C model and (b) RTL simulation using an automatically generated RTL testbench. Users can verify coprocessor using the existing software verification environment.

20. Can Coprocessor simulations be used in a transaction level simulation environment?

Cascade simulations are cycle accurate provided they are used in connection with a cycle accurate bus model for external coprocessor communication. These same models may also be connected to a transaction level bus interface.

21. How is 'cycle accurate' defined for a coprocessor model?

It is cycle accurate at each clock edge compared to the external behaviour of the RTL implementation. Simulation is cycle accurate only if used with a cycle accurate memory model.

22. How does the user know that a particular block of code is being executed on a coprocessor?

Cascade does not require the user to modify source code, so there is no formal identification mechanism. However, CriticalBlue recommends that offloaded code be annotated with suitable comments, or that offloaded code segments be segregated within a well defined and separate location in the software project source tree. Cascade automatically detects major changes to offloaded code, and warns when they cannot be mapped to the coprocessor.

back to top