| By |
|
During the last five years, DSP manufacturers have heavily invested in in-house code-generation tools that fully target their DSP's feature set, bringing the compiler to the point where it now can efficiently support contemporary DSP architectures. Some independent compiler vendors also invested in compiler R&D, advancing the technology, and delivering compilers that are competitive with or that outperform the proprietary tools of DSP manufacturers.
The mechanisms enabling this technology to work differ significantly from traditional compiler technology. This article focuses on code-generation techniques for DSPs and examines the architecture and optimizations applied in a DSP compiler as well as the framework to build such a compiler. TASKING's StarCore compiler and VIPER compiler framework will be the reference platform to illustrate the concepts in this article.
The compiler framework
Packaged standard DSP cores with a fixed architecture are not always the most optimal solution in terms of speed, power consumption, size, and economics. Therefore, configurable DSP cores exist, which are customizable in terms of:
- Instruction set and availability of hardware accelerators
- Number and mix of functional units
- Addressing modes and instruction/addressing-mode combinations
- Available number of registers
- Pipeline configuration
- Availability, size, and associativity of data caches
- Modular in such a way that new components can be easily added and existing components can be removed, modified, or replaced without impacting other components.
- Tight, requiring as few lines of target-dependent source code as possible so that the compiler writer can operate at a high level of abstraction. The target-dependent parts should be written almost entirely in languages other than C. The framework should automatically check at compiler-creation time for errors in the specification.
Reconfigurable cores require compilers that address changes and variants in the hardware at runtime. Major architectural changes, such as new instructions, addressing modes, or pipeline architecture, traditionally require a new compiler. Without rebuilding, this framework enables changes such as deletion of opcodes, registers, and addressing modes; changes in number and mix of functional-units; and changes in cache configuration.
Basic compiler architecture
The primary components of a compiler are the front and back ends. A front end reads the source files, performs lexical, syntactic, and semantic analysis, applies optimizations, and generates intermediate code. It depends primarily on the source language and is largely target-independent. The output of the front end is a Medium-level Intermediate Language (MIL), which is a canonical representation of the source code that contains all the information needed for code generation, such as symbol tables and debug information.
A back end includes the compiler sections that depend on the target machine. First, the instruction selector reads MIL input and translates it into a Low-level Intermediate Language (LIL). The LIL objects are defined using a target description language and correspond to a target processor instruction with the opcode, operands, and additional information used within the compiler. The back end then continues with instruction scheduling, register allocation, and optimizations that operate on the LIL. The output from a back end is either assembly or object code.
Figure 3 illustrates how the intermediate language representations MIL and LIL act as buses transferring information between optimization and code-generation phases. A set of local and global optimizers operates on MIL and LIL levels. The optimizers' behavior and the order in which they are invoked are both target- and application-dependent and can be further influenced by data obtained from the linker map file and execution profiles.
Advanced front ends can process multiple source modules, and then present the MIL for the entire application to the back end, thereby widening the scope for applying global optimizations. Typically, at the application-wide level, optimizations include:
- Global overlaying of constants
- Aggregation of global references
- Global inlining
- Inter-procedural alias analysis
- Inter-procedural constant propagation
- Code compression
ISO-C and C++ do not efficiently support DSP architectural features such as complex memory systems, address-generation units, fixed-point arithmetic, and data/instruction-level parallelism. To enable the compiler to employ these features, either C/C++ language extensions and/or a large set of intrinsic functions and pragmas are built into a DSP compiler.
Language extensions provide a superior solution because they increase readability and maintainability of source code, allow the compiler to perform thorough static error checking, enable optimizers to fully exploit the parallelism of the target architecture, and simplify debugging. Through language extensions, the debugger can display fixed-point types in the correct format and apply modulo and bit-reversed address modification when applicable.
Intrinsic functions, however, are essential to telecom applications where the basic operations used in the reference algorithms distributed by standards organizations are required to be available as intrinsic functions in the compiler.
A DSP compiler front end typically supports features to access special characteristics of the underlying hardware such as DSP-C/C++ language extensions, intrinsic functions, and inline assembly code. It provides facilities to easily interface C/C++ to legacy assembly code via user specifiable calling conventions, for example. Furthermore, it applies various kinds of optimizations such as classical optimizations, loop transformations, and processor- or application-specific optimizations.
To efficiently translate control code, a large set of classical optimizations must be applied. These classical optimizations each improve execution speed and/or code size by a small percentage. Since more than twenty types of optimizations are typically applied, the effects on code quality can be significant.
Loop transformations are of particular importance in signal processing since many inner loops contain conditions that prevent the back end from scheduling vector instructions (SIMD) or from executing instructions in parallel (MIMD). The loop optimizers rewrite the loop into a format more suited for parallel execution, typically by applying transformations, such as loop unrolling, reversal, interchange, fusion, fission, splitting, peeling, and switching. To make these transformations possible, data elements should be properly aligned, and supporting transformations such as induction variable rewriting, scalar expansion, and creation of temporary arrays, should be available as well.
The back end
Optimal exploitation of parallelism is not trivial for orthogonal homogeneous RISC architectures, but code generation for DSP is far more complex and requires different techniques. Register allocation is further complicated by the heterogeneous nature of a DSP's register set. In addition, instruction scheduling and software pipelining are not separate phases any more but interact with instruction selection and register allocation.
Figure 4 illustrates the primary components and stages that address a back end designed to meet DSP complexities. Most components are generic and are parameterized by target characteristics taken from a target description file.
The instruction selector reads the MIL and symbol information created by the front end and emits the corresponding LIL. In the meanwhile, the pre-allocator assigns a home for each automatic variable, reads and modifies symbols, and creates virtual registers. The peepholer uses a LIL editor to traverse the LIL stream, making improvements as directed by a LIL editor pattern file. A generic LIL optimizer performs dead-code elimination, hoisting, and value tracking. Subsequently, the instruction scheduler, which is a group of code generator phases that change the LIL stream to maximize efficiency of the application for a given target, reorders the instructions. The register allocator chooses a physical register to use for each virtual register. The code compactor reduces code size by scanning for identical sequences of instructions and replaces all occurrences with a call to one single copy of the sequence. Finally, the formatter reads the LIL operations to generate assembly language output.
|
|
| Figure 5: Instruction scheduler components |
An instruction scheduler (see Figure 5) is a complex multiphase component that interacts with other parts of the back end, such as the register allocator. When applicable for a given target, the if converter replaces if-then-else constructs by predicated instructions. Subsequently, when allowed, the predicated instruction promoter removes dependencies between LIL operations by removing the predicate from LIL operations. The partitioner separates targets with multiple register groupings, each associated with one or more functional units. Next, the software pipeliner optimizes inner loop schedules, keeps track of register pressure, and ensures it never exceeds the available resources. The list scheduler runs after software pipelining and again after register allocation to schedule any spill code. Finally, the delay slot filler addresses the various delay slot approaches of a processor.
Advanced DSP architecture
Advancements in process technology and electronic design automation enable the development of very complex customizable DSP architectures. Most DSPs are used in price sensitive systems. Therefore, cost constraints inhibit the implementation of super-scalar hardware facilities that simplify the code generation process. Furthermore, the size of DSP application software increases. These factors drive the requirement for better C/C++ compiler support.
A new generation of DSP compilers now generates efficient instruction schedules for control code and inner loops, enabling developers to maximize the DSP's potential without losing their productivity by programming the DSPs solely in assembler.


Jump to main articles index
