| By |
|
Doug makes the case for employing clock-less design techniques to high-performance DSPs. This approach balances power efficiency and flexibility.
For years, Digital Signal Processor (DSP) designers have tackled the task of providing high-performance chips, in a small footprint, with maximum flexibility and software programmability.
Recently, the pace of performance improvement has slowed at the same time as new more complex applications have evolved. To bridge the gap, high-performance multicore DSPs are increasingly being used in telecommunications access, edge, and infrastructure equipment to process voice, video, and radio signals. Even with the advent of multicore DSPs, power dissipation challenges are limiting their ability to address a complete solution from a density, cost, and power perspective.
To meet product requirements, telecommunications equipment manufacturers have necessarily turned to combinations of DSPs and Application Specific Integrated Circuits (ASICs). This trend conflicts with the need for upgradable solutions for access and infrastructure equipment that must last many years in network deployments. ASICs are not as flexible or field-programmable as a DSP, but they can be significantly more cost- and power-efficient. For DSPs to achieve the required performance and efficiency the issue of power dissipation must be overcome.
The power crisis
Previously, clock frequencies could simply be increased to boost computing performance between technology nodes, resulting in increased silicon efficiency (lower cost) because the circuit shrank and it ran faster. Beginning with the 90nm node, excessive power consumption began to work against this strategy. Each clock frequency increase resulted in a power consumption increase. This created an unsustainable race between device performance and heat dissipation.
Multicore DSPs
One of the strategies to solve this issue has been the use of several slower processors within a single device and an architectural shift to multicore DSPs. Reducing the clock frequency makes each processor more power efficient, pushing back the power barrier.
Multicore DSPs address power dissipation by reducing the frequency of a processor (say by half) and doubling the number of processors on a device. This lowers the power consumption but also causes a steep increase in silicon area, because the new solution requires two processors, which together can be 1.4 to 1.8 times the area of the original. Nonetheless the move to a multicore architecture has the merit of reducing power per cycle, thus packing more computing power in a single device. The process shrink reduces the cost, albeit at a slower rate.
Power density dissipation
Given a fixed power dissipation environment (package and system), if the heat generated on a silicon die increases, the temperature of the silicon will increase proportionally. To achieve maximum computing performance, high-performance processors already operate very close to the maximum junction temperature tolerable for device reliability. It is thus not possible to increase the power generated by a single die without a revolution in packaging technology.
Silicon process shrink and power density limitations
Power consumed depends generally on three key factors: frequency, capacitance, and core voltage. The power consumed varies proportionally with frequency, proportionally with capacitance, and proportionally with the square of the core voltage.
Prior to 90nm the core voltage dropped considerably with each technology node and significantly helped to reduce power density. However for technology nodes below 90nm, the core voltage is only being reduced by around 5 percent per node. The average combined reduction of capacitance for wires and transistors of a typical circuit is approximately 25 percent. And with the shorter wires and lower capacitance the frequency of a typical circuit increases by about 1.33 with the 5 percent lower core voltage.
Given that the silicon surface area of the circuit is reduced by half, power density, defined as the power dissipated per unit of silicon area, goes up by 1.8 when a technology shrink is performed below 90nm. Figure 1 shows normalized power density for a given design at various technology nodes.
|
|
| Figure 1: Between a rock and a hot place: Power density cannot rise without problems maintaining an internal junction temperature below a tolerable level. |
When junction temperatures are already at maximum the heat dissipation of the package and silicon needs to double, which is not happening. To maintain an internal junction temperature below a tolerable value, the power density cannot increase. To achieve this, the clock frequency must then be turned back by approximately 25 percent and performance proportionally with it. A traditional process migration thus can no longer at the same time boost performance while halving the die size and cost.
A solution is required that can significantly reduce the power required to achieve a given performance, allowing more performance to be contained in a given amount of silicon.
Eliminate the clock
Synchronous high-speed DSP designs require a massive clock tree to keep sequential cells synchronized. These clock trees –their high-power buffers and high-capacitance wires – change state twice per clock cycle, consuming power on every edge. This clock tree does not perform any information processing, thus provides no useful computing work, yet consumes a significant portion of the total power. Eliminating it alone can reduce power consumption by as much as 40 percent in a high-performance processor. Eliminating the clock trees also eliminates high-speed buffers, but the related silicon area saving is marginal.
Modern high-performance DSP architectures also require the use of a very large number of inter-stage flip-flops and state elements to operate at high clock frequency. How much do these inter-stage flip-flops and state elements contribute to the actual data processing and computing tasks performed by the processor? Absolutely nothing.
In addition, these inter-stage flip-flops have their own set-up and hold times that consume a portion of the precious time between clock edges in high-frequency synchronous designs. This clock period is further shortened by an ever-increasing amount of path timing uncertainty. Thus, the inter-stage circuit logic needs to be designed to operate even faster –much faster than a single clock period. This requires the use of over-sized, power-hungry circuits, further increasing total power.
Clock-less processor design
Asynchronous or clock-less designs have gotten a bad rep in the semiconductor industry. This reputation finds its roots back in academia. One of the first topics discussed in basic digital design courses is that asynchronous circuitry is an oddity whose behavior and performance cannot be readily characterized because of race conditions and meta-stability. Accordingly the advice to students is to avoid them and concentrate on circuits that can be readily analyzed and characterized: synchronous clocked circuits. Usually this is the last time an electronic engineer hears about asynchronous designs.
Yet a few asynchronous chips have been released commercially and do work. For instance, ARM has released a very low-power asynchronous processor design, and a company based in Southern California sells a quite successful line of Ethernet interconnect chips based on its internally developed asynchronous design methodology. Despite this, the industry has characterized clock-less designs as intractable and unreliable. Developers have avoiding using such designs to implement high-performance DSPs or other large, complex circuits. However, a fresh look at the issues of asynchronous design has proven those doubts to be unfounded given the right approach.
By approaching the challenges with an open mind and appropriate support, processor designers have been able to define a suitable, rigorous, and workable asynchronous design methodology to deal with the related issues. Investing the time and efforts to develop the complementary design tools to support this methodology has led to an efficient, reliable, and practical design environment for asynchronous development. An environment in which complex clock-less circuits can be implemented efficiently and predictably, where they can be simulated and verified functionally, and timed as thoroughly as synchronous designs are.
Some of the more significant advantages of clock-less designs relate to power consumption and silicon area efficiency. Figure 2 illustrates graphically at a high level the typical combined effects of moving from a synchronous to a clock-less implementation methodology in processor design.
|
|
| Figure 2: Tick-tock for the flip-flop: Without the clock tree and inter-stage flip-flops, power as well as space savings can be achieved. |
A clock-less design methodology eliminates both the clock, with its associated clock tree, and the need for an inter-stage pipeline state. Discarding the clock tree and all those inter-stage flip-flops and state elements saves the silicon space they occupy and the large amount of power they consume.
In a clock-less design, circuits do not have to deal with meeting the rigid inter-stage timing and therefore can be built using slower, smaller, and less power-hungry circuits, while still delivering the same level of overall performance. This further reduces power consumption and die area.
The silicon area savings discussed earlier translates into even more power savings, because wires connecting circuit elements get shorter as the circuits themselves shrink. Shorter wires bear less capacitance, thus switching them requires even fewer power-hungry smaller drivers.
So, clock-less designs eliminate clock trees and sequential elements which contribute nothing to processing and computing tasks. They also loosen timing constraints on the circuits that do useful processing work, lowering the area and power consumption of these circuits. This means significantly less silicon is required to do an equivalent function and the power density goes down, thereby substantially reducing die area and thus real product costs.
Harnessing the power-performance beast
As applications become increasingly diverse, and tools become increasingly complex, designers of telecommunications access and infrastructure equipment are scratching their heads about how best to build a high-performance product, at the right price point, with a service life that makes sense. Applying clock-less design techniques to high performance DSPs can solve the power density issue and rebalance the power performance trade-off, allowing them to continue delivering the density, cost, and power efficiency required while maintaining flexibility. Revisiting assumptions about asynchronous design have allowed the development of the necessary tools and techniques to make clock-less DSPs a practical solution to the “power crisis.”
Doug Morrissey is Vice President and Chief Technology Officer at Octasic and has over 10 years of experience in the definition and marketing of semiconductor devices. Joining Octasic in 1999, Morrissey strategically focuses on issues with regard to the technical evolution of future Octasic products within the Voice over Packet market. Prior to joining Octasic, Morrissey worked as Marketing Manager for ATM and DSL products for Agere (formerly Lucent Technologies, Microelectronics Group). Previously to that, he was Senior Systems Architect at Unisys Corporation.
Octasic
dmorrissey@octasic.com
www.octasic.com


Jump to main articles index

