articles

Realistic DSP benchmarking takes a systems view

By
Eric Verhulst
Eonic BV
and
Peter Beukelman
Eonic BV

DSP-based system developers are often frustrated when their systems’ final performance doesn’t come close to benchmark-based expectations. Often, they estimate system-level performance on the basis of 1024-point Fast Fourier Transform (FFT) DSP benchmarks, but the system performs at only a fraction of that level. This is particularly true in high-data rate applications such as radar, sonar, or image processing. In this article, Eric and Peter discuss various DSP architectures from both a chip and system level to illustrate how these discrepancies arise. They also describe the leading floating-point devices and compare their performance in a realistic setting.

Don’t neglect system effects
When deciding which DSP best fits an application, engineers rightly look to the devices’ ability to handle an FFT as a first benchmark. In today’s sophisticated applications, it’s safe to say that from one-third to one-half of all processor cycles are spent on this single algorithm and its derivatives, such as convolutions or correlations. However, many engineers mistakenly focus on peak gigaflops or consider FFT benchmarks optimized for specific circumstances. In doing so, they neglect system-level considerations that are far more important. It’s easy to generate impressive results if you calculate the times for a 1024-point FFT running from on-chip cache alone. From a systems standpoint, though, even the most well known processors drop to efficiencies as low as only 5 percent of their highly touted benchmarks.

While traditional DSPs perform best with all data in local cache or internal memory, even in this circumstance their ability to run the FFT algorithm varies widely. Consider that the first stage of a 1024-point FFT involves (N/2)log2(N) = 512 butterfly operations. In the case of complex FFT, this result means 10 floating-point operations per butterfly (six additions and four multiplications). Almost all DSP chips are built around a Von Neumann architecture. That is, they operate on data with a sequence of operations. An ideal processor would perform all butterfly operations in one cycle, but a traditional DSP runs through them sequentially in a series of cycles. Some chips incorporate parallel-processing units, perhaps as many as four or eight, and even when they pipeline instructions they still fall short of the theoretical maximum.

Programmable dataflow architecture offers an alternative to traditional DSP methods. Instead of running operations sequentially in a series of cycles, the module performs its computations when all operands are available. Programmable data-flow architecture modules are typically designed for a specific task, such as an FFT. Developers can implement this approach in FPGAs by having the devices perform many parallel multiplies. One problem with FPGAs concerns precision. The dynamic range of the FFTs computations increasingly demands floating-point precision. Developers find FPGA programming straightforward for fixed-point operations, but achieving floating-point equivalent precision requires tracking overflows and related occurrences. Therefore, writing a reliable algorithm can be complex. In addition, FPGAs tend to be somewhat expensive, and their heat dissipation can be relatively high, especially at high rates.

Going a step further, some manufacturers have created data-flow devices optimized for FFTs. For example, the PowerFFT performs multiple butterfly operations, the FFT’s core operation, in parallel and consumes only 1W to 2W maximum power.

Getting data to the device
Which architectures better suit FFTs becomes a secondary issue when other factors present a data input or output bottleneck. How a device handles memory and data I/O can dominate overall performance. Storing all data and instructions on a chip is ideal, but it’s too expensive to put that amount of cache on a chip. Although today’s caches can reach the megabit range, they nonetheless can’t handle the large square matrices common in image processing and related tasks.

Predicting real-time behavior also poses a cache-related problem. If a system were dedicated solely to a signal-processing algorithm, the instructions once loaded would stay in cache. In most embedded systems, when an asynchronous event requires processing, the CPU flushes the cache to load the instructions for the new task. Before returning to the signal-processing routines, the CPU must reload those instructions into the cache.

A cache fetches its contents from main memory, and this interface introduces bandwidth and addressing problems that can slow down signal processing. In a typical application, a CPU must acquire data from sensors into RAM (often best done with DMA), transfer that data into its cache, run the algorithm on the data, transfer results into system RAM, and then export the results to other system components (again, best done with DMA).

However, when the input data rate from the sensors matches the memory bandwidth, the DMA input channel might monopolize access to the memory. This function prevents the CPU from gaining access to the data and, thus, blocks the CPU from running at full speed. On the other hand, the need to interact with main memory can keep the DSP’s bus so occupied that there’s often little bandwidth remaining for other I/O operations, thereby causing the system’s peak performance to suffer. In theory, the external memory bandwidth should be a multiple of the internal bandwidth. Often, however, today’s fast-clocked DSPs, have memory bandwidth that’s just a fraction of internal bandwidth.

While some DSPs include on-board DMA controllers, it’s not necessarily easy to program them. For instance, the TigerSharc and Hammerhead feature on-chip DMA controllers, plus several high-speed serial Link Ports intended to ease the I/O bottleneck. However, configuring multiple DMA controllers to run concurrently and keep the CPU supplied with the correct data creates a programming challenge, especially because developers created high-level languages for sequential programming.

The myth of Random Access Memory
To achieve peak performance, a DSP must have the fastest possible access to data in external memory. In real-world applications such access demands can raise problems. Unlike its name suggests, Random Access Memory (RAM) doesn’t actually give random access at its clock frequency (an exception is static RAM, but those devices entail higher cost and smaller capacities). All dynamic RAMs are built up in rows, columns, and banks. However, only after opening a row and waiting for a few cycles are the row’s contents accessible. Row addressing boosts signal processing efficiency because consecutive data points coincide with consecutive memory addresses. If the processor encounters a cache miss, it fetches a new row, which conveniently contains the next group of samples.

While this setup is convenient for 1D signal-processing algorithms, many DSP applications, such as image processing, SAR processing, or correlations, involve two dimensions. After processing the data from rows in a data matrix, the DSP must do the same in the other dimension, although the data is no longer sequential in memory. Processing a series of values requires the CPU to perform stride (or indexed) addressing, with each required data point being a certain number of addresses away from the previous one. Even in the best case, a row of values in cache might have a required data point, but the system might need to bring into cache a new row for each data point to find the next sample. Even DSPs with built-in support for indexed addressing can’t overcome mismatched stride length and DRAM configuration inefficiencies. 

One way to get around inefficient stride addressing is to transpose the array, which involves a corner-turn operation. In other words, you perform the first dimension of processing on data stored in the DRAM’s fast dimension, then reorganize the data with a corner-turn so the data’s second dimension falls in the DRAM’s fast dimension. However, as corner turning the dataset also requires stride addressing, not much is gained.

A novel architectureAlthough traditional DSP devices have failed to live up to their reputations from a systems perspective, specialized chips with architectures tuned specifically for FFT processing have emerged. One example is the PowerFFT from Eonic, a 600-pin chip fabricated in 0.18-micron CMOS technology. The device, as shown in Figure 1, supplies sufficient on-chip resources, including a 128-MHz core that performs four simultaneous radix-2 butterfly operations to accomplish a 1024-point complex FFT internally. An input bus that accepts 64-bit complex data runs independently of a 64-bit output bus so the chip can sustain continuous throughput rates of 100 Msamples/sec, With these resources, the device computes a 1024-point complex FFT in 10 msecs.

Figure 1:
(click graphic to zoom by 1.3x)

To keep the core running at full speed when working with large 1D FFTs or square matrices, the PowerFFT supplies four external memory ports and optional addressing logic. With these features, the device can do multiple simultaneous data transfers, such as:

  • Use one bank to accept external sensor data.
  • Send processing results from a second bank to the output port.
  • Instructing the core to read new data from a third bank  while simultaneously writing results to a fourth memory bank

To enable such varied operations, the chip employs a programmable switched fabric to connect the external memory ports and the core I/O buses. This DSP differs from devices such as FPGAs, because the PowerFFT supports floating point variables and lets users change the algorithm on the fly. In just a few clock cycles users can reprogram the device to switch to an inverse FFT, correlation, convolution, or other FFT-derived algorithm. For this feature, Eonic created a dedicated high-level software development environment. Similar to a traditional compiler, it allows users to generate low-level code from a high-level perspective.

Eonic has also developed a patented memory controller that eliminates SDRAM addressing latency. With the PowerFFT, engineers can always leave data in place. Therefore, for 2D operations, a corner-turn memory operation is essentially free. This controller uses a combination of techniques to achieve a high throughput in both horizontal and vertical addressing modes of SDRAM. With a 1024-point FFT, the SDRAM behaves like SRAM, so whatever the data size, the core runs at its maximum throughput level.

Comparison to traditional processors
We compared the PowerFFT to floating-point processors from Analog Devices and Motorola, and selected ADI’s TS101 TigerSharc and ADSP21160 Hammerhead processors. Many designers don’t think of the Motorola G4 as a mainstream DSP. However it has a superscalar core with a double-precision floating-point unit and is often employed to run signal-processing algorithms. Thus we included the G4/MPC7410 in our evaluations.

To create realistic benchmarks, we ran complex FFTs of different lengths in floating-point format. To make certain that the results were not influenced by specific backplane bandwidth limitations, such as those of the PCI bus, the measured processing time includes I/O times to the processor, but excludes the time to move data in or from memory.

In another effort to create realistic benchmarks, we ran the tests in conditions similar to what most engineers experience on the bench. Specifically, we created the test programs from publicly available code, most often selected as the best algorithms an engineer can obtain from the semiconductor manufacturers. While some of the vendor-supplied code is written in assembler, we did not specifically handcraft assembly code and used high-level languages when possible. Most engineers prefer not to work in assembler, as it’s a tedious and error-prone job that requires testing the algorithms’ numerical accuracy. In addition, we compiled the necessary C routines with all compiler optimization switches turned on.

We ran the tests on DSP boards commercially available from Eonic or the device vendor. Because we’ve stressed in this article the importance of how to use on-chip cache and how much external memory is available, it’s important to review the resources each board provides:

  • The PowerFFT, clocked at 100 MHz, is on an Eonic PCI-64 board, which supplies four 64-Mbyte banks of SDRAM.
  • The Motorola G4 (MPC7410), clocked at 450 MHz, is resident on an Eonic Atlas3 3U G4 board. The G4 includes the Altivec unit that can execute special floating-point instructions in parallel with other instructions. The G4 itself has 2 Mbytes of L2 cache (which is turned on for our tests), but the user has little control over its use. The board, however, supplies 128 Mbytes of CAS3 SDRAM under Tundra chipset control.
  • The TS101 TigerSHARC, clocked at 250 MHz, is resident on an evaluation board from ADI. For internal memory, the chip has three banks, each 64k x 32 bits. The board comes with 32 Mbytes of CAS2 SDRAM clocked at 83.3 MHz.
  • The ADSP-21160 Hammerhead, clocked at 72 MHz, is resident on an Eonic Atlas2-HS V1.1 board. The chip provides a 32 x 48-bit cache for data and instructions, as well as 4 Mbits of dual-port SRAM, but the programmer must make optimal use of that SRAM manually. The board supplies 1 Mbyte of Zero Wait State SRAM clocked at 36 MHz.

To show a processor’s efficiency, it’s best to relate absolute timings to the device’s clock frequency and the maximum theoretical performance. Therefore, for some tests we normalized the benchmarks to what was the highest speed floating-point CPU for this study, the G4. At its full theoretical efficiency, the G4 would use all four of its processing units at each cycle, leading to 4 x 450 MFLOPS = 1.8 GFLOPS. Thus we set that level equal to unity on the graphs and measured the others relative to this value. In other cases, an engineer is less interested in relative efficiency and simply wants to know if the device can do a certain job in a certain amount of time. Thus, we list some of the results in absolute time.

Most applications often combine the FFT operation or one of its derived algorithms like correlation or convolution with some filtering or windowing operation in sustained operations. Thus, the benchmark routines include a windowing function.

Analyzing the results
Consider first the results for a 1D complex FFT with a windowing function applied. This test clearly indicates the importance of the system-level architecture and memory interfacing. The efficiency scores for the general-purpose CPUs all hover within the same general range, which is far below that possible with the PowerFFT. Besides its memory interface and specialized core, the PowerFFT has a programmable built-in multiplier unit in the data pipeline to provide windowing with no overhead. In contrast, the impact of adding the windowing function to general-purpose DSPs is significant. 

Figure 2 shows that even using code from Motorola, the G4 can’t achieve 100 percent efficiency and has a relative score consistently at 0.4 or below, dropping to 0.04 for a 1 Msample FFT. Running the test on subsequent iterations of a loop with already-cached instructions also affects the score.

Figure 2:
(click graphic to zoom by 1.4x)

Although the TS101 runs at approximately half the clock speed of the G4, the TS101 outperforms the G4 as long as the ADI board can run the benchmark from internal memory. As noted earlier, most practical applications require external memory. To see the best results under all conditions, we ran one TS101 test with internal memory only and another that uses external memory as well. Figure 2 shows the considerable difference in efficiency between the two plots. Memory use also plays a role in overall performance. The programmer makes tradeoffs when placing code, data, and coefficients in various segments. This is one reason why we use code from the chip vendors for the tests. Both curves for the TS101 stop at 8K and 32K samples due to the amount of memory on the test boards. It’s safe to assume that their performance curves would at best stay at their current levels if they weren’t to drop even lower.

Finally, note that for the 1 Msample FFT, the PowerFFT has a rating of 2.68 and beats the G4 by a factor of 67. Be aware that the times for the PowerFFT assume sustained processing and don’t take initializations into account. The drop in performance after 1024 samples arises because the device is optimized for processing a 1K complex FFT and must perform larger transforms in multiple passes. Even so, it far outperforms the other devices.

To compare the devices on a 2D FFT, we used a square matrix. Figure 3 depicts the results in absolute time instead of a normalized score (note that the available onboard memory did not allow the full range of tests). In many cases, engineers need absolute times, and normalized efficiency scores for a 2D FFT would not differ dramatically from those for a 1D FFT. For a 32-sample 2D FFT, the times range from 21 µsecs for the PowerFFT to 1.257 msecs for the Hammerhead. The differences, however, rise tremendously for a 1k x 1k FFT. The PowerFFT needs 21.8 msecs, whereas the TS101 needs 466 msecs and the G4 takes 768 msecs.

Figure 3:

It’s clear that the PowerFFT’s data-flow-based design outperforms the other devices on every FFT length. When the other processors must access main memory, direct SDRAM addressing reduces their performance, although a cache can somewhat mask the performance penalty. In addition, some of the code obtained from the manufacturers makes explicit use of internal memory, so for larger FFT runs we had to run different code.

The more attentive reader will undoubtedly see how performance correlates to the strength of the various devices’ instruction sets. Comparing results is also difficult because many interdependent factors are at work, such as to what degree the code is optimized, the placement of code and data in memory, external memory speed, and even test board design.

Because chip manufacturers upgrade their devices so often, introduce new models, and increase clock speeds, performing a series of benchmarks is like shooting at a moving target. Nonetheless, these benchmarks should aid your understanding of the system issues and help you formulate questions as you investigate which processor to use.

>

©MMIX DSP-FPGA.com. An OpenSystems Media, LLC publication.

About this Magazine and Website | Contact Us | DSP-FPGA.com Media Kits