articles

A proposed architecture for a low-cost, high-performance data acquisition and control card

By
Shyam Chandra
Lattice Semiconductor

Present-day market pressures are forcing data acquisition and control card providers to increase the number of channels on a data acquisition card, increase the sampling rate, and at the same time, reduce cost. Further, the actual function of the card is determined either during manufacturing or configured in-system. Sometimes designers expect the same card to perform different functions at different slots on the same system.

The use of a DSP CPU onboard, while providing the flexibility of hardware reuse across a wide range of applications, also becomes a bottleneck for board-level performance.

In this article, Shyam Chandra discusses a proposed architecture that would provide a low-cost, high-performance solution.

DSP CPU efficiency decreases with increased channel loading
As the number of channels increases, the load on the DSP engine also increases. This is due to the serial processing nature of current DSP CPUs, which are fetch/execute engines. In most applications, front-end preprocessing tasks, such as offset shifting, gain adjustment, preliminary filtering, etc., consume most of the DSP CPU’s bandwidth. Developing this code is also difficult and time consuming because, for efficiency reasons, it is usually already in assembly language. Even though DMA (Direct Memory Access) transfers samples from each channel to different memory locations, CPU performance suffers due to reduced availability of memory bandwidth. The faster the sample rate, the less time is available for the processor to perform the actual DSP functions such as signal analysis, video processing, compression, etc.

Since the DSP processor must switch contexts from one channel to the next, the cache thrashing further reduces the available bus bandwidth. In addition to processing the data, the DSP CPU must also manage data buffering, movement, etc., and transfer the processed data to the host processor through the backplane, backplane protocol management, and so forth.

To satisfy the processing demands of an increased number of channels, the DSP CPU performance should increase exponentially. Increasing both the processor operating frequency and/or using a more powerful processor mitigates the demand for increased DSP performance. Either way, the result is a much higher board cost.

This proposed architecture enables an economical increase in the number of channels and the sample rate per channel for a given DSP CPU. Using low-cost FPGAs, such as the LatticeECP and LatticeEC, as a coprocessor (offload engine) to the main DSP CPU achieves this goal by minimizing the front-end preprocessing as well as the non-DSP operational load. he resulting design continues to remain flexible to address a wide variety of applications. As these FPGA devices provide ample local storage, it is possible to realize a programmable data-flow architecture that further enhances performance. Data-flow architecture helps the DSP perform its computation once all the operands are available, instead of performing the computation sequentially as the operands arrive.

The following section describes two approaches to implementing data acquisition. The first method uses an FPGA to interface with an ADC bank, generate digital I/O interface, and manage the data transfer between the ADC and DAC and the SDRAM memory. The second method uses an FPGA with DSP math processing abilities not only to interface the CPU bus to the ADC and DAC, but also to preprocess the acquired digital samples, as would a DSP coprocessor.

Board architecture description
The block diagram in Figure 1 illustrates the architecture of a data acquisition and control card. The card can process 20 Msamples from the ADC Bank.

Figure1
Figure 1: DSP Board with Integrated 4-Channel Data Acquisition with Playback
(click graphic to zoom by 1.8x)

Follow the processing steps in the diagram starting from the left:

  • The ADC bank quantizes and transfers multiple, high- frequency analog signals to the FPGA via an LVDS interface
  • The FPGA transfers the data directly to the SDRAM attached to the DSP CPU through DMA
  • The DSP CPU processes the data in the memory and transfers analyzed data to the host processor through the PCI backplane
  • The FPGA generates the control signals for communication with the external digital subsystem through digital I/O at the CPU’s command

This board’s benchmarked throughput was capable of handling four channels of 5 Msamples each. At that rate, there was no processing power left for supporting either analog control, by driving DAC using Digital I/O, or for implementing improved signal analysis. Because the DSP CPU performed all the processing, including preprocessing functions, communication with the host processor, etc., only the use of a next-generation data acquisition board with a more expensive DSP CPU could mitigate any increase in the processing requirement. Alternatively, by reducing the number of input channels on the CPU, it could perform the additional processing.

The proposed approach explores the use of FPGAs with embedded hardware accelerators for performing preprocessing DSP functions, resulting in a significant reduction of the processing burden on the DSP CPU. The resulting design not only offers higher performance, but also frees the DSP CPU to perform more sophisticated DSP processing and control functions.

FPGA with hardware DSP math processing blocks
Mathematical operations such as multiplication, addition, and subtraction are required to perform the digital signal processing. It is both expensive and inefficient to implement mathematical functions in an FPGA using general-purpose FPGA fabric. However, FPGAs with hardware blocks having multiplication, addition, and subtraction abilities are capable of offloading the DSP CPUs economically and efficiently. These multiple blocks enable the FPGA to process multiple streams of data simultaneously, substantially improving the performance of the system.

sysDSP
This section briefly describes one such hardware math processing block implementation, called sysDSP, on a Lattice ECP-DSP FPGA. Designers can configure the high-performance sysDSP block to address a wide variety of DSP operations.

There are 4 to 10 sysDSP blocks per device, enabling parallel operation of DSP processing across multiple channels. A designer using software can configure each of the sysDSP blocks as one of the following modes:

  • 36 x 36 mode
    • One 36 x 36 multiplier
  • 18 x 18 mode
    • Four multipliers
    • Two 52-bit MACs
    • Two sums of two 18 x 18 multipliers each
    • One sum of four 18 x 18 multipliers
  • 9 x 9 mode
    • Eight multipliers
    • Two 34-bit MACs
    • Four sums of two 9 x 9 multipliers each
    • Two sums of four 9 x 9 multipliers each

The flexibility of the sysDSP block is applicable across a wide variety of DSP operations. The block diagram in Figure 2 shows the sysDSP block’s configuration in the 18 x 18 mode.

Figure2
Figure 2: A Proposed Architecture for a Low-Cost, High-Performance Data Acquisition and Control Card

These blocks are able to perform all functions at a speed of 250 MHz, resulting in an overall processing capacity of 10,000 MMACs (Mega Multiply Accumulate operations). Because hardware presents the data for processing to these sysDSP blocks, as opposed to a microprocessor fetching the data from its memory, the performance of these blocks meets benchmark-based expectations. The pipeline registers enable DSP processing operations at peak operating speeds. The sysDSP block’s ability to perform MAC, sum, and add or subtract within the block, without the use of an external FPGA fabric, makes it immune to fabric routing delays.

The next step is to use this FPGA with the sysDSP building block in a circuit as a preprocessing DSP engine while maintaining the flexibility of the earlier architecture. The following section describes the proposed architecture that uses two FPGAs. The FPGA with the sysDSP block, the LatticeECP-DSP FPGA (referred to as ECP-DSP), and the FPGA without the sysDSP block, the LatticeEC FPGA (referred to as EC), improve overall data handling and processing efficiency.

Improving performance through repartitioning using ECP-DSP and EC FPGAs
The block diagram in Figure 3 uses the same DSP CPU and addresses all the increased performance requirements, while doubling the channel capacity at a much lower cost compared to the board that used the newer DSP CPU.

Figure3
Figure 3: 40 MSample DSP Boad with Integrated 4/8 Channel Data Acquisition with Playback and Control
(click graphic to zoom by 1.8x)

Starting from the left, the ADC Bank can now sample up to eight high frequency analog signals and communicate with the ECP-DSP FPGA using the LVDS signaling interface. The ECP-DSP FPGA performs all real-time signal preprocessing functions on all channels simultaneously and writes the cleaned and ready-to-process sample data into the DDR memory through the EC FPGA.

The CPU bus interface (64-bit), implemented in the EC FPGA, enables the transfer of data and instructions between high performance DDR memory and CPU cache in burst mode. The EC FPGA also provides internal memory to buffer the data for processing, reducing the non-sequential access penalty of the DDR memory. While the CPU is processing data in its cache, the intelligent bus arbiter and switch logic implemented in the EC FPGA enables the data transfer to and from peripheral devices (EC FPGA, PCI Backplane, and Digital I/O interface).

This architecture improves DSP CPU efficiency by:

  • Reducing the preprocessing load on the DSP CPU
  • Reducing time critical channel and context switching interruptions
  • Reducing the non-DSP operation load
  • Increasing the memory bandwidth by using faster DDR memory
  • Reducing non-sequential access to DDR during processing

The increased availability of the DSP CPU increases the number of input and output channels.

  • Functions the ECP-DSP FPGA
    • High speed ADC interface
    • Per-channel, real-time preprocessing, implementing
      functions such as offset shifting, gain adjustment, FIR
      filter, etc.
    • Digital I/O interface, configurable as the
      application demands
    • The digital interface can also drive DAC with buffering
    • High-speed communication interface with the EC FPGA
  • Functions the EC FPGA
    • 64-bit DSP CPU interface
    • DDR interface
    • Local memory storage for partially processed data
    • CompactPCI/PXI interface
    • Logic to configure the ECP-DSP device using the
      configuration stored on the host CPU
    • High-speed communication interface with the ECP-DSP
      FPGA
    • Logic to transfer the boot code to DDR memory for the
      CPU to execute
    • Update boot code Flash with the updated code received
      from the host CPU
    • Intelligent, bus-arbitration logic functions required for
      managing data transfer among LatticeECP, memory,
      CompactPCI backplane, etc.

Configuring the board for different applications
Designers can configure this board either in-system through the backplane interface or during manufacturing.

Configuration from the backplane
The code in the boot ROM for the DSP CPU enables communication with the host processor, and the SPI configuration Flash attached to the LatticeEC FPGA is loaded with configuration data to perform all the functions described previously.

After the user plugs the card into the slot, the FPGA configures from SPI Flash, transfers the ROM code to DDR memory, and signals Power Manager to release CPU Reset. The DSP CPU communicates with the host processor and initiates the transfer of the DSP algorithm into DDR memory, as well as loading of the preprocessing configuration directly into the DSP FPGA by the sysConfig port controlled by the EC FPGA. The DSP CPU and the ECP-DSP FPGA are then ready to process signals as required by that slot.

Configuration during manufacturing
The DSP algorithm is stored in the boot ROM and the FPGA configurations are stored in their respective SPI Config Flash memory.

After a user plugs the card into the slot, the FPGA device configures itself from the SPI Flash (during this time the Power Manager holds the CPU in Reset condition) and transfers the contents of boot ROM into the DDR memory. Simultaneously, the LatticeECP-DSP FPGA configures itself from its own SPI Config Flash device. After both FPGAs are configured, the Power Manager is signaled to release the CPU Reset.

Advantages of the proposed architecture
The performance of the enhanced data acquisition board with the proposed architecture more than doubled, compared to the old implementation, while using the same DSP CPU. Primarily, elimination of time-consuming, assembly-language-coded, difficult-to-maintain, preprocessing functions from the old design resulted in the performance increase. Switching to faster and less expensive DDR memory provided additional performance.

The resulting design is significantly less expensive, offers higher performance, and is more flexible than boards with a traditional architecture.

>

©MMX DSP-FPGA.com. An OpenSystems Media, LLC publication.

About this Magazine and Website | Contact Us | DSP-FPGA.com Media Kits