| By |
|
Peter argues that GPGPU technology can, in many instances, replace either/or FPGAs and DSPs for reasons including greater performance, easier programming and lower cost.
Despite have the same number of legs as a mermaid, manycore devices such as Graphics Processing Units (GPUs) and Tilera processors are increasingly stepping forward to replace Field Programmable Gate Arrays (FPGAs) in certain applications. Developers are specifying GPUs to perform general purpose – rather than purely graphics – processing, which has given rise to General Purpose computing on Graphics Processing Units (GPGPU) technology, most notably with NVIDIA’s CUDA. When making the assessment of whether GPGPU, TILE, or FPGA technology is preferable for any given application, the checklist of what to consider includes but is not limited to:
· Available processing power
· Latency
· Scalability
· Cost of development
· Technology insertion
· Price
Processing power
Sizing up the processing power one method delivers over another depends on the type of data being processed and the algorithms doing the processing.
GPUs are especially adept at handling Single Precision (SP) – and in some cases Double Precision (DP) – Floating Point (FP) operations due to their origins as graphics rendering machines. Tilera’s TILE devices do not currently support hardware FP operation, instead requiring software emulation at a large performance penalty. And in general the same is true of FPGAs, with these devices gobbling many resources to tackle FP operations. Achieving acceptable performance demands IP blocks that consume many gates and require deep pipelining. For example, current generation Tesla class GPUs can peak at 1012 Floating Point Operations Per Second or 1 TFLOPS, while Xilinx Virtex-6 devices are quoted at 150 GFLOPS.
When considering fixed point operation, the picture changes somewhat. New-generation GPUs perform integer operations at the same rate as floating point, i.e., 1012 Operations Per Second or 1 TOPS, while the Virtex-6 device improves to 500 GOPS. Integer performance is the forte of TILE processors: The TILE-Gx (Figure 1) can peak at 750 GOPS for 8-bit data and at 188 GOPS for 32-bit data.
|
|
|
Figure 1: Tilera’s TILE-Gx processor can peak at 750 GOPS for 8-bit data. (click graphic to zoom by 1.9x) |
Use of fixed point processing in a signal processing application can lengthen development time. Much analysis takes place at system definition time to (1) measure the dynamic range requirements at each processing stage and (2) ensure no overflow or underflow occurs with real-world signals. Otherwise developers expend extra resources at runtime to continually monitor dynamic range and adjust block scaling factors.
Bitwise operations tip the scales heavily in favor of FPGAs, as the overhead for processing on more general-purpose architectures can be significantly larger both temporally and spatially. Once again, in the case of bitwise operations implemented on an FPGA, the development time needs to be considered.
With all compute systems it is well understood that the actual achievable processing power can differ from theoretical peak processing power by a large amount. Two factors that greatly affect this difference include the suitability of the algorithm as it is applied to the hardware architecture and the time it takes to optimize the implementation.
For instance, FPGAs can get much closer to theoretical maximums due to their abilities to exploit parallelism and adapt to a wide variety of algorithms. However, FPGAs require larger silicon real estate and long development times to approach those theoretical maximums. GPUs have been shown to achieve 20-30 percent of peak for algorithms that fit the GPU’s hardware parallelism model. They also have reasonable silicon densities (40nm process, with 32nm on the way) and development times (typically weeks versus months for an FPGA implementation). A TILEPro64 processor offers FPGA-like adaptability and GPU-like programmability – but lacks the ability to exploit fine grain parallelism to the same degree as FPGAs and GPUs do, due to their coarse, task-level decomposition of problems.
Equally important in assessing processor performance is memory bandwidth, where the GPU offers a 3x advantage over FPGAs and 6x over the TILEPro64. However, it must be noted that this bandwidth is not available without conditions: Large latencies can be incurred that must be managed by overlapping processing, and accesses should be coalesced by blocking them together in an optimal access pattern. With FPGAs, memory locality will need to be fully coordinated by the developer. Newer-generation GPUs and the TILEPro64 processor have traditional cache arrangements that can help to optimize memory locality and again reduce development time.
Latency
Probably the most likely factor that could rule out the use of a GPGPU is latency. The time to invoke a kernel and the long access time to main memory, for example, can provoke long latencies. In many cases this can be mitigated to varying degrees but not fully avoided. Large data sets are preferred, as is a large number of operations per datum, or in other words, high computational intensity. Where tight latency requirements need to be met (for example, closed loop control), FPGAs will be preferred. TILE processors can show good latency characteristics, particularly when programmed in their “bare metal” mode.
Scalability
FPGAs can be tightly coupled with low-overhead links, such as Aurora, or can implement standard serial fabrics such as Serial RapidIO or PCI Express. GPGPUs are coprocessors, always requiring a host processor. Multiple GPUs can be linked, as Figure 2 illustrates, to a single host (preferably multicore), but the use of shared resources will start to limit the returns when using one GPU per host core.
A common way to execute code across multiple closely linked GPGPUs is to use OpenMP. This method allows for automatic execution of processing loops in parallel threads that can each utilize a different GPU. Further scalability comes at the cluster (host + GPU [or GPUs]) level. Such clusters can be linked by PCI Express, 10G Ethernet, InfiniBand, and other links, and are typically programmed using such middleware as MPI.
TILE processors are highly connected internally (between cores) with several meshed fabrics tuned to different types of processing. The TILE processor multiple fabrics allow for inter-core general and low-latency IPC as well as inter-core memory coherence. Device-to-device connections can be made via 10G Ethernet and PCI Express. The whole device – or clusters of cores – can be programmed as a Symmetrical Multi Processing device.
Cost of development
A cost-of-development metric can be difficult to derive. Qualitatively, it is generally accepted that programming a multicore device in C or C++ is easier than programming an FPGA. It’s also accepted that you can more readily find qualified engineers to program multicore devices than you can recruit individuals to program FPGA devices in VHDL or Verilog. FPGAs require multiple skill sets to reach near-theoretic performance, as the developer has to develop and optimize both the hardware and the algorithm (software). In the case of multicore, the developer is free to focus on algorithm development and optimization (software only), as the hardware is already defined.
Quantifying this difference is the problem. One method is to consider Source Lines of Code (SLOCs). This varies depending on the algorithm, but ratios of greater than three to one in favor of manycore processors are not unusual. Use of higher-level abstractions can blur the picture – MATLAB in the case of GPGPUs, products such as Agility-C or MATLAB System-Generator for FPGAs.
SLOC count alone does not accurately represent the cost of development. Many of the tool and language innovations that have pushed software development productivity forward such as integrated development environments, debuggers, test coverage generation, and object-oriented programming are starting to have an impact on FPGA development – but there is a long way to go. Additionally, FPGA development out of the box does not have the quick test and modify cycles of software due to long synthesis and place and route times, reduced processor state transparency and simulation times that can be very long. There are solutions for these issues, but they require additional investment.
Technology insertion
On the application side, GPGPU code can take advantage of newer devices with more cores without change in many cases. The method of expressing parallelism via small execution units (kernels), for example, still holds, irrespective of the number of cores available. At development time and at runtime, the toolchain and drivers respectively abstract the application from the hardware. Hundreds to thousands of concurrent threads are invoked to execute these kernels.
A single binary can be run on different devices with different core counts. In many cases this can provide migration to new platforms with minimal pain.
Similarly, SMP applications for TILE processors can be written to automatically scale to a larger number of cores when newer devices become available.
In contrast, moving an FPGA application to a newer device can mean substantially reworking the hardware expression code to fit a different target platform even if the algorithm remains the same.
Price
Commercial-grade GPGPU boards (Figure 3) can be bought for as little as $50 and as much as $4,000 for the latest boards targeted at supercomputing applications. A board containing a high-end Virtex-6 FPGA will likely run in the region of $4,000. Tilera boards will most likely be much higher than both of these due to their niche nature. Ruggedized versions of all three types will be much higher due to lower volumes, board construction techniques, testing and screening. For example, a fully ruggedized, conduction-cooled GPU board can cost in the region of $7,000. Such boards are required for military/aerospace applications as commercial boards will not survive the environmental stresses of deployment in harsh environments and do not have the required life cycle support for long-term programs.
|
|
| Figure 3: NVIDIA’s CUDA-enabled GeForce GT130M is an example of an inexpensive entry level into GPGPU technology. |
Conclusion
For many use cases FPGA performance with regard to processing prowess and tight latency remains unrivalled. However, there are many other cases where the use of a multicore device can and should be considered. Due to their fixed point performance, TILE processors can be considered as a direct replacement for FPGAs. As GPGPUs are more adept at floating point operation, they can be considered as both a replacement of, and a complement to, FPGAs.
Multicore processors are making their way ever closer to the sensor, and it’s likely that modules will come to market soon – with such devices sitting right behind an analog-to-digital converter (ADC) where the FPGA used to be. Some applications that moved from General Purpose Processors (GPPs) such as PowerPC with AltiVec to arrays of FPGAs are now starting to migrate to manycore architectures. For example, medical imaging devices such as Computed Tomography backprojection and Magnetic Resonance Imaging are now employing GPGPUs to generate their images.
Radar systems that currently employ heterogeneous mixes of FPGAs and GPPs are evaluating the viability of using GPGPUs to reduce the Size, Weight and Power (SWaP) of the processing subsystems to allow deployment in smaller platforms such as UAVs or increase processing capability in the same footprint. Some imaging applications such as 360-degree situational awareness that once used dedicated hardware now use TILE processors and GPUs to ingest multiple camera streams, then to warp, stitch, and display panoramas (Figure 4).
|
|
|
Figure 4: A GPGPU can support the ingestion of multiple camera streams to create panoramic images in real time. (click graphic to zoom by 1.7x) |
Given the ease of programming these manycore devices, their processing power, and their low acquisition cost and typically lower associated development cost, much to commends them for applications once dominated by programmable hardware arrays.
Peter Thompson graduated in 1980 from the University of Birmingham, UK, with a BSc Eng. (Hons) in Electrical and Electronic Engineering. He began his career at Racal Redac, eventually joining Radstone Technology. He is currently Director of Applications, Business Development, at GE Intelligent Platforms, tasked with evangelizing new technologies to customers and gathering market intelligence on new technology directions.


Jump to main articles index

