For the past several years, the Intel FPGA family has gained much attention for its ability to implement OpenCL. This new programming language can help design teams implement advanced graphics applications. Using this technology, developers can easily create and simulate complex applications with minimal effort. However, some limitations are still present, such as low voltage, which reduces transistor switching speeds. Also, the Altera SDK only supports PCI Express memory controllers and is currently x86-specific. However, Altera is currently working with Rayming PCB & Assembly to broaden the range of supported architectures.
Low voltage decreases transistor switching speed
In the high-level design process using Intel FPGA, the CPU controls the transistor switching speed by running the Intel Quartus Timing Analyzer. This analyzer reports the speed of FPGAs under different conditions, including high temperature, low voltage, and increased Adx throughput. The following table illustrates the impact of different factors on transistor switching speed.
The high-level design process using Intel FPGA enables deep pipelines, useful for applications where data parallelism is needed. However, the FPGA cannot exploit all the available resources due to low voltage, and it has a higher throughput with a single-precision floating point. On the other hand, if the FPGA uses a double-precision floating-point, it will have a 3.5-fold longer execution time than the GPU.
The high-level design process can use multiple CPUs and GPUs to solve the problem. Low-voltage logic is more complex than standard processors, so transistors with low-voltage logic cannot execute the same algorithm. Therefore, the high-level design process using Intel OpenCL FPGA can help to overcome these limitations. The CPU and GPU are the two most popular processors used in modern applications.
While implementing higher-level physical layer functions on Intel FPGA, the low-voltage logic of the device limits transistor switching speed. As a result, in addition to being more energy-efficient, FPGA is also more efficient, making it more attractive for cloud-computing applications. The Arria10 GX 1150 has a total resource of 1.1 million transistors.
The OpenCL language enables designers to create software that translates software code to hardware. This makes it easier for software to write software, allowing for easier co-processing. The Liquid Metal project is a good example. Sunil Shukla is a research staff member at the IBM T.J. Watson Research Center and works on accelerator-based computing and dynamically reconfigurable accelerators.
The proposed approach addresses throughput latency during high-level design processes on the Intel FPGA. The underlying programming model allows OpenCL kernels to compile to an FPGA in a standard context. This reduces programming costs. The authors demonstrate the application of the OpenCL kernels in a Hiperspectral Image classifier. The proposed approach can be a helpful tool for designers interested in integrating OpenCL into their design process.
An integer algorithm is helpful in an OpenCL scenario in Intel FPGA, which avoids excessive DSP block usage. Floating-point calculations require highly complex circuits and DSP resources. To calculate the size of each OFDM symbol, the OpenCL algorithm receives input equal to the symbol size. It also uses a cyclic prefix addition kernel to prevent interferences between symbols.
A pipeline design is another effective way to reduce throughput latency during the high-level design process. For example, token decoding and history buffers can be in different pipeline stages. This design enables continuous processing of compressed data, reducing latency when data dependencies occur.
Throughput latency during the high-level design process is affected by various factors. For example, increased complexity requires more computation lanes, DSP slices, pipe FIFOs, and LUTs. By adjusting these parameters, Intel OpenCL FPGA can reduce the overall throughput latency of the design.
We can implement multiple engines on a single FPGA by reducing communication latency. Although data access control is a concern, this method results in throughput close to the interface bandwidth. Multiple engines will increase the throughput, but they won’t increase the speed of the data access control. The theoretical throughput is the maximum interface bandwidth of the accelerator. Therefore, minimizing the number of engines in an FPGA is crucial for improving its performance.
The Intel FPGA SDK for OpenCL software technology is an efficient development environment for heterogeneous platforms. It uses Intel’s Quartus(r) Prime Software to abstract away the FPGA details and delivers optimized results. Its VTune analysis feature reduces local memory usage and dynamic profiling. The SDK also features a comprehensive refactoring toolchain.
The authors demonstrate the performance impact of reducing the calculation latency during the high-level design process using OpenCL. They use a 2D interconnection between PEs and a two-dimensional dispatcher to reduce external memory bandwidth. In addition to reducing the external memory bandwidth, they develop a spatial-spectral classifier for Hiperspectral Image based on a K Nearest Neighbours filter. This filter is implemented using the OpenCL model and deployed in the FPGA.
Array partitioning and loop unrolling tools help generate efficient hardware circuits. These tools also reduce the need for manual intervention. The tools can generate efficient circuits that satisfy timing requirements and efficiently use hardware resources. The problem may occur outside the computational kernels, for instance, when merge trees and prefetching are required.
In Intel OpenCL FPGA, compute load is essential in overall execution time. Therefore, the load on CPU resources is a key component of the high-level design process. With OpenCL, you can save CPU processing time using the same resources in all symbol sizes. For example, in a video game simulation, you can save CPU resources by reducing the latency of each calculation task.
Image processing improvement
The FPGA overlay fabric consists of a streaming memory framework and softcore processors that target OpenCL. As a result, the FPGA hardware can reduce the test time from hours to seconds and improve performance while reducing the development time. The streaming memory framework is another advantage of FPGA over GPU. This framework allows developers to run OpenCL applications outside of a programmable acceleration workload. For example, image processing applications can include a DAQ module.
Edge processing is a crucial part of computer vision applications. This is the closest part of a camera or sensor and performs pre-processing functions. These functions include object detection, image enhancement, and convolutional neural networks. Intel FPGA OpenCL allows developers to create applications on embedded processor cores. These capabilities are possible because OpenCL enables an easy functional partitioning of the vision algorithm chain. Further, the Intel FPGA OpenCL technology supports using multiple on-die processor cores for computing at the edge of the camera.
Using a task-parallel programming model to implement an Intel OpenCL FPGA
The OpenCL framework supports the build of kernels online in the host program. Each kernel command adds a compilation task, which overlaps the previous one. A dependency edge protects compilation tasks to ensure we finish them before any other kernel task.
In implementing the task graph, each compute group has a work queue and a worker thread for each physical core. Each worker thread is bound to a compute node. We construct a task graph from these two data sources. Enqueued commands are in order of arrival. When a task has completed execution, a command queue flushing occurs to dismantle the tasks. Task graphs are distributed among the compute groups and pushed in a round-robin fashion.
OpenCL allows for implementing a task-parallel programming model with multiple threads. This enables multiple CPUs to run parallel applications. As a result, the CPU and GPU can share the same set of resources. The performance of OpenCL code can be optimized using this approach. This approach is useful for parallel computing as it allows multiple kernels to work parallel.
In addition to task parallelism, OpenCL programmers must ensure that the execution of each kernel synchronizes with each other. This synchronization process should ensure that global synchronization does not impact the workload. By using a task graph, OpenCL runtime can extract task parallelism from an in-order queue and perform the same tasks in parallel as a single-task CPU.