Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation and super resolution. However, the deconvolution algorithms are computationally intensive which limits their applicability to real time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms. In this work, we propose and develop highly optimized and efficient FPGA architectures for both deconvolution and CNN algorithms. A non-linear optimization model based on the performance model is introduced to efficiently explore the design space of the CNN accelerator in order to achieve optimal throughput of the system and improve power efficiency. Finally, we implement our designs on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 GOPS under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization. Our CNN accelerator achieves a performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization, and supports up to 17 frames per second for 512x512 image segmentation with a power consumption of only 9.6W.
Conventional homogeneous multicore processors are not able to provide continued performance and energy improvement that we have come to expect in the past. FPGA-based heterogeneous architectures are considered a promising paradigm to resolve this issue. As a consequence, a variety of CPU-FPGA acceleration platforms with diversified microarchitectural features have been supplied by industry vendors. Such diversity, however, poses a serious challenge to application developers in selecting the appropriate platform for a specific application or application domain. This paper aims to address this challenge by finding out what microarchitectural characteristics affect the performance, and how they affect. Specifically, we conduct a quantitative analysis on four state-of-the-art CPU-FPGA acceleration platforms: 1) the Alpha Data board that represents the traditional PCIe-based platform with private device memory, 2) IBM CAPI that represents the PCIe-based system with coherent shared memory, 3) Intel HARPv1 that represents the QPI-based system with coherent shared memory, and 4) HARPv2 that represents a hybrid PCIe (non-coherent) and QPI (coherent) based system with shared memory. Based on the analysis of their CPU-FPGA communication latency and bandwidth characteristics, we provide a series of insights for both application developers and platform designers.
Recent researches on neural network have shown great advantage in computer vision over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the great computation and storage complexity of neural network based algorithms poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA based neural network accelerator is becoming a research topic. Because specific designed hardware is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network accelerators based on FPGA and summarize the main techniques used. Investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA based neural network accelerator design and serves as a guide to future work.
Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations on the input logic network before carrying out technology mapping. While the "known recipes" of logic transformations often lead to improved mapping results, there remains a nontrivial gap between the quality metrics driving the pre-mapping logic optimizations and those targeted by the actual technology mapping. Needless to mention, such miscorrelations would eventually result in suboptimal quality of results. We propose PIMap, which couples logic transformations and technology mapping under an iterative improvement framework for LUT-based FPGAs. In each iteration, PIMap randomly proposes a transformation on the given logic network from an ensemble of candidate optimizations; it then invokes technology mapping and makes use of the mapping result to determine the likelihood of accepting the proposed transformation. By adjusting the optimization objective and incorporating required time constraints during the iterative process, PIMap can flexibly optimize for different objectives including area minimization, delay optimization, and delay-constrained area reduction. To mitigate the runtime overhead, we further introduce parallelization techniques to decompose a large design into multiple smaller sub-netlists that can be optimized simultaneously. Experimental results show that PIMap achieves promising quality improvement over a set of commonly used benchmarks.
The introduction of cache-coherent processor-logic interconnects in CPU-FPGA platforms promises low- latency communication between CPU and FPGA fabrics. This reduced latency improves the performance of heterogeneous systems implemented on such devices and gives rise to new software architectures that can better use the available hardware. Via an extended study accelerating the software task scheduler of a microkernel operating system, this paper reports on the potential for accelerating applications that exhibit fine-grained interactions. In doing so, we evaluate the performance of direct and cache-coherent communication methods for applications that involve frequent, low-bandwidth transactions between CPU and programmable logic. In the specific case we studied, we found that replacing a highly optimised software implementation of the task scheduler with an FPGA-based accelerator reduces the cost of communication between two software threads by 5.5%. We also found that, while hardware acceleration reduces cache footprint, we still observe execution time variability because of other non-deterministic features of the CPU.
This paper focuses on the FPGA-based acceleration of the Frequency-Domain Acceleration Search module, which is part of SKA pulsar search engine. In this module, the frequency-domain input signals have to be processed by 85 Finite Impulse response(FIR) filters within a short period of limit and for thousands of input arrays. Because of the large scale of the input length and FIR filter size, even high-end FPGA devices cannot parallelize the task completely. We start by investigating both time-domain FIR filter (TDFIR) and frequency-domain FIR filter (FDFIR) to tackle this task. We applied the overlap-add algorithm to split the coefficient array of TDFIR and the overlap-save algorithm to split the input signals of FDFIR. To achieve fast prototyping design, we employed OpenCL, which is a high-level FPGA development technique. The performance and power consumption are evaluated using multiple FPGA devices simultaneously and compared with GPU results, which is achieved by porting FPGA-based OpenCL kernels. The experimental evaluation shows that the FDFIR solution is very competitive in terms of performance, with a clear energy consumption advantage over the GPU solution.
In this work, a computer-aided design (CAD) approach that unrolls loops for designs targeted to low-cost FPGAs is described. Our approach considers latency constraints in an effort to minimize energy consumption for loop-based computation. To reduce glitch power, a glitch filtering approach is introduced that provides a balance between glitch reduction and design performance. Glitch filter enable signals are generated and routed to the filters using resources best suited to the target FPGA. Our approach automatically inserts glitch filters and associated control logic into a design prior to processing with FPGA synthesis, place, and route tools. Our energy-saving loop unrolling approach has been evaluated using five benchmarks often used in low-cost FPGAs. The energy-saving capabilities of the approach have been evaluated for an Intel Cyclone IV and a Xilinx Artix-7 FPGA using both simulation-based power estimation and board-level power measurement. The use of unrolling and glitch flitering is shown to reduce energy by at least 65% for an Artix-7 device and 50% for a Cyclone IV device while meeting design latency constraints.