Web search engines deploy large-scale selection services on CPUs to identify a set of web pages that match user queries. An FPGA-based accelerator can exploit various levels of parallelism and provide a lower latency, higher throughput, more energy efficient solution than commodity CPUs. However, maintaining such a customized accelerator in a commercial search engine is challenging because selection services are changed often. This paper presents our design for FlexSaaS (Flexible Selection as a Service), an FPGA-based accelerator for web search selection. To address efficiency and flexibility challenges, FlexSaaS abstracts computing models and separates memory access from computation. Specifically, FlexSaaS i) contains a reconfigurable number of matching processors that can handle various possible query plans, ii) decouples index stream reading from matching computation to fetch and decode index files, and iii) includes a universal memory accessor that hides the complex memory hierarchy and reduces host data access latency. Evaluated on FPGAs in the selection service of a commercial web search, FlexSaaS can be evolved quickly to adapt to new updates. Compared with the software baseline, FlexSaaS reduces average latency by 30% and increases throughput by 1.5× and energy efficiency by 9×.
FPGAs are becoming more heteregeneous to better adapt to different markets, motivating rapid exploration of different blocks/tiles for FPGAs. To evaluate a new FPGA architectural idea, one should be able to accurately obtain the area, delay, and energy consumption of the block of interest. However, current FPGA circuit design tools can only model simple, homogeneous FPGA architectures with basic logic blocks and also lack DSP and other heterogeneous block support. Modern FPGAs are instead composed of many different tiles, some of which are designed in a full custom style and some of which mix standard cell and full custom styles. To fill this modelling gap, we introduce COFFE 2, an open-source FPGA design toolset for automatic FPGA circuit design. COFFE 2 uses a mix of full custom and standard cell flows and supports not only complex logic blocks with fracturable lookup tables and hard arithmetic but also arbitrary heterogeneous blocks. To validate COFFE 2 and demonstrate its features we design and evaluate a multi-mode Stratix III-like DSP block and several logic tiles with fracturable LUTs and hard arithmetic. We also demonstrate how COFFE 2's interface to VTR allows full evaluation of block-routing interfaces and various fracturable 6-LUT architectures.
Conventional homogeneous multicore processors are not able to provide continued performance and energy improvement that we have come to expect in the past. FPGA-based heterogeneous architectures are considered a promising paradigm to resolve this issue. As a consequence, a variety of CPU-FPGA acceleration platforms with diversified microarchitectural features have been supplied by industry vendors. Such diversity, however, poses a serious challenge to application developers in selecting the appropriate platform for a specific application or application domain. This paper aims to address this challenge by finding out what microarchitectural characteristics affect the performance, and how they affect. Specifically, we conduct a quantitative analysis on four state-of-the-art CPU-FPGA acceleration platforms: 1) the Alpha Data board that represents the traditional PCIe-based platform with private device memory, 2) IBM CAPI that represents the PCIe-based system with coherent shared memory, 3) Intel HARPv1 that represents the QPI-based system with coherent shared memory, and 4) HARPv2 that represents a hybrid PCIe (non-coherent) and QPI (coherent) based system with shared memory. Based on the analysis of their CPU-FPGA communication latency and bandwidth characteristics, we provide a series of insights for both application developers and platform designers.
Memory bandwidth has been a crucial bottleneck that impedes performance enhancement during the parallelism optimization for the data path. An effective method for high-level synthesis on FPGA is to map data elements in the same array to multiple on-chip memory banks using a memory partitioning technique. In this paper, we propose a three-step memory partitioning approach for parallel multi-pattern data access via data reuse. Firstly, we propose to combine the multiple patterns into a single pattern to reduce the complexity of the multi-pattern. Then, we propose to make data reuse analysis upon the combined pattern to find the data reuse opportunities as well as the non-reusable data pattern. Finally, an efficient memory partitioning algorithm with low complexity and low overhead is proposed to find the optimal bank mapping solution. Experimental results show that compared with the state-of-the-art method, our proposed approach can reduce the required number of BRAMs by 58.9% on average, with the average reduction of 79.6% in SLICEs, 85.3% in LUTs, 67.9% in Flip-Flops, 54.6% in DSP48Es, 83.9% in SRLs, 50.0% in storage overhead, 95.0% in execution time and 77.3% in dynamic power consumption while improving the performance by 14.0%.
Recent researches on neural network have shown great advantage in computer vision over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the great computation and storage complexity of neural network based algorithms poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA based neural network accelerator is becoming a research topic. Because specific designed hardware is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network accelerators based on FPGA and summarize the main techniques used. Investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA based neural network accelerator design and serves as a guide to future work.
In this work, a computer-aided design (CAD) approach that unrolls loops for designs targeted to low-cost FPGAs is described. Our approach considers latency constraints in an effort to minimize energy consumption for loop-based computation. To reduce glitch power, a glitch filtering approach is introduced that provides a balance between glitch reduction and design performance. Glitch filter enable signals are generated and routed to the filters using resources best suited to the target FPGA. Our approach automatically inserts glitch filters and associated control logic into a design prior to processing with FPGA synthesis, place, and route tools. Our energy-saving loop unrolling approach has been evaluated using five benchmarks often used in low-cost FPGAs. The energy-saving capabilities of the approach have been evaluated for an Intel Cyclone IV and a Xilinx Artix-7 FPGA using both simulation-based power estimation and board-level power measurement. The use of unrolling and glitch flitering is shown to reduce energy by at least 65% for an Artix-7 device and 50% for a Cyclone IV device while meeting design latency constraints.
The primary motivation for this work is to demonstrate the use of low-end FPGAs in constrained environments such as edge computing and IoT, but while meeting the needs of high end computationally and memory intensive CNN applications. We present a design methodology for deriving resource-efficient CNN implementations on low end resource constrained FPGAs. The work includes several contributions and several CNN related design optimizations. The first contribution is in the derivation of a design that is platform independent and implemented on two different vendor platforms. The second contribution is in a reduction of resource utilization for convolutions using signal flow graph optimization and re-organizing stride and processing order. Third, the design is modular providing flexibility for multi-layer CNN implementations with different topologies. Fourth, the data transfer is properly timed to timely needed active processing layer providing minimal resource utilization. Finally, several insights are drawn from the designs to set a roadmap for future CNN optimizations on low end FPGAs. We employ the methodology for AlexNet, a popular CNN for image classification, and which requires 700 million operations with 61 Million parameters for every image. Our results demonstrated the claimed contributions on Zedboard and Cyclone V platforms.