Recent researches on neural network have shown great advantage in computer vision over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the great computation and storage complexity of neural network based algorithms poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA based neural network accelerator is becoming a research topic. Because specific designed hardware is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network accelerators based on FPGA and summarize the main techniques used. Investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA based neural network accelerator design and serves as a guide to future work.
Loop pipelining is an essential technique in high-level synthesis (HLS) to increase the throughput and resource utilisation of FPGA-based accelerators. It relies on modulo schedulers to compute an operator schedule that allows subsequent loop iterations to overlap partially when executed, while still honouring all precedence and resource constraints. Modulo schedulers face a bi-criteria problem: minimise the initiation interval (II), i.e. the number of time steps after which new iterations are started, and minimise the schedule length. We present Moovac, a novel exact formulation that models all aspects (including the II minimisation) of the modulo scheduling problem as a single integer linear program (ILP), and discuss simple measures to prevent excessive runtimes, to challenge the old preconception that exact modulo scheduling is impractical. We substantiate this claim by conducting an experimental study covering 188 loops from two established HLS benchmark suites, four different time limits, and three bounds for the schedule length, to compare our approach against a highly-tuned exact formulation and a state-of-the-art heuristic algorithm. In the fastest configuration, an accumulated runtime of under 35 minutes is spent on scheduling all loops, and proven optimal IIs are found for 175 test instances.
We present a general automata processing framework on FPGAs, which can generate an RTL kernel for automata processing together with an AXI and PCIe based I/O circuitry. We implement the framework on both local nodes and cloud platforms (Amazon AWS and Nimbix) with novel features. A full performance comparison of the framework is conducted against state-of-the-art automata processing engines on CPUs, GPUs, and Micron`s Automata Processor using the ANMLZoo benchmark suite and some real-world datasets. Results show that FPGAs enable extremely high-throughput automata processing compared to von Neumann architectures. We also collect the resource utilization and power consumption on the two cloud platforms, and find that the I/O circuitry consumes most of the hardware resources and power. Furthermore, we propose a fast, symbol-only reconfiguration mechanism based on the framework for large pattern sets that cannot fit on one device and need to be partitioned. The proposed method supports multiple passes of the input stream and reduces the re-compilation cost from hours to seconds.
NPN classification of Boolean functions is a powerful technique used in many logic synthesis and technology mapping tools in both standard cell and FPGA design flows. Computing the canonical form is the most common approach of Boolean function classification. This paper proposes two different hybrid NPN canonical forms and a new algorithm to compute them. By exploiting symmetries under different phase assignment as well as higher-order symmetries, the search space of NPN canonical form computation is pruned and the runtime is dramatically reduced. Nevertheless, the runtime for some difficult functions remains high. Fast heuristic method can be used for such functions to compute semi-canonical forms in a reasonable time. The proposed algorithm can be adjusted to be a slow exact algorithm or a fast heuristic algorithm with lower quality. For exact NPN classification, the proposed algorithm is 40× faster than state-of-the-art. For heuristic classification, the proposed algorithm has similar performance as state-of-the-art with a possibility to trade runtime for quality.
The primary motivation for this work is to demonstrate the use of low-end FPGAs in constrained environments such as edge computing and IoT, but while meeting the needs of high end computationally and memory intensive CNN applications. We present a design methodology for deriving resource-efficient CNN implementations on low end resource constrained FPGAs. The work includes several contributions and several CNN related design optimizations. The first contribution is in the derivation of a design that is platform independent and implemented on two different vendor platforms. The second contribution is in a reduction of resource utilization for convolutions using signal flow graph optimization and re-organizing stride and processing order. Third, the design is modular providing flexibility for multi-layer CNN implementations with different topologies. Fourth, the data transfer is properly timed to timely needed active processing layer providing minimal resource utilization. Finally, several insights are drawn from the designs to set a roadmap for future CNN optimizations on low end FPGAs. We employ the methodology for AlexNet, a popular CNN for image classification, and which requires 700 million operations with 61 Million parameters for every image. Our results demonstrated the claimed contributions on Zedboard and Cyclone V platforms.