ACM DL

ACM Transactions on

Reconfigurable Technology and Systems (TRETS)

Menu
Latest Articles

High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression

Although performing inference with artificial neural networks (ANN) was until quite recently... (more)

FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks

Convolutional Neural Networks have rapidly become the most successful machine-learning algorithm, enabling ubiquitous machine vision and intelligent decisions on even embedded computing systems. While the underlying arithmetic is structurally simple, compute and memory requirements are challenging.... (more)

Lightening the Load with Highly Accurate Storage- and Energy-Efficient LightNNs

Hardware implementations of deep neural networks (DNNs) have been adopted in many systems because of their higher classification speed. However, while... (more)

NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs

Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAghe, a... (more)

You Cannot Improve What You Do not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference

Recently, deep learning (DL) has become best-in-class for numerous applications but at a high computational cost that necessitates high-performance energy-efficient acceleration. The reconfigurability of FPGAs is appealing due to the rapid change in DL models but also causes lower performance and area-efficiency compared to ASICs. In this article,... (more)

ReDCrypt: Real-Time Privacy-Preserving Deep Learning Inference in Clouds Using FPGAs

Artificial Intelligence (AI) is increasingly incorporated into the cloud business in order to improve the functionality (e.g., accuracy) of the service. The adoption of AI as a cloud service raises serious privacy concerns in applications where the risk of data leakage is not acceptable. Examples of such applications include scenarios where clients... (more)

Instruction Driven Cross-layer CNN Accelerator for Fast Detection on FPGA

In recent years, Convolutional Neural Networks (CNNs) have been widely applied in computer vision and have achieved significant improvements in object... (more)

NEWS

Call for Editor-In-Chief Nominations

Nominations, including self-nominations, are invited for a three-year term as TRETS EiC, beginning on March 1, 2019. 

The deadline for submitting nominations is 15 October 2018, although nominations will continue to be accepted until the position is filled.

Read more here

------------------------------------------

Special Issue on Security in FPGA-Accelerated Cloud and Datacenters

(deadline extended to Oct 31, 2018)

http://trets.acm.org/call-for-papers.cfm

------------------------------------------

2017 TRETS Best Paper Award Winner:

We are pleased to announce the winner of the 2017 TRETS Best Paper:

Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs

by Nachiket Kapre and Jan Gray

------------------------------------------

New Page Limit:

Effective April 12, 2018, the page limit for TRETS submissions has increased from 22 to 28 pages.

-----------------------------------------------

 

Forthcoming Articles
Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA

Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation and super resolution. However, the deconvolution algorithms are computationally intensive which limits their applicability to real time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms. In this work, we propose and develop highly optimized and efficient FPGA architectures for both deconvolution and CNN algorithms. A non-linear optimization model based on the performance model is introduced to efficiently explore the design space of the CNN accelerator in order to achieve optimal throughput of the system and improve power efficiency. Finally, we implement our designs on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 GOPS under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization. Our CNN accelerator achieves a performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization, and supports up to 17 frames per second for 512x512 image segmentation with a power consumption of only 9.6W.

In-depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms

Conventional homogeneous multicore processors are not able to provide continued performance and energy improvement that we have come to expect in the past. FPGA-based heterogeneous architectures are considered a promising paradigm to resolve this issue. As a consequence, a variety of CPU-FPGA acceleration platforms with diversified microarchitectural features have been supplied by industry vendors. Such diversity, however, poses a serious challenge to application developers in selecting the appropriate platform for a specific application or application domain. This paper aims to address this challenge by finding out what microarchitectural characteristics affect the performance, and how they affect. Specifically, we conduct a quantitative analysis on four state-of-the-art CPU-FPGA acceleration platforms: 1) the Alpha Data board that represents the traditional PCIe-based platform with private device memory, 2) IBM CAPI that represents the PCIe-based system with coherent shared memory, 3) Intel HARPv1 that represents the QPI-based system with coherent shared memory, and 4) HARPv2 that represents a hybrid PCIe (non-coherent) and QPI (coherent) based system with shared memory. Based on the analysis of their CPU-FPGA communication latency and bandwidth characteristics, we provide a series of insights for both application developers and platform designers.

A Survey of FPGA Based Neural Network Inference Accelerator

Recent researches on neural network have shown great advantage in computer vision over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the great computation and storage complexity of neural network based algorithms poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA based neural network accelerator is becoming a research topic. Because specific designed hardware is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network accelerators based on FPGA and summarize the main techniques used. Investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA based neural network accelerator design and serves as a guide to future work.

PIMap: A Flexible Framework for Improving LUT-Based Technology Mapping via Parallelized Iterative Optimization

Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations on the input logic network before carrying out technology mapping. While the "known recipes" of logic transformations often lead to improved mapping results, there remains a nontrivial gap between the quality metrics driving the pre-mapping logic optimizations and those targeted by the actual technology mapping. Needless to mention, such miscorrelations would eventually result in suboptimal quality of results. We propose PIMap, which couples logic transformations and technology mapping under an iterative improvement framework for LUT-based FPGAs. In each iteration, PIMap randomly proposes a transformation on the given logic network from an ensemble of candidate optimizations; it then invokes technology mapping and makes use of the mapping result to determine the likelihood of accepting the proposed transformation. By adjusting the optimization objective and incorporating required time constraints during the iterative process, PIMap can flexibly optimize for different objectives including area minimization, delay optimization, and delay-constrained area reduction. To mitigate the runtime overhead, we further introduce parallelization techniques to decompose a large design into multiple smaller sub-netlists that can be optimized simultaneously. Experimental results show that PIMap achieves promising quality improvement over a set of commonly used benchmarks.

Efficient fine-grained processor-logic interactions on the cache-coherent Zynq platform

The introduction of cache-coherent processor-logic interconnects in CPU-FPGA platforms promises low- latency communication between CPU and FPGA fabrics. This reduced latency improves the performance of heterogeneous systems implemented on such devices and gives rise to new software architectures that can better use the available hardware. Via an extended study accelerating the software task scheduler of a microkernel operating system, this paper reports on the potential for accelerating applications that exhibit fine-grained interactions. In doing so, we evaluate the performance of direct and cache-coherent communication methods for applications that involve frequent, low-bandwidth transactions between CPU and programmable logic. In the specific case we studied, we found that replacing a highly optimised software implementation of the task scheduler with an FPGA-based accelerator reduces the cost of communication between two software threads by 5.5%. We also found that, while hardware acceleration reduces cache footprint, we still observe execution time variability because of other non-deterministic features of the CPU.

FPGA-based Acceleration of FT Convolution for Pulsar Search Using OpenCL

This paper focuses on the FPGA-based acceleration of the Frequency-Domain Acceleration Search module, which is part of SKA pulsar search engine. In this module, the frequency-domain input signals have to be processed by 85 Finite Impulse response(FIR) filters within a short period of limit and for thousands of input arrays. Because of the large scale of the input length and FIR filter size, even high-end FPGA devices cannot parallelize the task completely. We start by investigating both time-domain FIR filter (TDFIR) and frequency-domain FIR filter (FDFIR) to tackle this task. We applied the overlap-add algorithm to split the coefficient array of TDFIR and the overlap-save algorithm to split the input signals of FDFIR. To achieve fast prototyping design, we employed OpenCL, which is a high-level FPGA development technique. The performance and power consumption are evaluated using multiple FPGA devices simultaneously and compared with GPU results, which is achieved by porting FPGA-based OpenCL kernels. The experimental evaluation shows that the FDFIR solution is very competitive in terms of performance, with a clear energy consumption advantage over the GPU solution.

Loop Unrolling for Energy Efficiency in Low-Cost FPGAs

In this work, a computer-aided design (CAD) approach that unrolls loops for designs targeted to low-cost FPGAs is described. Our approach considers latency constraints in an effort to minimize energy consumption for loop-based computation. To reduce glitch power, a glitch filtering approach is introduced that provides a balance between glitch reduction and design performance. Glitch filter enable signals are generated and routed to the filters using resources best suited to the target FPGA. Our approach automatically inserts glitch filters and associated control logic into a design prior to processing with FPGA synthesis, place, and route tools. Our energy-saving loop unrolling approach has been evaluated using five benchmarks often used in low-cost FPGAs. The energy-saving capabilities of the approach have been evaluated for an Intel Cyclone IV and a Xilinx Artix-7 FPGA using both simulation-based power estimation and board-level power measurement. The use of unrolling and glitch flitering is shown to reduce energy by at least 65% for an Artix-7 device and 50% for a Cyclone IV device while meeting design latency constraints.

Introduction to the Special Section on Deep Learning in FPGAs

All ACM Journals | See Full Journal Index

Search TRETS
enter search term and/or author name