ACM DL

ACM Transactions on

Reconfigurable Technology and Systems (TRETS)

Menu
Latest Articles

Framework for Rapid Performance Estimation of Embedded Soft Core Processors

The large number of embedded soft core processors available today make it tedious and time consuming to select the best processor for a given... (more)

Preemption of the Partial Reconfiguration Process to Enable Real-Time Computing With FPGAs

To improve computing performance in real-time applications, modern embedded platforms comprise hardware accelerators that speed up the task’s... (more)

Wotan: Evaluating FPGA Architecture Routability without Benchmarks

FPGA routing architectures consist of routing wires and programmable switches that together account for the majority of the fabric delay and area, making evaluation and optimization of an FPGA’s routing architecture very important. Routing architectures have traditionally been evaluated using a full synthesize, pack, place and route CAD... (more)

Reconfigurable Hardware Architecture for Authenticated Key Agreement Protocol Over Binary Edwards Curve

In this article, we present a high-performance hardware architecture for Elliptic curve based... (more)

Automated Synthesis of Streaming Transfer Level Hardware Designs

As modern field-programmable gate arrays (FPGA) enable high computing performance and efficiency, their programming with low-level hardware... (more)

NEWS

Call for Editor-In-Chief Nominations

Nominations, including self-nominations, are invited for a three-year term as TRETS EiC, beginning on March 1, 2019. 

The deadline for submitting nominations is 15 October 2018, although nominations will continue to be accepted until the position is filled.

Read more here

------------------------------------------

Special Issue on Security in FPGA-Accelerated Cloud and Datacenters

(deadline extended to Oct 31, 2018)

http://trets.acm.org/call-for-papers.cfm

------------------------------------------

2017 TRETS Best Paper Award Winner:

We are pleased to announce the winner of the 2017 TRETS Best Paper:

Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs

by Nachiket Kapre and Jan Gray

------------------------------------------

New Page Limit:

Effective April 12, 2018, the page limit for TRETS submissions has increased from 22 to 28 pages.

-----------------------------------------------

 

Forthcoming Articles
High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression

Although performing inference with artificial neural networks (ANN) was until quite recently considered as essentially compute intensive, the emergence of deep neural networks coupled with the evolution of the integration technology transformed inference into a memory bound problem. This ascertainment being established, many works have lately focused on minimizing memory accesses, either by enforcing and exploiting sparsity on weights or by using few bits for representing activations and weights, so as to be able to use ANNs inference in embedded devices. In this work, we detail an architecture dedicated to inference using ternary {-1, 0, 1} weights and activations. This architecture is configurable at design time to provide throughput vs power trade-offs to choose from. It is also generic in the sense that it uses information drawn for the target technologies (memory geometries and cost, number of available cuts, etc) to adapt at best to the FPGA resources. This allows to achieve up to 5.2k fps per Watt for classification on a VC709 board using approximately half of the resources of the FPGA.

Lightening the Load with Highly Accurate Storage- and Energy-Efficient LightNNs

Hardware implementations of Deep Neural Networks (DNNs) have been adopted in many systems because of their higher classification speed. However, while they may be characterized by better accuracy, larger DNNs require significant energy and area, thereby limiting their wide adoption. The energy consumption of DNNs is driven by both memory accesses and computation. Binarized Neural Networks (BNNs) show poor accuracy when a smaller DNN configuration is adopted. In this paper, we propose a new DNN architecture, LightNN, which replaces the multiplications to one shift or a constrained number of shifts and adds. Our theoretical analysis for LightNNs show that their accuracy is maintained, while dramatically reducing storage and energy requirements. For a fixed DNN configuration, LightNNs have better accuracy at a slight energy increase than BNNs, yet are more energy efficient with only slightly less accuracy than conventional DNNs. Therefore, LightNNs provide more options for hardware designers to trade-off accuracy and energy. These conclusions are verified by experiment using the MNIST and CIFAR-10 datasets for different DNN configurations. Our FPGA implementation for conventional DNNs and LightNNs confirms all theoretical and simulation results and shows that LightNNs reduce latency and use fewer FPGA resources compared to conventional DNN architectures.

Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA

Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation and super resolution. However, the deconvolution algorithms are computationally intensive which limits their applicability to real time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms. In this work, we propose and develop highly optimized and efficient FPGA architectures for both deconvolution and CNN algorithms. A non-linear optimization model based on the performance model is introduced to efficiently explore the design space of the CNN accelerator in order to achieve optimal throughput of the system and improve power efficiency. Finally, we implement our designs on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 GOPS under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization. Our CNN accelerator achieves a performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization, and supports up to 17 frames per second for 512x512 image segmentation with a power consumption of only 9.6W.

A Survey of FPGA Based Neural Network Inference Accelerator

Recent researches on neural network have shown great advantage in computer vision over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the great computation and storage complexity of neural network based algorithms poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA based neural network accelerator is becoming a research topic. Because specific designed hardware is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network accelerators based on FPGA and summarize the main techniques used. Investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA based neural network accelerator design and serves as a guide to future work.

[DL] Instruction Driven Cross-layer CNN Accelerator For Fast Detection on FPGA

In recent years, Convolutional Neural Network (CNN) has been widely applied in computer vision and has achieved significant improvement in object detection tasks. FPGA has been widely explored to accelerate CNN due to its high performance, high energy efficiency, and flexibility. By fusing multiple layers in CNN, the intermediate data transfer can be reduced. With a faster algorithm using Winograd transformation, the computation of convolution can be further accelerated. However, previous accelerators with cross-layer or Winograd algorithm are designed for a particular CNN model, lacking flexibility. In this work, we design an instruction driven CNN accelerator supporting Winograd algorithm and cross-layer scheduling for object detection. We firstly modify the cross-layer loop unrolling order to extract basic operations as instructions, and then improve the on-chip memory architecture for higher utilization rate in Winograd computation. We evaluate the hardware architecture and scheduling policy on Xilinx KU115 FPGA platform. The intermediate data transfer can be reduced by over 90% on the VGG-D CNN model with cross-layer policy. The performance of our hardware accelerator reaches 1700 GOP/s. Besides classification tasks, we implement a frame work for object detection algorithms, and achieve 2.3x and 50x in power efficiency compared to GPU and CPU.

PIMap: A Flexible Framework for Improving LUT-Based Technology Mapping via Parallelized Iterative Optimization

Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations on the input logic network before carrying out technology mapping. While the "known recipes" of logic transformations often lead to improved mapping results, there remains a nontrivial gap between the quality metrics driving the pre-mapping logic optimizations and those targeted by the actual technology mapping. Needless to mention, such miscorrelations would eventually result in suboptimal quality of results. We propose PIMap, which couples logic transformations and technology mapping under an iterative improvement framework for LUT-based FPGAs. In each iteration, PIMap randomly proposes a transformation on the given logic network from an ensemble of candidate optimizations; it then invokes technology mapping and makes use of the mapping result to determine the likelihood of accepting the proposed transformation. By adjusting the optimization objective and incorporating required time constraints during the iterative process, PIMap can flexibly optimize for different objectives including area minimization, delay optimization, and delay-constrained area reduction. To mitigate the runtime overhead, we further introduce parallelization techniques to decompose a large design into multiple smaller sub-netlists that can be optimized simultaneously. Experimental results show that PIMap achieves promising quality improvement over a set of commonly used benchmarks.

You Can't Improve What You Don't Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference

Recently, Deep Learning (DL) has become best-in-class for numerous applications but at a high computational cost that necessitates high-performance energy-efficient acceleration. The reconfigurability of FPGAs is appealing due to the rapid change in DL models but also causes lower performance and area-efficiency compared to ASICs. In this paper, we implement three state-of-the-art Computing Architectures (CAs) for Convolutional Neural Network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations, we highlight the area and performance costs of programmability to pinpoint the inefficiencies in current FPGA architectures. We perform our experiments using three variations of these CAs: AlexNet, VGG-16 and ResNet-50 to allow extensive comparisons. We find that the performance gap varies significantly from 3.8x to 9.4x while the area gap is consistent across CAs with an average FPGA-to-ASIC area ratio of 9.8. Among different blocks of the CAs, the convolution engine, constituting 32% to 61% of the total area, has a high area ratio ranging from 16 to 40. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking the on-chip memories to reduce the programmability gap for DL applications.

Efficient fine-grained processor-logic interactions on the cache-coherent Zynq platform

The introduction of cache-coherent processor-logic interconnects in CPU-FPGA platforms promises low- latency communication between CPU and FPGA fabrics. This reduced latency improves the performance of heterogeneous systems implemented on such devices and gives rise to new software architectures that can better use the available hardware. Via an extended study accelerating the software task scheduler of a microkernel operating system, this paper reports on the potential for accelerating applications that exhibit fine-grained interactions. In doing so, we evaluate the performance of direct and cache-coherent communication methods for applications that involve frequent, low-bandwidth transactions between CPU and programmable logic. In the specific case we studied, we found that replacing a highly optimised software implementation of the task scheduler with an FPGA-based accelerator reduces the cost of communication between two software threads by 5.5%. We also found that, while hardware acceleration reduces cache footprint, we still observe execution time variability because of other non-deterministic features of the CPU.

FPGA-based Acceleration of FT Convolution for Pulsar Search Using OpenCL

This paper focuses on the FPGA-based acceleration of the Frequency-Domain Acceleration Search module, which is part of SKA pulsar search engine. In this module, the frequency-domain input signals have to be processed by 85 Finite Impulse response(FIR) filters within a short period of limit and for thousands of input arrays. Because of the large scale of the input length and FIR filter size, even high-end FPGA devices cannot parallelize the task completely. We start by investigating both time-domain FIR filter (TDFIR) and frequency-domain FIR filter (FDFIR) to tackle this task. We applied the overlap-add algorithm to split the coefficient array of TDFIR and the overlap-save algorithm to split the input signals of FDFIR. To achieve fast prototyping design, we employed OpenCL, which is a high-level FPGA development technique. The performance and power consumption are evaluated using multiple FPGA devices simultaneously and compared with GPU results, which is achieved by porting FPGA-based OpenCL kernels. The experimental evaluation shows that the FDFIR solution is very competitive in terms of performance, with a clear energy consumption advantage over the GPU solution.

ReDCrypt: Real-Time Privacy-Preserving Deep Learning Inference in Clouds Using FPGAs

Artificial Intelligence (AI) is increasingly incorporated into the cloud business to improve the functionality of the service. The adoption of AI as a cloud service raises serious privacy concerns in applications where the risk of data leakage is not acceptable. Examples of such applications include scenarios where clients hold potentially sensitive private information such as medical records, financial data, and/or location. This paper proposes ReDCrypt, the first reconfigurable hardware-accelerated framework that empowers privacy-preserving execution of deep learning models in cloud servers. ReDCrypt is well-suited for streaming settings where clients need to dynamically analyze their data as it is collected over time without having to queue the samples to meet a certain batch size. Unlike prior work, ReDCrypt neither requires to change how AI models are trained nor relies on two non-colluding servers to perform. The secure computation in ReDCrypt is executed using Yao's Garbled Circuit (GC) protocol. We implement high-throughput and power-efficient functional APIs for efficient realization of GC protocol on cloud servers supporting FPGA accelerators. Our API provides support for the GC-optimized implementation of various computational layers used in deep learning. Proof-of-concept evaluations for different DL applications demonstrate up to 57-fold higher throughput per core compared to the prior-art.

Loop Unrolling for Energy Efficiency in Low-Cost FPGAs

In this work, a computer-aided design (CAD) approach that unrolls loops for designs targeted to low-cost FPGAs is described. Our approach considers latency constraints in an effort to minimize energy consumption for loop-based computation. To reduce glitch power, a glitch filtering approach is introduced that provides a balance between glitch reduction and design performance. Glitch filter enable signals are generated and routed to the filters using resources best suited to the target FPGA. Our approach automatically inserts glitch filters and associated control logic into a design prior to processing with FPGA synthesis, place, and route tools. Our energy-saving loop unrolling approach has been evaluated using five benchmarks often used in low-cost FPGAs. The energy-saving capabilities of the approach have been evaluated for an Intel Cyclone IV and a Xilinx Artix-7 FPGA using both simulation-based power estimation and board-level power measurement. The use of unrolling and glitch flitering is shown to reduce energy by at least 65% for an Artix-7 device and 50% for a Cyclone IV device while meeting design latency constraints.

NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs

Deep convolutional neural networks obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURAGHE, a flexible and efficient hardware/software solution for the acceleration of CNN on Zynq SoCs. NEURAghe leverages synergistic usage of the Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft-core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: while the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURAghe supports end-to-end CNN-based classification with a peak performance of 169 Gops/s, with an energy efficiency of 17 Gops/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 fps on VGG-16, and 6.6 fps on ResNet-18.

FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks

Convolutional Neural Networks have rapidly become the most successful machine learning algorithm, enabling ubiquitous machine vision and intelligent decisions on even embedded computing-systems. While the underlying arithmetic is structurally simple, compute and memory requirements are challenging. One of the promising opportunities is leveraging reduced-precision representations for inputs, activations and model parameters. The resulting scalability in performance, power efficiency and storage footprint provides interesting design compromises in exchange for a small reduction in accuracy. FPGAs are ideal for exploiting low-precision inference engines leveraging custom precisions to achieve the required numerical accuracy for a given application. In this article, we describe the second generation of the FINN framework, an end-to-end tool which enables design space exploration and automates the creation of fully customized inference engines on FPGAs. Given a neural network description, the tool optimizes for given platforms, design targets and a specific precision. We introduce formalizations of resource cost and performance predictions, and elaborate on optimization algorithms. Finally, we evaluate a selection of reduced precision neural networks ranging from CIFAR-10 classifiers to YOLO-based object detection on a range of platforms including PYNQ and AWS-F1, demonstrating new unprecedented measured throughput at 36 TOp/s on AWS-F1 and 5 TOp/s on embedded devices.

All ACM Journals | See Full Journal Index

Search TRETS
enter search term and/or author name