ACM Transactions on

Reconfigurable Technology and Systems (TRETS)

Latest Articles

Framework for Rapid Performance Estimation of Embedded Soft Core Processors

The large number of embedded soft core processors available today make it tedious and time consuming to select the best processor for a given... (more)

Preemption of the Partial Reconfiguration Process to Enable Real-Time Computing With FPGAs

To improve computing performance in real-time applications, modern embedded platforms comprise hardware accelerators that speed up the task’s... (more)

Wotan: Evaluating FPGA Architecture Routability without Benchmarks

FPGA routing architectures consist of routing wires and programmable switches that together account for the majority of the fabric delay and area, making evaluation and optimization of an FPGA’s routing architecture very important. Routing architectures have traditionally been evaluated using a full synthesize, pack, place and route CAD... (more)


Call for Editor-In-Chief Nominations

Nominations, including self-nominations, are invited for a three-year term as TRETS EiC, beginning on March 1, 2019. 

The deadline for submitting nominations is 15 October 2018, although nominations will continue to be accepted until the position is filled.

Read more here


Special Issue on Security in FPGA-Accelerated Cloud and Datacenters


2017 TRETS Best Paper Award Winner:

We are pleased to announce the winner of the 2017 TRETS Best Paper:

Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs

by Nachiket Kapre and Jan Gray


New Page Limit:

Effective April 12, 2018, the page limit for TRETS submissions has increased from 22 to 28 pages.



Forthcoming Articles
High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression

Although performing inference with artificial neural networks (ANN) was until quite recently considered as essentially compute intensive, the emergence of deep neural networks coupled with the evolution of the integration technology transformed inference into a memory bound problem. This ascertainment being established, many works have lately focused on minimizing memory accesses, either by enforcing and exploiting sparsity on weights or by using few bits for representing activations and weights, so as to be able to use ANNs inference in embedded devices. In this work, we detail an architecture dedicated to inference using ternary {-1, 0, 1} weights and activations. This architecture is configurable at design time to provide throughput vs power trade-offs to choose from. It is also generic in the sense that it uses information drawn for the target technologies (memory geometries and cost, number of available cuts, etc) to adapt at best to the FPGA resources. This allows to achieve up to 5.2k fps per Watt for classification on a VC709 board using approximately half of the resources of the FPGA.

Lightening the Load with Highly Accurate Storage- and Energy-Efficient LightNNs

Hardware implementations of Deep Neural Networks (DNNs) have been adopted in many systems because of their higher classification speed. However, while they may be characterized by better accuracy, larger DNNs require significant energy and area, thereby limiting their wide adoption. The energy consumption of DNNs is driven by both memory accesses and computation. Binarized Neural Networks (BNNs) show poor accuracy when a smaller DNN configuration is adopted. In this paper, we propose a new DNN architecture, LightNN, which replaces the multiplications to one shift or a constrained number of shifts and adds. Our theoretical analysis for LightNNs show that their accuracy is maintained, while dramatically reducing storage and energy requirements. For a fixed DNN configuration, LightNNs have better accuracy at a slight energy increase than BNNs, yet are more energy efficient with only slightly less accuracy than conventional DNNs. Therefore, LightNNs provide more options for hardware designers to trade-off accuracy and energy. These conclusions are verified by experiment using the MNIST and CIFAR-10 datasets for different DNN configurations. Our FPGA implementation for conventional DNNs and LightNNs confirms all theoretical and simulation results and shows that LightNNs reduce latency and use fewer FPGA resources compared to conventional DNN architectures.

Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA

Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation and super resolution. However, the deconvolution algorithms are computationally intensive which limits their applicability to real time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms. In this work, we propose and develop highly optimized and efficient FPGA architectures for both deconvolution and CNN algorithms. A non-linear optimization model based on the performance model is introduced to efficiently explore the design space of the CNN accelerator in order to achieve optimal throughput of the system and improve power efficiency. Finally, we implement our designs on Xilinx Zynq ZC706 board and the deconvolution accelerator achieves a performance of 90.1 GOPS under 200MHz working frequency and a performance density of 0.10 GOPS/DSP using 32-bit quantization. Our CNN accelerator achieves a performance of 107 GOPS and 0.12 GOPS/DSP using 16-bit quantization, and supports up to 17 frames per second for 512x512 image segmentation with a power consumption of only 9.6W.

PIMap: A Flexible Framework for Improving LUT-Based Technology Mapping via Parallelized Iterative Optimization

Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations on the input logic network before carrying out technology mapping. While the "known recipes" of logic transformations often lead to improved mapping results, there remains a nontrivial gap between the quality metrics driving the pre-mapping logic optimizations and those targeted by the actual technology mapping. Needless to mention, such miscorrelations would eventually result in suboptimal quality of results. We propose PIMap, which couples logic transformations and technology mapping under an iterative improvement framework for LUT-based FPGAs. In each iteration, PIMap randomly proposes a transformation on the given logic network from an ensemble of candidate optimizations; it then invokes technology mapping and makes use of the mapping result to determine the likelihood of accepting the proposed transformation. By adjusting the optimization objective and incorporating required time constraints during the iterative process, PIMap can flexibly optimize for different objectives including area minimization, delay optimization, and delay-constrained area reduction. To mitigate the runtime overhead, we further introduce parallelization techniques to decompose a large design into multiple smaller sub-netlists that can be optimized simultaneously. Experimental results show that PIMap achieves promising quality improvement over a set of commonly used benchmarks.

You Can't Improve What You Don't Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference

Recently, Deep Learning (DL) has become best-in-class for numerous applications but at a high computational cost that necessitates high-performance energy-efficient acceleration. The reconfigurability of FPGAs is appealing due to the rapid change in DL models but also causes lower performance and area-efficiency compared to ASICs. In this paper, we implement three state-of-the-art Computing Architectures (CAs) for Convolutional Neural Network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations, we highlight the area and performance costs of programmability to pinpoint the inefficiencies in current FPGA architectures. We perform our experiments using three variations of these CAs: AlexNet, VGG-16 and ResNet-50 to allow extensive comparisons. We find that the performance gap varies significantly from 3.8x to 9.4x while the area gap is consistent across CAs with an average FPGA-to-ASIC area ratio of 9.8. Among different blocks of the CAs, the convolution engine, constituting 32% to 61% of the total area, has a high area ratio ranging from 16 to 40. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking the on-chip memories to reduce the programmability gap for DL applications.

Efficient fine-grained processor-logic interactions on the cache-coherent Zynq platform

The introduction of cache-coherent processor-logic interconnects in CPU-FPGA platforms promises low- latency communication between CPU and FPGA fabrics. This reduced latency improves the performance of heterogeneous systems implemented on such devices and gives rise to new software architectures that can better use the available hardware. Via an extended study accelerating the software task scheduler of a microkernel operating system, this paper reports on the potential for accelerating applications that exhibit fine-grained interactions. In doing so, we evaluate the performance of direct and cache-coherent communication methods for applications that involve frequent, low-bandwidth transactions between CPU and programmable logic. In the specific case we studied, we found that replacing a highly optimised software implementation of the task scheduler with an FPGA-based accelerator reduces the cost of communication between two software threads by 5.5%. We also found that, while hardware acceleration reduces cache footprint, we still observe execution time variability because of other non-deterministic features of the CPU.

FPGA-based Acceleration of FT Convolution for Pulsar Search Using OpenCL

This paper focuses on the FPGA-based acceleration of the Frequency-Domain Acceleration Search module, which is part of SKA pulsar search engine. In this module, the frequency-domain input signals have to be processed by 85 Finite Impulse response(FIR) filters within a short period of limit and for thousands of input arrays. Because of the large scale of the input length and FIR filter size, even high-end FPGA devices cannot parallelize the task completely. We start by investigating both time-domain FIR filter (TDFIR) and frequency-domain FIR filter (FDFIR) to tackle this task. We applied the overlap-add algorithm to split the coefficient array of TDFIR and the overlap-save algorithm to split the input signals of FDFIR. To achieve fast prototyping design, we employed OpenCL, which is a high-level FPGA development technique. The performance and power consumption are evaluated using multiple FPGA devices simultaneously and compared with GPU results, which is achieved by porting FPGA-based OpenCL kernels. The experimental evaluation shows that the FDFIR solution is very competitive in terms of performance, with a clear energy consumption advantage over the GPU solution.

ReDCrypt: Real-Time Privacy-Preserving Deep Learning Inference in Clouds Using FPGAs

Artificial Intelligence (AI) is increasingly incorporated into the cloud business to improve the functionality of the service. The adoption of AI as a cloud service raises serious privacy concerns in applications where the risk of data leakage is not acceptable. Examples of such applications include scenarios where clients hold potentially sensitive private information such as medical records, financial data, and/or location. This paper proposes ReDCrypt, the first reconfigurable hardware-accelerated framework that empowers privacy-preserving execution of deep learning models in cloud servers. ReDCrypt is well-suited for streaming settings where clients need to dynamically analyze their data as it is collected over time without having to queue the samples to meet a certain batch size. Unlike prior work, ReDCrypt neither requires to change how AI models are trained nor relies on two non-colluding servers to perform. The secure computation in ReDCrypt is executed using Yao's Garbled Circuit (GC) protocol. We implement high-throughput and power-efficient functional APIs for efficient realization of GC protocol on cloud servers supporting FPGA accelerators. Our API provides support for the GC-optimized implementation of various computational layers used in deep learning. Proof-of-concept evaluations for different DL applications demonstrate up to 57-fold higher throughput per core compared to the prior-art.

Reconfigurable Hardware Architecture for Authenticated Key Agreement Protocol Over Binary Edwards Curve

We present a high-performance hardware architecture for Elliptic curve based (authenticated) key agreement protocol Elliptic Curve Menezes, Qu and Vanstone (ECMQV) over Binary Edwards Curve (BEC). First, we analyze inversion module on a 251-bit binary field. Subsequently, we present FPGA implementations of the unified formula for computing elliptic curve point addition on BEC in affine and projective coordinates and investigate the relative performance of these two coordinates. Then we implement the $w$-coordinate based differential addition formulae suitable for usage in Montgomery ladder. Next, we present a novel hardware architecture of BEC point multiplication using mixed $w$-coordinates of the Montgomery laddering algorithm and analyse its resistance to Simple Power Analysis (SPA) attack. In order to improve the performance, the architecture utilize registers efficiently in addition to the efficient scheduling mechanisms introduced for the BEC arithmetic implementations. Our implementation results show that the proposed architecture is resistant against SPA attack and yields a better performance when compared to the existing state-of-the-art BEC designs for computing point multiplication. Finally, we present an FPGA design of ECMQV key agreement protocol using BEC defined over $\mathbb{GF}({2^{251}})$. To the best of our knowledge, this is the first FPGA design of the ECMQV protocol using BEC.

FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks

Convolutional Neural Networks have rapidly become the most successful machine learning algorithm, enabling ubiquitous machine vision and intelligent decisions on even embedded computing-systems. While the underlying arithmetic is structurally simple, compute and memory requirements are challenging. One of the promising opportunities is leveraging reduced-precision representations for inputs, activations and model parameters. The resulting scalability in performance, power efficiency and storage footprint provides interesting design compromises in exchange for a small reduction in accuracy. FPGAs are ideal for exploiting low-precision inference engines leveraging custom precisions to achieve the required numerical accuracy for a given application. In this article, we describe the second generation of the FINN framework, an end-to-end tool which enables design space exploration and automates the creation of fully customized inference engines on FPGAs. Given a neural network description, the tool optimizes for given platforms, design targets and a specific precision. We introduce formalizations of resource cost and performance predictions, and elaborate on optimization algorithms. Finally, we evaluate a selection of reduced precision neural networks ranging from CIFAR-10 classifiers to YOLO-based object detection on a range of platforms including PYNQ and AWS-F1, demonstrating new unprecedented measured throughput at 36 TOp/s on AWS-F1 and 5 TOp/s on embedded devices.

Automated Synthesis of Streaming Transfer Level Hardware Designs

As modern FPGA enable high computing performance and efficiency, their programming at low-level hardware description languages is time-consuming and remains a major obstacle to their adoption. High-level synthesis compilers are able to produce RTL designs from C/C++ algorithmic descriptions, but despite allowing significant design time improvements, these tools are not always able to generate hardware designs that compare to hand-made RTL designs. In this paper, we consider synthesis from an intermediate-level (IL) language that allows the description of algorithmic state machines handling connections between streaming sources and sinks. We propose new advanced connection operators to simplify the description of pipelined architectures. However, blind interconnection of sources and sinks can lead to cyclic combinational relations, resulting in undesirable behaviors. We propose a functional-level methodology to automate the resolution of such cyclic relations into acyclic combinational functions. The proposed IL synthesis methodology has been applied to the design of pipelined floating-point cores. The results obtained with our compiler for the design of a floating-point accumulator, a Gaussian elimination coprocessor, and a Gauss-Jordan matrix inversion coprocessor, show how the proposed IL methodology can simplify the description of pipelined architectures while enabling performances that are close to those achievable through an RTL design methodology.

All ACM Journals | See Full Journal Index

Search TRETS
enter search term and/or author name