ACM Transactions on

Reconfigurable Technology and Systems (TRETS)

Latest Articles

Leakier Wires: Exploiting FPGA Long Wires for Covert- and Side-channel Attacks

In complex FPGA designs, implementations of algorithms and protocols from third-party sources are common. However, the monolithic nature of FPGAs means that all sub-circuits share common on-chip infrastructure, such as routing resources. This presents an attack vector for all FPGAs that contain designs from multiple vendors, especially for FPGAs... (more)

Mitigating Electrical-level Attacks towards Secure Multi-Tenant FPGAs in the Cloud

A rising trend is the use of multi-tenant FPGAs, particularly in cloud environments, where partial access to the hardware is given to multiple third... (more)

A Protection and Pay-per-use Licensing Scheme for On-cloud FPGA Circuit IPs

Using security primitives, a novel scheme for licensing hardware intellectual properties (HWIPs) on Field Programmable Gate Arrays (FPGAs) in public... (more)

Recent Attacks and Defenses on FPGA-based Systems

Field-programmable gate array (FPGA) is a kind of programmable chip that is widely used in many areas, including automotive electronics, medical... (more)

Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data... (more)

Novel Congestion-estimation and Routability-prediction Methods based on Machine Learning for Modern FPGAs

Effectively estimating and managing congestion during placement can save substantial placement and... (more)


Editorial: A Message from the new Editor-in-Chief


2018 TRETS Best Paper Award Winner:

We are pleased to announce the winner of the 2018 TRETS Best Paper:

General-Purpose Computing with Soft GPUs on FPGAs

Muhammed Al Kadi, Benedikt Janssen, Jones Yudi, and Michael Huebner.


New Editor-in-Chief:

TRETS welcomes Deming Chen as its new Editor-in-Chief for the term March 1, 2019, to February 28, 2022. Deming is a Professor in the Electrical and Computer Engineering Department at University of Illinois at Urbana-Champaign as well as a Research Professor in the Coordinated Science Laboratory and an Affiliate Professor in the CS Department.


New Page Limit:

Effective immediately, the page limit for TRETS submissions has increased to 32 pages.



Forthcoming Articles
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms

Due to the irregular nature of connections in most graph datasets, partitioning graph algorithms across multiple computational nodes that do not share a common memory inevitably leads to large amounts of interconnect traffic. Previous research has shown that FPGAs can outcompete software-based graph processing in shared memory contexts, but it remains an open question if this advantage can be maintained in distributed systems. In this work, we present GraVF-M, a framework designed to ease the implementation of FPGA-based graph processing accelerators for multi-FPGA platforms with distributed memory. Based on a lightweight description of the algorithm kernel, the framework automatically generates optimized RTL code for the whole multi-FPGA design. We exploit an aspect of the programming model to present a familiar message-passing paradigm to the user, while under the hood implementing a more efficient architecture that can reduce the necessary inter-FPGA network traffic by a factor equal to the average degree of the input graph. A performance model based on a theoretical analysis of the factors influencing performance serves to evaluate the efficiency of our implementation. With a throughput of up to 5.8 GTEPS on a 4-FPGA system, the designs generated by GraVF-M compare favorably to state-of-the-art frameworks from the literature.

Unrolling Ternary Neural Networks

The computational complexity of neural networks for large scale or real-time applications necessitates hardware acceleration. Most approaches assume that the network architecture and parameters are unknown at design time, permitting usage in a large number of applications. This paper demonstrates, for the case where the neural network architecture and ternary weight values are known a priori, that extremely high throughput implementations of neural network inference can be made by customising the datapath and routing to remove unnecessary computations and data movement. This approach is ideally suited to FPGA implementations as a specialized implementation of a trained network improves efficiency while still retaining generality with the reconfigurability of an FPGA. A VGG style network with ternary weights and fixed point activations is implemented for the CIFAR10 dataset on Amazon's AWS F1 instance. This paper demonstrates how to remove 90% of the operations in convolutional layers by exploiting sparsity and compile-time optimizations. The implementation in hardware achieves 91.9% accuracy and 100k frames per second with a latency of 27us, which is the fastest CNN inference implementation reported so far.

The FPOA, A Medium-Grained Reconfigurable Architecture for High-Level Synthesis

In this paper we present a novel type of medium-grained reconfigurable architecture that we term the Field Programmable Operation Array (FPOA). This device has been designed specifically for the implementation of HLS-generated circuitry. At the core of the FPOA is the OP-block. Unlike a standard LUT, an OP-block performs multi-bit operations through gate-based logic structures, translating into greater speed and efficiency in digital circuit implementation. Our device is not optimized for a specific application domain. Rather, we have created a device that is optimized for a specific circuit structure, namely those generated by HLS. This gives the FPOA a significant advantage as it can be used across all application domains. In this work, we add support for both distributed and block memory to the FPOA architecture. Experiment results show up to a 13.5x reduction in logic area and a 9.5x reduction in critical path delay for circuit implementation using the FPOA compared to a standard FPGA.

A DSL-Based Hardware Generator in Scala for Fast Fourier Transforms and Sorting Networks

We present a hardware generator for computations with regular structure including the fast Fourier transform (FFT), sorting networks, and others. The input of the generator is a high-level description of the algorithm; the output is a token-based, synchronized design in the form of RTL-Verilog. Building on prior work, the generator uses several layers of domain-specific languages (DSLs) to represent and optimize at different levels of abstraction to produce a RAM- and area-efficient hardware implementation. Two of these layers and DSLs are novel. The first one allows the use and domain-specific optimization of state-of-the-art streaming permutations. The second DSL enables the automatic pipelining of a streaming hardware dataflow and the synchronization of its data-independent control signals. The generator including the DSLs are implemented in Scala, leveraging its type system, and uses concepts from lightweight modular staging (LMS) to handle the constraints of streaming hardware. Particularly, these concepts offer genericity over hardware number representation, including seamless switching between fixed-point arithmetic and FloPoCo generated IEEE floating-point operators, while ensuring type-safety. We show benchmarks of generated FFTs, sorting networks and Walsh-Hadamard transforms that outperform prior generators.

A Design Flow Engine for the Support of Customized Dynamic High Level Synthesis Flows

High Level Synthesis is a set of methodologies aimed at generating hardware descriptions starting from specifications written in high level languages. While these methodologies share different elements with traditional compilation flows, there are characteristics of the addressed problem which require ad hoc management. In particular, differently from most of the traditional compilation flows, the complexity and the execution time of the High Level Synthesis techniques are much less relevant than the quality of the produced results. For this reason, fixed point analyses as well as successive refinement optimizations can be accepted, provided that they can improve the quality of the generated designs. This paper presents the design flow engine developed for Bambu, an open source High Level Synthesis tool based on GNU GCC. The engine allows the execution of complex and customized synthesis flows and supports: dynamic addition of passes, cyclic dependences, selective pass invalidation and skipping. Experimental results show the benefits of such type of design flows with respect to static linear design flows when applied to High Level Synthesis

All ACM Journals | See Full Journal Index

Search TRETS
enter search term and/or author name