ACM Transactions on

Reconfigurable Technology and Systems (TRETS)

Latest Articles

An Efficient Memory Partitioning Approach for Multi-Pattern Data Access via Data Reuse

Memory bandwidth has become a bottleneck that impedes performance improvement during the parallelism optimization of the datapath. Memory partitioning... (more)

COFFE 2: Automatic Modelling and Optimization of Complex and Heterogeneous FPGA Architectures

FPGAs are becoming more heteregeneous to better adapt to different markets, motivating rapid exploration of different blocks/tiles for FPGAs. To evaluate a new FPGA architectural idea, one should be able to accurately obtain the area, delay, and energy consumption of the block of interest. However, current FPGA circuit design tools can only model... (more)

In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms

Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past... (more)

FlexSaaS: A Reconfigurable Accelerator for Web Search Selection

Web search engines deploy large-scale selection services on CPUs to identify a set of web pages that match user queries. An FPGA-based accelerator can exploit various levels of parallelism and provide a lower latency, higher throughput, more energy-efficient solution than commodity CPUs. However, maintaining such a customized accelerator in a... (more)


New Editor-in-Chief:

TRETS welcomes Deming Chen as its new Editor-in-Chief for the term March 1, 2019, to February 28, 2022. Deming is a Professor in the Electrical and Computer Engineering Department at University of Illinois at Urbana-Champaign as well as a Research Professor in the Coordinated Science Laboratory and an Affiliate Professor in the CS Department.

2017 TRETS Best Paper Award Winner:

We are pleased to announce the winner of the 2017 TRETS Best Paper:

Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs

by Nachiket Kapre and Jan Gray


New Page Limit:

Effective immediately, the page limit for TRETS submissions has increased to 32 pages.



Forthcoming Articles
A Survey of FPGA-Based Neural Network Inference Accelerator

Recent researches on neural network have shown great advantage in computer vision over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the great computation and storage complexity of neural network based algorithms poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA based neural network accelerator is becoming a research topic. Because specific designed hardware is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network accelerators based on FPGA and summarize the main techniques used. Investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA based neural network accelerator design and serves as a guide to future work.

Exact and Practical Modulo Scheduling for High-level Synthesis

Loop pipelining is an essential technique in high-level synthesis (HLS) to increase the throughput and resource utilisation of FPGA-based accelerators. It relies on modulo schedulers to compute an operator schedule that allows subsequent loop iterations to overlap partially when executed, while still honouring all precedence and resource constraints. Modulo schedulers face a bi-criteria problem: minimise the initiation interval (II), i.e. the number of time steps after which new iterations are started, and minimise the schedule length. We present Moovac, a novel exact formulation that models all aspects (including the II minimisation) of the modulo scheduling problem as a single integer linear program (ILP), and discuss simple measures to prevent excessive runtimes, to challenge the old preconception that exact modulo scheduling is impractical. We substantiate this claim by conducting an experimental study covering 188 loops from two established HLS benchmark suites, four different time limits, and three bounds for the schedule length, to compare our approach against a highly-tuned exact formulation and a state-of-the-art heuristic algorithm. In the fastest configuration, an accumulated runtime of under 35 minutes is spent on scheduling all loops, and proven optimal IIs are found for 175 test instances.

Automata Processing in Reconfigurable Architectures: In-the-cloud Deployment, Cross-platform Evaluation and Fast Symbol-only Reconfiguration

We present a general automata processing framework on FPGAs, which can generate an RTL kernel for automata processing together with an AXI and PCIe based I/O circuitry. We implement the framework on both local nodes and cloud platforms (Amazon AWS and Nimbix) with novel features. A full performance comparison of the framework is conducted against state-of-the-art automata processing engines on CPUs, GPUs, and Micron`s Automata Processor using the ANMLZoo benchmark suite and some real-world datasets. Results show that FPGAs enable extremely high-throughput automata processing compared to von Neumann architectures. We also collect the resource utilization and power consumption on the two cloud platforms, and find that the I/O circuitry consumes most of the hardware resources and power. Furthermore, we propose a fast, symbol-only reconfiguration mechanism based on the framework for large pattern sets that cannot fit on one device and need to be partitioned. The proposed method supports multiple passes of the input stream and reduces the re-compilation cost from hours to seconds.

Fast Adjustable NPN Classification Using Generalized Symmetries

NPN classification of Boolean functions is a powerful technique used in many logic synthesis and technology mapping tools in both standard cell and FPGA design flows. Computing the canonical form is the most common approach of Boolean function classification. This paper proposes two different hybrid NPN canonical forms and a new algorithm to compute them. By exploiting symmetries under different phase assignment as well as higher-order symmetries, the search space of NPN canonical form computation is pruned and the runtime is dramatically reduced. Nevertheless, the runtime for some difficult functions remains high. Fast heuristic method can be used for such functions to compute semi-canonical forms in a reasonable time. The proposed algorithm can be adjusted to be a slow exact algorithm or a fast heuristic algorithm with lower quality. For exact NPN classification, the proposed algorithm is 40× faster than state-of-the-art. For heuristic classification, the proposed algorithm has similar performance as state-of-the-art with a possibility to trade runtime for quality.

FeatherNet: An Accelerated Convolutional Neural Network Design for Resource-Constrained FPGAs

The primary motivation for this work is to demonstrate the use of low-end FPGAs in constrained environments such as edge computing and IoT, but while meeting the needs of high end computationally and memory intensive CNN applications. We present a design methodology for deriving resource-efficient CNN implementations on low end resource constrained FPGAs. The work includes several contributions and several CNN related design optimizations. The first contribution is in the derivation of a design that is platform independent and implemented on two different vendor platforms. The second contribution is in a reduction of resource utilization for convolutions using signal flow graph optimization and re-organizing stride and processing order. Third, the design is modular providing flexibility for multi-layer CNN implementations with different topologies. Fourth, the data transfer is properly timed to timely needed active processing layer providing minimal resource utilization. Finally, several insights are drawn from the designs to set a roadmap for future CNN optimizations on low end FPGAs. We employ the methodology for AlexNet, a popular CNN for image classification, and which requires 700 million operations with 61 Million parameters for every image. Our results demonstrated the claimed contributions on Zedboard and Cyclone V platforms.

All ACM Journals | See Full Journal Index

Search TRETS
enter search term and/or author name