Due to the irregular nature of connections in most graph datasets, partitioning graph algorithms across multiple computational nodes that do not share a common memory inevitably leads to large amounts of interconnect traffic. Previous research has shown that FPGAs can outcompete software-based graph processing in shared memory contexts, but it remains an open question if this advantage can be maintained in distributed systems. In this work, we present GraVF-M, a framework designed to ease the implementation of FPGA-based graph processing accelerators for multi-FPGA platforms with distributed memory. Based on a lightweight description of the algorithm kernel, the framework automatically generates optimized RTL code for the whole multi-FPGA design. We exploit an aspect of the programming model to present a familiar message-passing paradigm to the user, while under the hood implementing a more efficient architecture that can reduce the necessary inter-FPGA network traffic by a factor equal to the average degree of the input graph. A performance model based on a theoretical analysis of the factors influencing performance serves to evaluate the efficiency of our implementation. With a throughput of up to 5.8 GTEPS on a 4-FPGA system, the designs generated by GraVF-M compare favorably to state-of-the-art frameworks from the literature.
In this paper we present a novel type of medium-grained reconfigurable architecture that we term the Field Programmable Operation Array (FPOA). This device has been designed specifically for the implementation of HLS-generated circuitry. At the core of the FPOA is the OP-block. Unlike a standard LUT, an OP-block performs multi-bit operations through gate-based logic structures, translating into greater speed and efficiency in digital circuit implementation. Our device is not optimized for a specific application domain. Rather, we have created a device that is optimized for a specific circuit structure, namely those generated by HLS. This gives the FPOA a significant advantage as it can be used across all application domains. In this work, we add support for both distributed and block memory to the FPOA architecture. Experiment results show up to a 13.5x reduction in logic area and a 9.5x reduction in critical path delay for circuit implementation using the FPOA compared to a standard FPGA.
In earlier technology nodes FPGAs had low power consumption compared to other compute chips such as CPUs and GPUs. However, in the 14-nm technology node FPGAs are consuming unprecedented power in the 100+ Watt range, making power consumption a pressing concern. To reduce FPGA power consumption, researchers have proposed deploying dynamic voltage scaling. While the previously proposed solutions show promising results, they are restricted to applications that only use the soft logic part of the FPGA. In this work, we present the first DVS solution that is able to fully handle FPGA applications that use BRAMs. Our solution not only robustly tests the soft logic component of the application but also tests all components connected to the BRAMs. We extend a previously proposed CAD tool, FRoC, to automatically generate calibration bitstreams that are used to measure the application's critical paths delay on silicon. The calibration bitstreams also include testers that ensure all used SRAM cells operate safely while scaling Vdd. We experimentally show that using our DVS solution, we are able to save 32% of the total power consumed by a discrete Fourier transform application running with a fixed Vdd and clocked at the Fmax reported by static timing analysis.
Given the growth in data inputs and application complexity, it is often the case that a single hardware accelerator is not enough to solve a given problem. In particular, the computational demands and I/O of many tasks in Machine Learning often require a cluster of accelerators to make a relevant difference in performance. In this paper, we explore the efficient construction of FPGA clusters using inference over Decision Tree Ensembles as the target application. The paper explores several levels of the problem: 1) a lightweight inter-FPGA communication protocol and routing layer to facilitate the communication between the different FPGAs; 2) the data partitioning and distribution strategies maximizing performance; 3) and an in depth analysis on how applications can be efficiently distributed over such a cluster. The experimental analysis shows that the resulting system can support inference over decision tree ensembles at a significantly higher throughput than that achieved by existing systems.
High Level Synthesis is a set of methodologies aimed at generating hardware descriptions starting from specifications written in high level languages. While these methodologies share different elements with traditional compilation flows, there are characteristics of the addressed problem which require ad hoc management. In particular, differently from most of the traditional compilation flows, the complexity and the execution time of the High Level Synthesis techniques are much less relevant than the quality of the produced results. For this reason, fixed point analyses as well as successive refinement optimizations can be accepted, provided that they can improve the quality of the generated designs. This paper presents the design flow engine developed for Bambu, an open source High Level Synthesis tool based on GNU GCC. The engine allows the execution of complex and customized synthesis flows and supports: dynamic addition of passes, cyclic dependences, selective pass invalidation and skipping. Experimental results show the benefits of such type of design flows with respect to static linear design flows when applied to High Level Synthesis