This paper presents a new method for Monte-Carlo (MC) option pricing using FPGAs, which uses a discrete-space random-walk over a binomial lattice, rather than the continuous space-walks used by existing approaches. The underlying hypothesis is that the discrete-space walk will significantly reduce the area needed for each MC engine, and the resulting increase in parallelisation and raw performance outweighs any accuracy losses introduced by the discretisation. Experimental results support this hypothesis, showing that for a given MC simulation size, there is no significant loss in accuracy by using a discrete space model for the path-dependent exotic financial options. Analysis of the binomial simulation model shows that only limited precision fixed point arithmetic is needed, and also shows that pairs of MC kernels are able to share RAM resources. When using realistic constraints on pricing problems, it was found that the size of a discrete-space MC engine can be kept to 370 Flip-Flops and 233 LUTs, allowing up to 3000 variance-reduced MC cores in one FPGA. The combination of a highly parallelisable architecture and model-specific optimisations means that the binomial pricing technique allows for a 50x improvement in throughput compared to existing FPGA approaches, without any reduction in accuracy.
Using FPGAs as a substrate to deploy soft GPUs would enable offering the FPGA compute power in a very flexible GPU-like tool flow. Application-specific adaptations like selective hardening of floating-point operations would mitigate the high area and power demands of soft GPUs. This work explores the capabilities and limitations of soft General Purpose Computing on GPUs (GPGPU) for both fixed- and floating point arithmetic. For this purpose, we have developed FGPU: a configurable, scalable and portable GPU architecture designed especially for FPGAs. FGPU is open-source and implemented completely in RTL. It can be programmed in OpenCL and controlled through a Python API. In comparison to homogeneous Multi-Processor System-On-Chips (MPSoCs), we found that using a soft GPU is a pareto-optimal solution in terms of throughput per area and energy consumption. On average, FGPU has a 2.9x better compute density and 11.2x less energy consumption than a single MicroBlaze processor when computing in IEEE-754 floating-point format. An average speedup of about 4x over the ARM Cortex-A9 supported with the NEON vector co-processor has been measured for fixed- or floating-point benchmarks. In addition, the biggest FGPU cores we could realize on a Xilinx Zynq-7000 System-On-Chip (SoC) can deliver similar performance to equivalent implementations with High-Level Synthesis (HLS).
We show that continuously monitoring on-chip delays at the LUT-to-LUT link level during operation allows an FPGA to detect and self-adapt to aging and environmental timing effects. Using a lightweight (<4% added area) mechanism for monitoring transition timing, a Difference Detector with First-Fail Latch, we can estimate the timing margin on circuits and identify the individual links that have degraded and whose delay is determining the worst-case circuit delay. Combined with Choose-Your-own-Adventure precomputed, fine- grained repair alternatives, we introduce a strategy for rapid, in-system incremental repair of links with degraded timing. We show that these techniques allow us to respond to a single aging event in less than 130 ms for the toronto20 benchmarks. The result is a step toward systems where adaptive reconfiguration on the time-scale of seconds is viable and beneficial.
Space processing applications deployed on SRAM-based Field Programmable Gate Arrays (FPGAs) are vulnerable to radiation-induced Single Event Upsets (SEUs). Compared with the well-known SEU mitigation solution --- Triple Modular Redundancy (TMR) with configuration memory scrubbing --- TMR with module-based error recovery (MER) is notably more energy efficient and responsive in repairing soft-errors in the system. Unfortunately, TMR-MER systems also need to resort to scrubbing when errors occur between sub-components, such as in interconnection nets, which are not recovered by MER. This paper addresses this problem by proposing a fine-grained module-based error recovery technique that without additional system hardware can localize and correct errors that classic MER fails to do. We evaluate our proposal via fault-injection campaigns on three types of circuits implemented in Xilinx 7-Series devices. With respect to scrubbing, we observed reductions in the mean time to repair configuration memory errors of between 48.5\% and 89.4\%, while reductions in energy used recovering from configuration memory errors were estimated at between 77.4\% and 96.1\%. These improvements result in higher reliability for systems employing TMR with fine-grained reconfiguration than equivalent systems relying on scrubbing for configuration error recovery.
While plentiful on-chip memory is necessary for many designs to fully utilize an FPGA's computational capacity, SRAM scaling is becoming more difficult because of increasing device variation. An alternative is to build FPGA block RAM (BRAM) from magnetic tunnel junctions (MTJ) as this emerging embedded memory has a small cell size, low energy usage, and good scalability. We conduct a detailed comparison study of SRAM and MTJ BRAMs that includes cell designs that are robust with device variation, transistor-level design and optimization of all the required BRAM-specific circuits, and variation-aware simulation at the 22nm node. At a 256 Kb block size, MTJ-BRAM is 3.06× denser and 55% more energy efficient and its Fmax is 274 MHz, which is adequate for most FPGA system clock domains. We also detail further enhancements that allow these 256 Kb MTJ BRAMs to operate at a higher speed of 353 MHz for the streaming FIFOs which are very common in FPGA designs, and describe how the non-volatility of MTJ BRAM enables novel on-chip configuration and power-down modes. For a RAM architecture similar to the latest commercial FPGAs, MTJ-BRAMs could expand FPGA memory capacity by 2.95× with no die size increase.
In an FPGA system-on-chip design, it is often insufficient to merely assess the power consumption of the entire circuit by compile-time estimation or runtime power measurement. Instead, to make better runtime decisions, one must understand the power consumed by each module in the system. In this work, we combine system-level power measurements with register-level activity counting to build an adaptive model that produces a breakdown of power consumption within the design. Online model refinement avoids time-consuming characterisation while also allowing the model to track long-term operating condition changes. Central to our method is an automated flow that selects signals predicted to be indicative of high power consumption, instrumenting them for monitoring. We named this technique KAPow, for 'K'ounting Activity for Power estimation, which we show to be accurate and to have low overheads across a range of representative benchmarks. We also propose a strategy allowing for the identification and subsequent elimination of counters found to be of low significance at runtime, reducing algorithmic complexity without sacrificing accuracy. Finally, we demonstrate an application example in which a module-level power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by up to 7%.