A system of interacting agents is, by definition, very demanding in terms of computational resources. Although multi-agent systems have been used to solve complex problems in many areas, it is usually very difficult to perform large-scale simulations in their targeted serial computing platforms. Reconfigurable hardware, in particular Field Programmable Gate Arrays (FPGA) devices, have been successfully used in High Performance Computing applications due to their inherent flexibility, data parallelism and algorithm acceleration capabilities. Indeed, reconfigurable hardware seems to be the next logical step in the agency paradigm, but only a few attempts have been successful in implementing multi-agent systems in these platforms. Currently there is no clear methodology that integrates the agency paradigm with the digital design live cycle. This paper discusses the problem of inter-agent communications in Field Programmable Gate Arrays. It proposes a Network-on-Chip in a hierarchical star topology to enable agents transactions through message broadcasting using the Open Core Protocol, as an interface between hardware modules. A customizable router microarchitecture is described and a multi-agent system is created to simulate and analyse message exchanges in a generic heavy traffic load agent-based application. Experiments have shown a throughput of 1.6Gbps per port at 100 MHz without packet loss and seamless scalability characteristics.
Conventional software simulators are so slow that studying large-scale Networks-on-Chip (NoCs) for emerging many-core systems with hundreds to thousands of cores is challenging. In this article, we propose novel methods for fast and cycle-accurate emulation of large-scale NoCs using a single FPGAs, supporting both direct and indirect topologies. We first describe a method to avoid using slow off-chip memory even when emulating NoCs with thousands of nodes under synthetic workloads. We next present a novel use of time-division multiplexing where fast on-chip memory blocks are effectively used for emulating a network using a small number of nodes, thereby substantially reducing the logic usage. Our evaluation results show that, (1) the size of the largest NoC that can be emulated depends on only the FPGA on-chip memory capacity, (2) a mesh-based NoC with 16,384 nodes (128x128 NoC) and a fat-tree-based NoC with 6,144 switch nodes and 4,096 terminal nodes (4-ary 6-tree NoC) can be emulated using a single Virtex-7 XC7VX485T FPGA, and (3) when emulating these two NoCs, we achieve, respectively, 5,050x and 232x speedups over BookSim, one of the most widely used NoC simulators, while maintaining the same level of accuracy.
Kernel adaptive filters (KAFs) are online machine learning algorithms amenable to highly efficient streaming implementations. They require only a single pass through the data and can act as universal approximators, i.e. approximate any continuous function with arbitrary accuracy. KAFs are members of a family of kernel methods which apply an implicit nonlinear mapping of input data to a high dimensional feature space, permitting learning algorithms to be expressed entirely as inner products. Such an approach avoids explicit projection into the feature space, enabling computational efficiency. In this paper, we propose the first fully pipelined implementation of the kernel normalised least mean squares algorithm for regression. Independent training tasks necessary for hyperparameter optimisation fill pipeline stages, so no stall cycles to resolve dependencies are required. Together with other optimisations to reduce resource utilisation and latency, our core achieves 161 GFLOPS on a Virtex 7 XC7VX485T FPGA for a floating point implementation and 211 GOPS for fixed point. Our PCI Express based floating-point system implementation achieves 80% of the core's speed, this being a speedup of 230 over an optimised implementation on a desktop processor.
In an FPGA system-on-chip design, it is often insufficient to merely assess the power consumption of the entire circuit by compile-time estimation or runtime power measurement. Instead, to make better runtime decisions, one must understand the power consumed by each module in the system. In this work, we combine system-level power measurements with register-level activity counting to build an adaptive model that produces a breakdown of power consumption within the design. Online model refinement avoids time-consuming characterisation while also allowing the model to track long-term operating condition changes. Central to our method is an automated flow that selects signals predicted to be indicative of high power consumption, instrumenting them for monitoring. We named this technique KAPow, for 'K'ounting Activity for Power estimation, which we show to be accurate and to have low overheads across a range of representative benchmarks. We also propose a strategy allowing for the identification and subsequent elimination of counters found to be of low significance at runtime, reducing algorithmic complexity without sacrificing accuracy. Finally, we demonstrate an application example in which a module-level power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by up to 7%.