TL;DR

TinyTinyTPU is an educational, open-source 2×2 systolic-array implementation of a TPU-style matrix-multiply unit written in SystemVerilog and deployed on a Basys3 FPGA. The project includes a full post-MAC pipeline, UART host interface, multi-layer MLP inference support, simulation tests, and both vendor and open-source FPGA toolflow instructions.

What happened

A minimal TPU-like accelerator named TinyTinyTPU has been published as a SystemVerilog project and targeted to the Basys3 (Xilinx Artix-7) FPGA. The design implements a 2×2 systolic array composed of four processing elements and a complete post-MAC pipeline that performs accumulation, activation (ReLU/ReLU6), normalization, and quantization. A UART-based bridge and a Python host driver let a PC load weights and activations, trigger execution, and read results; example scripts include an inference demo and a gesture-recognition demo built around a two-layer MLP. The repository provides a simulation environment using Verilator and cocotb with testbenches and waveform generation, as well as FPGA build files and instructions for both Xilinx Vivado and an open-source flow using Yosys and nextpnr. Resource usage on the Basys3 is documented and the project includes automated tests for individual modules and system integration.

Why it matters

  • Provides a compact, hands-on reference for how systolic arrays implement matrix multiply and dataflow timing.
  • Packages a full inference path (weights → MMU → accumulator → activation → quantization) suitable for learning and FPGA prototyping.
  • Includes both vendor and open-source FPGA toolflow guidance, lowering the barrier to experiment with RTL and bitstream generation.
  • Demonstrates an end-to-end host interface and example applications (inference and gesture demo) that make the design usable on real hardware.

Key facts

  • Implements a 2×2 systolic array (4 processing elements) in SystemVerilog.
  • Includes a full post-MAC pipeline: accumulator, activation (ReLU/ReLU6), normalizer, and quantizer.
  • Host interface is UART-based with a byte-oriented command protocol (commands for write weight, write activation, execute, read result, read status).
  • Target FPGA: Basys3 (Xilinx Artix-7); UART pins: RX on B18, TX on A18; clock is the onboard 100 MHz oscillator.
  • Documented Basys3 resource usage: ~1,000 LUTs (~5%), ~1,000 flip-flops (~3%), 8 DSP48E1 slices, ~10–15 BRAM blocks; estimated gate count ~25,000.
  • Simulation/test tooling: Verilator 5.022+, cocotb, Python 3.8+, with per-module tests and waveform generation.
  • Host software: Python drivers and demos (inference_demo.py, gesture_demo.py) that can write weights/activations, execute inference, and read results.
  • MLP support: multi-layer sequential inference with a controller state machine and double-buffered activations to overlap computation and weight loading.
  • Open-source build option: Yosys for synthesis and nextpnr for place-and-route for Xilinx 7-series devices.

What to watch next

  • Efforts to scale the design beyond the 2×2 array for larger, production-like accelerators (the repo notes scaling to sizes such as 256×256).
  • Compatibility and maturity of the Yosys + nextpnr flow for Basys3 builds versus the Vivado flow.
  • Further demonstrations or benchmarks comparing the tiny TPU's inference behavior and accuracy across more models and datasets.

Quick glossary

  • Systolic array: A hardware architecture where data pulses through a grid of processing elements in lockstep, enabling efficient regular matrix operations.
  • TPU (Tensor Processing Unit): A class of accelerators optimized for machine learning workloads, particularly matrix-multiply and convolution operations.
  • FPGA (Field-Programmable Gate Array): A reconfigurable semiconductor device that can be programmed with hardware descriptions to implement custom digital circuits.
  • BRAM: Block RAM: on-chip memory blocks available on FPGAs used for storage such as buffers or weight memories.
  • DSP slice: A specialized hardware block on FPGAs designed to perform arithmetic operations like multiply-accumulate efficiently.

Reader FAQ

What is TinyTinyTPU?
An educational, open-source SystemVerilog implementation of a 2×2 TPU-style systolic-array matrix-multiply unit, with full pipeline and FPGA deployment files.

Which FPGA board is supported?
The project targets the Basys3 (Xilinx Artix-7) board; Basys3-specific pinouts and build scripts are included.

Can it run multi-layer inference?
Yes. The design includes an MLP controller and double-buffered activations to support sequential multi-layer inference.

Is an open-source FPGA toolchain supported?
Yes. The repository documents using Yosys for synthesis and nextpnr for place-and-route as an alternative to Vivado.

TinyTinyTPU A minimal 2×2 systolic-array TPU-style matrix-multiply unit, implemented in SystemVerilog and deployed on FPGA. This project implements a complete TPU architecture including: 2×2 systolic array (4 processing elements) Full…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *