TinyTinyTPU: 2×2 systolic-array TPU-style matrix-multiply unit on Basys3

TL;DR

TinyTinyTPU is an educational, open-source 2×2 systolic-array implementation of a TPU-style matrix-multiply unit written in SystemVerilog and deployed on a Basys3 FPGA. The project includes a full post-MAC pipeline, UART host interface, multi-layer MLP inference support, simulation tests, and both vendor and open-source FPGA toolflow instructions.

What happened

A minimal TPU-like accelerator named TinyTinyTPU has been published as a SystemVerilog project and targeted to the Basys3 (Xilinx Artix-7) FPGA. The design implements a 2×2 systolic array composed of four processing elements and a complete post-MAC pipeline that performs accumulation, activation (ReLU/ReLU6), normalization, and quantization. A UART-based bridge and a Python host driver let a PC load weights and activations, trigger execution, and read results; example scripts include an inference demo and a gesture-recognition demo built around a two-layer MLP. The repository provides a simulation environment using Verilator and cocotb with testbenches and waveform generation, as well as FPGA build files and instructions for both Xilinx Vivado and an open-source flow using Yosys and nextpnr. Resource usage on the Basys3 is documented and the project includes automated tests for individual modules and system integration.

Why it matters

Provides a compact, hands-on reference for how systolic arrays implement matrix multiply and dataflow timing.
Packages a full inference path (weights → MMU → accumulator → activation → quantization) suitable for learning and FPGA prototyping.
Includes both vendor and open-source FPGA toolflow guidance, lowering the barrier to experiment with RTL and bitstream generation.
Demonstrates an end-to-end host interface and example applications (inference and gesture demo) that make the design usable on real hardware.

Key facts

Implements a 2×2 systolic array (4 processing elements) in SystemVerilog.
Includes a full post-MAC pipeline: accumulator, activation (ReLU/ReLU6), normalizer, and quantizer.
Host interface is UART-based with a byte-oriented command protocol (commands for write weight, write activation, execute, read result, read status).
Target FPGA: Basys3 (Xilinx Artix-7); UART pins: RX on B18, TX on A18; clock is the onboard 100 MHz oscillator.
Documented Basys3 resource usage: ~1,000 LUTs (~5%), ~1,000 flip-flops (~3%), 8 DSP48E1 slices, ~10–15 BRAM blocks; estimated gate count ~25,000.
Simulation/test tooling: Verilator 5.022+, cocotb, Python 3.8+, with per-module tests and waveform generation.
Host software: Python drivers and demos (inference_demo.py, gesture_demo.py) that can write weights/activations, execute inference, and read results.
MLP support: multi-layer sequential inference with a controller state machine and double-buffered activations to overlap computation and weight loading.
Open-source build option: Yosys for synthesis and nextpnr for place-and-route for Xilinx 7-series devices.

What to watch next

Efforts to scale the design beyond the 2×2 array for larger, production-like accelerators (the repo notes scaling to sizes such as 256×256).
Compatibility and maturity of the Yosys + nextpnr flow for Basys3 builds versus the Vivado flow.
Further demonstrations or benchmarks comparing the tiny TPU's inference behavior and accuracy across more models and datasets.

Quick glossary

Systolic array: A hardware architecture where data pulses through a grid of processing elements in lockstep, enabling efficient regular matrix operations.
TPU (Tensor Processing Unit): A class of accelerators optimized for machine learning workloads, particularly matrix-multiply and convolution operations.
FPGA (Field-Programmable Gate Array): A reconfigurable semiconductor device that can be programmed with hardware descriptions to implement custom digital circuits.
BRAM: Block RAM: on-chip memory blocks available on FPGAs used for storage such as buffers or weight memories.
DSP slice: A specialized hardware block on FPGAs designed to perform arithmetic operations like multiply-accumulate efficiently.

Reader FAQ

What is TinyTinyTPU?
An educational, open-source SystemVerilog implementation of a 2×2 TPU-style systolic-array matrix-multiply unit, with full pipeline and FPGA deployment files.

Which FPGA board is supported?
The project targets the Basys3 (Xilinx Artix-7) board; Basys3-specific pinouts and build scripts are included.

Can it run multi-layer inference?
Yes. The design includes an MLP controller and double-buffered activations to support sequential multi-layer inference.

Is an open-source FPGA toolchain supported?
Yes. The repository documents using Yosys for synthesis and nextpnr for place-and-route as an alternative to Vivado.

TinyTinyTPU A minimal 2×2 systolic-array TPU-style matrix-multiply unit, implemented in SystemVerilog and deployed on FPGA. This project implements a complete TPU architecture including: 2×2 systolic array (4 processing elements) Full…

TinyTinyTPU: 2×2 systolic-array TPU-style matrix-multiply unit on Basys3

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

uvx ptn: Scan a QR to open your local terminal on a phone, no SSH

Proving Liveness with TLA: Using TLAPS’ New Temporal Logic Support

Blaze: Rust-based emulator revives DEC VT420 terminal hardware and more

Leave a Reply Cancel reply

You missed

Mui Board — a minimalist, screen‑free smart home controller crafted from wood

Trump says Venezuela’s Maduro was captured after strikes, reports say

How Gemini Turns Google Keep From Junk Drawer Into Productivity Engine

Framework Laptop 16 Review: RTX 5070 Brings Upgradable Graphics to Laptops