Z80-μLM: Conversational AI Shrunk to a 40KB CP/M .COM Binary

TL;DR

Z80-μLM is a tiny character-level 'conversational' model engineered to run on a 4MHz Z80 with 64KB RAM, packaged as an approximately 40KB .COM including inference, weights and a chat UI. It uses trigram hash encoding, 2-bit quantized weights and 16-bit integer arithmetic to squeeze neural inference into classic 8-bit hardware.

What happened

A developer published Z80-μLM, a micro language model built to run on vintage Z80 processors and fit inside CP/M's Transient Program Area as an ~40KB .COM file. The model maps input text into 128 trigram buckets (query + context), performs inference with 2-bit weight quantization ({-2,-1,0,+1} packed four per byte) and carries out all math using 16-bit integer operations native to the Z80. Output is generated autoregressively, one character at a time; interaction modes include a tinychat chatbot and a 20-questions style 'guess' game. The repository includes training guidance (TRAINING.md) and tools to synthesize training data via LLM APIs, plus code showing tightly optimized Z80 inner loops for unpacking weights and multiply-accumulate operations. The project is released under MIT or Apache-2.0 licensing choices.

Why it matters

Demonstrates how quantization-aware training and careful engineering can enable neural inference on decades-old 8-bit hardware.
Shows a path to extremely small, self-hosted AI binaries that include both model and UI within constrained storage.
Provides an educational reference for low-level implementation techniques (packed weights, 16-bit accumulators, fixed-point math).
Highlights a different interaction paradigm: terse, categorized outputs rather than free-form, generative responses.

Key facts

Final binary size is roughly 40KB as a CP/M .COM file, claimed to include inference, weights and a chat-style UI.
Target platform is a Z80 CPU running at 4MHz with 64KB of RAM (historic Z80-era hardware).
Input encoding is trigram hashing into 128 buckets (query) plus 128 context buckets, designed to be typo tolerant and word-order invariant.
Weights are quantized to 2 bits per weight with possible values {-2, -1, 0, +1} and packed four weights per byte.
Inference uses 16-bit signed integer arithmetic (Z80 register pairs) and fixed-point scaling; no floating point is used.
Model outputs characters autoregressively, producing short character-by-character responses.
Two example models are included: 'tinychat' (casual Q&A) and 'guess' (a 20 Questions yes/no/maybe game).
Repository includes training guidance (TRAINING.md) and utilities for generating training data via Ollama or Claude APIs.
License choices offered are MIT or Apache-2.0.

What to watch next

Whether the project draws more community contributions or ports to other retro platforms (not confirmed in the source).
Performance and latency measurements on real Z80 hardware beyond the provided examples (not confirmed in the source).
Additional prebuilt models, larger vocabularies or multi-turn context tracking improvements in future updates (not confirmed in the source).

Quick glossary

Quantization-aware training (QAT): A training approach that accounts for reduced numeric precision during training so model weights remain effective when quantized for deployment.
Trigram hash encoding: A method that maps overlapping three-character sequences from input text into a fixed set of buckets to create a compact, typo-tolerant feature vector.
Autoregressive generation: A generation method where the model produces output one token (here, character) at a time, conditioning each prediction on previous outputs.
CP/M Transient Program Area (TPA): The memory region in the CP/M operating system where transient executable programs (like .COM files) are loaded and run.

Reader FAQ

Can Z80-μLM run on real Z80 hardware?
The project targets a 4MHz Z80 with 64KB RAM and builds into an ~40KB CP/M .COM, indicating it is designed for real Z80 environments.

Does the model use floating point math?
No. All arithmetic is integer-based with fixed-point scaling; no floating point is used.

Will this pass the Turing test or act like a modern chatbot?
No. The source states it will not pass the Turing test and is not intended to produce novel, deeply contextual sentences.

Is the source code and license available?
Yes; the repository offers MIT or Apache-2.0 licensing options.

Can it be fine-tuned or trained by users?
The repository includes a TRAINING.md and tools for generating training data, but details about training on specific hardware are not confirmed in the source.

Z80-μLM: A Retrocomputing Micro Language Model Z80-μLM is a 'conversational AI' that generates short character-by-character sequences, with quantization-aware training (QAT) to run on a Z80 processor with 64kb of ram….

Sources

Show HN: Z80-μLM, a 'Conversational AI' That Fits in 40KB

Z80-μLM: Conversational AI Shrunk to a 40KB CP/M .COM Binary

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

The waning era of scale-only AI: why scaling’s grip is weakening

McKinsey and General Catalyst: the ‘learn once, work forever’ era is over

Masonite community announces passing of contributor Joe Mancuso

Leave a Reply Cancel reply

You missed

SMTP Tunnel: A SOCKS5 proxy that masks TCP as SMTP to bypass DPI

Recreated: Steve Jobs’s 1975 Atari horoscope program — now runnable

Google to publish AOSP source twice yearly, a setback for custom ROMs

Transform your phone into a true productivity workhorse with a USB-C hub