TL;DR

ByteShape used a bitlength-learning approach (ShapeLearn) to quantize Qwen3-30B-A3B-Instruct-2507 so it fits and runs interactively on constrained hardware. On a Raspberry Pi 5 (16GB), a ByteShape Q3_K_S-2.70bpw configuration reached about 8.03 TPS while preserving ~94.18% of BF16 quality; ByteShape models also showed favorable TPS/quality tradeoffs across CPUs, desktop GPUs and high-end GPUs versus Unsloth and MagicQuant.

What happened

ByteShape applied its ShapeLearn bitlength-learning technique to the Qwen3-30B-A3B-Instruct-2507 model, selecting per-weight datatypes to balance tokens-per-second (TPS) and output fidelity while keeping the quantized model within device memory. On a Raspberry Pi 5 with 16 GB RAM, the selected Q3_K_S-2.70bpw variant ran at 8.03 TPS and retained about 94.18% of BF16-equivalent quality, which the team describes as perceptually real-time for interactive generation. The group evaluated the same model family across other CPUs, an Intel i7 64GB machine, and GPUs (RTX 5090 and RTX 4080). Results showed device-dependent sweet spots—particularly a ~4-bit region on the RTX 5090 where several quantizations clustered at high TPS—and that fewer bits does not always yield faster inference because of kernel and overhead effects. Across tested setups, ByteShape quantizations generally produced a stronger TPS-versus-quality curve than competing methods.

Why it matters

  • A 30B model can run interactively on a Pi-class device when quantized and targeted to fit available memory.
  • Perceived responsiveness for interactive generation appears achievable around ~8 tokens per second on edge devices.
  • Quantization choices must account for kernel and device-specific overheads; lower bitwidths do not guarantee higher speed.
  • Device-aware bitlength selection (prioritizing TPS and quality once memory fits) can yield better practical tradeoffs than minimizing model size alone.

Key facts

  • Model evaluated: Qwen3-30B-A3B-Instruct-2507 using ByteShape's ShapeLearn quantization.
  • Raspberry Pi 5 (16 GB): Q3_K_S-2.70bpw achieved 8.03 TPS at 2.70 bits-per-weight (BPW) and ~94.18% of BF16 quality.
  • Sustaining roughly 8–8.5 TPS at >92% baseline accuracy with a 30B model on a Raspberry Pi was demonstrated.
  • CPU behavior: once a model fits in RAM, reducing footprint generally increased TPS in a predictable way if datatypes were chosen well.
  • Intel i7 (64 GB): ByteShape reported models reaching ~23.1 TPS at ~98% accuracy for a 3.25 BPW configuration (Q3_K_S-3.25bpw).
  • RTX 5090 (32 GB): a ~4-bit quantization cluster produced very high TPS (~302–303 TPS) at ≈98.4–98.9% accuracy; ByteShape had the most accurate high-end model (IQ4_XS-4.67bpw: 272.98 TPS, 99.75% accuracy).
  • RTX 4080 (16 GB): smaller VRAM prevented the 4-bit sweet spot, and ByteShape models outperformed Unsloth under the same 16 GB constraint (example: IQ4_XS-3.87bpw at ~214.81 TPS and ~98.66% accuracy).
  • Across multiple device classes, ByteShape models often delivered higher TPS at comparable quality or higher quality at similar TPS versus Unsloth and MagicQuant in these tests.

What to watch next

  • How kernel-level decode and matmul/matvec paths in frameworks like llama.cpp evolve, since they drive device-specific quantization sweet spots.
  • Broader adoption of per-weight bitlength learning for other large models and edge devices, and whether similar TPS/quality tradeoffs appear.
  • not confirmed in the source: whether this exact 30B configuration will run acceptably on older Pi models or Pi devices with less than 16 GB of RAM.

Quick glossary

  • Bits-per-weight (BPW): Average number of bits used to store each model parameter after quantization; lower BPW reduces model memory but can affect accuracy and performance.
  • Tokens per second (TPS): A throughput metric measuring how many output tokens the model generates each second; used to gauge latency and perceived responsiveness.
  • Quantization: The process of reducing numeric precision of model weights and activations to lower memory use and potentially increase speed, at the cost of some numerical fidelity.
  • BF16: Brain Floating Point 16-bit format; a lower-precision floating-point representation often used as a baseline for quality comparisons.
  • ShapeLearn: ByteShape’s bitlength-learning approach that selects per-weight datatypes to optimize the tradeoff between speed (TPS) and output quality given a device memory constraint.

Reader FAQ

Can a 30B Qwen model run in real time on a Raspberry Pi?
Yes. In these tests, Qwen3-30B quantized with ByteShape (Q3_K_S-2.70bpw) ran on a Raspberry Pi 5 (16 GB) at about 8.03 TPS while retaining ~94.18% BF16-equivalent quality.

Does using fewer bits always make inference faster?
No. The report shows that lower bitwidths can trigger different kernels and overheads; on some devices, reducing bits further can slow inference rather than speed it.

How did ByteShape models compare to Unsloth and MagicQuant?
Across the tested CPUs and GPUs, ByteShape’s quantizations typically produced a better TPS/quality tradeoff, often achieving higher TPS at the same quality or higher quality at similar TPS.

Will reproducing these results require special software or kernel changes?
not confirmed in the source

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time Published by ByteShape Team • January 5, 2026 For this release, we optimize for what people…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *