TL;DR

ByteShape used its Shapelearn bitlength-learning method to quantize Qwen3-30B-A3B-Instruct-2507 for target devices, trading bits-per-weight to maximize tokens-per-second (TPS) while meeting memory constraints. On a Raspberry Pi 5 (16 GB) a ByteShape configuration hit about 8.03 TPS at 2.70 BPW with ~94.18% of BF16-quality, and ByteShape models generally outperform Unsloth and MagicQuant across tested CPUs and GPUs.

What happened

ByteShape applied its Shapelearn bitlength optimization to Qwen3-30B-A3B-Instruct-2507 with the practical constraint that the quantized model must comfortably fit device memory before further tuning. The goal was to maximize observable performance—tokens per second—while maintaining output quality. On a Raspberry Pi 5 with 16 GB RAM the recommended configuration Q3_K_S-2.70bpw (2.70 BPW) delivered 8.03 TPS and preserved 94.18% of a BF16 quality baseline, which ByteShape characterizes as perceptually real-time. Across other hardware, ByteShape reports consistent gains: on an Intel i7 with 64 GB RAM it reached higher-throughput operating points (including 26+ TPS regions), and on GPUs the team observed a ~4-bit sweet spot on an RTX 5090 (around ~302–303 TPS for several ~4b models). On an RTX 4080 (16 GB) ByteShape’s top-fitting model delivered 214.81 TPS at ~98.66% quality.

Why it matters

  • A 30B model reaching ~8 TPS on a Raspberry Pi 5 suggests larger models can be viable for interactive edge use when quantized carefully.
  • Optimizing bit allocations under a memory budget produces more predictable speed/quality tradeoffs for on-device inference.
  • GPU performance does not scale monotonically with lower bits; quantization-specific kernel choices create device-dependent sweet spots.
  • ByteShape’s Shapelearn approach may enable better resource use on both constrained CPUs and high-end GPUs compared with alternatives tested.

Key facts

  • Target model: Qwen3-30B-A3B-Instruct-2507 quantized with Shapelearn.
  • Raspberry Pi 5 (16 GB): Q3_K_S-2.70bpw achieved 8.03 TPS at 2.70 BPW and retained 94.18% of BF16 baseline quality.
  • ByteShape prioritizes fitting the model in memory first, then trading bitlength to maximize TPS and quality rather than minimizing file size alone.
  • On an Intel i7 with 64 GB RAM ByteShape produced models that occupy a high-throughput region, including configurations cited above 26 TPS.
  • RTX 5090 (32 GB) tests revealed a ~4-bit sweet spot where several ~4b quantizations clustered near ~302–303 TPS at ~98.4–98.9% quality.
  • Best accuracy reported on the 5090: IQ4_XS-4.67bpw (4.67 BPW) at 272.98 TPS and 99.75% accuracy (relative error 0.25%).
  • RTX 4080 (16 GB) top ByteShape model that fits: IQ4_XS-3.87bpw at 214.81 TPS and ~98.66% accuracy.
  • Comparisons in the report indicate ByteShape models generally achieved higher TPS at equal quality or higher quality at equal TPS than the Unsloth and MagicQuant entries tested.

What to watch next

  • Adoption of device-targeted bitlength learning for other large models and edge platforms — not confirmed in the source
  • Further kernel and llama.cpp improvements that change the GPU quantization sweet spots and per-device throughput behavior — not confirmed in the source
  • Wider availability or official ports of the specific ByteShape quantized files and inference recipes for consumer devices — not confirmed in the source

Quick glossary

  • Quantization: The process of reducing the precision of a model's weights (for example, using fewer bits per parameter) to lower memory use and potentially speed up inference.
  • BPW (Bits Per Weight): A measure of the average number of bits used to store each parameter in a quantized model; lower BPW reduces memory but can affect quality and performance.
  • TPS (Tokens Per Second): A runtime metric indicating how many output tokens a model generates per second; used as a proxy for responsiveness and throughput.
  • Shapelearn: ByteShape’s bitlength learning method for choosing per-weight datatypes to optimize the practical tradeoff between throughput and output quality under a memory constraint.
  • BF16: Bfloat16, a 16-bit floating-point format commonly used as a baseline for neural network quality and performance comparisons.

Reader FAQ

Can a 30B Qwen model run in real time on a Raspberry Pi?
Yes. ByteShape reports a 30B Qwen configuration running on a Raspberry Pi 5 (16 GB) at about 8.03 TPS with 2.70 BPW, which they call real-time.

What setting achieved the Pi result?
The report highlights Q3_K_S-2.70bpw (2.70 BPW) as the recommended Pi configuration delivering 8.03 TPS and ~94.18% of BF16 quality.

Do fewer bits always mean faster inference?
Not on every device. The report emphasizes that on GPUs, reducing BPW can change kernels and overheads so lower bits do not always increase TPS.

How does ByteShape compare to Unsloth and MagicQuant?
In the ByteShape report, Shapelearn-quantized models consistently offered better TPS/quality tradeoffs than the Unsloth and MagicQuant models they tested.

Are the quantized files and code publicly available?
not confirmed in the source

A 30B Qwen Model Walks Into a Raspberry Pi… and Runs in Real Time Published by ByteShape Team • January 5, 2026 For this release, we optimize for what people…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *