TL;DR

Fabrice Bellard published ts_zip, a tool that compresses text using an RWKV 169M v4 language model plus arithmetic coding. It achieves substantially better compression ratios than conventional compressors for text but requires a GPU and is slower to run.

What happened

Fabrice Bellard released ts_zip, an experimental utility that compresses (and can decompress) text files by combining a compact Large Language Model with an arithmetic coder. The tool uses the RWKV 169M v4 model, quantized to 8 bits per parameter and evaluated in BF16, to predict next-token probabilities which the arithmetic coder then encodes. Reported results show lower bits-per-byte (bpb) values than xz on common text corpora. The implementation is deterministic and reproducible, so compressed files can be decompressed on different hardware or with different thread counts. ts_zip currently supports only text files (binary files are not reduced much), prefers English due to the model’s training data but can handle other languages and source code, and requires a GPU and at least 4 GB of RAM. Bellard also notes the tool is experimental with no guarantee of backward compatibility between versions.

Why it matters

  • Demonstrates that a relatively small LLM can outperform conventional compressors on text by exploiting learned token probabilities.
  • Deterministic evaluation means decompression does not depend on the exact GPU/CPU or thread configuration, easing portability.
  • Highlights a trade-off space: higher compression ratios at the cost of GPU dependency and slower throughput compared with standard compressors.
  • Points toward new use cases for ML-based compression where text-dominant datasets benefit from learned models rather than generic entropy coders.

Key facts

  • Tool name: ts_zip — purpose: text compression using a Large Language Model.
  • Core model: RWKV 169M v4, quantized to 8 bits per parameter and evaluated using BF16.
  • Requires a GPU for reasonable speed and at least 4 GB of RAM.
  • Reported speed: up to about 1 MB/s for both compression and decompression on an RTX 4090.
  • Supports only text files; binary files are not compressed effectively.
  • Experimental project — no backward compatibility between different ts_zip versions is promised.
  • Deterministic evaluation: compression/decompression results do not depend on GPU/CPU model or number of threads.
  • Representative compression results (bits per byte, bpb) vs xz: alice29.txt — xz 2.551 bpb, ts_zip 1.142 bpb; book1 — xz 2.717 bpb, ts_zip 1.431 bpb; enwik8 — xz 1.989 bpb, ts_zip 1.106 bpb; enwik9 — xz 1.707 bpb, ts_zip 1.084 bpb; linux-1.2.13.tar — xz 1.441 bpb, ts_zip 1.021 bpb.
  • Download packages published: ts_zip-2024-03-02.tar.gz (Linux) and ts_zip-2024-03-02-win64.zip (Windows).

What to watch next

  • Whether future versions add broader file-type support or improved speed — not confirmed in the source.
  • Potential updates to the model (larger or differently trained RWKV variants) that could change compression/speed trade-offs — not confirmed in the source.
  • Any efforts to introduce stable, backward-compatible formats for long-term archive use — not confirmed in the source.

Quick glossary

  • Large Language Model (LLM): A neural network trained on large text corpora to predict or generate text sequences and estimate probabilities for next tokens.
  • Token: A unit of text (which can be a character, subword, or word) that a language model predicts as the next element in a sequence.
  • Arithmetic coder: An entropy coding method that converts a sequence of probabilities into a compact binary representation.
  • Quantization: A process that reduces the numerical precision of model parameters (for example to 8 bits) to lower storage and compute requirements.
  • BF16 (bfloat16): A 16-bit floating-point format often used to speed up neural network evaluation while retaining reasonable numerical range.

Reader FAQ

Can ts_zip decompress files on different hardware than was used to compress them?
Yes. The implementation is deterministic and reproducible, and Bellard states decompression does not depend on the exact GPU/CPU model or thread count.

Does ts_zip work on binary files like images or executables?
No. The source states only text files are supported and binary files won’t be compressed much.

Is ts_zip fast compared with conventional compressors?
No. It is slower; reported throughput is up to about 1 MB/s on an RTX 4090 for both compression and decompression.

Is backward compatibility guaranteed across ts_zip releases?
No. The project is experimental and the source explicitly says no backward compatibility should be expected between versions.

Where can I download ts_zip?
Linux and Windows packages are listed as ts_zip-2024-03-02.tar.gz and ts_zip-2024-03-02-win64.zip respectively.

ts_zip: Text Compression using Large Language Models The ts_zip utility can compress (and hopefully decompress) text files using a Large Language Model. The compression ratio is much higher than with…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *