TL;DR

The team behind the chonkie chunking library built memchunk to test theoretical and practical limits of delimiter-based text chunking. By combining memchr's SIMD-accelerated byte search, a 256-entry lookup table for larger delimiter sets, and reverse searches, memchunk achieves order-of-magnitude speedups over existing Rust chunkers and exposes Python and WASM bindings with zero-copy views.

What happened

While benchmarking at Wikipedia scale, the developers of the chonkie project found existing chunking tools slower than expected and explored how far delimiter-based chunking could be pushed. They built memchunk, a library that picks among fast byte-search strategies: memchr/memrchr (which use SWAR fallback and AVX2/SSE2 SIMD paths) for one-to-three delimiters, and a 256-entry boolean lookup table for four-or-more delimiters. Memchunk favors backward searches (using memrchr) from the target window edge so the first match is the best split point and avoids bookkeeping for multiple forward matches. The table is constructed lazily and subsequent iterations use the chosen low-overhead path. Benchmarks reported in the source show memchunk processing at about 164 GB/s and outperforming other Rust chunkers by large factors; the author says this is sufficient to chunk an English Wikipedia-sized dataset (~20 GB) in roughly 120 ms. Python and WebAssembly bindings expose zero-copy slices to retain much of the performance across FFI boundaries.

Why it matters

  • Faster chunking reduces preprocessing time for retrieval-augmented generation (RAG) pipelines operating on large corpora.
  • Choosing the right low-level strategy (SIMD vs lookup table) yields major throughput gains without adding heavyweight NLP components.
  • Zero-copy bindings make it practical to use an optimized Rust core from Python and JavaScript while keeping latency low.
  • Backward search minimizes per-window work, lowering CPU overhead and simplifying implementation.

Key facts

  • Delimiter-based chunking preserves sentence boundaries better than fixed token/character splits, improving retrieval quality.
  • memchr uses SWAR (SIMD Within A Register) as a fast generic fallback and uses AVX2/SSE2 vector instructions on supported x86_64 CPUs.
  • memchr provides memchr (one byte), memchr2 (two bytes), and memchr3 (three bytes) fast search primitives; the implementation stops at three needles by design.
  • For delimiter sets larger than three bytes, memchunk falls back to a 256-entry boolean lookup table for O(1) per-byte checks.
  • Memchunk searches backward from the target chunk boundary (using memrchr) to locate the best split point in one pass, avoiding repeated bookkeeping.
  • The delimiter lookup table is built lazily on first use and reused for subsequent searches to minimize overhead.
  • Reported benchmark throughput for memchunk is about 164 GB/s on the tested hardware.
  • Comparative throughput reported: kiru 4.5 GB/s (≈36× slower), langchain 0.35 GB/s (≈469× slower), semchunk 0.013 GB/s (≈12,615× slower), llama-index 0.0035 GB/s (≈46,857× slower), text-splitter 0.0017 GB/s (≈96,471× slower).
  • The author reports memchunk can chunk an English Wikipedia–scale corpus (~20 GB) in roughly 120 milliseconds.
  • Memchunk exposes Python and WASM/JS bindings that return zero-copy views (memoryview in Python, subarray in JS) to avoid unnecessary allocations.

What to watch next

  • Integration or adoption of memchunk in major RAG toolkits and pipelines: not confirmed in the source
  • Any upstream changes to memchr (for example expanding multi-needle SIMD paths) that could affect memchunk's strategy: not confirmed in the source
  • Performance behavior across non-x86 architectures and in different runtime environments (ARM, older CPUs, constrained VMs): not confirmed in the source

Quick glossary

  • Chunking: The process of splitting large text into smaller segments suitable for embeddings, indexing, or model context windows.
  • SIMD: Single Instruction, Multiple Data — a CPU capability that applies the same operation to multiple data elements in parallel, often used for byte or vector processing.
  • SWAR: SIMD Within A Register — a technique that uses standard integer operations on wide registers to process multiple lanes simultaneously when dedicated SIMD instructions are unavailable.
  • Lookup table: A precomputed array (here 256 entries) used to test membership of a byte in a delimiter set in constant time with a single array index.
  • memchr / memrchr: Low-level byte-search primitives: memchr finds a byte moving forward through memory; memrchr searches backward. Implementations may use SIMD and SWAR for speed.

Reader FAQ

What is memchunk?
A chunking library that selects fast byte-search strategies to split text on delimiters and was developed from work on chonkie.

Why prefer delimiter-based chunking over fixed token/character splits?
Delimiters avoid cutting sentences in half, which preserves semantic completeness and improves retrieval quality.

Does memchr handle more than three needles with SIMD?
No — memchr supports up to three needles with SIMD-accelerated paths; beyond that memchunk uses a 256-entry lookup table.

Are language bindings available?
Yes — the project provides Python and WebAssembly/JavaScript bindings that return zero-copy views.

we’ve been working on chonkie, a chunking library for RAG pipelines, and at some point we started benchmarking on wikipedia-scale datasets. that’s when things started feeling… slow. not unbearably slow,…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *