TL;DR
Researchers behind Umbra and CedarDB developed a compact 128-bit string representation, nicknamed “German strings,” to speed up data-processing workloads. The format is now used in systems such as DuckDB, Apache Arrow, Polars and Facebook Velox and trades mutability and capacity fields for inline short storage, a stored prefix, and immutable exact-sized payloads.
What happened
Engineers behind the Umbra research project and CedarDB designed a custom string type optimized for common database workloads and published its details. Their “German-style” strings use a fixed 128-bit struct that can hold either an inline short form or a long form with a 32-bit length, a four-character prefix, and a pointer to an exact-sized immutable payload. Short strings (12 characters or fewer) are stored directly inside the struct; longer strings keep a stored prefix to speed comparisons and avoid a pointer dereference for early-outs. The design drops a separate capacity field, reduces per-string overhead, and lets string values be passed in registers. The format has since been adopted by several data projects, including DuckDB, Apache Arrow, Polars and Facebook Velox. The authors acknowledge tradeoffs such as more expensive appends and a 4 GiB length limit due to a 32-bit length field.
Why it matters
- Reduces per-string memory overhead compared with traditional std::string-like layouts, improving packing and cache behavior.
- Inline short storage and stored prefixes speed common operations like prefix checks and early inequality detection.
- Immutable payloads simplify concurrent reads because data can be accessed without read locks while the string exists.
- Eliminating a capacity field and using exact-sized buffers reduces duplicated free space and copying during storage of many strings.
Key facts
- The format originates from Umbra research and the CedarDB project.
- Several projects have implemented the format: DuckDB, Apache Arrow, Polars, and Facebook Velox.
- Each string is represented as a fixed 128-bit struct to reduce overhead and enable passing in registers.
- Short-string representation stores content inline for strings of 12 characters or fewer.
- Long-string representation contains a 32-bit length, a four-character prefix, and a pointer to the payload.
- The length field is 32 bits, which imposes a 4 GiB maximum representable string size.
- Pointers reference immutable, exact-sized buffers (no separate capacity field), enabling tighter packing.
- Two bits of the pointer are repurposed to encode a storage class (persistent, transient, or temporary).
- Appending to a long string requires allocating a new buffer and copying data, making appends relatively expensive.
What to watch next
- Broader adoption of the format in other database and data-processing systems — not confirmed in the source
- Comparative benchmarks vs. other string implementations (std::string and language runtimes) in production workloads — not confirmed in the source
- Any language-standard or runtime-level support for 16-byte string representations to enable cross-language ABI optimizations — not confirmed in the source
Quick glossary
- Short string optimization: A technique that stores short string bytes directly inside the string object to avoid heap allocation and pointer indirection.
- Immutable payload: A memory region that once allocated and populated is never modified, allowing safe concurrent reads without locking while the data is alive.
- Prefix: A small fixed-size sequence of the initial characters of a longer string stored separately to accelerate comparisons and early exits.
- Storage class: A lifetime category for string data (for example persistent, transient, temporary) that determines how and when the backing memory is managed.
- Capacity field: A stored value in many string implementations that records the size of the allocated buffer, allowing in-place growth without reallocation.
Reader FAQ
What are German strings?
A custom 128-bit string representation developed in the Umbra/CedarDB work that stores short strings inline and long strings with a 32-bit length, a four-character prefix, and a pointer to an exact-sized immutable payload.
Are German strings mutable?
The design favors immutability: payloads referenced by pointers are immutable; temporary variants can be allocated and freed, but in-place mutation of pointed payloads is not part of the model.
Do German strings have a size limit?
Yes. The length field is 32 bits, which limits the maximum representable string size to 4 GiB.
Where are German strings used?
The source reports implementations in DuckDB, Apache Arrow, Polars, and Facebook Velox.
Are appends cheap with this design?
No. Appending to a long string requires allocating a new buffer and copying the payload, so appends are relatively expensive.
Solutions Pricing About Us Blog Docs Get Started JULY 16, 2024 • 11 MINUTES Why German Strings are Everywhere Many data processing systems have adapted our custom string format. Find…
Sources
- Why German Strings Are Everywhere?
- Das Problem mit German Strings
- German Strings: The 16-Byte Secret to Faster Analytics
- A Deep Dive into German Strings
Related posts
- Setting Up a New Phone? Don’t Repeat the Old App-Cluttering Mistake
- Why SQLite Is Implemented in C: Performance, Portability and Stability
- Congress reverses most proposed NASA science cuts, leaves Shuttle plan unresolved