High-Performance GPU Cuckoo Filter Library Accelerated with CUDA

TL;DR

A header-only CUDA library implements a GPU-accelerated Cuckoo Filter optimized for high-throughput batch insert, lookup and delete operations. Benchmarks on an NVIDIA GH200 (H100 HBM3) show large speedups versus CPU and several GPU alternatives; source, build instructions and a thesis with fuller evaluation are available in the public repository.

What happened

A public repository delivers a CUDA-based implementation of the Cuckoo Filter designed for high-throughput batch workloads. The library exposes configurable parameters such as fingerprint size, bucket size and eviction limits, and supports multiple eviction strategies (DFS, BFS), sorted insertion for improved memory coalescing, and IPC for cross-process sharing. It is distributed as a header-only C++20 library with Meson build support and requires CUDA Toolkit 12.9 or newer. The project includes a multi-GPU variant that uses a gossip-style approach for workloads exceeding a single device. Performance results in the repository report comparisons on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s) at different residency sizes, showing large throughput gains versus a CPU cuckoo filter and several other GPU-resident probabilistic filters; the author points readers to an accompanying thesis for a more comprehensive evaluation.

Why it matters

GPU acceleration can dramatically increase throughput for set membership workloads used in databases, networking, and streaming systems.
Configurable fingerprint and bucket parameters let engineers trade space and false-positive behavior against performance.
Multi-GPU and IPC support broaden applicability to larger or multi-process deployments.
Header-only design and standard build tooling simplify integration into C++20 projects.

Key facts

Library implements batch insert, lookup and delete operations accelerated by CUDA.
Configurable template parameters include key type, fingerprint bits (8/16/32), max evictions, CUDA block size and bucket size.
Supports multiple eviction policies (DFS, BFS) and sorted insertion mode for better memory coalescing.
Multi-GPU support is provided via a gossip mechanism; IPC support enables cross-process sharing.
Benchmarks run at 80% load factor on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s).
Reported speedups (L2-resident, ~4M items): GPU vs CPU Cuckoo — insert 360×, query 973×; versus other GPU filters show varying improvements.
Reported speedups (DRAM-resident, ~268M items): GPU vs CPU Cuckoo — insert 583×, query 1504×.
Some comparisons show the GPU Cuckoo Filter slower for insert versus a blocked Bloom filter (0.6×) but faster for queries (1.4×) in the L2-resident case.
Repository is public on GitHub and licensed under the MIT license.

What to watch next

Scaling characteristics and throughput when using the multi-GPU gossip mode (documented in the repository).
Full experimental details, additional systems and deeper analysis available in the accompanying thesis (referenced in the repo).
Real-world application performance and integration patterns with database or streaming systems: not confirmed in the source.

Quick glossary

Cuckoo Filter: A space-efficient probabilistic data structure that supports insert, lookup and delete operations with a tunable false positive rate using fingerprints and cuckoo-style relocations.
Fingerprint: A compact hash-derived identifier stored in a filter's bucket slot; its size (bits) affects space use and false positive rate.
Eviction policy: The strategy used to relocate existing entries during insertions when target buckets are full; examples include depth-first search (DFS) and breadth-first search (BFS).
Memory coalescing: A GPU memory-access optimization where adjacent threads access contiguous memory addresses to improve bandwidth utilization.
IPC (Inter-Process Communication): Mechanisms that allow different processes to share data; here used to enable cross-process sharing of the filter.

Reader FAQ

Is the project open source?
Yes — the repository is public on GitHub and the code is released under the MIT license.

What hardware was used for the benchmarks?
Benchmarks reported in the repository used an NVIDIA GH200 (H100 HBM3) with 3.4 TB/s memory bandwidth.

Does the library support multi-GPU and cross-process use?
Yes — the project includes a multi-GPU implementation (gossip) and IPC support for cross-process sharing.

How do I build and run the code?
The repo requires CUDA Toolkit >= 12.9, a C++20-compatible compiler and Meson >= 1.3.0; build via meson setup and meson compile as shown in the repository.

GPU-Accelerated Cuckoo Filter A high-performance CUDA implementation of the Cuckoo Filter data structure, developed as part of the thesis "Design and Evaluation of a GPU-Accelerated Cuckoo Filter". Overview This library…

Sources

High-Performance GPU Cuckoo Filter

High-Performance GPU Cuckoo Filter Library Accelerated with CUDA

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

AMD teases MI500 GPUs, Helios racks and Venice Epyc to challenge Nvidia

LaTeX Coffee Stains (2021 PDF): Notes on the CoffeeStains Package

Butter adds automatic template induction to improve dynamic LLM response caching

Leave a Reply Cancel reply

You missed

Nothing Phone 3 drops to a new low price — still an eccentric flagship

The best iPad deals you can get right now: current discounts guide

Disney Plus plans TikTok-style vertical video feed arriving later in 2026

Why Email Encryption Fails in 2026 — Lessons from gpg.fail and DKIM