GPU-Accelerated Cuckoo Filter: Faster Queries and Deletions on CUDA

TL;DR

A header-only CUDA implementation of the Cuckoo Filter provides batch insert, lookup and delete operations optimized for GPUs. Benchmarks on an NVIDIA GH200 show large speedups versus CPU and several GPU-resident alternatives for many workloads, and the project includes multi-GPU and IPC support.

What happened

A public repository publishes a CUDA implementation of the Cuckoo Filter developed for a thesis titled "Design and Evaluation of a GPU-Accelerated Cuckoo Filter." The library targets high-throughput batch operations and exposes configurable parameters such as fingerprint size, bucket size, eviction policy (DFS or BFS) and maximum eviction attempts. It is header-only and includes features aimed at GPU performance: sorted insertion for better memory coalescing, multiple eviction strategies, multi-GPU operation via a gossip mechanism, and IPC support for cross-process sharing. Performance results at 80% load on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s) compare the GPU Cuckoo Filter to CPU Cuckoo, Bulk Two-Choice Filter, GPU Counting Quotient Filter and GPU Blocked Bloom Filter; the repo also contains benchmarks, tests, example applications and build scripts. Requirements listed include CUDA Toolkit >= 12.9, a C++20 compiler and Meson >= 1.3.0.

Why it matters

Delivers high-throughput membership operations on GPUs, which can accelerate analytics and streaming workloads that rely on approximate set membership.
Supports deletion, a capability not available in traditional Bloom filters, widening applicability for dynamic datasets.
Multi-GPU and IPC features enable scaling and sharing across processes and devices without redesigning the data structure.
Header-only design and configurable parameters make the implementation adaptable to different GPU environments and application needs.

Key facts

Project implements a GPU-accelerated Cuckoo Filter optimized for batch insert, lookup and delete.
Benchmarks run at 80% load factor on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s).
L2-resident (≈4M items, ~8 MiB) comparisons: GPU vs CPU Cuckoo — insert 360× faster, query 973× faster.
DRAM-resident (≈268M items, ~512 MiB) comparisons: GPU vs CPU Cuckoo — insert 583× faster, query 1504× faster.
Compared to a GPU Blocked Bloom Filter: insert performance is slower (0.6–0.7×) while query performance is similar or faster (1.0–1.4×) depending on residency.
Configurable options exposed through a template CuckooConfig include key type, fingerprint bits (8/16/32), max evictions, CUDA block size and bucket size.
Eviction policies implemented: DFS and BFS; alternate bucket policies available (e.g., XorAltBucketPolicy).
Supports multi-GPU operation (CuckooFilterMultiGPU) and IPC for cross-process sharing; repository includes benchmarks, tests and examples.
Build requirements: CUDA Toolkit >= 12.9, C++20-compatible compiler, Meson build system >= 1.3.0; Meson commands provided to build and disable tests/benchmarks.

What to watch next

Read the accompanying thesis for a more comprehensive evaluation of systems and detailed analysis (linked from the repo).
Future adoption or integration with production systems is not confirmed in the source.
Support for non-CUDA GPUs or alternative backends (e.g., ROCm) is not confirmed in the source.

Quick glossary

Cuckoo Filter: A probabilistic data structure for approximate set membership that supports insertion, lookup and deletion using small fingerprints and multiple candidate buckets per item.
Bloom Filter: A space-efficient probabilistic structure for testing set membership that supports inserts and lookups but typically does not support deletion without extensions.
Fingerprint: A short, fixed-size hash-derived value stored in a filter slot representing an inserted key; smaller fingerprints reduce memory but increase false positives.
Bucket: A fixed-size set of slots in a filter where fingerprints can reside; bucket size affects collision handling and capacity.
IPC (Inter-Process Communication): Mechanisms that allow data structures or memory to be shared across different processes, enabling reuse without duplicate copies.

Reader FAQ

Does this implementation support deletions?
Yes. The GPU Cuckoo Filter exposes delete operations in batch form.

What platforms and toolchains are required to build it?
The repository lists CUDA Toolkit >= 12.9, a C++20-compatible compiler and the Meson build system >= 1.3.0.

Is the code header-only and easy to integrate?
Yes. The implementation is provided as a header-only library with example applications and usage snippets.

Are the performance claims validated across many systems?
The repo includes benchmarks and notes that a more comprehensive evaluation is available in the accompanying thesis.

GPU-Accelerated Cuckoo Filter A high-performance CUDA implementation of the Cuckoo Filter data structure, developed as part of the thesis "Design and Evaluation of a GPU-Accelerated Cuckoo Filter". Overview This library…

Sources

Show HN: GPU Cuckoo Filter – faster queries than Blocked Bloom, with deletion

GPU-Accelerated Cuckoo Filter: Faster Queries and Deletions on CUDA

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

vamos: a virtual AmigaOS runtime for running m68k CLI tools on macOS and PC

Musashi: Portable ANSI C Motorola 680×0-family emulator on GitHub

AMD teases MI500 GPUs, Helios racks and Venice Epyc to challenge Nvidia

Leave a Reply Cancel reply

You missed

Waymo Rebrands Zeekr RT Robotaxi as ‘Ojai’ Ahead of Public Rollout

How to Push Back Against Biometric Surveillance at Wegmans and Protect Privacy

Study: Linux kernel bugs lurk 2.1 years on average; some persist 20+ years

Navee shows Wavefly seaplane and premium e-scooter lineup at CES 2026