40-Line Fix Cuts Java Thread User-Time Call Latency by Up to 400×

TL;DR

An OpenJDK change replaced a /proc-based user-time implementation with a clock_gettime-based approach, removing file I/O and parsing. Microbenchmarks show average latency falling from ~11.2 µs to ~0.279 µs — about a 40× improvement in the test run — and much cleaner syscall profiles.

What happened

OpenJDK maintainers replaced code that read /proc/self/task/<tid>/stat and parsed its text to compute a thread's user CPU time with a small change that uses a thread-specific clockid and clock_gettime(). The removed implementation performed file I/O, string parsing and multiple syscalls; the new version queries pthread_getcpuclockid(), flips low bits in the returned clockid to select the user-time-only clock on Linux, and calls clock_gettime() directly. The patch added a JMH benchmark and removed the complex parsing code. In a local JMH run with 16 threads the author measured average latency drop from about 11.186 microseconds per call to 0.279 microseconds, reducing observed median and tail syscalls and showing far fewer kernel interactions. The change relies on a Linux kernel clockid encoding documented in kernel sources rather than POSIX man pages.

Why it matters

Calls to ThreadMXBean.getCurrentThreadUserTime() become far cheaper, reducing overhead in tooling and monitoring that sample thread CPU time frequently.
Replacing /proc reads with a single clock_gettime() syscall lowers kernel and VFS activity and reduces lock/contention under concurrency.
A small, targeted change in native code produced a substantial run-time improvement, showing value of platform-specific APIs where safe.
Because the fix depends on Linux-specific clockid encoding, portability and behavior on non-Linux systems need consideration.

Key facts

The OpenJDK change replaced a /proc-based implementation with a clock_gettime-based approach using a modified clockid.
Removed code read /proc/self/task/<tid>/stat, used sscanf to parse fields and converted clock ticks to nanoseconds.
New code obtains a clockid via pthread_getcpuclockid() and flips low bits to select the VIRT clock (user time only) before calling clock_gettime().
Linux encodes clock type into clockid_t; bits indicate thread vs process and clock type (00 PROF, 01 VIRT, 10 SCHED, 11 FD).
Linux kernels have used this clockid encoding since 2.6.12 (2005); documentation is sparse and primarily in kernel sources.
Author ran a JMH benchmark (16 threads) included with the patch; average latency fell from ~11.186 µs/op to ~0.279 µs/op.
Measured improvement in that run is about 40× on average; the original bug report cited a 30×–400× gap depending on setup.
Both before and after runs still show rare high-tail outliers (~1.2 ms), but the fixed version shows a much cleaner syscall profile.

What to watch next

Whether this change is backported or propagated across OpenJDK release branches (not confirmed in the source).
Impact on high-concurrency workloads and real-world monitoring tools that frequently sample per-thread user time.
Potential kernel or libc changes that could alter clockid encoding or its stability (not confirmed in the source).

Quick glossary

clock_gettime: A POSIX/C function that returns the current value of a specified clock, typically used to read elapsed, process, or thread CPU time.
/proc filesystem: A virtual filesystem on Linux that exposes kernel and process information as text files; reading it can involve kernel string synthesis and VFS operations.
pthread_getcpuclockid: A POSIX function that returns a clockid associated with a specific thread; on Linux the returned clockid encodes additional type and target information.
JMH: Java Microbenchmark Harness, a toolkit for building and running microbenchmarks on the JVM.
clockid_t: An integer type representing a clock identifier used by clock_gettime and related APIs; on Linux parts of its bits encode clock type and target thread/process.

Reader FAQ

What did the patch change in OpenJDK?
It replaced a /proc-based user-time reader with a clock_gettime-based implementation that uses a thread-specific clockid.

How big was the performance improvement?
In the author's JMH run the average latency dropped from ~11.186 µs to ~0.279 µs per call, roughly a 40× reduction; the original bug reported a 30×–400× range.

Why wasn't clock_gettime used originally?
POSIX clock types normally expose total CPU time (user + system); obtaining user-time-only required a Linux-specific tweak to the clockid.

Is this change portable to other operating systems?
The approach depends on Linux-specific clockid encoding; portability to non-Linux systems is not confirmed in the source.

How a 40-Line Fix Eliminated a 400x Performance Gap Jaromir Hamala QuestDB Team January 13, 2026 Tags: jvm linux performance engineering I have a habit of skimming the OpenJDK commit…

40-Line Fix Cuts Java Thread User-Time Call Latency by Up to 400×

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

SkyPilot: Unified system to run and manage AI compute across clouds

How a 40-line fix closed a 400x JVM performance gap on Linux systems

Hacker News’ ‘The {name} Programming Language’ Posts: A Curated Index

Leave a Reply Cancel reply

You missed

How to use TestFlight to install and test beta apps across Apple devices

Revup lets you upload once to create multiple related GitHub PRs

Using Network Proxies to Keep Secrets Out of Claude Code Sandboxes

SkyPilot: Unified system to run and manage AI compute across clouds