How a 40-line fix closed a 400x JVM performance gap on Linux systems

TL;DR

OpenJDK replaced a procfs-based routine for ThreadMXBean.getCurrentThreadUserTime() with a clock_gettime() approach using a clockid bit trick. Benchmarks show the change cuts average latency from ~11 microseconds to ~0.279 microseconds, largely removing heavy /proc-related syscall overhead.

What happened

An OpenJDK commit replaced an implementation that read /proc/self/task/<tid>/stat and parsed text fields to compute per-thread user CPU time with a direct clock_gettime() call using a manipulated clockid_t. The removed code opened and read /proc, then used sscanf to extract user and system tick fields before converting ticks to nanoseconds. The new approach calls pthread_getcpuclockid() to obtain a thread clockid, flips low bits to request the VIRT (user-time-only) clock, then invokes clock_gettime() on that id. The change removed file I/O, string parsing and multiple syscalls. The commit also included a JMH microbenchmark; on the author’s test rig the average latency for getCurrentThreadUserTime() dropped from about 11.186 µs to about 0.279 µs, a ~40x reduction in mean latency, and profiles show the fixed path makes just a single syscall.

Why it matters

Huge reduction in per-call latency for fetching thread user CPU time reduces overhead in instrumentation and monitoring code.
Eliminates multiple syscalls and kernel VFS work (open/read/close), lowering kernel contention under concurrency.
Simpler implementation reduces surface area for parsing bugs and cost of string handling in userspace.
Relies on a long-standing Linux kernel encoding for clockid_t, avoiding changes to POSIX APIs while using a Linux-specific feature.

Key facts

Commit referenced: 858d2e434dd 8372584 (title: Replace reading proc to get thread CPU time with clock_gettime).
Old implementation read /proc/self/task/<tid>/stat and used sscanf to extract user/system ticks.
New implementation obtains pthread clockid and flips low bits to CPUCLOCK_VIRT to get user-time-only.
Linux encodes clock type and target PID/TID inside clockid_t; the low bits indicate clock type (VIRT, PROF, SCHED, FD).
Author included a 55-line JMH benchmark in the changeset and ran it with 16 threads on a Ryzen 9950X and JDK main branch commit 8ab7d3b89f656e5c.
Benchmark results: before — mean ~11.186 µs/op; after — mean ~0.279 µs/op (≈40x lower mean latency).
Original bug report (2018) observed getCurrentThreadUserTime() being 30x–400x slower than getCurrentThreadCpuTime().
Profiles show the old path spent most time in syscalls and VFS/file handling; the new path mostly runs in JVM with a single syscall.

What to watch next

Whether this change is backported or adopted across other OpenJDK branches and downstream distributions: not confirmed in the source.
Potential risks if kernel internals or the clockid_t encoding change in future kernels: not confirmed in the source.
Any portability concerns for non-Linux platforms that lack the clockid bit encoding: not confirmed in the source.

Quick glossary

clock_gettime: A POSIX API that returns the current value of a specified clock (seconds and nanoseconds).
/proc filesystem (procfs): A virtual filesystem on Linux that exposes process and kernel information as text files.
clockid_t: An integer type used to identify a clock for clock_gettime and related APIs; on Linux it can encode clock type and target id.
pthread_getcpuclockid: A POSIX function that returns a clockid_t associated with a pthread, typically indicating a thread CPU-time clock.
JMH: Java Microbenchmark Harness, a toolkit for building, running and analyzing Java microbenchmarks.

Reader FAQ

What specifically was changed?
The code that read and parsed /proc/self/task/<tid>/stat to compute thread user time was replaced with a clock_gettime() call using a modified clockid_t obtained from pthread_getcpuclockid().

Why was /proc used previously?
Because POSIX-standard clocks typically report total CPU time (user + system) and the /proc parsing was a way to extract user-only time; the source suggests POSIX constraints led to the prior approach.

How much faster is the new approach?
On the author’s benchmark run the mean latency dropped from about 11.186 microseconds to about 0.279 microseconds, roughly a 40x reduction in average latency.

Is this portable to non-Linux systems?
The change relies on Linux kernel encoding of clockid_t; portability to other OSes is not confirmed in the source.

How a 40-Line Fix Eliminated a 400x Performance Gap Jaromir Hamala QuestDB Team January 13, 2026 Tags: jvm linux performance engineering I have a habit of skimming the OpenJDK commit…

Sources

A 40-line fix eliminated a 400x performance gap

How a 40-line fix closed a 400x JVM performance gap on Linux systems

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

Revup lets you upload once to create multiple related GitHub PRs

SkyPilot: Unified system to run and manage AI compute across clouds

Terra: A Rolling-Release Fedora Repository Built on Rust Tooling

Leave a Reply Cancel reply

You missed

DOJ publishes partially redacted documents related to Operation Absolute Resolve

ASCII Clouds: Browser-based ASCII cloud generator with customizable presets

Apple Scrambles as Glass-Cloth Shortage Threatens Chip Supply Through 2027

How IRC Amplifies Social Interaction Compared with Real Life