TL;DR
An engineer investigating intermittent HTTP 499 errors on an Envoy network proxy used OpenTelemetry eBPF Instrumentation (OBI) to capture zero-code traces across Envoy and backend hops. The experiment used a minimal Docker Compose setup and a production-like stack (OBI -> OpenTelemetry Collector -> Jaeger/Prometheus/Grafana) to capture per-hop spans, latencies, addresses and traceparent IDs.
What happened
Facing sporadic HTTP 499 responses originating somewhere in cloud infrastructure, the author found Envoy access logs and built-in tracing insufficient for end-to-end debugging. To get actionable telemetry without changing application code, they tested OpenTelemetry eBPF Instrumentation (OBI, formerly Grafana Beyla). In a reproducible Docker Compose lab the author deployed an Envoy TCP proxy (listener on 8000 forwarding to backend on 8080), a simple Go HTTP server, and an OBI autoinstrumenter container configured to print traces to stdout. OBI produced per-hop spans with timestamps, per-span latencies, source/destination host:port, content and response lengths, service labels, and traceparent IDs. The author extended the test to chain two Envoy instances and adjusted OBI to monitor a port range and use the host PID namespace, producing multi-span traces that show the request path through client-facing Envoy, internal Envoy hop, and the backend. A separate production-like setup used Incus containers, an OpenTelemetry Collector, Jaeger, Prometheus and Grafana to filter and visualize traces, with the Collector used to filter telemetry by PID because OBI alone could not.
Why it matters
- Zero-code eBPF instrumentation can reveal per-hop spans and latencies without modifying application code or proxy configuration.
- OBI captures both server-side and client-side spans for proxy-to-backend chains, which helps pinpoint where request time is spent.
- Using an OpenTelemetry Collector lets teams filter noisy telemetry (for example by PID) and forward traces to Jaeger and metrics to Prometheus/Grafana.
- OBI can instrument network-level components (Envoy TCP proxy) that lack application-level tracing in some deployment types.
Key facts
- The author investigated intermittent HTTP 499 errors that occurred roughly every 10 minutes.
- Envoy access logs lacked the request-tracing detail required to locate the latency bottleneck.
- OpenTelemetry tracing built into Envoy was noted to be available only for Application Load Balancers, not the Network Load Balancer scenario under investigation.
- The minimal reproducible setup used Docker Compose, Envoy, a Go HTTP server, and OBI (image otel/ebpf-instrument:main).
- Envoy was configured as a TCP proxy listening on port 8000 and forwarding to a backend at port 8080.
- OBI was run privileged and configured to print traces to stdout (OTEL_EBPF_TRACE_PRINTER=text) and to inspect an open port (OTEL_EBPF_OPEN_PORT: 8000, later 8000-9000 for multi-Envoy).
- OBI trace lines include timestamp, total response time and internal execution time, protocol/method/path, source->destination addresses, contentLen, responseLen, svc label, and traceparent identifiers.
- To instrument a chain of Envoy instances, the autoinstrumenter was placed in the host PID namespace (pid: host) so it could observe processes across containers.
- OBI could filter metrics and traces by attribute values but not by process PID; the OpenTelemetry Collector was used to filter telemetry by PID in the production-like setup.
- The production-like observability stack included Jaeger for traces and Prometheus/Grafana for metrics and dashboards.
What to watch next
- Whether OBI integrations expand filtering options so PID-based exclusion is available without an external Collector (not confirmed in the source).
- How teams route OBI telemetry through an OpenTelemetry Collector to Jaeger and Prometheus/Grafana for visualization, as shown in the production setup.
- Operational requirements for OBI such as privileged mode and using the host PID namespace when observing multi-container proxy chains.
Quick glossary
- eBPF: A Linux kernel technology that allows safe, programmable instrumentation of kernel and user-space events for observability and security without changing application code.
- Envoy: A high-performance open-source edge and service proxy for cloud-native applications, commonly used for load balancing, routing and observability.
- OpenTelemetry Collector: A component that receives, processes and exports telemetry data (traces, metrics, logs) to backend systems; it can apply processors such as filters.
- Span / Trace: A trace is a tree of spans that represent operations in a distributed request; each span records timing, metadata and relationships to other spans.
Reader FAQ
Does OBI require code changes to instrument Envoy and backend services?
No — the author demonstrates zero-code automatic instrumentation using eBPF.
Can OBI produce per-hop traces for Envoy TCP proxy chains?
Yes — the experiment produced multi-span traces showing client-facing Envoy, internal Envoy hops, and the backend, including per-span latency and addresses.
Can OBI filter telemetry by process PID on its own?
OBI could not filter by PID in the author’s tests; the OpenTelemetry Collector was used to perform PID-based filtering.
Is the production readiness and performance overhead of OBI discussed?
not confirmed in the source
Zero-Code Instrumentation of an Envoy TCP Proxy using eBPF I recently had to debug an Envoy Network Load Balancer, and the options Envoy provides just weren't enough. We were seeing…
Sources
- Zero-Code Instrumentation of an Envoy TCP Proxy Using eBPF
- Zero-code Instrumentation
- Zero-Code Observability: Using eBPF to Auto-Instrument …
- Exploring OpenTelemetry Go Instrumentation via eBPF
Related posts
- Why unglamorous, decades-old software still powers the modern global infrastructure
- zpdf: a high-performance PDF text extractor library written in Zig
- F-Droid replaces core server with donated hardware to speed builds