TL;DR

A founder with ten years in observability examined real customer data and built tooling to automatically identify low-value telemetry. In tests on multiple services he found roughly 40% of logs could be classified as waste and helped teams reduce noise and costs by surfacing that understanding.

What happened

The author, who left an engineering role in 2016 to build a hosted logging product that evolved into Vector, says a decade in observability revealed persistent cost and noise problems. After Vector’s acquisition and three more years with the project, he began receiving requests from former users to help reduce vendor bills. Given access to a complex Vector deployment, he inspected extensive sampling, aggregation, regex drop lists and storage tiering already in use. He focused on the regex bottleneck and adopted a high-speed pattern engine (Hyperscan) to compile and match very large pattern sets at line rate. He then created a system that compresses billions of log lines into thousands of semantic events and evaluates them with service and failure-context. In tests across several services he reported waste estimates of roughly 30–60% per service, averaging about 40%. After validating the results manually and comparing them to existing drop patterns, he presented the findings and worked with teams to gradually reduce noisy telemetry, simplifying pipelines and lowering bills.

Why it matters

  • High volumes of low-value telemetry drive most observable platform costs and operational overhead.
  • Teams spend engineering time policing and triaging billing spikes instead of improving software or observability signal.
  • Vendor reluctance to quantify customer data quality can leave buyers managing expensive, noisy systems alone.
  • Automating identification of low-value events can reduce noise, simplify pipelines and improve the utility of remaining data.

Key facts

  • The author left an engineering role in 2016 to found a hosted logging platform that became Vector.
  • Vector saw mass adoption and was later acquired; the author remained with the project for three years after acquisition.
  • Complex customer configurations included sampling, aggregation, storage tiering, archiving and long regex drop lists.
  • Hyperscan was used to compile very large pattern sets and match at line rate as part of the analysis.
  • A system was built to compress billions of logs into thousands of semantic events evaluated with service and failure context.
  • Test results across multiple services showed waste estimates of about 30%, 40% and 60% respectively, averaging roughly 40%.
  • Manual review and comparison to existing handcrafted patterns supported the automated findings before rollout.
  • Customers who applied the findings cleaned logging, simplified pipelines and reduced bills over time.

What to watch next

  • Whether major observability vendors begin providing systematic waste analysis for customer data — not confirmed in the source.
  • How quickly teams can adopt automated pattern-based filtering and what percentage of bills they’ll actually recover — not confirmed in the source.
  • The roadmap and public product releases from Tero (the team mentioned building this approach) and their reported impact — not confirmed in the source.

Quick glossary

  • Observability: The practice of instrumenting systems to collect telemetry (logs, metrics, traces) that helps engineers understand system behavior and troubleshoot issues.
  • Log: A timestamped record of events or messages emitted by software, used for debugging, auditing, or monitoring system behavior.
  • Cardinality: In telemetry, the number of unique values a label or tag can take; high cardinality increases storage and processing costs.
  • Sampling: A technique that reduces the volume of collected telemetry by selecting a subset of events to ingest or retain.
  • Regex (regular expression): A pattern-matching syntax used to identify, classify or drop lines of text such as log entries.
  • Hyperscan: A high-performance regular expression matching library capable of compiling and running many patterns at line rate.

Reader FAQ

How much of observability data was found to be waste?
In the sample analysis described, individual services showed about 30% to 60% waste, with an average near 40%.

Did vendors help customers identify and cut waste?
The author reports vendors frequently declined to quantify customer waste and often did not provide proactive help.

Is it safe to drop large portions of logs?
The author’s approach emphasized automated classification and phased rollouts; teams reviewed results and progressively cleaned logging rather than dropping data recklessly.

Will this method eliminate all observability problems?
Not confirmed in the source.

This year marks a decade for me in observability. I left my engineering job in 2016 to start Timber.io, a hosted logging platform, because I thought logs could be simple…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *