TL;DR

A solo founder built AudioText Live, a real-time audio intelligence platform, using Go, microservices, and LLMs to cut per-hour voice intelligence costs from roughly $1.50 to about $0.30. The project relied on a distributed pipeline (Twilio input, NATS JetStream, recorder, async STT with Soniox, Gemini summarization) and required painful operational work on message bus backoff and consumer configs.

What happened

The author, a developer experienced in functional languages, chose Go to build AudioText Live — a real-time audio intelligence platform — despite personal dislike for the language. Priorities were developer velocity (fast compile times) and predictable, AI-assisted code generation. The system was built as a distributed set of more than 15 microservices deployed on k3s instead of a single monolith because audio ingestion, recording, and transcription have distinct runtime needs. In the pipeline, incoming audio from telephony providers is buffered into NATS JetStream, recorded as raw PCM, converted to WAV and uploaded to S3, then processed asynchronously: Soniox handles diarization, Gemini generates summaries, and embeddings are stored in Qdrant. The author ran into subtle NATS/Wate rmill interactions where backoff settings unintentionally shortened AckWait values, causing duplicate work; resolving it required custom migration scripts and adjusted timeouts. The design emphasis was idempotency, resiliency, and cost arbitrage versus bundled vendor services.

Why it matters

  • Shows how language choice can be driven by engineering velocity and toolchain ergonomics rather than developer preference.
  • Demonstrates cost savings by unbundling telephony intelligence and using targeted STT/LLM providers.
  • Highlights operational complexity of message buses and the need to tune consumer/ack settings in production.
  • Illustrates that a distributed microservice approach can be necessary when components have divergent performance and GC behaviors.

Key facts

  • The platform ingests live audio from telephony providers such as Twilio, Telnyx, or SignalWire.
  • AudioText Live runs over 15 microservices on k3s (Kubernetes).
  • NATS JetStream was used as the persistent message bus; Watermill was used as an abstraction layer.
  • Recorder service holds WebSocket connections and writes raw PCM to disk; post_recorder converts audio to WAV and uploads to S3.
  • stt_async_service sends WAV files to Soniox for diarization, triggers Gemini for summarization, and imports embeddings into Qdrant.
  • The author reports raw service costs around $0.30/hour vs Twilio’s $1.50/hour for comparable voice intelligence, achieved by unbundling services.
  • A misconfiguration with Watermill backoff caused AckWait timeouts and message redelivery; the author had to migrate JetStream consumer configs and increase timeouts.
  • Idempotency keys are used in transcription billing logic to avoid double-charging on retries or crashes.

What to watch next

  • NATS JetStream consumer settings and AckWait values — incorrect backoff configuration can cause duplicate processing.
  • Idempotency and billing ledger correctness when retries or crashes occur.
  • Stability and latency of third-party STT (Soniox) and LLM (Gemini) integrations, since the architecture relies on async post-processing for cost savings.

Quick glossary

  • NATS JetStream: A persistent messaging and streaming system used to deliver and store messages between services.
  • Microservice: A small, independently deployable service that performs a specific function within a larger application.
  • Idempotency: A property of operations that ensures repeating the same action has no additional effect after the first success.
  • Diarization: The process of determining who spoke when in an audio recording.
  • k3s: A lightweight Kubernetes distribution designed for resource-constrained environments or edge deployments.

Reader FAQ

Why did the author use Go if they dislike it?
They prioritized developer velocity: Go compiles quickly and LLMs produce reliable Go code, which accelerated development.

How did AudioText Live cut costs compared to Twilio?
By unbundling voice intelligence, recording audio, using Soniox for diarization, and running async LLM summarization; reported raw cost is about $0.30/hour versus $1.50/hour.

Was the system built as a monolith?
No — the project uses a distributed architecture with over 15 microservices to isolate different runtime needs like recording and transcription.

Was Watermill effective as a messaging abstraction?
Watermill simplified some aspects but hid crucial knobs for NATS, contributing to a backoff/AckWait issue that required custom fixes.

I Hate Go, But It Saved My Startup: An Architectural Autopsy 1 January 2026 · 6 mins AUTHOR AudioText Team Tutorials, updates, and deep dives into AI-powered telephony. Learn how…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *