TL;DR
A DigitalOcean customer reported a production outage after a managed PostgreSQL update interfered with private VPC connectivity to their managed Kubernetes cluster. The vendor-level issue traced to a Cilium bug; support applied a temporary workaround while the upstream patch awaits deployment to DOKS.
What happened
A user on Hacker News described a production outage caused when DigitalOcean's managed PostgreSQL update broke private VPC connectivity to their managed Kubernetes service. The application's public endpoint remained reachable while the private endpoint timed out. The commenter identified the root cause as a Cilium bug (issue #34503) that leaves ARP entries stale after infrastructure changes. DigitalOcean support responded in under 12 hours and implemented a temporary mitigation: deploying a DaemonSet obtained from a third-party GitHub user which pings stale ARP entries every 10 seconds. The upstream Cilium fix has been merged, but according to the report it has not yet been rolled out to DigitalOcean Kubernetes Service (DOKS), and no ETA was provided. The poster said they remain a customer but now regard managed services as a trade-off of different failure modes rather than an absence of operational issues.
Why it matters
- Managed service updates can introduce cross-service failures that affect customer workloads.
- Private VPC networking failures can leave services reachable publicly but inaccessible internally, complicating incident response.
- Temporary vendor workarounds may rely on third-party artifacts and differ from upstream code deployments.
- Delays between upstream fixes and provider rollouts can extend outage risk for affected customers.
Key facts
- Production application experienced an outage after a DigitalOcean managed PostgreSQL update.
- Public endpoint continued to function; private endpoint connections timed out.
- Root cause identified as Cilium bug #34503 causing ARP entries to become stale after infrastructure changes.
- DigitalOcean support responded in less than 12 hours, per the report.
- Applied mitigation involved deploying a DaemonSet from a GitHub user to ping stale ARP entries every 10 seconds.
- Upstream Cilium fix has been merged but had not been deployed to DigitalOcean Kubernetes (DOKS) at the time of the report.
- No estimated time of arrival (ETA) for the provider deployment was given in the report.
- The commenter described themselves as a small startup that paid for managed services to avoid hands-on ops work.
What to watch next
- Whether DigitalOcean will deploy the merged upstream Cilium fix to DOKS and the timeline for that deployment (not confirmed in the source).
- If DigitalOcean issues a formal incident report or postmortem explaining the update sequence and remediation steps (not confirmed in the source).
- Whether other customers report similar outages tied to the same Cilium issue or provider update (not confirmed in the source).
Quick glossary
- Managed service: A cloud offering where the provider operates and maintains infrastructure or software on behalf of customers.
- VPC (Virtual Private Cloud): A logically isolated virtual network in a cloud environment used to host resources privately.
- Cilium: An open-source networking and security layer for cloud-native environments, often used with Kubernetes.
- DaemonSet: A Kubernetes workload that ensures a copy of a pod runs on selected nodes across a cluster.
- ARP (Address Resolution Protocol): A protocol used to map network addresses (IP) to physical machine addresses (MAC) on a local network.
Reader FAQ
Was the outage linked to a specific update?
Yes — the commenter attributed the outage to a DigitalOcean managed PostgreSQL update that affected private VPC connectivity.
Did DigitalOcean provide a fix?
Support implemented a temporary workaround by deploying a DaemonSet that pings stale ARP entries; the upstream Cilium fix was merged but not yet rolled out to DOKS.
How quickly did support respond?
The commenter said DigitalOcean support responded in under 12 hours.
Are other customers affected or has a timeline for the permanent fix been announced?
Not confirmed in the source.
Yesterday my production app went down. The cause? DigitalOcean's managed PostgreSQL update broke private VPC connectivity to their managed Kubernetes. Public endpoint worked. Private endpoint timed out. Root cause: a…
Sources
- Tell HN: DigitalOcean's managed services broke each other after update
- DigitalOcean Status – Incident History
- Quick Fixes: Common DigitalOcean Issues and How to …
- DigitalOcean Status
Related posts
- Message Queues: A Simple Guide with Analogies — Post Offices, Warehouses, DBs (2024)
- Agent of Empires: tmux-based session manager for AI code agents
- Yolobox — Run AI coding agents with full sudo, keep your home safe