DigitalOcean managed services disrupted each other after update

TL;DR

A DigitalOcean customer reported a production outage after a managed PostgreSQL update interfered with private VPC connectivity to their managed Kubernetes cluster. The vendor-level issue traced to a Cilium bug; support applied a temporary workaround while the upstream patch awaits deployment to DOKS.

What happened

A user on Hacker News described a production outage caused when DigitalOcean's managed PostgreSQL update broke private VPC connectivity to their managed Kubernetes service. The application's public endpoint remained reachable while the private endpoint timed out. The commenter identified the root cause as a Cilium bug (issue #34503) that leaves ARP entries stale after infrastructure changes. DigitalOcean support responded in under 12 hours and implemented a temporary mitigation: deploying a DaemonSet obtained from a third-party GitHub user which pings stale ARP entries every 10 seconds. The upstream Cilium fix has been merged, but according to the report it has not yet been rolled out to DigitalOcean Kubernetes Service (DOKS), and no ETA was provided. The poster said they remain a customer but now regard managed services as a trade-off of different failure modes rather than an absence of operational issues.

Why it matters

Managed service updates can introduce cross-service failures that affect customer workloads.
Private VPC networking failures can leave services reachable publicly but inaccessible internally, complicating incident response.
Temporary vendor workarounds may rely on third-party artifacts and differ from upstream code deployments.
Delays between upstream fixes and provider rollouts can extend outage risk for affected customers.

Key facts

Production application experienced an outage after a DigitalOcean managed PostgreSQL update.
Public endpoint continued to function; private endpoint connections timed out.
Root cause identified as Cilium bug #34503 causing ARP entries to become stale after infrastructure changes.
DigitalOcean support responded in less than 12 hours, per the report.
Applied mitigation involved deploying a DaemonSet from a GitHub user to ping stale ARP entries every 10 seconds.
Upstream Cilium fix has been merged but had not been deployed to DigitalOcean Kubernetes (DOKS) at the time of the report.
No estimated time of arrival (ETA) for the provider deployment was given in the report.
The commenter described themselves as a small startup that paid for managed services to avoid hands-on ops work.

What to watch next

Whether DigitalOcean will deploy the merged upstream Cilium fix to DOKS and the timeline for that deployment (not confirmed in the source).
If DigitalOcean issues a formal incident report or postmortem explaining the update sequence and remediation steps (not confirmed in the source).
Whether other customers report similar outages tied to the same Cilium issue or provider update (not confirmed in the source).

Quick glossary

Managed service: A cloud offering where the provider operates and maintains infrastructure or software on behalf of customers.
VPC (Virtual Private Cloud): A logically isolated virtual network in a cloud environment used to host resources privately.
Cilium: An open-source networking and security layer for cloud-native environments, often used with Kubernetes.
DaemonSet: A Kubernetes workload that ensures a copy of a pod runs on selected nodes across a cluster.
ARP (Address Resolution Protocol): A protocol used to map network addresses (IP) to physical machine addresses (MAC) on a local network.

Reader FAQ

Was the outage linked to a specific update?
Yes — the commenter attributed the outage to a DigitalOcean managed PostgreSQL update that affected private VPC connectivity.

Did DigitalOcean provide a fix?
Support implemented a temporary workaround by deploying a DaemonSet that pings stale ARP entries; the upstream Cilium fix was merged but not yet rolled out to DOKS.

How quickly did support respond?
The commenter said DigitalOcean support responded in under 12 hours.

Are other customers affected or has a timeline for the permanent fix been announced?
Not confirmed in the source.

Yesterday my production app went down. The cause? DigitalOcean's managed PostgreSQL update broke private VPC connectivity to their managed Kubernetes. Public endpoint worked. Private endpoint timed out. Root cause: a…

DigitalOcean managed services disrupted each other after update

By

TL;DR

What happened

Why it matters

Key facts

What to watch next

Quick glossary

Reader FAQ

Sources

Related posts

By

Related Post

Message Queues: A Simple Guide with Analogies — Post Offices, Warehouses, DBs (2024)

Understanding Message Queues: Post Offices, Warehouses and Microservices

Use WebRTC to open an interactive debugging terminal in GitHub Actions

Leave a Reply Cancel reply

You missed

India mandates geolocation and selfies for crypto customer KYC

How Ozempic and other GLP‑1 drugs are reshaping Americans’ food spending

iOS 26.3 beta 2 includes carrier toggle, signaling RCS E2EE may arrive

DigitalOcean managed services disrupted each other after update