TL;DR

Atlassian spent four years reducing dangerous internal dependencies after a tabletop disaster recovery exercise exposed wide recovery failures. The company re-architected its platform into layered components, migrated a key registry off its orchestration system, and built a new low-dependency deployer, while some circular links remain.

What happened

Atlassian disclosed a multiyear program to reduce internal service dependencies after discovering recovery blockers in its platform. Its custom orchestration system, Micros, runs thousands of services and supports high-volume activity — more than 2,000 services, over 5,000 deploys per day, some 40,000 DynamoDB tables, 80,000+ RDS tables and about three million Lambda functions. In 2021 the private Docker registry Artifactory was deployed using Micros, creating a circular dependency that could prevent either system from being recovered if the other failed. The company launched a Continuous PaaS Recovery (CPR) effort and staged a 2023 tabletop disaster-recovery exercise that simulated 6.5 days of recovery work; results showed many services remained down because of dependency tangles. To address the problem Atlassian reorganized its cloud platform into layered tiers with strict allowed dependency directions, migrated Artifactory from Micros to Kubernetes, and built Atlassian Platform Deployer (APD) using AWS CloudFormation. The effort removed hundreds of circular dependencies though some internal cycles persist.

Why it matters

  • Dependency loops can block recovery and expand the impact of outages for SaaS providers.
  • Re-architecting into dependency-aware layers reduces the risk surface and clarifies recovery order.
  • Deployment and provisioning tooling (like APD) are central to platform resilience for cloud migrations.
  • Customers being moved to cloud-only offerings will depend on the robustness of these changes.

Key facts

  • Atlassian ran a four-year effort focused on reducing internal dependencies.
  • Micros, the company's custom orchestration system, manages over 2,000 services and 5,000+ daily deploys.
  • Micros works with more than 40,000 DynamoDB and 80,000+ RDS tables, and about three million Lambda functions.
  • In 2021 Artifactory (a private Docker registry) was deployed via Micros, creating a critical circular dependency.
  • The Continuous PaaS Recovery (CPR) project prioritized unpicking dependencies that blocked service recovery.
  • A 2023 tabletop DR exercise simulated 6.5 days of recovery and revealed many services remained down due to dependency tangles.
  • Atlassian re-architected its platform into a 'layer cake' with rules limiting allowed hard dependencies between layers.
  • Artifactory was migrated from Micros to Kubernetes to remove a key circular dependency.
  • Atlassian built Atlassian Platform Deployer (APD), which uses AWS CloudFormation as its orchestration engine.
  • Hundreds of circular dependencies were eliminated, though some internal cycles remain.

What to watch next

  • Whether the remaining circular dependencies trigger real-world outages or recovery failures — not confirmed in the source.
  • Results of future disaster-recovery exercises or metrics demonstrating improved recovery times after the changes — not confirmed in the source.
  • How the platform handles load and incidents as Atlassian proceeds with its cloud-only migration for customers — not confirmed in the source.

Quick glossary

  • PaaS (Platform as a Service): A cloud model that provides a managed platform allowing developers to build, run and manage applications without handling underlying infrastructure.
  • Docker registry / Artifactory: A repository service that stores and distributes container images and other build artifacts used in deployment pipelines.
  • Kubernetes: An open-source system for automating deployment, scaling and management of containerized applications.
  • AWS CloudFormation: An AWS service that models and provisions resources and application stacks using templates for automated infrastructure deployment.
  • Tabletop disaster-recovery exercise: A simulated, discussion-based drill where teams walk through incident scenarios to identify gaps in recovery plans and coordination.

Reader FAQ

What was the primary problem Atlassian faced?
Circular and tangled internal dependencies that made it hard to recover services after failures.

How long did the cleanup effort take?
The company says the effort spanned four years.

Did Atlassian fully eliminate circular dependencies?
No — hundreds were removed but some internal circular dependencies remain.

Was the private registry Artifactory moved off Micros?
Yes; Artifactory was migrated from Micros to Kubernetes.

Will the changes prevent all future outages?
Not confirmed in the source.

PAAS + IAAS 43 Atlassian ran a tabletop DR simulation that revealed it lived in dependency hell Four-year effort replaced spaghetti tangle with more robust and recoverable cloudy layer cake…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *