TL;DR

In 2011 Netflix described a suite of tools — the "Simian Army" — that intentionally injects or detects failures across its cloud environment to validate fault-tolerance. Individual components (Chaos Monkey, Latency Monkey, Security Monkey, etc.) either simulate outages or find misconfigurations so engineers can build automated recovery and harden production systems.

What happened

Netflix outlined a programmatic approach to validating cloud reliability by intentionally provoking and detecting failures in production. The effort began with Chaos Monkey, a tool that randomly disables production instances to ensure systems and teams can recover without customer impact. Engineers run Chaos Monkey during business hours with monitoring and staff ready to respond so failures reveal systemic weaknesses in a controlled way. From that foundation Netflix described a growing “Simian Army” of focused tools: Latency Monkey adds artificial delays to simulate degradation; Conformity Monkey removes instances that don’t follow deployment best practices; Doctor Monkey detects unhealthy instances and takes them out of service; Janitor Monkey reclaims unused resources; Security Monkey hunts for security misconfigurations; 10–18 Monkey checks localization/internationalization issues; and Chaos Gorilla simulates an entire availability-zone outage. The blog noted parts of the Simian Army were already implemented while other ideas remained aspirations, and invited engineers to help expand the effort.

Why it matters

  • Force-tests fault-tolerance so real outages expose design or operational gaps before customers are affected.
  • Automated checks and removals (Conformity, Doctor, Security, Janitor) reduce human toil and help enforce deployment and security best practices.
  • Simulated zone-level failures (Chaos Gorilla) validate system-level redundancy and automatic rebalancing.
  • The approach treats cloud infrastructure as a dynamic environment that must be continuously tested, not a one-time setup.

Key facts

  • Chaos Monkey randomly disables production instances to validate recovery mechanisms.
  • Chaos Monkey runs during business hours with engineers on hand to observe system behavior and learn from failures.
  • Latency Monkey injects delays into client-server calls to simulate degraded services or dependencies without taking them fully offline.
  • Conformity Monkey finds instances that don’t follow best practices (for example, not being in an auto-scaling group) and shuts them down.
  • Doctor Monkey monitors instance health (including CPU load and health checks) and removes unhealthy instances from service for remediation.
  • Janitor Monkey locates and disposes of unused cloud resources to reduce clutter and waste.
  • Security Monkey searches for security violations such as misconfigured AWS security groups and checks certificate validity, terminating offending instances.
  • 10–18 Monkey targets localization and internationalization configuration and runtime issues across regions and character sets.
  • Chaos Gorilla simulates an entire availability-zone outage to confirm automatic rebalancing to healthy zones.
  • The initiative was described in a Netflix Technology Blog post published July 19, 2011, by Yury Izrailevsky and Ariel Tseitlin; parts were implemented and other ideas were presented as aspirations.

What to watch next

  • Expansion and implementation status of additional Simian Army tools (source says parts are built and others remain aspirations).
  • How routinely the team runs larger-scale experiments such as Chaos Gorilla and what automation is added to handle zone-level failures — not confirmed in the source.
  • Metrics showing reduced customer impact or faster recovery as a result of these tools — not confirmed in the source.

Quick glossary

  • Fault-tolerance: Design and operational practices that allow a system to continue functioning when components fail.
  • Auto-scaling group: A set of compute instances managed together so the group can scale up or down automatically according to defined policies.
  • Availability zone: A discrete data center or group of data centers within a cloud region that provides isolated power, networking, and connectivity.
  • Latency: The time delay between a request and a response in a networked system.

Reader FAQ

What is Chaos Monkey?
A tool that randomly disables production instances to test whether services and automated recovery mechanisms keep user impact to a minimum.

Does Netflix run these tools in production?
Yes; the blog describes running Chaos Monkey in production during business hours with engineers monitoring and ready to respond.

Is the Simian Army fully implemented?
Parts of the Simian Army had been built as of the post, while other tools and ideas were described as aspirations waiting for engineers.

Is the Simian Army open source or available to others?
not confirmed in the source

The Netflix Simian Army Netflix Technology Blog 4 min read · Jul 19, 2011 — 10 We’ve talked a bit in the past about our move to the cloud, and…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *