I just recently came across this great post by John Ciancutti at Netflix about how to run a reliable production service on Amazon Web Services (AWS). My favourite bit of advice is “the best way to avoid failure is to fail constantly”, where he writes:
We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.
If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
Having a Chaos Monkey sounds like a terrific idea if you’re involved in deploying any kind of production infrastructure on the internet, running on AWS or anywhere else. I can’t imagine a better way to test for reliability than this.
Via Jeff Atwood.