No, three 9 reliability for a single server. 1-.001^3 = .999999999, which is und...

Retric · on Sept 30, 2011

Your assuming independence to a level that does not exist. Consider a Y2K style bug in the OS would could take down all severs for an extended period of time. Or someone could write a virus that uses a zero day exploit etc.

davidcuddeback · on Sept 30, 2011

I think it's more likely that the programmer screws up in keeping the separate servers independent through database migrations.

oconnore · on Sept 30, 2011

I am making no such assumption. See [1] in my original post. I already talked about intersection. Feel free to add Y3K to the list of nuclear war, chinese hackers, etc. The intersection is incredibly small, and not something that I am going to include in my back-of-the-napkin calculation.

shabda · on Sept 30, 2011

Your failover code never has bugs?

See the linked discussion, only 3 out of 20 top sites had 5 nines.

oconnore · on Sept 30, 2011

For enough time and money dumped into code auditing and hiring smart people, no. Not that most companies should do that, but it is possible if you want to pay for it. Most companies (rightly) prioritize innovation, scalability and profit margins over absolute reliability.

How many of those top sites actually prioritized reliability? Is it even justifiable for their business models? I bet you can find a lot better reliability engineering in bank, credit, and stock systems. For example, when was the last time the Visa credit network crashed (as a whole, not localized outages)? Nasdaq?

cube13 · on Sept 30, 2011

Nasdaq states that their system has 4 nine's: http://www.nasdaqtrader.com/trader.aspx?id=tradingusequities