Skype Outage, Conspiracy Theories, and More Robust Testing Methods
The Skype blog explains the recent outage as due to a bug in the Skype software and:
triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.
Before jumping to conspiracy theories or assuming this is the whole story, I have to admit I think the culprit is the complexity of complexity of the distributed application and the solution isn't more testing (although that won't hurt) but more simulation modeling to study the impact of different network load scenarios.
We probably can't test a world-wide reboot like the one that purportedly took down Skype on real software but we can build simulations.
Let's start at the beginning. Bruce Stewart at O'Reilly
considers:
While it does seem plausible that a massive concurrent restart of Skype clients could cause some grief for Skype’s network, that doesn’t explain why it took 2 days to restore service. And I’m also left wondering why previous Windows Updates haven’t caused similar problems.
Network dynamics are complex, whether we are talking about a VoIP system or chemical networks in biological systems, small localized changes can have dramatic global impacts, but the conditions must be just right. We haven't seen this kind of failure before because we have seen the same conditions timed and correlated as they were last week.
Engineers use simulations to test complex systems. Monte Carlo methods are especially appropriate in this case because it uses lots and lots of simulations with randomly varying conditions. The more simulations you run, the more likely your are to uncover unlikely events like a full-blown Skype meltdown. Maybe we need to start adding simulation modeling to software development practices along with the good old regression tests.



Email This!
Digg it!
Del.icio.us
Reddit!
Newsvine
