Site Sponsor:

mcafee_logo.gif
line

Now Available:

Featured Resource:

line

Newsletter

Email Address:


line

Ask the Expert

Have a question for our resident expert? Email your questions to Dan or post a comment to the blog.

« Phishers Steal Monster.com Data - Shows Challenges in Database Monitoring | Main | Mobile Users to IT: Security is Your Job Not Mine »

Skype Outage, Conspiracy Theories, and More Robust Testing Methods

The Skype blog explains the recent outage as due to a bug in the Skype software and:

triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.

Before jumping to conspiracy theories or assuming this is the whole story, I have to admit I think the culprit is the complexity of complexity of the distributed application and the solution isn't more testing (although that won't hurt) but more simulation modeling to study the impact of different network load scenarios.

We probably can't test a world-wide reboot like the one that purportedly took down Skype on real software but we can build simulations.

Let's start at the beginning. Bruce Stewart at O'Reilly
considers:

While it does seem plausible that a massive concurrent restart of Skype clients could cause some grief for Skype’s network, that doesn’t explain why it took 2 days to restore service. And I’m also left wondering why previous Windows Updates haven’t caused similar problems.

Network dynamics are complex, whether we are talking about a VoIP system or chemical networks in biological systems, small localized changes can have dramatic global impacts, but the conditions must be just right. We haven't seen this kind of failure before because we have seen the same conditions timed and correlated as they were last week.

Engineers use simulations to test complex systems. Monte Carlo methods are especially appropriate in this case because it uses lots and lots of simulations with randomly varying conditions. The more simulations you run, the more likely your are to uncover unlikely events like a full-blown Skype meltdown. Maybe we need to start adding simulation modeling to software development practices along with the good old regression tests.

TrackBack

TrackBack URL for this entry:
http://www.realtime-websecurity.com/type/mt-tb.cgi/384

Post a comment

(All comments are approved by site leader before appearing here. Thanks for commenting!)

line

Dan Sullivan's Bio:

Dan Sullivan is a systems architect with 20 years of IT experience that includes engagements in enterprise security, application design, and systems architecture. His experience includes a broad range of industries, including financial services, manufacturing, government, retail, gas and oil production, power generation, and education. Dan’s security-related project work has ranged from requirements analysis for enterprise information security to designing and implementing security for database applications and enterprise portals. Dan has written about information security and other enterprise information management topics for Business Security Advisor, DM Review, Intelligent Enterprise, and E-Business Advisor. You can contact Dan at: dan_sullivan@realtimepublishers.net