Search GSSD

Why do Internet services fail, and what can be done about it?

In this article published from the department of electrical engineering and computer science at the University of California at Berkeley, the researchers analyze a few case studies in which service components have failed while also comparing the failures to the applicability of failure mitigation techniques. By studying individual problem reports, the researchers were able to contribute the majority of failures to operator errors that took the form of mis-configuration not procedural errors. They also found that the reported Time to Repair values can be misleading because they can be affected by the priority of the problem. In my opinion, the most interesting piece of information discerned in this report is that a checking the correctness of software before releasing is the most successful way of reducing system failures. I have found this interesting at a couple points in our studies that companies will tend to release products and technology before they are fully ready with hopes that users will find issues and they can be fixed later. While this has been the trend forever, it would be interesting to see more research in just how many failures could be prevented if organizations only released completed and corrected programs.
David Oppenheimer, Archana Ganapathi, and David A. Patterson
University of California at Berkeley, EECS Computer Science Division
Domains-Issue Area: 
Industry Focus: 
Internet & Cyberspace
Bibliographies & Reports