> Congress / Final Programme / Tutorials / Tut5
Tut5: Software Rejuvenation - Modeling and Analysis
Room: Argos
Presenters:
Kishor S. Trivedi, Duke University, Durham, NC, USA
Kalyan Vaidyanathan, Sun Microsystems CTO Labs, San Diego, CA, USA
Abstract
Software reliability is one of the weakest links in system reliability even for applications that have relatively less complex software. In this tutorial, we will first give an overview of software fault classification and briefly discuss software reliability in the testing/debugging and operational phases. We will then describe the phenomenon of software aging that has been reported in widely used software and also in high-availability and safety-critical systems. To counteract this phenomenon, a proactive technique called ``software rejuvenation'' has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. Software rejuvenation has not only attracted a lot of interest from the academic community, but also from the computer industry.
First, we will discuss methods of evaluating the effectiveness of software rejuvenation in operational software systems and determining optimal times to perform rejuvenation. This is done by developing stochastic models which tradeoff the cost of unexpected failures due to software aging with the overhead of proactive fault management. Given a sample data of failure times, statistical non-parametric algorithms based on the total time on test (TTT) transform will be described to obtain the optimal rejuvenation interval. We will also present a framework of adaptive estimation and rejuvenation of software systems in the presence of aging sources. We will then describe measurement-based models which are constructed using workload and resource usage data collected from the UNIX operating system over a period of time. The measurement-based models are the first steps towards predicting aging related failures, intended to help development of strategies for software rejuvenation triggered by actual measurements. Rejuvenation has also been extended to cluster systems, where our analyses show that it results in a significant increase in system availability. Finally, we discuss the implementation of a software rejuvenation agent in a major commercial server.
Presenters
Kishor S. Trivedi holds the Hudson Chair in the Department of Electrical and Computer Engineering at Duke University, Durham, NC. His research interests are in reliability and performance assessment of computer and communication systems. He has published over 300 articles, lectured extensively on these topics and supervised 38 Ph.D. dissertations. He is the Duke-Site Director of an NSF Industry-University Cooperative Research Center between NC State University and Duke University. He is a co-designer of the HARP, SAVE, SHARPE, SPNP and SREPT modeling packages which have been widely circulated. He is the author of Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd edition, published by Wiley. He is a Fellow of the IEEE and a Golden Core Member of IEEE Computer Society.
Alyan Vaidyanathan received his Ph.D. degree in Electrical and Computer Engineering from Duke University in 2002. He is currently a software engineer at Sun Microsystems CTO Labs, San Diego, CA. His interests include proactive fault monitoring, software reliability and performance and dependability evaluation of computer systems.

|