Presentation
Using Benford's Law to Identify Unusual Failure Regions
DescriptionFault tolerance remains a key challenge for current high performance computing systems. Effective and efficient scheduling of mitigation methods continues to be a critical issue in the face of dynamic and difficult-to-predict error rates found on many systems. Using failure data from the Astra supercomputer, we examine the efficacy of a simple method to determine if a sliding window of recent failures contains an unusual pattern of errors. Specifically, we investigate using Benford’s Law to predict the likelihood that the system is currently in a period of unusual failure occurrences. While still in its initial stages, this work provides critical analysis of failure status for extreme-scale systems and a simple form of prediction for determining when the scheduling of failure mitigation may be suboptimal and needs to be reevaluated due to the unusual pattern of errors that are occurring.