"A comprehensive analysis of performance quality of large supercomputer complexes"
Voevodin V.V.

Currently, the problem of low performance of supercomputer complexes is largely due to the fact that administrators of such complexes cannot always timely detect and eliminate the root causes of reduced efficiency. This largely concerns not the equipment failure (such cases can usually be detected using monitoring systems), but an implicit performance decrease of certain supercomputer components, provided that they seems to continue working correctly. Such a situation arises because there are no sufficiently flexible and convenient software tools for prompt and comprehensive analysis of all the performance quality characteristics of computer systems at the moment. The existing solutions either allow analyzing only a small part of such characteristics or are made as non-universal solutions that satisfy only a small set of specific needs provided by administrators of a particular system. This paper describes a systematic approach to solving this issue, which will allow one to perform a comprehensive analysis of various aspects of supercomputer functioning, primarily related to the execution of supercomputer applications. A software tool developed on the basis of this approach will collect, within a single model, all the most important data on the properties and quality of jobs running on the supercomputer - data on their execution performance, size and duration, presence of specific or abnormal behavior scenarios, the usage of application packages and libraries, etc. Using flexible aggregation capabilities, the required level of detail will be specified - individual users, projects, application packages, subject areas, supercomputer partitions, time ranges, etc. This will allow one to create hundreds and thousands of different views for analyzing the state of the supercomputer, which will help administrators to choose the most suitable option for them.

Keywords: supercomputer, parallel computing, supercomputer applications, performance, efficiency analysis, monitoring data.

  • Voevodin V.V. – Lomonosov Moscow State University, Research Computing Center; Leninskie Gory, Moscow, 119991, Russia; Ph.D., Senior Scientist, e-mail: vadim@parallel.ru