SUPReMM is a comprehensive open-source tool chain that provides resource monitoring capabilities to users and managers of HPC systems.
The SUPReMM architecture comprises three major components:
- Software that runs directly on HPC compute nodes and periodically collects performance information.
- Software that uses the node-level performance data to generate job-level data.
- An Open XDMoD module that enables the job-level information to be viewed and analyzed.
This repository contains the software that combines the node-level performance data to generate job-level summary data.
Full details of the SUPReMM project are available on the SUPReMM overview page in the Open XDMoD documentation.
This work was sponsored by NSF under grant numbers ACI 1203560, ACI 1025159 and ACI 1445806 for the XD Metrics Service (XMS) for NSF.
For more information, questions, feedback or bug reports send email to
ccr-xdmod-help
at buffalo.edu
.
Want to be notified about SUPReMM package releases and news? Subscribe to the XDMoD mailing list.
Install the PCP repository configuration following the instructions on the pcp packages page. Install the EPEL repository configuration:
yum install epel-release
Install the build dependencies:
yum install rpm-build pcp-libs-devel gcc python-devel
This project uses the python distutils for package creation.
python setup.py install --prefix=PATH_TO_INSTALL_DIR
RPM packages are created using:
python setup.py bdist_rpm
We accept contributions via standard github pull requests.
This project is under active development with new features planned.
Please contact us via the ccr-xdmod-help
at buffalo.edu
email address
before you get started so that we can co-operate and avoid duplication of effort.
Full details of how to install and use the software are available on the SUPReMM overview page in the Open XDMoD documentation. This section gives a very brief overview of the summarization software for software developers. As always, the definitive reference is the source code itself.
The summarization software processing flow is approximately as follows:
- Initial setup including parsing configuration files, opening database connections, etc.
- Query an accounting database to get list of jobs to process and list of PCP archives containing data.
- For each job:
- retrieve the PCP archives that cover the time period that the job ran;
- extract the relevant datapoints from the PCP archives;
- run the data through the preprocessors;
- run the data through the plugins;
- collect the output of the preprocessors and plugins and store in an output database.
preprocessors and plugins are both python modules that implement a defined interface. The main difference between a preprocessor and a plugin is that the preprocessors run first and their output is avialable to the plugin code.
Each plugin is typically responsible for generating a job-level summmary for a PCP metric or group of PCP metrics. Each module defines:
- an identifier for the output data;
- a list of PCP metrics;
- a mode of operation (either only process the first and last datapoints or process all data);
- an implementation of a processing function that will be called by the framework with the requested datapoints;
- an implementation of a function that will be called at the end to return the results of the analyis.
An example of a plugin is one that records the mean and maximum memory usage for the job. Another example is a plugin that checks the temporal variance of the L1D cache load rate to determine if the job failed prematurely.
The software that retrieves the job information from the accounting database and writes to the output database is configurable. So, for example, you can setup the software to write the job summary records to stdout for testing purposes. The accounting database interface supports multiple accounting databases (Open XDMoD being the main one).
If you are interested in doing plugin development, then a suggested starting
point is to look at some of the existing plugins. The simplest plugins, such as
the block device plugin (supremm/plugins/Block.py
) use the framework-provided
implementation. A more complex example is the Slurm cgroup memory processor
(supremm/plugins/SlurmCgroupMemory.py
) that contains logic to selectively
ignore certain datapoints and to do some non-trivial statistics on the data.
If you are interested in understanding the full processing workflow, then the
starting point is the main() function in the summarize_jobs.py
script.
The SUPReMM processing tools package is an open source project released under the GNU Lesser General Public License ("LGPL") Version 3.0.