Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLPerf logging #831

Merged
merged 63 commits into from
May 3, 2022
Merged

Add MLPerf logging #831

merged 63 commits into from
May 3, 2022

Conversation

hanlint
Copy link
Contributor

@hanlint hanlint commented Mar 25, 2022

This PR contributes a callback MLPerfCallback, which will create a submission directory and results file that is compliant with the MLPerf rules for Training v1.1 (e.g. it passes the mlperf logging's package checker).

Upon usage, a submission folder structure will be created with the root_folder as the base and the following directories:

    root_folder/
        results/
            [system_name]/
                [benchmark]/
                    results_0.txt
                    results_1.txt
                    ...
        systems/
            [system_name].json

For each training run, a results file will be created, e.g.

:::MLLOG {"namespace": "", "time_ms": 1648183840286, "event_type": "INTERVAL_START", "key": "cache_clear", "value": null, "metadata": {"file": "/composer/composer/callbacks/mlperf.py", "lineno": 109}}
:::MLLOG {"namespace": "", "time_ms": 1648183840438, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/composer/composer/callbacks/mlperf.py", "lineno": 110}}
...

The entire directory can then be submitted to mlperf, and checked with the package_checker in https://github.com/mlcommons/logging.

Currently this callback only supports the OPEN division benchmark.

This PR is gated by:

TODO:

  • generate and validate ResNet-50 run on 8x A100 GPUs.
  • create Hparams object

@hanlint hanlint requested a review from a team as a code owner March 25, 2022 22:02
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good; left some comments for now. Will re-review once the open TODOs are addressed.

composer/callbacks/mlperf.py Outdated Show resolved Hide resolved
composer/callbacks/mlperf.py Outdated Show resolved Hide resolved
composer/callbacks/mlperf.py Show resolved Hide resolved
composer/callbacks/mlperf.py Outdated Show resolved Hide resolved
composer/callbacks/mlperf.py Outdated Show resolved Hide resolved
tests/callbacks/test_mlperf_callback.py Outdated Show resolved Hide resolved
tests/callbacks/test_mlperf_callback.py Outdated Show resolved Hide resolved
tests/callbacks/test_mlperf_callback.py Outdated Show resolved Hide resolved
composer/callbacks/mlperf.py Show resolved Hide resolved
composer/callbacks/mlperf.py Outdated Show resolved Hide resolved
@ravi-mosaicml ravi-mosaicml marked this pull request as draft April 18, 2022 20:23
@hanlint hanlint marked this pull request as ready for review April 18, 2022 20:30
@hanlint hanlint requested a review from ravi-mosaicml May 3, 2022 19:07
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really close! Just a few nits, plus one comment re: binding the train dataloader and evaluator to the state after Event.INIT runs. Rational is that these parameters will not need be specified in Trainer.init(...) after #948 lands (and algorithms / callbacks are already OK with no dataloaders or evaluators during Event.INIT).

Makefile Outdated Show resolved Hide resolved
composer/core/state.py Outdated Show resolved Hide resolved
composer/core/state.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
composer/callbacks/mlperf.py Outdated Show resolved Hide resolved
composer/callbacks/mlperf.py Outdated Show resolved Hide resolved
composer/callbacks/mlperf.py Outdated Show resolved Hide resolved
composer/callbacks/mlperf.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 💯! See comments

composer/core/state.py Show resolved Hide resolved
composer/trainer/trainer.py Outdated Show resolved Hide resolved
composer/callbacks/mlperf.py Outdated Show resolved Hide resolved
@hanlint hanlint merged commit be8ebcf into dev May 3, 2022
@hanlint hanlint deleted the hanlin/mlperf branch May 3, 2022 23:18
ravi-mosaicml pushed a commit that referenced this pull request May 3, 2022
Adds an experimental logger to create MLperf compliant submission files
@bandish-shah bandish-shah restored the hanlin/mlperf branch May 4, 2022 22:21
@ravi-mosaicml ravi-mosaicml deleted the hanlin/mlperf branch May 6, 2022 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants