Add MLPerf logging #831

hanlint · 2022-03-25T22:02:11Z

This PR contributes a callback MLPerfCallback, which will create a submission directory and results file that is compliant with the MLPerf rules for Training v1.1 (e.g. it passes the mlperf logging's package checker).

Upon usage, a submission folder structure will be created with the root_folder as the base and the following directories:

    root_folder/
        results/
            [system_name]/
                [benchmark]/
                    results_0.txt
                    results_1.txt
                    ...
        systems/
            [system_name].json

For each training run, a results file will be created, e.g.

:::MLLOG {"namespace": "", "time_ms": 1648183840286, "event_type": "INTERVAL_START", "key": "cache_clear", "value": null, "metadata": {"file": "/composer/composer/callbacks/mlperf.py", "lineno": 109}}
:::MLLOG {"namespace": "", "time_ms": 1648183840438, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/composer/composer/callbacks/mlperf.py", "lineno": 110}}
...

The entire directory can then be submitted to mlperf, and checked with the package_checker in https://github.com/mlcommons/logging.

Currently this callback only supports the OPEN division benchmark.

This PR is gated by:

TODO:

generate and validate ResNet-50 run on 8x A100 GPUs.
create Hparams object

ravi-mosaicml

Overall looks good; left some comments for now. Will re-review once the open TODOs are addressed.

composer/callbacks/mlperf.py

tests/callbacks/test_mlperf_callback.py

composer/callbacks/mlperf.py

Co-authored-by: ravi-mosaicml <[email protected]>

…lin/mlperf

ravi-mosaicml

Really close! Just a few nits, plus one comment re: binding the train dataloader and evaluator to the state after Event.INIT runs. Rational is that these parameters will not need be specified in Trainer.init(...) after #948 lands (and algorithms / callbacks are already OK with no dataloaders or evaluators during Event.INIT).

Makefile

composer/core/state.py

composer/trainer/trainer.py

composer/callbacks/mlperf.py

composer/trainer/trainer.py

Co-authored-by: ravi-mosaicml <[email protected]>

composer/callbacks/callback_hparams.py

…lin/mlperf

ravi-mosaicml

LGTM 💯! See comments

composer/core/state.py

composer/trainer/trainer.py

composer/callbacks/mlperf.py

Adds an experimental logger to create MLperf compliant submission files

hanlint added 6 commits March 18, 2022 14:40

draft mlperf logger

c52f47b

add to callbacks module

1d75c3a

add mlperf logging callback

caab2e9

add submission directory structure

8f2fee6

add mlperf to setup

69bb806

fix duplicate logging

0813bb6

hanlint requested a review from a team as a code owner March 25, 2022 22:02

Merge branch 'dev' into hanlin/mlperf

43c74cd

ravi-mosaicml reviewed Mar 28, 2022

View reviewed changes

hanlint and others added 2 commits March 28, 2022 12:59

Apply suggestions from code review

d2153d2

Co-authored-by: ravi-mosaicml <[email protected]>

Merge branch 'dev' into hanlin/mlperf

8bbd7cd

ravi-mosaicml marked this pull request as draft April 18, 2022 20:23

update with current_metrics

e010476

hanlint requested a review from ravi-mosaicml April 18, 2022 20:30

hanlint marked this pull request as ready for review April 18, 2022 20:30

hanlint added 10 commits April 18, 2022 17:11

fix setup

9d588f7

fix docstrings

bee409f

add hparams object

f70406b

fix error

ba8652f

skip callback in asset test

7ac866b

Merge branch 'dev' into hanlin/mlperf

03758b1

cleanup

689d84c

try removing world_size

f02eef3

restore world_size

6491e8c

Merge branch 'dev' into hanlin/mlperf

5fe7957

hanlint mentioned this pull request Apr 19, 2022

Attempt to destroy process groups after each test runs #918

Closed

hanlint added 4 commits April 19, 2022 11:51

Merge branch 'dev' into hanlin/mlperf

b1b6004

Merge branch 'hanlin/mlperf' of github.com:mosaicml/composer into han…

465f76f

…lin/mlperf

trying removing mlperf tag

d80e39d

cleanup

99bb2ab

hanlint added 7 commits April 29, 2022 14:50

Merge branch 'hanlin/mlperf' of github.com:mosaicml/composer into han…

a80a208

…lin/mlperf

restore dataloaders to state

cc4d9be

cleanup

a431577

move items to init

4cbd163

Merge branch 'dev' into hanlin/mlperf

b7fd11e

fix pyright

cdeac03

clean up tests

212b089

hanlint requested a review from ravi-mosaicml May 3, 2022 19:07

use code block because cannot automate testcode

13066df

ravi-mosaicml reviewed May 3, 2022

View reviewed changes

Apply suggestions from code review

f8c9732

Co-authored-by: ravi-mosaicml <[email protected]>

bandish-shah reviewed May 3, 2022

View reviewed changes

composer/callbacks/callback_hparams.py Outdated Show resolved Hide resolved

bandish-shah reviewed May 3, 2022

View reviewed changes

composer/callbacks/callback_hparams.py Outdated Show resolved Hide resolved

bandish-shah reviewed May 3, 2022

View reviewed changes

composer/callbacks/callback_hparams.py Outdated Show resolved Hide resolved

hanlint added 4 commits May 3, 2022 15:09

address comments

b689f86

Merge branch 'dev' into hanlin/mlperf

1705a61

cleanup

051035b

type ignore until logging pypi is done

afe5313

hanlint requested review from ravi-mosaicml and bandish-shah May 3, 2022 22:19

hanlint added 3 commits May 3, 2022 15:24

Merge branch 'dev' into hanlin/mlperf

ec2a578

cleanup

2480a2b

Merge branch 'hanlin/mlperf' of github.com:mosaicml/composer into han…

0136d3d

…lin/mlperf

ravi-mosaicml approved these changes May 3, 2022

View reviewed changes

composer/core/state.py Show resolved Hide resolved

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/callbacks/mlperf.py Outdated Show resolved Hide resolved

cleanup

621d12e

hanlint merged commit be8ebcf into dev May 3, 2022

hanlint deleted the hanlin/mlperf branch May 3, 2022 23:18

ravi-mosaicml pushed a commit that referenced this pull request May 3, 2022

Add MLPerf logging (#831)

02a5414

Adds an experimental logger to create MLperf compliant submission files

bandish-shah restored the hanlin/mlperf branch May 4, 2022 22:21

ravi-mosaicml deleted the hanlin/mlperf branch May 6, 2022 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MLPerf logging #831

Add MLPerf logging #831

hanlint commented Mar 25, 2022 •

edited

Loading

ravi-mosaicml left a comment

ravi-mosaicml left a comment

ravi-mosaicml left a comment

Add MLPerf logging #831

Add MLPerf logging #831

Conversation

hanlint commented Mar 25, 2022 • edited Loading

ravi-mosaicml left a comment

Choose a reason for hiding this comment

ravi-mosaicml left a comment

Choose a reason for hiding this comment

ravi-mosaicml left a comment

Choose a reason for hiding this comment

hanlint commented Mar 25, 2022 •

edited

Loading