Improve and refine MLP tests for extensibility and A/B testing #8561

rpsilva-aws · 2025-01-13T21:41:14Z

In this PR, we include various fixes, improvements and extensions, namely:

Exposing the MLP test to other tests (to allow us to A/B test convergences: requirement for the grad acc tests)
Improving the asserts, and extending to losses and outputs
Using the appropriate flag for the LR and log steps
Improving the model layout with nn.Sequential
Enhance the coverage, by actually utilizing the checkpointing flag, and including a sanity test for CPU
Decouple to simplify the A/B coverage
Fix imports

rpsilva-aws · 2025-01-13T21:42:24Z

@tengyifei @ManfeiBai I found myself having to largely improve/enhance the MLP tests, since I wanted to reuse this test for A/B convergence validation:

Gradient accumulation without the loop (batch size = 128)
Gradient accumulation with the loop (experimental grad acc) (batch size = 128)
Higher DP degree, no grad acc (batch size = 128)

PTAL.

test/neuron/run_tests.sh

test/run_tests.sh

tengyifei

Synced offline. Let's add a --skip-gradient-checkpointing CLI arg to the train testing script or similar to skip the gradient checkpointing on CPU, in order to avoid the confusing test_train_spmd_linear_model_grad_checkpointing name.

rpsilva-aws · 2025-01-15T00:24:22Z

Synced offline. Let's add a --skip-gradient-checkpointing CLI arg to the train testing script or similar to skip the gradient checkpointing on CPU, in order to avoid the confusing test_train_spmd_linear_model_grad_checkpointing name.

I was just about to write this, done!

test/run_tests.sh

test/utils/train_spmd_linear_model.py

lsy323

Looks like the test file is duplicated under test/utils?

rpsilva-aws · 2025-01-15T01:03:28Z

Looks like the test file is duplicated under test/utils?

I don't see it. The core of the functionality was moved to utils - it's a standalone training script. The test_ file can run simultaneous executions of it, to, intentionally, run A/B testing. The test/utils file is not for direct testing, but users can run it as a standalone script, and so can tests.

lsy323 · 2025-01-15T20:02:53Z

Looks like the test file is duplicated under test/utils?

I don't see it. The core of the functionality was moved to utils - it's a standalone training script. The test_ file can run simultaneous executions of it, to, intentionally, run A/B testing. The test/utils file is not for direct testing, but users can run it as a standalone script, and so can tests.

Hi @rpsilva-aws thank you for the explanation. I still don't get the A/B testing part. But if the body can be reused in other tests then it's fine

rpsilva-aws · 2025-01-15T20:11:57Z

Looks like the test file is duplicated under test/utils?

I don't see it. The core of the functionality was moved to utils - it's a standalone training script. The test_ file can run simultaneous executions of it, to, intentionally, run A/B testing. The test/utils file is not for direct testing, but users can run it as a standalone script, and so can tests.

Hi @rpsilva-aws thank you for the explanation. I still don't get the A/B testing part. But if the body can be reused in other tests then it's fine

Thanks. You can see it in this PR here: https://github.com/pytorch/xla/pull/8561/files#diff-09c6a280d1c6fc8053d5a17e919e29fe51a75fd593f6c2012496b6e970312c25R44

Essentially, it allows it to A/B test different functionality. It adds a bit more value than running each standalone test, and just checking that the losses/output are not 0. Alternatively, we'd need to assert against an hardcoded set of expected values. We want to reuse this for testing gradient accumulation with and without XLA's while loop.

…ch#8561)

rpsilva-aws marked this pull request as ready for review January 13, 2025 21:41

rpsilva-aws force-pushed the rpsilva_improve_mlp branch 2 times, most recently from fee509f to 30718c9 Compare January 13, 2025 23:30

rpsilva-aws changed the title ~~Improve the simple MLP tests~~ Improve and refine MLP tests for extensibility and A/B testing Jan 13, 2025

rpsilva-aws force-pushed the rpsilva_improve_mlp branch from 30718c9 to e9e35ab Compare January 14, 2025 00:22

rpsilva-aws added 9 commits January 14, 2025 02:06

Expose simple MLP script to other tests

26fab61

Simple MLP asserts on the output

7bd88b8

Use the flag learning rate

8268283

Return and assert model losses

047069d

Improve the MLP model with proper layer layouts

5b30012

Fix the import orderings

18913f5

Introduce MLP test to CPU and checkpointing coverage

f2a7d3f

Use the flag log steps

c23865a

Use a random generator for proper convergence validation

bb1dbaf

rpsilva-aws force-pushed the rpsilva_improve_mlp branch 3 times, most recently from 3cea463 to 4d92118 Compare January 14, 2025 02:20

tengyifei approved these changes Jan 14, 2025

View reviewed changes

test/neuron/run_tests.sh Outdated Show resolved Hide resolved

test/run_tests.sh Outdated Show resolved Hide resolved

tengyifei self-requested a review January 15, 2025 00:22

rpsilva-aws force-pushed the rpsilva_improve_mlp branch from 4d92118 to 5e89a3e Compare January 15, 2025 00:23

tengyifei requested changes Jan 15, 2025

View reviewed changes

tengyifei reviewed Jan 15, 2025

View reviewed changes

test/run_tests.sh Show resolved Hide resolved

Decouple and run MLP runs for comparisons

b65e408

rpsilva-aws force-pushed the rpsilva_improve_mlp branch from 5e89a3e to b65e408 Compare January 15, 2025 00:26

rpsilva-aws requested a review from tengyifei January 15, 2025 00:34

tengyifei approved these changes Jan 15, 2025

View reviewed changes

lsy323 reviewed Jan 15, 2025

View reviewed changes

test/utils/train_spmd_linear_model.py Show resolved Hide resolved

lsy323 requested changes Jan 15, 2025

View reviewed changes

rpsilva-aws requested a review from lsy323 January 15, 2025 01:03

lsy323 approved these changes Jan 15, 2025

View reviewed changes

tengyifei merged commit 1d61556 into pytorch:master Jan 15, 2025
12 checks passed

rpsilva-aws deleted the rpsilva_improve_mlp branch January 15, 2025 20:12

qihqi pushed a commit that referenced this pull request Jan 16, 2025

Improve and refine MLP tests for extensibility and A/B testing (#8561)

10de6ed

This was referenced Jan 17, 2025

Improve and refine MLP tests for extensibility and A/B testing #8590

Merged

2.6 backport PR request list #8455

Open

rpsilva-aws added a commit to rpsilva-aws/xla that referenced this pull request Jan 17, 2025

Improve and refine MLP tests for extensibility and A/B testing (pytor…

0dd7d63

…ch#8561)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve and refine MLP tests for extensibility and A/B testing #8561

Improve and refine MLP tests for extensibility and A/B testing #8561

rpsilva-aws commented Jan 13, 2025 •

edited

Loading

rpsilva-aws commented Jan 13, 2025 •

edited

Loading

tengyifei left a comment

rpsilva-aws commented Jan 15, 2025

lsy323 left a comment

rpsilva-aws commented Jan 15, 2025 •

edited

Loading

lsy323 commented Jan 15, 2025

rpsilva-aws commented Jan 15, 2025

Improve and refine MLP tests for extensibility and A/B testing #8561

Improve and refine MLP tests for extensibility and A/B testing #8561

Conversation

rpsilva-aws commented Jan 13, 2025 • edited Loading

rpsilva-aws commented Jan 13, 2025 • edited Loading

tengyifei left a comment

Choose a reason for hiding this comment

rpsilva-aws commented Jan 15, 2025

lsy323 left a comment

Choose a reason for hiding this comment

rpsilva-aws commented Jan 15, 2025 • edited Loading

lsy323 commented Jan 15, 2025

rpsilva-aws commented Jan 15, 2025

rpsilva-aws commented Jan 13, 2025 •

edited

Loading

rpsilva-aws commented Jan 13, 2025 •

edited

Loading

rpsilva-aws commented Jan 15, 2025 •

edited

Loading