Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and refine MLP tests for extensibility and A/B testing #8561

Merged
merged 10 commits into from
Jan 15, 2025

Conversation

rpsilva-aws
Copy link
Contributor

@rpsilva-aws rpsilva-aws commented Jan 13, 2025

In this PR, we include various fixes, improvements and extensions, namely:

  • Exposing the MLP test to other tests (to allow us to A/B test convergences: requirement for the grad acc tests)
  • Improving the asserts, and extending to losses and outputs
  • Using the appropriate flag for the LR and log steps
  • Improving the model layout with nn.Sequential
  • Enhance the coverage, by actually utilizing the checkpointing flag, and including a sanity test for CPU
  • Decouple to simplify the A/B coverage
  • Fix imports

@rpsilva-aws rpsilva-aws marked this pull request as ready for review January 13, 2025 21:41
@rpsilva-aws
Copy link
Contributor Author

rpsilva-aws commented Jan 13, 2025

@tengyifei @ManfeiBai I found myself having to largely improve/enhance the MLP tests, since I wanted to reuse this test for A/B convergence validation:

  • Gradient accumulation without the loop (batch size = 128)
  • Gradient accumulation with the loop (experimental grad acc) (batch size = 128)
  • Higher DP degree, no grad acc (batch size = 128)

PTAL.

@rpsilva-aws rpsilva-aws force-pushed the rpsilva_improve_mlp branch 2 times, most recently from fee509f to 30718c9 Compare January 13, 2025 23:30
@rpsilva-aws rpsilva-aws changed the title Improve the simple MLP tests Improve and refine MLP tests for extensibility and A/B testing Jan 13, 2025
@rpsilva-aws rpsilva-aws force-pushed the rpsilva_improve_mlp branch 3 times, most recently from 3cea463 to 4d92118 Compare January 14, 2025 02:20
test/neuron/run_tests.sh Outdated Show resolved Hide resolved
test/run_tests.sh Outdated Show resolved Hide resolved
@tengyifei tengyifei self-requested a review January 15, 2025 00:22
Copy link
Collaborator

@tengyifei tengyifei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline. Let's add a --skip-gradient-checkpointing CLI arg to the train testing script or similar to skip the gradient checkpointing on CPU, in order to avoid the confusing test_train_spmd_linear_model_grad_checkpointing name.

@rpsilva-aws
Copy link
Contributor Author

Synced offline. Let's add a --skip-gradient-checkpointing CLI arg to the train testing script or similar to skip the gradient checkpointing on CPU, in order to avoid the confusing test_train_spmd_linear_model_grad_checkpointing name.

I was just about to write this, done!

Copy link
Collaborator

@lsy323 lsy323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the test file is duplicated under test/utils?

@rpsilva-aws
Copy link
Contributor Author

rpsilva-aws commented Jan 15, 2025

Looks like the test file is duplicated under test/utils?

I don't see it. The core of the functionality was moved to utils - it's a standalone training script. The test_ file can run simultaneous executions of it, to, intentionally, run A/B testing. The test/utils file is not for direct testing, but users can run it as a standalone script, and so can tests.

@rpsilva-aws rpsilva-aws requested a review from lsy323 January 15, 2025 01:03
@tengyifei tengyifei merged commit 1d61556 into pytorch:master Jan 15, 2025
12 checks passed
@lsy323
Copy link
Collaborator

lsy323 commented Jan 15, 2025

Looks like the test file is duplicated under test/utils?

I don't see it. The core of the functionality was moved to utils - it's a standalone training script. The test_ file can run simultaneous executions of it, to, intentionally, run A/B testing. The test/utils file is not for direct testing, but users can run it as a standalone script, and so can tests.

Hi @rpsilva-aws thank you for the explanation. I still don't get the A/B testing part. But if the body can be reused in other tests then it's fine

@rpsilva-aws
Copy link
Contributor Author

Looks like the test file is duplicated under test/utils?

I don't see it. The core of the functionality was moved to utils - it's a standalone training script. The test_ file can run simultaneous executions of it, to, intentionally, run A/B testing. The test/utils file is not for direct testing, but users can run it as a standalone script, and so can tests.

Hi @rpsilva-aws thank you for the explanation. I still don't get the A/B testing part. But if the body can be reused in other tests then it's fine

Thanks. You can see it in this PR here: https://github.com/pytorch/xla/pull/8561/files#diff-09c6a280d1c6fc8053d5a17e919e29fe51a75fd593f6c2012496b6e970312c25R44

Essentially, it allows it to A/B test different functionality. It adds a bit more value than running each standalone test, and just checking that the losses/output are not 0. Alternatively, we'd need to assert against an hardcoded set of expected values. We want to reuse this for testing gradient accumulation with and without XLA's while loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants