Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for mlflow #77

Merged
merged 330 commits into from
Jan 27, 2025
Merged

Add support for mlflow #77

merged 330 commits into from
Jan 27, 2025

Conversation

khintz
Copy link
Contributor

@khintz khintz commented Oct 3, 2024

Describe your changes

Add support for mlflow logger by utilising pytorch_lightning.loggers
The native wandb module is replaced with pytorch_lightning wandb logger and introducing pytorch_lightning mlflow logger.
https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/loggers/logger.py

This will allow people to choose between wandb and mlflow.

Builds upon #66 although this is not strictly necessary for this change, but I am working with this feature to work with our dataset.

Issue Link

Closes #76

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the README to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • author has added an entry to the changelog (and designated the change as added, changed or fixed)
  • Once the PR is ready to be merged, squash commits and merge the PR.

@khintz khintz requested a review from SimonKamuk January 21, 2025 16:17
Copy link
Contributor

@SimonKamuk SimonKamuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think it all looks good now! This is ready as far as I am concerned

@khintz
Copy link
Contributor Author

khintz commented Jan 21, 2025

Thanks @SimonKamuk. I just noticed I didn't restore the defaults in the pytorch lightning trainer call in train_model.py. I just changed that back. This will not change anything related to this PR.

@khintz
Copy link
Contributor Author

khintz commented Jan 21, 2025

A small final touch.

  1. --logger-url is removed in favor of the env variable MLFLOW_TRACKING_URI

  2. I noticed that mlflow created empty runs which wasn't used later on, because the logger was initialised on all ranks. I have used the function pytorch_lightningl.utilities.rank_zero.rank_zero_only as decorator on the _setup_training_logger function to fix this: https://pytorch-lightning.readthedocs.io/en/1.7.7/api/pytorch_lightning.utilities.rank_zero.html#pytorch_lightning.utilities.rank_zero.rank_zero_only
    I tested with both wandb and mlflow afterwards.

  3. MLFlow is now setting the same run_name as is used in wandb.

  4. Updated the README with instructions on mlflow.

From my perspective this is now ready.

@joeloskarsson
Copy link
Collaborator

I've looked over this at a high-level and I think it is looking very good! I don't have the capacity for some in-depth testing right now, but I see from the discussions above that there have been multiple pairs of eyes looking at this, so I don't think it requires my detailed input again 😄

Only thing that I think would be good would be to move the CustomMLFlowLogger to its own file, and possibly _setup_training_logger to utils.py. It is quite nice to keep the list of argparse argument definitions close to the top of train_model.py. And CustomMLFlowLogger is substantial enough that it deserves its own file anyhow. Maybe we could even have a loggers sub-directory, although I am not sure if we expect any more to be implemented.

Apart from the point above I think this can be merged.

@khintz
Copy link
Contributor Author

khintz commented Jan 22, 2025

Sounds like a plan @joeloskarsson. I will do that right away.
For now, I don't think a sub-directory is needed.

Copy link
Member

@leifdenby leifdenby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really great! I tested both logging on wandb and mlflow and both work fine. Although I did find I had to include the protocol (i.e. https://) in the URL otherwise the logging failed silently. You have put the protocol in the README example though :) so that's great! I think there might be a bug in mlflow here...

Anyway, I only made a few minor suggestions. The main one is documenting in the README how the credentials for AWS S3 can be defined using a shared credentials files (rather than setting environment variables for the key and secret)

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
neural_lam/custom_loggers.py Outdated Show resolved Hide resolved
neural_lam/custom_loggers.py Show resolved Hide resolved
neural_lam/datastore/plot_example.py Outdated Show resolved Hide resolved
neural_lam/models/ar_model.py Show resolved Hide resolved
neural_lam/models/ar_model.py Show resolved Hide resolved
neural_lam/models/ar_model.py Outdated Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
@khintz khintz requested a review from leifdenby January 27, 2025 10:30
Copy link
Member

@leifdenby leifdenby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to go! 🚀

@khintz khintz merged commit 698bed7 into mllam:main Jan 27, 2025
8 checks passed
@khintz khintz deleted the feat/mlflow branch January 27, 2025 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support mlflow logger (and other loggers from pytorch-lightning)
5 participants