Add support for mlflow #77

khintz · 2024-10-03T12:15:33Z

Describe your changes

Add support for mlflow logger by utilising pytorch_lightning.loggers
The native wandb module is replaced with pytorch_lightning wandb logger and introducing pytorch_lightning mlflow logger.
https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/loggers/logger.py

This will allow people to choose between wandb and mlflow.

Builds upon #66 although this is not strictly necessary for this change, but I am working with this feature to work with our dataset.

Issue Link

Closes #76

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

…ject-toml

the behavior of xr.Dataset.dims will change soon

SimonKamuk

Alright, I think it all looks good now! This is ready as far as I am concerned

khintz · 2025-01-21T16:52:38Z

Thanks @SimonKamuk. I just noticed I didn't restore the defaults in the pytorch lightning trainer call in train_model.py. I just changed that back. This will not change anything related to this PR.

khintz · 2025-01-21T19:01:48Z

A small final touch.

--logger-url is removed in favor of the env variable MLFLOW_TRACKING_URI
I noticed that mlflow created empty runs which wasn't used later on, because the logger was initialised on all ranks. I have used the function pytorch_lightningl.utilities.rank_zero.rank_zero_only as decorator on the _setup_training_logger function to fix this: https://pytorch-lightning.readthedocs.io/en/1.7.7/api/pytorch_lightning.utilities.rank_zero.html#pytorch_lightning.utilities.rank_zero.rank_zero_only
I tested with both wandb and mlflow afterwards.
MLFlow is now setting the same run_name as is used in wandb.
Updated the README with instructions on mlflow.

From my perspective this is now ready.

joeloskarsson · 2025-01-22T07:53:59Z

I've looked over this at a high-level and I think it is looking very good! I don't have the capacity for some in-depth testing right now, but I see from the discussions above that there have been multiple pairs of eyes looking at this, so I don't think it requires my detailed input again 😄

Only thing that I think would be good would be to move the CustomMLFlowLogger to its own file, and possibly _setup_training_logger to utils.py. It is quite nice to keep the list of argparse argument definitions close to the top of train_model.py. And CustomMLFlowLogger is substantial enough that it deserves its own file anyhow. Maybe we could even have a loggers sub-directory, although I am not sure if we expect any more to be implemented.

Apart from the point above I think this can be merged.

khintz · 2025-01-22T12:16:53Z

Sounds like a plan @joeloskarsson. I will do that right away.
For now, I don't think a sub-directory is needed.

leifdenby

Looking really great! I tested both logging on wandb and mlflow and both work fine. Although I did find I had to include the protocol (i.e. https://) in the URL otherwise the logging failed silently. You have put the protocol in the README example though :) so that's great! I think there might be a bug in mlflow here...

Anyway, I only made a few minor suggestions. The main one is documenting in the README how the credentials for AWS S3 can be defined using a shared credentials files (rather than setting environment variables for the key and secret)

README.md

neural_lam/custom_loggers.py

neural_lam/datastore/plot_example.py

neural_lam/models/ar_model.py

pyproject.toml

Apply suggestions from review by leifdenby Co-authored-by: Leif Denby <[email protected]>

leifdenby

Ready to go! 🚀

leifdenby and others added 30 commits June 3, 2024 21:02

install py39 on cirun runner

6fff3fc

cleanup: boundary_mask, zarr-opening, utils

74b4a10

Merge remote-tracking branch 'origin/main' into feature_dataset_yaml

0a041d1

change ami image to gpu

8054e9e

Merge remote-tracking branch 'upstream/main' into maint/deps-in-pypro…

39fbf3a

…ject-toml

use cheaper gpu instance

97aeb2e

adapted tests for zarr-analysis data

425123c

Readme adapted for yaml zarr analysis workflow

4dcf671

samller bugfixes and improvements

6d384f0

Added fixed data config file for testing on Danra

12ff4f2

reducing runtime of tests with smaller sample

03f7769

download danra data for test and example (streaming not possible)

26f069c

bugfixes after real-life testcase

1f1cbcc

Merge remote-tracking branch 'origin/main' into feature_dataset_yaml

b369306

organize .zarr in /data

0cdc361

cleanup

23ca7b3

linter

81422f1

static dataset doesn't have time dim

124541b

making two complex functions more modular

6140fdb

chunk dataset by time

db6a912

create list first for performance

1aaa8dc

converting to_array is very slow

81856b2

the behavior of xr.Dataset.dims will change soon

allow for forcings to not be normalized

b3da818

allow non_normalized_vars to be null

7ee5398

fixed coastlines using new xy_extent function

4782103

Some projections return inverted axes (rotatedPole)

e0ffc5b

Docstrings added

c1f43b7

wip

21fd929

npy mllam nearly done

c52f98e

minor adjustment

80f3639

khintz added 4 commits January 21, 2025 16:00

add choices to logger argument

e1dac04

remove unused command line arguments

5fe9f84

Merge branch 'main' into feat/mlflow

97e7bb3

correct changelog after merging in main

7a90c57

khintz requested a review from SimonKamuk January 21, 2025 16:17

SimonKamuk approved these changes Jan 21, 2025

View reviewed changes

restore defaults for pl.Trainer

2ca3959

khintz added 3 commits January 21, 2025 18:11

use MLFLOW_TRACKING_URI environment variable instead of command-line

757b502

Update README on MLFlow

05e1be0

init logger on rank0 and use run_name for mlflow

f7f90c4

khintz added 4 commits January 22, 2025 12:47

move custom_logger to its own file

3187dc3

correct typo in custom_loggers

d88c23c

Merge branch 'main' into feat/mlflow

d54e3d8

improve code to pass linting

90b48ff

leifdenby reviewed Jan 27, 2025

View reviewed changes

khintz and others added 7 commits January 27, 2025 11:01

Apply suggestions from code review

d5c5dcc

Apply suggestions from review by leifdenby Co-authored-by: Leif Denby <[email protected]>

Satisfy linting after review

1c0d121

elaborate docstring for save_dir

42dafa1

remove unnecesary change

1b9b7f0

add inline comment for logger handling

f743890

warn if logger does not support image logging

2d2aa23

expand on inline comment on logger key

d2ffefc

khintz requested a review from leifdenby January 27, 2025 10:30

leifdenby approved these changes Jan 27, 2025

View reviewed changes

khintz merged commit 698bed7 into mllam:main Jan 27, 2025
8 checks passed

khintz deleted the feat/mlflow branch January 27, 2025 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for mlflow #77

Add support for mlflow #77

khintz commented Oct 3, 2024

SimonKamuk left a comment

khintz commented Jan 21, 2025

khintz commented Jan 21, 2025

joeloskarsson commented Jan 22, 2025

khintz commented Jan 22, 2025

leifdenby left a comment

leifdenby left a comment

Add support for mlflow #77

Add support for mlflow #77

Conversation

khintz commented Oct 3, 2024

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

SimonKamuk left a comment

Choose a reason for hiding this comment

khintz commented Jan 21, 2025

khintz commented Jan 21, 2025

joeloskarsson commented Jan 22, 2025

khintz commented Jan 22, 2025

leifdenby left a comment

Choose a reason for hiding this comment

leifdenby left a comment

Choose a reason for hiding this comment