Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to experiment-based Hydra config in L-MGN example #771

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Alexey-Kamenev
Copy link
Collaborator

@Alexey-Kamenev Alexey-Kamenev commented Jan 28, 2025

Modulus Pull Request

Description

  • moved to Hydra experiment config which simplifies configuring different dataset experiments in L-MGN.
  • The example now supports 15+ datasets that are used in the original paper.
  • Unified training and inference script configuration which are now both experiment-driven which guarantees coherency.
  • Refactored logging since W&B is not available anymore.

Results on different datasets:

  • Water:
  • Water 3D:
  • Sand:
  • Water ramps:

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@Alexey-Kamenev Alexey-Kamenev added the 2 - In Progress Currently a work in progress label Jan 28, 2025
@Alexey-Kamenev Alexey-Kamenev self-assigned this Jan 28, 2025
@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

Copy link
Collaborator

@ktangsali ktangsali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great organization! Thanks!

modulus/datapipes/gnn/lagrangian_dataset.py Show resolved Hide resolved
This is an example of MeshGraphNet for particle-based simulation, based on the
[Learning to Simulate](https://sites.google.com/view/learning-to-simulate/)
work. It demonstrates how to use Modulus to train a Graph Neural Network (GNN)
to simulate Lagrangian fluids, solids, and deformable materials.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an experiment included to simulate solids?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's ./conf/experiment/sand.yaml - the material is sand in this case.

Copy link
Collaborator

@hakhondzadeh hakhondzadeh Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worthwhile to add another experiment with deformable solid example too as opposed to discrete granular material (sand). - I mean in future, not this PR.

- **Node target**: acceleration (\(d\))
- **Node features**:
- position ($d$)
- historical velocity ($t \times d$)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does t here equal to num_history? Maybe say this explicitely?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```

The URL to the dashboard will be displayed in the terminal after the run is launched.
Alternatively, the logging utility in `train.py` can be switched to MLFlow.
Progress and loss logs can be monitored using Weights & Biases. To activate that,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe suggest monitoring alternatives if acquiring a W&B license is not possibe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we don't have any - but it should be easy to add MLFlow (we already have it in Modulus, just need to properly plug into this code) or TensorBoard. Feel free to take this as a task for yourself.

dim: 2

# Main output directory.
output: outputs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make outputs a runtime argument to enable running multiple training jobs?
Or maybe add a job id like this?
output: outputs/${job.name}

Copy link
Collaborator Author

@Alexey-Kamenev Alexey-Kamenev Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Hydra you can specify it pretty flexibly. For example, output can be set by default to:

output: outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}

This will create a new directory on each run based on the current time. Users can do the same via command line.
job.name does not provide a unique name either - it defaults to the script name (e.g. train) so users will still have to override the output via command line/config.
Also, using ./outputs is kind of default for other Modulus examples (whether it's good or not is a separate discussion)


test:
batch_size: 1
device: cuda
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to specify cuda device on train as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for train the DistributedManager is used to set the device properly. If needed, the device can be set using other mechanisms, like CUDA_VISIBLE_DEVICES.

Once the model is trained, run the following command:

```bash
python inference.py +experiment=water \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to specify sample id for the inference? So that we can try running the model on different data (including the ones used for training) and see the output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment, but it's a great feature! I'm going to add it to the list of tasks.

self.dim = cfg.dim
self.gravity = torch.zeros(self.dim, device=self.dist.device)
self.gravity[-1] = -9.8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is gravity considered now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently not used and I don't think we need it. But I need to confirm with the original author of the code to understand why it was added in the first place.

@@ -248,27 +203,30 @@ def main(cfg: DictConfig) -> None:
mean_loss_pos = sum(loss_pos_list) / len(loss_pos_list)
mean_loss_vel = sum(loss_vel_list) / len(loss_vel_list)
mean_loss_acc = sum(loss_acc_list) / len(loss_acc_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use np?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum is faster on lists than np.sum (which is faster on numpy arrays but that's not what we have here).

@@ -248,27 +203,30 @@ def main(cfg: DictConfig) -> None:
mean_loss_pos = sum(loss_pos_list) / len(loss_pos_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mean_loss is only used for logging and not for training, right? Then, maybe rename it so it doesn't get confused with the loss being used for training?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is refactored in my upcoming PR, please wait.

def main(cfg: DictConfig) -> None:
# initialize distributed manager
DistributedManager.initialize()
dist = DistributedManager()

init_python_logging(cfg, dist.rank)
logger.info(f"Config summary:\n{OmegaConf.to_yaml(cfg, sort_keys=True)}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a log_gpu_info function as well to dump the gpu info, namely, number of devices, model, cuda version, etc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

This is an example of MeshGraphNet for particle-based simulation, based on the
[Learning to Simulate](https://sites.google.com/view/learning-to-simulate/)
work. It demonstrates how to use Modulus to train a Graph Neural Network (GNN)
to simulate Lagrangian fluids, solids, and deformable materials.
Copy link
Collaborator

@hakhondzadeh hakhondzadeh Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worthwhile to add another experiment with deformable solid example too as opposed to discrete granular material (sand). - I mean in future, not this PR.

@Alexey-Kamenev
Copy link
Collaborator Author

/blossom-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants