Move to experiment-based Hydra config in L-MGN example #771

Alexey-Kamenev · 2025-01-28T00:55:45Z

Modulus Pull Request

Description

moved to Hydra experiment config which simplifies configuring different dataset experiments in L-MGN.
The example now supports 15+ datasets that are used in the original paper.
Unified training and inference script configuration which are now both experiment-driven which guarantees coherency.
Refactored logging since W&B is not available anymore.

Results on different datasets:

Water:
Water 3D:
Sand:
Water ramps:

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

Alexey-Kamenev · 2025-01-28T00:56:05Z

/blossom-ci

Alexey-Kamenev · 2025-01-28T19:25:06Z

/blossom-ci

Alexey-Kamenev · 2025-01-28T20:15:40Z

/blossom-ci

This reverts commit 5c13a1f.

Alexey-Kamenev · 2025-01-28T20:23:22Z

/blossom-ci

ktangsali

Great organization! Thanks!

modulus/datapipes/gnn/lagrangian_dataset.py

hakhondzadeh · 2025-02-03T23:20:46Z

examples/cfd/lagrangian_mgn/README.md

+This is an example of MeshGraphNet for particle-based simulation, based on the
+[Learning to Simulate](https://sites.google.com/view/learning-to-simulate/)
+work. It demonstrates how to use Modulus to train a Graph Neural Network (GNN)
+to simulate Lagrangian fluids, solids, and deformable materials.


Do you have an experiment included to simulate solids?

Yes, it's ./conf/experiment/sand.yaml - the material is sand in this case.

hakhondzadeh · 2025-02-03T23:23:24Z

examples/cfd/lagrangian_mgn/README.md

- **Node target**: acceleration (\(d\))
+- **Node features**:
+  - position ($d$)
+  - historical velocity ($t \times d$)


Does t here equal to num_history? Maybe say this explicitely?

hakhondzadeh · 2025-02-03T23:24:55Z

examples/cfd/lagrangian_mgn/README.md

 ```

-The URL to the dashboard will be displayed in the terminal after the run is launched.
-Alternatively, the logging utility in `train.py` can be switched to MLFlow.
+Progress and loss logs can be monitored using Weights & Biases. To activate that,


Maybe suggest monitoring alternatives if acquiring a W&B license is not possibe?

For now we don't have any - but it should be easy to add MLFlow (we already have it in Modulus, just need to properly plug into this code) or TensorBoard. Feel free to take this as a task for yourself.

hakhondzadeh · 2025-02-03T23:28:41Z

examples/cfd/lagrangian_mgn/conf/config.yaml

+dim: 2
+
+# Main output directory.
+output: outputs


Maybe make outputs a runtime argument to enable running multiple training jobs?
Or maybe add a job id like this?
output: outputs/${job.name}

With Hydra you can specify it pretty flexibly. For example, output can be set by default to:

output: outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}

This will create a new directory on each run based on the current time. Users can do the same via command line.
job.name does not provide a unique name either - it defaults to the script name (e.g. train) so users will still have to override the output via command line/config.
Also, using ./outputs is kind of default for other Modulus examples (whether it's good or not is a separate discussion)

hakhondzadeh · 2025-02-03T23:36:14Z

examples/cfd/lagrangian_mgn/conf/config.yaml

+
+test:
+  batch_size: 1
+  device: cuda


Do you need to specify cuda device on train as well?

No, for train the DistributedManager is used to set the device properly. If needed, the device can be set using other mechanisms, like CUDA_VISIBLE_DEVICES.

hakhondzadeh · 2025-02-04T00:05:15Z

examples/cfd/lagrangian_mgn/README.md

+Once the model is trained, run the following command:
+
+```bash
+python inference.py +experiment=water \


is there a way to specify sample id for the inference? So that we can try running the model on different data (including the ones used for training) and see the output?

Not at the moment, but it's a great feature! I'm going to add it to the list of tasks.

hakhondzadeh · 2025-02-04T00:06:51Z

examples/cfd/lagrangian_mgn/train.py

        self.dim = cfg.dim
-        self.gravity = torch.zeros(self.dim, device=self.dist.device)
-        self.gravity[-1] = -9.8


How is gravity considered now?

It's currently not used and I don't think we need it. But I need to confirm with the original author of the code to understand why it was added in the first place.

hakhondzadeh · 2025-02-04T00:08:06Z

examples/cfd/lagrangian_mgn/train.py

@@ -248,27 +203,30 @@ def main(cfg: DictConfig) -> None:
        mean_loss_pos = sum(loss_pos_list) / len(loss_pos_list)
        mean_loss_vel = sum(loss_vel_list) / len(loss_vel_list)
        mean_loss_acc = sum(loss_acc_list) / len(loss_acc_list)


Why not use np?

sum is faster on lists than np.sum (which is faster on numpy arrays but that's not what we have here).

hakhondzadeh · 2025-02-04T00:18:38Z

examples/cfd/lagrangian_mgn/train.py

@@ -248,27 +203,30 @@ def main(cfg: DictConfig) -> None:
        mean_loss_pos = sum(loss_pos_list) / len(loss_pos_list)


mean_loss is only used for logging and not for training, right? Then, maybe rename it so it doesn't get confused with the loss being used for training?

This code is refactored in my upcoming PR, please wait.

hakhondzadeh · 2025-02-04T00:26:52Z

examples/cfd/lagrangian_mgn/train.py

 def main(cfg: DictConfig) -> None:
    # initialize distributed manager
    DistributedManager.initialize()
    dist = DistributedManager()

+    init_python_logging(cfg, dist.rank)
+    logger.info(f"Config summary:\n{OmegaConf.to_yaml(cfg, sort_keys=True)}")


Maybe add a log_gpu_info function as well to dump the gpu info, namely, number of devices, model, cuda version, etc?

Alexey-Kamenev · 2025-02-04T20:06:30Z

/blossom-ci

Move to experiment-based Hydra config. Refactor logging.

5766437

Alexey-Kamenev added the 2 - In Progress Currently a work in progress label Jan 28, 2025

Alexey-Kamenev self-assigned this Jan 28, 2025

Update README and configs.

f671793

Alexey-Kamenev requested a review from ktangsali January 28, 2025 20:07

Delete old configs.

5c13a1f

Revert "Delete old configs."

c1f8769

This reverts commit 5c13a1f.

Merge branch 'main' into lagrangian-mgn

d03d6dd

Alexey-Kamenev requested a review from hakhondzadeh February 3, 2025 16:35

ktangsali approved these changes Feb 3, 2025

View reviewed changes

modulus/datapipes/gnn/lagrangian_dataset.py Show resolved Hide resolved

hakhondzadeh reviewed Feb 4, 2025

View reviewed changes

Alexey-Kamenev added 3 commits February 4, 2025 08:46

Merge branch 'main' into lagrangian-mgn

1dae00e

Address review comments.

a7532b4

Merge branch 'main' into lagrangian-mgn

018b91d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to experiment-based Hydra config in L-MGN example #771

Move to experiment-based Hydra config in L-MGN example #771

Alexey-Kamenev commented Jan 28, 2025 •

edited

Loading

Alexey-Kamenev commented Jan 28, 2025

Alexey-Kamenev commented Jan 28, 2025

Alexey-Kamenev commented Jan 28, 2025

Alexey-Kamenev commented Jan 28, 2025

ktangsali left a comment

hakhondzadeh Feb 3, 2025

Alexey-Kamenev Feb 4, 2025

hakhondzadeh Feb 3, 2025

Alexey-Kamenev Feb 4, 2025

hakhondzadeh Feb 3, 2025

Alexey-Kamenev Feb 4, 2025

hakhondzadeh Feb 3, 2025

Alexey-Kamenev Feb 4, 2025 •

edited

Loading

hakhondzadeh Feb 3, 2025

Alexey-Kamenev Feb 4, 2025

hakhondzadeh Feb 4, 2025

Alexey-Kamenev Feb 4, 2025

hakhondzadeh Feb 4, 2025

Alexey-Kamenev Feb 4, 2025

hakhondzadeh Feb 4, 2025

Alexey-Kamenev Feb 4, 2025

hakhondzadeh Feb 4, 2025

Alexey-Kamenev Feb 4, 2025

hakhondzadeh Feb 4, 2025

Alexey-Kamenev Feb 4, 2025

Alexey-Kamenev commented Feb 4, 2025

		@@ -248,27 +203,30 @@ def main(cfg: DictConfig) -> None:
		mean_loss_pos = sum(loss_pos_list) / len(loss_pos_list)

Move to experiment-based Hydra config in L-MGN example #771

Are you sure you want to change the base?

Move to experiment-based Hydra config in L-MGN example #771

Conversation

Alexey-Kamenev commented Jan 28, 2025 • edited Loading

Modulus Pull Request

Description

Checklist

Dependencies

Alexey-Kamenev commented Jan 28, 2025

Alexey-Kamenev commented Jan 28, 2025

Alexey-Kamenev commented Jan 28, 2025

Alexey-Kamenev commented Jan 28, 2025

ktangsali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alexey-Kamenev Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Alexey-Kamenev commented Feb 4, 2025

Alexey-Kamenev commented Jan 28, 2025 •

edited

Loading

Alexey-Kamenev Feb 4, 2025 •

edited

Loading