Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to experiment-based Hydra config in L-MGN example #771

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 84 additions & 43 deletions examples/cfd/lagrangian_mgn/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
# MeshGraphNet with Lagrangian mesh

This is an example of Meshgraphnet for particle-based simulation on the
water dataset based on
<https://github.com/google-deepmind/deepmind-research/tree/master/learning_to_simulate>
in PyTorch.
It demonstrates how to train a Graph Neural Network (GNN) for evaluation
of the Lagrangian fluid.
This is an example of MeshGraphNet for particle-based simulation, based on the
[Learning to Simulate](https://sites.google.com/view/learning-to-simulate/)
work. It demonstrates how to use Modulus to train a Graph Neural Network (GNN)
to simulate Lagrangian fluids, solids, and deformable materials.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an experiment included to simulate solids?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's ./conf/experiment/sand.yaml - the material is sand in this case.

Copy link
Collaborator

@hakhondzadeh hakhondzadeh Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worthwhile to add another experiment with deformable solid example too as opposed to discrete granular material (sand). - I mean in future, not this PR.


## Problem overview

Expand All @@ -22,38 +20,47 @@ steps to maintain physically valid prediction.

## Dataset

We rely on [DeepMind's particle physics datasets](https://sites.google.com/view/learning-to-simulate)
for this example. They datasets are particle-based simulation of fluid splashing
and bouncing in a box or cube.
For this example, we use [DeepMind's particle physics datasets](https://sites.google.com/view/learning-to-simulate).
Some of these datasets contain particle-based simulations of fluid splashing and bouncing
within a box or cube while others use materials like sand or goop.
There are a total of 17 datasets, with some of them listed below:

| Datasets | Num Particles | Num Time Steps | dt | Ground Truth Simulator |
|--------------|---------------|----------------|----------|------------------------|
| Water-3D | 14k | 800 | 5ms | SPH |
| Water-2D | 2k | 1000 | 2.5ms | MPM |
| WaterRamp | 2.5k | 600 | 2.5ms | MPM |
| Sand | 2k | 320 | 2.5ms | MPM |
| Goop | 1.9k | 400 | 2.5ms | MPM |

See the section **B.1** in the [original paper](https://arxiv.org/abs/2002.09405).

## Model overview and architecture

In this model, we utilize a Meshgraphnet to capture the fluid system’s dynamics.
We represent the system as a graph, with vertices corresponding to fluid particles
and edges representing their interactions. The model is autoregressive, using
historical data to predict future states. The input features for the vertices
include the current position, current velocity, node type (e.g., fluid, sand,
boundary), and historical velocity. The model's output is the acceleration,
defined as the difference between the current and next velocity. Both velocity
and acceleration are derived from the position sequence and normalized to a
standard Gaussian distribution for consistency.
This model uses MeshGraphNet to capture the dynamics of the fluid system.
The system is represented as a graph, where vertices correspond to fluid particles,
and edges represent their interactions. The model is autoregressive,
utilizing historical data to predict future states. Input features for the vertices
include current position, velocity, node type (e.g., fluid, sand, boundary),
and historical velocity. The models output is acceleration, defined as the difference
between current and next velocity. Both velocity and acceleration are derived from
the position sequence and normalized to a standard Gaussian distribution
for consistency.

For computational efficiency, we do not explicitly construct wall nodes for
square or cubic domains. Instead, we assign a wall feature to each interior
particle node, representing its distance from the domain boundaries. For a
system dimensionality of \(d = 2\) or \(d = 3\), the features are structured
system dimensionality of $d = 2$ or $d = 3$, the features are structured
as follows:

- **Node features**: position (\(d\)), historical velocity (\(t \times d\)),
one-hot encoding of node type (6), wall feature (\(2 \times d\))
- **Edge features**: displacement (\(d\)), distance (1)
- **Node target**: acceleration (\(d\))
- **Node features**:
- position ($d$)
- historical velocity ($t \times d$),
where the number of steps $t$ can be set using `data.num_history` config parameter.
- one-hot encoding of node type (e.g. 6),
- wall feature ($2 \times d$)
- **Edge features**: displacement ($d$), distance (1)
- **Node target**: acceleration ($d$)

We construct edges based on a predefined radius, connecting pairs of particle
nodes if their pairwise distance is within this radius. During training, we
Expand All @@ -65,54 +72,88 @@ a small amount of noise is added during training.

The model uses a hidden dimensionality of 128 for the encoder, processor, and
decoder. The encoder and decoder each contain two hidden layers, while the
processor consists of eight message-passing layers. We use a batch size of
20 per GPU, and summation aggregation is applied for message passing in the
processor. The learning rate is set to 0.0001 and decays exponentially with
a rate of 0.9999991. These hyperparameters can be configured in the config file.
processor consists of ten message-passing layers. We use a batch size of
20 per GPU (for Water dataset), and summation aggregation is applied for
message passing in the processor. The learning rate is set to 0.0001 and decays
using cosine annealing schedule. These hyperparameters can be configured using
command line or in the config file.

## Getting Started

This example requires the `tensorflow` library to load the data in the `.tfrecord`
format. Install with
format. Install with:

```bash
pip install tensorflow
pip install "tensorflow<=2.17.1"
```

To download the data from DeepMind's repo, run
To download the data from DeepMind's repo, run:

```bash
cd raw_dataset
bash download_dataset.sh Water /data/
```

Change the data path in `conf/config_2d.yaml` correspondingly
This example uses [Hydra](https://hydra.cc/docs/intro/) for [experiment](https://hydra.cc/docs/patterns/configuring_experiments/)
configuration. Hydra offers a convenient way to modify nearly any experiment parameter,
such as dataset settings, model configurations, and optimizer options,
either through the command line or config files.

To view the full set of training script options, run the following command:

```bash
python train.py --help
```

To train the model, run
If you encounter issues with the Hydra config, you may receive an error message
that isn’t very helpful. In that case, set the `HYDRA_FULL_ERROR=1` environment
variable for more detailed error information:

```bash
python train.py
HYDRA_FULL_ERROR=1 python train.py ...
```

Progress and loss logs can be monitored using Weights & Biases. To activatethat,
set `wandb_mode` to `online` in the `conf/config_2d.yaml` This requires to have an active
Weights & Biases account. You also need to provide your API key in the config file.
To train the model with the Water dataset, run:

```bash
wandb_key: <your_api_key>
python train.py +experiment=water data.data_dir=/data/Water
```

The URL to the dashboard will be displayed in the terminal after the run is launched.
Alternatively, the logging utility in `train.py` can be switched to MLFlow.
Progress and loss logs can be monitored using Weights & Biases. To activate that,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe suggest monitoring alternatives if acquiring a W&B license is not possibe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we don't have any - but it should be easy to add MLFlow (we already have it in Modulus, just need to properly plug into this code) or TensorBoard. Feel free to take this as a task for yourself.

set `loggers.wandb.mode` to `online` in the command line:

```bash
python train.py +experiment=water data.data_dir=/data/Water loggers.wandb.mode=online
```

Once the model is trained, run
An active Weights & Biases account is required. You will also need to set your
API key either through the command line option `loggers.wandb.wandb_key`
or by using the `WANDB_API_KEY` environment variable:

```bash
python inference.py
export WANDB_API_KEY=key
python train.py ...
```

This will save the predictions for the test dataset in `.gif` format in the `animations`
directory.
## Inference

The inference script, `inference.py`, also supports Hydra configuration, ensuring
consistency between training and inference runs.

Once the model is trained, run the following command:

```bash
python inference.py +experiment=water \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to specify sample id for the inference? So that we can try running the model on different data (including the ones used for training) and see the output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment, but it's a great feature! I'm going to add it to the list of tasks.

data.data_dir=/data/Water \
data.test.num_samples=1 \
resume_dir=/data/models/lmgn/water \
output=/data/models/lmgn/water/inference
```

Use the `resume_dir` parameter to specify the location of the model checkpoints.

This will save the predictions for the test dataset as `.gif` files in the
`/data/models/lmgn/water/inference/animations` directory.

## References

Expand Down
101 changes: 101 additions & 0 deletions examples/cfd/lagrangian_mgn/conf/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

defaults:
- /logging/python: default
- override hydra/job_logging: disabled # We use rank-aware logger configuration instead.
- _self_

hydra:
run:
dir: ${output}
output_subdir: hydra # Default is .hydra which causes files not being uploaded in W&B.

# Dimensionality of the problem (2D or 3D).
dim: 2

# Main output directory.
output: outputs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make outputs a runtime argument to enable running multiple training jobs?
Or maybe add a job id like this?
output: outputs/${job.name}

Copy link
Collaborator Author

@Alexey-Kamenev Alexey-Kamenev Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Hydra you can specify it pretty flexibly. For example, output can be set by default to:

output: outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}

This will create a new directory on each run based on the current time. Users can do the same via command line.
job.name does not provide a unique name either - it defaults to the script name (e.g. train) so users will still have to override the output via command line/config.
Also, using ./outputs is kind of default for other Modulus examples (whether it's good or not is a separate discussion)


# The directory to search for checkpoints to continue training.
resume_dir: ${output}

# The dataset directory must be set either in command line or config.
data:
data_dir: ???
num_history: 5
num_node_types: 6
train:
split: train
valid:
split: valid
test:
split: test

# The loss should be set in the experiment.
loss: ???

# The optimizer should be set in the experiment.
optimizer: ???

# The scheduler should be set in the experiment.
lr_scheduler: ???

train:
batch_size: 20
epochs: 20
checkpoint_save_freq: 5
dataloader:
batch_size: ${..batch_size}
shuffle: true
num_workers: 8
pin_memory: true
drop_last: true

test:
batch_size: 1
device: cuda
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to specify cuda device on train as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, for train the DistributedManager is used to set the device properly. If needed, the device can be set using other mechanisms, like CUDA_VISIBLE_DEVICES.

dataloader:
batch_size: ${..batch_size}
shuffle: false
num_workers: 1
pin_memory: true
drop_last: false

compile:
enabled: false
args:
backend: inductor

amp:
enabled: false

loggers:
wandb:
_target_: loggers.WandBLogger
project: meshgraphnet
entity: modulus
name: l-mgn
group: l-mgn
mode: disabled
dir: ${output}
id:
wandb_key:
watch_model: false

inference:
frame_skip: 1
frame_interval: 1
29 changes: 29 additions & 0 deletions examples/cfd/lagrangian_mgn/conf/data/lagrangian_dataset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

_target_: modulus.datapipes.gnn.lagrangian_dataset.LagrangianDataset
_convert_: all

name: ${data.name}
data_dir: ${data.data_dir}
split: ???
num_samples: ???
num_history: ${..num_history}
num_steps: 600
num_node_types: ${..num_node_types}
noise_std: 0.0003
radius: 0.015
dt: 0.0025
42 changes: 42 additions & 0 deletions examples/cfd/lagrangian_mgn/conf/experiment/goop.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# @package _global_

# SPDX-FileCopyrightText: Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

defaults:
- /[email protected]: lagrangian_dataset
- /[email protected]: lagrangian_dataset
- /[email protected]: lagrangian_dataset
- /model: mgn_2d
- /loss: mseloss
- /optimizer: fused_adam
- /lr_scheduler: cosine

data:
name: Goop
num_node_types: 9
train:
num_samples: 1000
num_steps: 395 # 400 - ${num_history}
valid:
num_samples: 30
num_steps: 100
test:
num_samples: 30
num_steps: 100

model:
input_dim_nodes: 25 # 9 node types instead of 6.
Loading