-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move to experiment-based Hydra config in L-MGN example #771
base: main
Are you sure you want to change the base?
Changes from 4 commits
5766437
f671793
5c13a1f
c1f8769
d03d6dd
1dae00e
a7532b4
018b91d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,9 @@ | ||
# MeshGraphNet with Lagrangian mesh | ||
|
||
This is an example of Meshgraphnet for particle-based simulation on the | ||
water dataset based on | ||
<https://github.com/google-deepmind/deepmind-research/tree/master/learning_to_simulate> | ||
in PyTorch. | ||
It demonstrates how to train a Graph Neural Network (GNN) for evaluation | ||
of the Lagrangian fluid. | ||
This is an example of MeshGraphNet for particle-based simulation, based on the | ||
[Learning to Simulate](https://sites.google.com/view/learning-to-simulate/) | ||
work. It demonstrates how to use Modulus to train a Graph Neural Network (GNN) | ||
to simulate Lagrangian fluids, solids, and deformable materials. | ||
|
||
## Problem overview | ||
|
||
|
@@ -22,38 +20,46 @@ steps to maintain physically valid prediction. | |
|
||
## Dataset | ||
|
||
We rely on [DeepMind's particle physics datasets](https://sites.google.com/view/learning-to-simulate) | ||
for this example. They datasets are particle-based simulation of fluid splashing | ||
and bouncing in a box or cube. | ||
For this example, we use [DeepMind's particle physics datasets](https://sites.google.com/view/learning-to-simulate). | ||
Some of these datasets contain particle-based simulations of fluid splashing and bouncing | ||
within a box or cube while others use materials like sand or goop. | ||
There are a total of 17 datasets, with some of them listed below: | ||
|
||
| Datasets | Num Particles | Num Time Steps | dt | Ground Truth Simulator | | ||
|--------------|---------------|----------------|----------|------------------------| | ||
| Water-3D | 14k | 800 | 5ms | SPH | | ||
| Water-2D | 2k | 1000 | 2.5ms | MPM | | ||
| WaterRamp | 2.5k | 600 | 2.5ms | MPM | | ||
| Sand | 2k | 320 | 2.5ms | MPM | | ||
| Goop | 1.9k | 400 | 2.5ms | MPM | | ||
|
||
See the section **B.1** in the [original paper](https://arxiv.org/abs/2002.09405). | ||
|
||
## Model overview and architecture | ||
|
||
In this model, we utilize a Meshgraphnet to capture the fluid system’s dynamics. | ||
We represent the system as a graph, with vertices corresponding to fluid particles | ||
and edges representing their interactions. The model is autoregressive, using | ||
historical data to predict future states. The input features for the vertices | ||
include the current position, current velocity, node type (e.g., fluid, sand, | ||
boundary), and historical velocity. The model's output is the acceleration, | ||
defined as the difference between the current and next velocity. Both velocity | ||
and acceleration are derived from the position sequence and normalized to a | ||
standard Gaussian distribution for consistency. | ||
This model uses MeshGraphNet to capture the dynamics of the fluid system. | ||
The system is represented as a graph, where vertices correspond to fluid particles, | ||
and edges represent their interactions. The model is autoregressive, | ||
utilizing historical data to predict future states. Input features for the vertices | ||
include current position, velocity, node type (e.g., fluid, sand, boundary), | ||
and historical velocity. The model’s output is acceleration, defined as the difference | ||
between current and next velocity. Both velocity and acceleration are derived from | ||
the position sequence and normalized to a standard Gaussian distribution | ||
for consistency. | ||
|
||
For computational efficiency, we do not explicitly construct wall nodes for | ||
square or cubic domains. Instead, we assign a wall feature to each interior | ||
particle node, representing its distance from the domain boundaries. For a | ||
system dimensionality of \(d = 2\) or \(d = 3\), the features are structured | ||
system dimensionality of $d = 2$ or $d = 3$, the features are structured | ||
as follows: | ||
|
||
- **Node features**: position (\(d\)), historical velocity (\(t \times d\)), | ||
one-hot encoding of node type (6), wall feature (\(2 \times d\)) | ||
- **Edge features**: displacement (\(d\)), distance (1) | ||
- **Node target**: acceleration (\(d\)) | ||
- **Node features**: | ||
- position ($d$) | ||
- historical velocity ($t \times d$) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does t here equal to num_history? Maybe say this explicitely? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
- one-hot encoding of node type (e.g. 6), | ||
- wall feature ($2 \times d$) | ||
- **Edge features**: displacement ($d$), distance (1) | ||
- **Node target**: acceleration ($d$) | ||
|
||
We construct edges based on a predefined radius, connecting pairs of particle | ||
nodes if their pairwise distance is within this radius. During training, we | ||
|
@@ -65,54 +71,88 @@ a small amount of noise is added during training. | |
|
||
The model uses a hidden dimensionality of 128 for the encoder, processor, and | ||
decoder. The encoder and decoder each contain two hidden layers, while the | ||
processor consists of eight message-passing layers. We use a batch size of | ||
20 per GPU, and summation aggregation is applied for message passing in the | ||
processor. The learning rate is set to 0.0001 and decays exponentially with | ||
a rate of 0.9999991. These hyperparameters can be configured in the config file. | ||
processor consists of ten message-passing layers. We use a batch size of | ||
20 per GPU (for Water dataset), and summation aggregation is applied for | ||
message passing in the processor. The learning rate is set to 0.0001 and decays | ||
using cosine annealing schedule. These hyperparameters can be configured using | ||
command line or in the config file. | ||
|
||
## Getting Started | ||
|
||
This example requires the `tensorflow` library to load the data in the `.tfrecord` | ||
format. Install with | ||
format. Install with: | ||
|
||
```bash | ||
pip install tensorflow | ||
pip install "tensorflow<=2.17.1" | ||
``` | ||
|
||
To download the data from DeepMind's repo, run | ||
To download the data from DeepMind's repo, run: | ||
|
||
```bash | ||
cd raw_dataset | ||
bash download_dataset.sh Water /data/ | ||
``` | ||
|
||
Change the data path in `conf/config_2d.yaml` correspondingly | ||
This example uses [Hydra](https://hydra.cc/docs/intro/) for [experiment](https://hydra.cc/docs/patterns/configuring_experiments/) | ||
configuration. Hydra offers a convenient way to modify nearly any experiment parameter, | ||
such as dataset settings, model configurations, and optimizer options, | ||
either through the command line or config files. | ||
|
||
To view the full set of training script options, run the following command: | ||
|
||
```bash | ||
python train.py --help | ||
``` | ||
|
||
To train the model, run | ||
If you encounter issues with the Hydra config, you may receive an error message | ||
that isn’t very helpful. In that case, set the `HYDRA_FULL_ERROR=1` environment | ||
variable for more detailed error information: | ||
|
||
```bash | ||
python train.py | ||
HYDRA_FULL_ERROR=1 python train.py ... | ||
``` | ||
|
||
Progress and loss logs can be monitored using Weights & Biases. To activatethat, | ||
set `wandb_mode` to `online` in the `conf/config_2d.yaml` This requires to have an active | ||
Weights & Biases account. You also need to provide your API key in the config file. | ||
To train the model with the Water dataset, run: | ||
|
||
```bash | ||
wandb_key: <your_api_key> | ||
python train.py +experiment=water data.data_dir=/data/Water | ||
``` | ||
|
||
The URL to the dashboard will be displayed in the terminal after the run is launched. | ||
Alternatively, the logging utility in `train.py` can be switched to MLFlow. | ||
Progress and loss logs can be monitored using Weights & Biases. To activate that, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe suggest monitoring alternatives if acquiring a W&B license is not possibe? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now we don't have any - but it should be easy to add MLFlow (we already have it in Modulus, just need to properly plug into this code) or TensorBoard. Feel free to take this as a task for yourself. |
||
set `loggers.wandb.mode` to `online` in the command line: | ||
|
||
```bash | ||
python train.py +experiment=water data.data_dir=/data/Water loggers.wandb.mode=online | ||
``` | ||
|
||
Once the model is trained, run | ||
An active Weights & Biases account is required. You will also need to set your | ||
API key either through the command line option `loggers.wandb.wandb_key` | ||
or by using the `WANDB_API_KEY` environment variable: | ||
|
||
```bash | ||
python inference.py | ||
export WANDB_API_KEY=key | ||
python train.py ... | ||
``` | ||
|
||
This will save the predictions for the test dataset in `.gif` format in the `animations` | ||
directory. | ||
## Inference | ||
|
||
The inference script, `inference.py`, also supports Hydra configuration, ensuring | ||
consistency between training and inference runs. | ||
|
||
Once the model is trained, run the following command: | ||
|
||
```bash | ||
python inference.py +experiment=water \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a way to specify sample id for the inference? So that we can try running the model on different data (including the ones used for training) and see the output? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not at the moment, but it's a great feature! I'm going to add it to the list of tasks. |
||
data.data_dir=/data/Water \ | ||
data.test.num_samples=1 \ | ||
resume_dir=/data/models/lmgn/water \ | ||
output=/data/models/lmgn/water/inference | ||
``` | ||
|
||
Use the `resume_dir` parameter to specify the location of the model checkpoints. | ||
|
||
This will save the predictions for the test dataset as `.gif` files in the | ||
`/data/models/lmgn/water/inference/animations` directory. | ||
|
||
## References | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. | ||
# SPDX-FileCopyrightText: All rights reserved. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
defaults: | ||
- /logging/python: default | ||
- override hydra/job_logging: disabled # We use rank-aware logger configuration instead. | ||
- _self_ | ||
|
||
hydra: | ||
run: | ||
dir: ${output} | ||
output_subdir: hydra # Default is .hydra which causes files not being uploaded in W&B. | ||
|
||
# Dimensionality of the problem (2D or 3D). | ||
dim: 2 | ||
|
||
# Main output directory. | ||
output: outputs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe make outputs a runtime argument to enable running multiple training jobs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With Hydra you can specify it pretty flexibly. For example,
This will create a new directory on each run based on the current time. Users can do the same via command line. |
||
|
||
# The directory to search for checkpoints to continue training. | ||
resume_dir: ${output} | ||
|
||
# The dataset directory must be set either in command line or config. | ||
data: | ||
data_dir: ??? | ||
num_node_types: 6 | ||
train: | ||
split: train | ||
valid: | ||
split: valid | ||
test: | ||
split: test | ||
|
||
# The loss should be set in the experiment. | ||
loss: ??? | ||
|
||
# The optimizer should be set in the experiment. | ||
optimizer: ??? | ||
|
||
# The scheduler should be set in the experiment. | ||
lr_scheduler: ??? | ||
|
||
train: | ||
batch_size: 20 | ||
epochs: 20 | ||
checkpoint_save_freq: 5 | ||
dataloader: | ||
batch_size: ${..batch_size} | ||
shuffle: true | ||
num_workers: 8 | ||
pin_memory: true | ||
drop_last: true | ||
|
||
test: | ||
batch_size: 1 | ||
device: cuda | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you need to specify cuda device on train as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, for train the |
||
dataloader: | ||
batch_size: ${..batch_size} | ||
shuffle: false | ||
num_workers: 1 | ||
pin_memory: true | ||
drop_last: false | ||
|
||
compile: | ||
enabled: false | ||
args: | ||
backend: inductor | ||
|
||
amp: | ||
enabled: false | ||
|
||
loggers: | ||
wandb: | ||
_target_: loggers.WandBLogger | ||
project: meshgraphnet | ||
entity: modulus | ||
name: l-mgn | ||
group: l-mgn | ||
mode: disabled | ||
dir: ${output} | ||
id: | ||
wandb_key: | ||
watch_model: false | ||
|
||
inference: | ||
frame_skip: 1 | ||
frame_interval: 1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. | ||
# SPDX-FileCopyrightText: All rights reserved. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
_target_: modulus.datapipes.gnn.lagrangian_dataset.LagrangianDataset | ||
_convert_: all | ||
|
||
name: ${data.name} | ||
data_dir: ${data.data_dir} | ||
split: ??? | ||
num_samples: ??? | ||
num_history: 5 | ||
num_steps: 600 | ||
num_node_types: ${..num_node_types} | ||
noise_std: 0.0003 | ||
radius: 0.015 | ||
dt: 0.0025 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# @package _global_ | ||
|
||
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. | ||
# SPDX-FileCopyrightText: All rights reserved. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
defaults: | ||
- /[email protected]: lagrangian_dataset | ||
- /[email protected]: lagrangian_dataset | ||
- /[email protected]: lagrangian_dataset | ||
- /model: mgn_2d | ||
- /loss: mseloss | ||
- /optimizer: fused_adam | ||
- /lr_scheduler: cosine | ||
|
||
data: | ||
name: Goop | ||
num_node_types: 9 | ||
train: | ||
num_samples: 1000 | ||
num_steps: 395 # 400 - ${num_history} | ||
valid: | ||
num_samples: 30 | ||
num_steps: 100 | ||
test: | ||
num_samples: 30 | ||
num_steps: 100 | ||
|
||
model: | ||
input_dim_nodes: 25 # 9 node types instead of 6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an experiment included to simulate solids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's
./conf/experiment/sand.yaml
- the material is sand in this case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worthwhile to add another experiment with deformable solid example too as opposed to discrete granular material (sand). - I mean in future, not this PR.