-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move to experiment-based Hydra config in L-MGN example #771
base: main
Are you sure you want to change the base?
Conversation
/blossom-ci |
/blossom-ci |
/blossom-ci |
This reverts commit 5c13a1f.
/blossom-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great organization! Thanks!
This is an example of MeshGraphNet for particle-based simulation, based on the | ||
[Learning to Simulate](https://sites.google.com/view/learning-to-simulate/) | ||
work. It demonstrates how to use Modulus to train a Graph Neural Network (GNN) | ||
to simulate Lagrangian fluids, solids, and deformable materials. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an experiment included to simulate solids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's ./conf/experiment/sand.yaml
- the material is sand in this case.
- **Node target**: acceleration (\(d\)) | ||
- **Node features**: | ||
- position ($d$) | ||
- historical velocity ($t \times d$) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does t here equal to num_history? Maybe say this explicitely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
``` | ||
|
||
The URL to the dashboard will be displayed in the terminal after the run is launched. | ||
Alternatively, the logging utility in `train.py` can be switched to MLFlow. | ||
Progress and loss logs can be monitored using Weights & Biases. To activate that, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe suggest monitoring alternatives if acquiring a W&B license is not possibe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we don't have any - but it should be easy to add MLFlow (we already have it in Modulus, just need to properly plug into this code) or TensorBoard. Feel free to take this as a task for yourself.
dim: 2 | ||
|
||
# Main output directory. | ||
output: outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make outputs a runtime argument to enable running multiple training jobs?
Or maybe add a job id like this?
output: outputs/${job.name}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Hydra you can specify it pretty flexibly. For example, output
can be set by default to:
output: outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}
This will create a new directory on each run based on the current time. Users can do the same via command line.
job.name
does not provide a unique name either - it defaults to the script name (e.g. train
) so users will still have to override the output via command line/config.
Also, using ./outputs
is kind of default for other Modulus examples (whether it's good or not is a separate discussion)
|
||
test: | ||
batch_size: 1 | ||
device: cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to specify cuda device on train as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, for train the DistributedManager
is used to set the device properly. If needed, the device can be set using other mechanisms, like CUDA_VISIBLE_DEVICES
.
Once the model is trained, run the following command: | ||
|
||
```bash | ||
python inference.py +experiment=water \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to specify sample id for the inference? So that we can try running the model on different data (including the ones used for training) and see the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at the moment, but it's a great feature! I'm going to add it to the list of tasks.
self.dim = cfg.dim | ||
self.gravity = torch.zeros(self.dim, device=self.dist.device) | ||
self.gravity[-1] = -9.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is gravity considered now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's currently not used and I don't think we need it. But I need to confirm with the original author of the code to understand why it was added in the first place.
@@ -248,27 +203,30 @@ def main(cfg: DictConfig) -> None: | |||
mean_loss_pos = sum(loss_pos_list) / len(loss_pos_list) | |||
mean_loss_vel = sum(loss_vel_list) / len(loss_vel_list) | |||
mean_loss_acc = sum(loss_acc_list) / len(loss_acc_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use np?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sum
is faster on lists than np.sum
(which is faster on numpy arrays but that's not what we have here).
@@ -248,27 +203,30 @@ def main(cfg: DictConfig) -> None: | |||
mean_loss_pos = sum(loss_pos_list) / len(loss_pos_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mean_loss is only used for logging and not for training, right? Then, maybe rename it so it doesn't get confused with the loss being used for training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is refactored in my upcoming PR, please wait.
def main(cfg: DictConfig) -> None: | ||
# initialize distributed manager | ||
DistributedManager.initialize() | ||
dist = DistributedManager() | ||
|
||
init_python_logging(cfg, dist.rank) | ||
logger.info(f"Config summary:\n{OmegaConf.to_yaml(cfg, sort_keys=True)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a log_gpu_info function as well to dump the gpu info, namely, number of devices, model, cuda version, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/blossom-ci |
Modulus Pull Request
Description
Results on different datasets:
Checklist
Dependencies