Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entire coordinate field is sent to one MPI rank for NetCDF outputting. #556

Open
JHopeCollins opened this issue Sep 25, 2024 · 7 comments
Open
Labels
parallel Pull requests or issues relating to parallel capability

Comments

@JHopeCollins
Copy link
Collaborator

JHopeCollins commented Sep 25, 2024

Python parallel NetCDF does not support NetCDF4 features. To get around this, the entire coordinate field must be communicated to a single MPI rank to use NetCDF outputting. This happens in Coordinates.register_space:

def register_space(self, domain, space_name):

This is ok for smaller jobs, but is impractical/impossible for larger resolutions or processor counts.
However, the Domain will always register the DG space:

self.coords.register_space(self, 'DG1_equispaced')

This means that even if you don't use the NetCDF outputting, the coordinate field will still be collected on one rank.
Ideally this collection would be optional. Either by telling the Domain explicitly, or even better by delaying the register_space call until it is actually needed. Then if dump_nc=False is passed to the OutputParameters then register_space is never called. This might go in this method, but I'm not familiar enough to really tell.

def create_nc_dump(self, filename, space_names):

@JHopeCollins JHopeCollins added the parallel Pull requests or issues relating to parallel capability label Sep 25, 2024
@tommbendall
Copy link
Contributor

Thinking about longer term solutions to parallel outputting to solve this, here are some first thoughts on different options:

  1. Having each rank write out to its own netCDF file. For plotting we'd presumably then need a routine to stitch these files back together.
  2. Send data to the first rank in batches, e.g. for data on an extruded mesh, send it level-by-level. The aim would be to avoid the first rank needing to allocate memory for the whole data array. But this doesn't necessarily help with any communications bottlenecks.
  3. Using pnetcdf. This would likely involve restructuring the contents of the output file as pnetcdf doesn't support several HDF5 features. It would feel odd not to me not use the HDF5 standard...

This is also relevant to #378 but posting here as this feels a more relevant place right now. Option 1 seems the most sensible to me ...

@colinjcotter
Copy link
Contributor

I think that it would be a lot more sustainable (for hpc installation etc) to move Gusto over to checkpoint files for IO, and then develop offline tools for transferring to and from netcdf as needed.

@JHopeCollins
Copy link
Collaborator Author

JHopeCollins commented Sep 26, 2024

Of the three options that Tom identified, I agree that option 1 is the best. Option 2 gets around the out-of-memory errors but will scale poorly for larger jobs.

But, I do think that Colin's suggestion that checkpoint files are more sustainable. It would also by more Firedrake-onic (is that a word? Pythonic is. Fire-draconic maybe?), as well as being the more efficient and standard approach - dump data to disk as quickly as possible during the compute job when you're using lots of nodes, then do the postprocessing later on a smaller number of nodes.

It would also make it easier a) for other people who might have existing postprocessing workflows that use plain Firedrake rather than Gusto, and b) to compare to results from other Firedrake libraries.

As an aside, postprocessing large files from previous runs is one of the reasons most HPC machines have high memory nodes, which usually have at least twice the amount of RAM of the usual nodes. So that's one option for doing data processing that relies on reductions onto a single rank.

@JDBetteridge
Copy link
Member

I'm going to stick my oar in here and disagree with @tommbendall and @JHopeCollins: I think option 1 is actually a bad idea as I don't think it will scale well on many systems, and it doesn't leverage any of the features of MPI-IO that enable the scalable writing of files. I think the "right" solution for scalability is either:

  1. Use the existing parallel IO available in netcdf4.
  2. Use Firedrake HDF5 checkpointing as Colin suggested.

Both of these options will utilise MPI-IO and hopefully prevent performance and scalability issues. Happy to expand on these points if you want more information.

@colinjcotter
Copy link
Contributor

colinjcotter commented Sep 27, 2024

Only downside of Firedrake hd5 is that it creates errors on non affine meshes. So we should really work out how to fix that.

@colinjcotter
Copy link
Contributor

although we can keep a hedgehog mesh around for such situations.

@tommbendall
Copy link
Contributor

Thanks for the discussion all!

I hadn't realised that we might be able to use the existing parallel option with netCDF4. That looks like it should be straightforward so I'll give that a try.

Since we already have checkpointing capability, it should also be straightforward to allow people to only write diagnostic output to Firedrake checkpoint files if they would prefer.

I would definitely want to keep some form of online netCDF outputting, rather than post-processing to do this, as setting that up sounds like more work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallel Pull requests or issues relating to parallel capability
Projects
None yet
Development

No branches or pull requests

4 participants