Runaway memory usage; surely my own bug! #1241

brucelaughlin · 2024-03-05T23:12:55Z

brucelaughlin
Mar 5, 2024

Hello,

I'm experiencing runaway memory usage, which kills runs once they reach the maximum available memory on my system.

I've made some minor customizations in a model and reader, and I'm confident that I have not introduced any variables that will continuously grow in size throughout the run (ie there is no list appending happening, ,etc).

I'm plotting memory usage of my runs, and it's ever-increasing in a regular staircase pattern. I am using daily model output; if I run for 10 days, I see 10 steps in the plot of memory usage. So, there seems to be a connection between when a new model output file is loaded and when memory usage jumps. The jumps are all roughly the same size throughout the run.

I installed Opendrift about three months ago, following the instructions on the main webpage (https://opendrift.github.io/install.html). So I think my version of is somewhat up-to-date.

What could be causing this ever-increasing memory usage?

I'm looking into the "run" method, to see how the history object is modified. It seems like it's initialized as a fixed-size list, and then flushed when the number of model (or output?) timesteps equals <export_buffer_length>. I haven't yet found anything in the source code that would be continuing to grow throughout the run, but I continue to look.

I've also run with various configurations (adjusting <export_buffer_length>, running with default settings (ie only specifying an output file), adjusting model and output timesteps). The staircase pattern of memory usage does not change.

Please advise,

Thank you,
Bruce

brucelaughlin · 2024-03-08T01:23:22Z

brucelaughlin
Mar 8, 2024
Author

I'm confused, because when I run the same experiment using the standard Oceandrift model and reader_ROMS_native reader, the constant increase in memory usage continues.

Now, increasing memory usage might be expected as particles spread out throughout a run, but the increases continue well past what can be explained by having to load forcing data for the entire domain.

I've tried running Oceandrift now with:
o.add_readers_from_list(his_file_list), where his_file_list is an ordered list of model files

and also with:
o.add_reader(r), where r = reader_ROMS_native.Reader(his_file_1), with his_file_1 a wildcard string (like 'model_output_dir/model_avg_*.nc').

In all cases, my memory usage continues to rise monotonically. For a current test run, 10,000 particles, after 3 months of simulation (using daily forcing files, and hourly calculation timestep, memory usage has risen from 3.5BG to 10GB

1 reply

brucelaughlin Mar 8, 2024
Author

Using Oceandrift as my model, and seeding 10,000 floats along the coast of my domain at a single time, I do:

o = OceanDrift(loglevel=0)
r = reader_ROMS_native.Reader('model_output_dir/model_avg_*.nc')
o.add_reader(r)

After calling "o.run(), there is a steady rise to 12BG of memory usage over the first 1/3 of the 1-year run, and then usage increases by another 1GB until almost the very end, where it spikes from 13.5GB to 30GB. So, this is different behavior than when I'm running with my custom model and reader. However, I'm still surprised by the rise to 12GB of memory usage for most of the run, since Opendrift shouldn't need to load more than a max of ~1GB of model data per timestep (by my own calculation). Furthermore, I'd assume that the amount of memory needed per particle (to track location, etc) is fixed throughout the run. I can't explain what I'm seeing, especially now that I'm using a stock model and reader.

knutfrode · 2024-03-09T20:01:42Z

knutfrode
Mar 9, 2024
Maintainer

Hi,

Yes, OpenDrift should not use much memory, as the history array is flushed to disk every 100 timesteps by default (export_buffer_length), and for readers, only previous and upcoming timesteps are kept/cached in memory.
I assume you are saving output to file, i.e. specifying outfile to run method? Otherwise all data has to be kept in memory.
In fact I have almost never encountered memory problems myself, and not many user have either.
It could be that Xarray has a memory leak, as used within the reader.

2 replies

brucelaughlin Mar 11, 2024
Author

Is there a flag somewhere that I may have missed? Indeed, I see my growing throughout the run, so there's definitely data being written during execution. I just made a test where I'm using Oceandrift and reader_ROMS_native (ie not doing anything custom, other than specifying initial locations and an output file), and memory usage continues to grow, going beyond 10GB for 10,000 particles. I ran with default arguments, only specifying (so <export_buffer_length> should be default = 100). I'm so confused.

brucelaughlin Mar 11, 2024
Author

As another test, I started runs with 10,000, 100,000, and 1,000,000 particles. The 1 million particle test quickly jumped to around 22GB of memory usage, and has been hovering there, though I expect that to rise as it slowly moves through the simulation. Is ~20GB of memory for 1 million particles expected?

Perhaps nothing is out of the ordinary, and I was just surprised that I could not run more than one set of 10,000 floats at a time on a single node. Knut, when you said you'd almost never encountered memory problems, can you give an estimate of what the expected memory usage would be for the runs I've described, based on your intuition? Or, what has memory usage been like in your runs, or those of others?

I had hoped to have multiple seedings (ie multiple starting times) of ~10,000 particles per seeding running on a single node, but perhaps it's not surprising that I can only do one at a time?

knutfrode · 2024-03-11T22:20:49Z

knutfrode
Mar 11, 2024
Maintainer

I have run with a million particles on a laptop without problems, and have never bothered to monitor memory usage as it has never been an issue for me.
Are you able to make a minimalistic script to reproduce the problem using one of the included test-datasets?
Then I could see if I can reproduce the problems here.

0 replies

knutfrode · 2024-03-11T22:51:54Z

knutfrode
Mar 11, 2024
Maintainer

E.g. changing the number of particles in example_generic from 3000 to 1 million gives for me a memory usage of about 4GB.
https://opendrift.github.io/gallery/example_generic.html
How does that work for you?

1 reply

brucelaughlin Mar 12, 2024
Author

Thanks for your attention, Knut.

Running "example_generic" in a few different configurations, my memory usage is similar to yours, and less when I lower <export_buffer_length>. So, that seems good!

However, I'm not sure if I can reproduce what I'm seeing in my own runs, since I'd need readers to load multiple forcing files throughout the run. Are there perhaps any available datasets where I/we can test an example setup that uses multiple forcing files as time progresses?

In my previous (problematic) experiments, the jumps I was seeing in memory usage seemed to exactly correlate with the loading of new model data. For instance, a 10-day Opendrift run would need to read 10 forcing files (I have daily model output), and in graphs of memory usage, I was seeing 10 distinct jumps. Changing the buffer_length changed the "smoothness" of the memory usage graphs (if I recall correctly), but the monotonic increase was still there.

Also, the main difference between my script and the example script is in the seeding; I start by determining the starting locations of my floats using previously saved lat/lon coordinates corresponding to all the "coastal" release sites I want to seed from, ie all of the model grid points within 10km of the coast. For each lat/lon coordinate, I seed at depths down to 20m, at intervals of 5m. This is how I end up with ~12,000 floats for a single release time. I construct lists of lon,lat, and z, and then I pass these to o.seed(), along with the single starting time. I then call o.run() with the outfile (and buffer length, for testing) as the only argument, though previously I was also passing "run duration" and the calculation and saving timesteps. No matter the arguments to o.run(), I'd get the same monotonic increases as the run progressed.

For the reader, I used:

his_file_1 = ".../model_output_dir/model_avg_*.nc"
r = reader_ROMS_native.Reader(his_file_1)
o.add_reader(r)

Maybe you can think of another way to test what I'm doing, or I can perhaps send a link to model data from my group and a version of my script that uses that data (I should do this for my own sanity, as well)

knutfrode · 2024-03-12T07:39:17Z

knutfrode
Mar 12, 2024
Maintainer

Could you make one or two of these ROMS files available for testing?
Alternatively, paste here the output from ncdump.

Also it could be useful to see your full script that leads to memory problems.

0 replies

brucelaughlin · 2024-03-12T20:26:20Z

brucelaughlin
Mar 12, 2024
Author

Hi Knut,

Thanks for your continued help. Please let me know if the following link doesn't work:

https://drive.google.com/drive/folders/1SFrCu0EgelI1FoKelTSyaOqxje6Rp0wQ?usp=drive_link

It should take you to a Google Drive folder containing my script, a single ".p" (Pickle) file from which initial lon/lat coordinates of floats are read, and 10 netCDF forcing files from our circulation model.

Otherwise, I can easily paste the script here, if that's best.

Thank you,
Bruce

0 replies

brucelaughlin · 2024-03-20T18:08:59Z

brucelaughlin
Mar 20, 2024
Author

Hi, were you able to look at those files? Everything should run if it's all in the same directory. I'm still unable to get a single seeding of 12000 floats to run for 90 days; see the picture of memory usage, before crashing:

3 replies

brucelaughlin Mar 20, 2024
Author

@knutfrode

brucelaughlin Mar 20, 2024
Author

Here's the same plot, when I use OceanDrift and the standard reader_ROMS_native, again for ~12000 particles seeded once, running for 90 days.

brucelaughlin Mar 20, 2024
Author

In the previous plot from the run that failed, I had a long list of export variables (our model saves a lot of biological variables that I want to track). That run also used my custom model and reader. In the plot here, using OceanDrift and reader_ROMS_native, I don't include an export variable list when calling to o.run().

knutfrode · 2024-03-20T22:29:21Z

knutfrode
Mar 20, 2024
Maintainer

The script (unchanged) worked well on my laptop, and spend about 2 minutes, producing an output file of 269MB.
How did you monitor the memory usage?
It might be useful to add logging of memory per timestep to OpenDrift, so that it can be plotted afterwards with something like o.plot_memory_usage()

3 replies

brucelaughlin Mar 20, 2024
Author

Thanks for running it. I am also to able run this quickly with no problems. I just wanted to show that, even with this short run, the beginnings of the trend of ever-increasing memory usage were appearing (at least for me).

I'm using Slurm to run this on a local cluster node, so I wrote a script to ping the node every 10 seconds using the Slurm "sstat" memory usage command. Are you able to use Slurm? I can provide my bash scripts which generate a logfile and a "memory usage log" file, the latter being used to make the plot I shared

brucelaughlin Mar 21, 2024
Author

Hi @knutfrode. I've updated the google drive to include my bash scripts which run the job via slurm. If you don't have access to a cluster via slurm, then we'll have to figure out another way to monitor memory locally. But, if those work, then you just need to run "a_main.sh", and everything should run automatically. After the run has finished, run "plot_memory_tests.py" using python, and you should hopefully see a plot like the ones I posted above.

Also, I've added 20 more forcing files (for a total of 30). If you can't run my slurm scripts, you can still run "opendrift_control_script_memory_leak.py" and it should run for 30 days now, and perhaps you have a way to monitor memory usage.

brucelaughlin Mar 21, 2024
Author

Here's my graph for the test that I shared in the Google drive (yours would hopefully look the same, if you could use the Slurm scripts). Does this look unusual to you? About 10.000 particles released along the coast, drifting for 30 days.

knutfrode · 2024-03-21T08:21:57Z

knutfrode
Mar 21, 2024
Maintainer

I am not familiar with Slurm and do presently not have access to a cluster.
However, I added now a simple logging of virtual memory usage during a simulation, with psutil.virtual_memory().used
f85cde2

After downloading all your forcing files and adding o.plot_memory_usage('memory.png') to your script, I get the following plot.

So yes, there seem to be a steady increase of memory usage. I am not sure what causes this, as history/data array is flushed to disk every 100 time steps. Maybe it is Xarray consuming more memory after opening more of the forcing files?
Maybe @gauteh has an idea?

0 replies

knutfrode · 2024-03-21T17:59:17Z

knutfrode
Mar 21, 2024
Maintainer

I did some more testing with Memray (https://github.com/bloomberg/memray) which was quite easy to install and use, however, I was not able to identify any source of memory leak.

However, I found this page hinting that Xarray has a memory leak when using the default netcdf4 engine, but much less when using the h5netcdf engine: pydata/xarray#3200

After installing h5netcdf with conda install h5netcdf and adding engine="h5netcdf" to open_mfdataset in the ROMS reader,
memory usage remained stable and low with your example script.
Hopefully this will solve your problem.

0 replies

brucelaughlin · 2024-03-21T18:26:17Z

brucelaughlin
Mar 21, 2024
Author

You solved the problem! Thank you so much!!!!!! Memory usage rises to 2GB and then remains stable - what a sight for sore eyes!
Thanks for your attention, I'm so excited to be able to continue our research here using Opendrift!
Also, thanks for sharing Memray as a resource, as well as the solution post.
Cheers!!!! - Bruce

0 replies

brucelaughlin · 2024-03-21T20:54:50Z

brucelaughlin
Mar 21, 2024
Author

Out of curiosity, might you know what the "spikes" in memory seen at the end of these graphs are caused by? One is from the same test I shared with you, and the other is from a 90 day run with 12,000 particles using my custom model and reader. Both look stable now (great!), but I'm still wondering about the jump at the ends. Have you seen this in your own runs?

0 replies

knutfrode · 2024-03-21T22:07:49Z

knutfrode
Mar 21, 2024
Maintainer

As the netCDF files is written with time as the open/unlimited dimension, is it not strictly CF-compatible as is.
Thus after the simulation is finished and last timesteps are written to disk, the whole output file is read back in, and written back to disk now with time as a fixed dimension:
https://github.com/OpenDrift/opendrift/blob/master/opendrift/export/io_netcdf.py#L168

This naturally consumes a lot of memory. For very long simulations with many elements, this may crash. But the output file is still usable, just not strictly CF-compliant.

0 replies

brucelaughlin · 2024-04-02T22:25:58Z

brucelaughlin
Apr 2, 2024
Author

Hi @knutfrode, thanks for explaining the memory spike at the end of Opendrift runs. That makes sense! It also gave me a chance to look into "CF compliance". (Sorry for delayed response; just back from "spring break")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runaway memory usage; surely my own bug! #1241

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 14 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Runaway memory usage; surely my own bug! #1241

brucelaughlin Mar 5, 2024

Replies: 14 comments · 10 replies

brucelaughlin Mar 8, 2024 Author

brucelaughlin Mar 8, 2024 Author

knutfrode Mar 9, 2024 Maintainer

brucelaughlin Mar 11, 2024 Author

brucelaughlin Mar 11, 2024 Author

knutfrode Mar 11, 2024 Maintainer

knutfrode Mar 11, 2024 Maintainer

brucelaughlin Mar 12, 2024 Author

knutfrode Mar 12, 2024 Maintainer

brucelaughlin Mar 12, 2024 Author

brucelaughlin Mar 20, 2024 Author

brucelaughlin Mar 20, 2024 Author

brucelaughlin Mar 20, 2024 Author

brucelaughlin Mar 20, 2024 Author

knutfrode Mar 20, 2024 Maintainer

brucelaughlin Mar 20, 2024 Author

brucelaughlin Mar 21, 2024 Author

brucelaughlin Mar 21, 2024 Author

knutfrode Mar 21, 2024 Maintainer

knutfrode Mar 21, 2024 Maintainer

brucelaughlin Mar 21, 2024 Author

brucelaughlin Mar 21, 2024 Author

knutfrode Mar 21, 2024 Maintainer

brucelaughlin Apr 2, 2024 Author

brucelaughlin
Mar 5, 2024

Replies: 14 comments 10 replies

brucelaughlin
Mar 8, 2024
Author

brucelaughlin Mar 8, 2024
Author

knutfrode
Mar 9, 2024
Maintainer

brucelaughlin Mar 11, 2024
Author

brucelaughlin Mar 11, 2024
Author

knutfrode
Mar 11, 2024
Maintainer

knutfrode
Mar 11, 2024
Maintainer

brucelaughlin Mar 12, 2024
Author

knutfrode
Mar 12, 2024
Maintainer

brucelaughlin
Mar 12, 2024
Author

brucelaughlin
Mar 20, 2024
Author

brucelaughlin Mar 20, 2024
Author

brucelaughlin Mar 20, 2024
Author

brucelaughlin Mar 20, 2024
Author

knutfrode
Mar 20, 2024
Maintainer

brucelaughlin Mar 20, 2024
Author

brucelaughlin Mar 21, 2024
Author

brucelaughlin Mar 21, 2024
Author

knutfrode
Mar 21, 2024
Maintainer

knutfrode
Mar 21, 2024
Maintainer

brucelaughlin
Mar 21, 2024
Author

brucelaughlin
Mar 21, 2024
Author

knutfrode
Mar 21, 2024
Maintainer

brucelaughlin
Apr 2, 2024
Author