-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCS and template object clean up to aid iteration speed and memory footprint #242
Comments
@durack1 this is misleading,
|
@doutriaux1 feel free to edit the top #242 (comment) above, I was just trying to get down succinctly what the problem was, and what I thought was a reasonable solution - my aim was to reduce the tweaking of object deletion repetition (see below) and also the other kludges to make the software more user friendly.. FYI, my code includes the following:
I am issuing an I would also note that a user cannot be expected to know to issue a |
@downiec sounds good, I am also onsite tomorrow and this would certainly benefit from a screen showing. The script that really diagnoses the issue really well (plot timing starts at 3s and blows out to ~900+s after ~1000 loop steps) so it should be relatively easy to ascertain what the issue is and squash this long-standing bug once and for all |
@downiec that demo file for testing is at https://github.com/PCMDI/amipbcs/pull/22/files |
@forsyth2 @muryanto1 this was the VCS issue that I was having problems with @downiec ping |
@durack1 Did you tried your test with vcs nightly? |
@danlipsa the test was with cdat/label/v82 which pulled down
Are there other versions that have some changes I should be looking for? |
@danlipsa I just left my code running for weeks and on the 1858th loop, it's now taking 2027 seconds to plot and save a 500Kb PNG. |
Wow! How long does it take for the first loop? Does the memory increase as well? How much? |
@danlipsa 3 secs, the memory footprint goes from 1.3 -> 2.3Gb, so that is a bug that is solved, but there is still a MAJOR issue. If you take a peek at #242 (comment) and the issues linked in this issue, you'll see some of the logs. The latest run looks like (only have a couple of timesteps, I didn't pipe the terminal output to a file):
The file that is being used here is found at https://github.com/PCMDI/amipbcs/pull/22/files#diff-db7ce91f4ea32772ae1ab4ad8df7e998, it depends on mounts to data using the LLNL /p/ mounts. If you have access to an LLNL machine, you should be able to reproduce this easily |
@danlipsa if you don’t have access, I can hang the two (or four) input netcdf files off the ftp LLNL server |
@danlipsa my guesswork here is the VCS-VTK implementation needs a tweak, as python objects created just continue to accumulate and their handles remain intact as |
@durack1 If you could put the netcdf files on an ftp server that would be great. Thanks. |
@danlipsa there should be a Just extract the archive, and run |
@durack1 I get that
|
I don't think is that, maybe I cannot access this from outside. |
@danlipsa I am outside, try connecting to ftp.llnl.gov first (user: anonymous; password: email address), then The configuration of the system is pretty weird. |
@danlipsa any more luck? |
OK, I am getting it right now. I am using a ftp client (filezilla) to be able to get this. Maybe the issue is that the setup does not allow to list a directory only to cd and get a file. |
@danlipsa yeah I have raised queries about the config a number of times, they claim it's setup for heightened security, but in reality it rather limits the utility of having such a service when standard commands etc are blocked. Let me know if you have any problem running the example script, I think the demo should work as simply as extracting the archive to an empty directory, loading up a CDAT conda env and running the script |
@danlipsa any luck? And just to be complete, this is what the script is currently up to:
|
@durack1 Yes, I got the data. Thanks! I'll look at this as soon as I can. It may not be before the end of the year though as I am busy with two other projects. |
@danlipsa no problem, just to make sure that you can reproduce this issue (as in a couple of months I will likely have forgotten details), can you reproduce my experience over the first 1 to 100 steps of the loop, the time increases from ~1.3 to 7.4s and the python object counts increase from 178534 to 197982:
|
@danlipsa this is great, I've just managed to break the 3000s limit, next stop 1hr to generate a ~500Kb png:
|
@durack1 Attached is a your script modified so that it does not seem to leak memory and also the time to save a PNG stays constant. I only tried the first 100 steps. The rule I followed is: every time there is a x.createXXX there should be a corresponding removeobject. EzTemplate is different: you have to remove all created templates as well as the template passed as parameter. You were already doing all that, but the code was commented out and you were doing something similar by accessing vcs.elements - I think you missed some of the objects this way. I removed that and also removed all del commands as they should not be needed. Take a look at this and let me know how this works for you. |
user error 🤣 |
@doutriaux1 I’d suggest that if such issues cause a reasonably advanced user problems, then it’s a problem with the software and documentation, rather than user error. Defaults should be far cleaner, and self clearing, than accumulating at the rate they do |
@danlipsa thanks for this, I’ll test it out today and get back to you. Is there a way we can clean up the default behavior of EzTemplate so such issues don’t recur, I.e it cleans up after itself? |
@durack1 I think the issue is that createXXX as well as EzTemplate store the created graphics methods and templates in the global map vcs.elements. Never understood fully why that is the case. Is it only to access those graphics methods later on using their name? This does not seem to me that compelling. @doutriaux1 ? So I think we can change this behavior but it is not a backward compatible change and also a bigger change. So I think the way it is implemented now it is consistend but may be surprising to python users not accustomed to deleting objects. I do think we have to improve documentation on this. |
@danlipsa yes it is so that you can refer these object by name (and save them in init attribute json file) |
@danlipsa if my fading memory doesn't betray me, I think thre's a vcs function to remove all object not used on any canvas (search for something like |
@danlipsa one should probably implement |
@doutriaux1 Yes, that is a good idea. |
@danlipsa @doutriaux1 I would vote for more logical rather than maintaining backward compatibility, it seems to me that if there are default behaviors that trip a user up, then these should be changed for ease of use and logic, rather than preserved. On this subject, I was curious about the object use, it seems 180k objects is extreme to me, the below example shows how many (redundant?) objects are created when
|
For comparison, I was curious about
|
@danlipsa I also note that there is still the active line |
@danlipsa I just tried your script and can report that it solves the problem, so here is the first 10 and last 10 timesteps over the first loop: ~/tmp make_newVsOldDiffs_test_dan.py
~/tmp
tosList: tos_input4MIPs_SSTsAndSeaIce_CMIP_PCMDI-AMIP-1-1-6_gn_187001-201812.nc
sicbcList2: siconcbcs_input4MIPs_SSTsAndSeaIce_CMIP_PCMDI-AMIP-1-1-5_gn_187001-201806.nc
UV-CDAT version: 8.1
UV-CDAT prefix: ~anaconda3/envs/cdat82MesaPy3
Background graphics: True
donotstoredisplay: True
00001 processing: 1870-01 sic Time: 01.408 secs; Max mem: 1.766 GB PyObj#: 0183591;
00002 processing: 1870-02 sic Time: 01.266 secs; Max mem: 1.766 GB PyObj#: 0183550;
00003 processing: 1870-03 sic Time: 01.228 secs; Max mem: 1.766 GB PyObj#: 0183590;
00004 processing: 1870-04 sic Time: 01.269 secs; Max mem: 1.766 GB PyObj#: 0183590;
00005 processing: 1870-05 sic Time: 01.153 secs; Max mem: 1.766 GB PyObj#: 0183590;
00006 processing: 1870-06 sic Time: 01.141 secs; Max mem: 1.766 GB PyObj#: 0183590;
00007 processing: 1870-07 sic Time: 01.149 secs; Max mem: 1.766 GB PyObj#: 0183590;
00008 processing: 1870-08 sic Time: 01.184 secs; Max mem: 1.766 GB PyObj#: 0183590;
00009 processing: 1870-09 sic Time: 01.203 secs; Max mem: 1.766 GB PyObj#: 0183590;
00010 processing: 1870-10 sic Time: 01.165 secs; Max mem: 1.766 GB PyObj#: 0183590;
...
03543 processing: 2017-03 sic bc Time: 01.157 secs; Max mem: 2.351 GB PyObj#: 0178489;
03544 processing: 2017-04 sic bc Time: 01.153 secs; Max mem: 2.351 GB PyObj#: 0178489;
03545 processing: 2017-05 sic bc Time: 01.146 secs; Max mem: 2.351 GB PyObj#: 0178489;
03546 processing: 2017-06 sic bc Time: 01.154 secs; Max mem: 2.351 GB PyObj#: 0178489;
03547 processing: 2017-07 sic bc Time: 01.169 secs; Max mem: 2.351 GB PyObj#: 0178489;
03548 processing: 2017-08 sic bc Time: 01.249 secs; Max mem: 2.351 GB PyObj#: 0178489;
03549 processing: 2017-09 sic bc Time: 01.155 secs; Max mem: 2.351 GB PyObj#: 0178489;
03550 processing: 2017-10 sic bc Time: 01.155 secs; Max mem: 2.351 GB PyObj#: 0178489;
03551 processing: 2017-11 sic bc Time: 01.144 secs; Max mem: 2.351 GB PyObj#: 0178489;
03552 processing: 2017-12 sic bc Time: 01.153 secs; Max mem: 2.351 GB PyObj#: 0178489; It weirds me out a little the step jumps in objects that occur now and then |
@danlipsa @doutriaux1 to be a little clearer. I personally find it confusing that we have |
Some of them should be removed: clean_auto_generated_objects, some of them should be private: removeP, onclosing |
@durack1 Seems like x.backend.renWin = None is not needed. I think the issue and workaround were needed only for the old OpenGL1 backend. Now we ported everything to the new OpenGL 3.2 backend. |
I think the reason for this are the default graphics methods and templates that are loaded from a json file at initialization. |
@durack1 Can you close this issue then? |
There have been a number of issues that describe code slowdowns and considerable memory growth when generate pngs using loops. Some of the problems are highlighted in the issues:
#241
#237
#236
#235
PCMDI/amipbcs#10
CDAT/cdat#1424
CDAT/cdat#1397
It appears the primary problem, is that vcs and EzTemplate objects are being created and never deleted, and over an iteration of ~8000 this can lead to a huge memory footprint (~1 -> 100Gb) and a very marked slowdown (~3 to 2000s) in the time taken to generate a single 3 panel png output file.
There are a number of ways to resolve this, and I believe the
vcs.removeobject
function should be augmented to provide a user the ability to purge allvcs
(andEzTemplate
or all used graphic) objects in the current session, so a command like:This example function would purge ALL objects, even if there are current handles to them (the
force=True
option).It would be useful to think about other use cases, and then incorporate these into the single function, rather than having multiple functions that do variants of the same thing.. Oh yes, and documentation of this functionality will be key
@lee1043 @doutriaux1 @danlipsa pinging you all here
The text was updated successfully, but these errors were encountered: