Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ostrich process hangs for larger calibrations #219

Closed
richardarsenault opened this issue Mar 18, 2020 · 18 comments
Closed

Ostrich process hangs for larger calibrations #219

richardarsenault opened this issue Mar 18, 2020 · 18 comments

Comments

@richardarsenault
Copy link
Contributor

richardarsenault commented Mar 18, 2020

Ostrich seems to hang if we ask for larger number of model evaluations. Ex:

Calibrating on 100 model evaluations = 38 seconds.
Calibrating on 1000 model evaluations = 440 seconds.
Calibrating on 10000 model evaluations = Was still incomplete (no error, just running) after 16 hours.

This leads me to believe there is some sort of config limiting the duration of processes maybe? a sort of timeout?

@huard
Copy link
Contributor

huard commented Mar 18, 2020

Are you using progress=True ?

@huard
Copy link
Contributor

huard commented Mar 18, 2020

Synchronous mode
    The client sends the `Execute` request to the server and waits with open
    server connection, till the process is calculated and final response is
    returned back. This is useful for fast calculations which do not take
    longer then a couple of seconds (`Apache2 httpd server uses 300 seconds <https://httpd.apache.org/docs/2.4/mod/core.html#timeout>`_ as default value for ConnectionTimeout).

Asynchronous mode
    Client sends the `Execute` request with explicit request for asynchronous
    mode. If supported by the process (in PyWPS, we have a configuration for
    that), the server returns back `ProcessAccepted` response immediately with
    URL, where the client can regularly check for *process execution status*. 

@huard
Copy link
Contributor

huard commented Mar 18, 2020

This should be better documented in our docs.

@richardarsenault
Copy link
Contributor Author

Hmmm I thought I did in all my notebooks, turns out this one did not. I'll relaunch with a larger size and keep you posted. Thanks!

@richardarsenault
Copy link
Contributor Author

I just get this error in the notebook when I try to run the process with Progress=True:

The save operation succeeded, but the notebook does not appear to be valid. The validation error was:

Notebook validation failed: {'version_major': 2, 'version_minor': 0, 'model_id': '817877db674542ad8586f18de20711ba'} is not valid under any of the given schemas:
{
"version_major": 2,
"version_minor": 0,
"model_id": "817877db674542ad8586f18de20711ba"
}

@richardarsenault
Copy link
Contributor Author

After running the code on the Ouranos JupyterLab instance (on pavics.ouranos.ca/jupyter) with progress=True, OSTRICH does the same thing where it does not crash but has been running for over 12 hours and still no response whereas I expected the code to take ~1h20 mins.

@huard
Copy link
Contributor

huard commented Mar 19, 2020

Ok. Could you confirm that if you run Raven on your own machine (from the terminal, no python wrapper or wps server), it works.

@julemai
Copy link
Collaborator

julemai commented Mar 19, 2020

Hi @huard. I was testing Richard's setup on my machine. For a 4747-day simulation with different budgets for OSTRICH took:

budget 100 iterations: 10s
budget 1000 iterations: 78s
budget 10000 iterations: 8123.419s

I will contact Shawn Matott (developer of Ostrich) if he has an idea why runtime is not linear.

@richardarsenault
Copy link
Contributor Author

OK, glad to see I'm not going crazy. thanks for the info!

@julemai
Copy link
Collaborator

julemai commented Mar 19, 2020

Yeah, I'm sorry about that. I normally don't use such large budgets and never realized. I am guessing that it has something to do with the increased memory allocation of Ostrich to hold all the statistics etc of the previous runs. But let's seee what Shawn says. Just sent out the email with the runtime stats and the example setup. :)

@richardarsenault
Copy link
Contributor Author

Follow-up: It would seem that the process hanging also affects other birds that demand long run times.

@huard
Copy link
Contributor

huard commented Apr 6, 2021

There is a known issue with PyWPS queue management. I'm hoping to make some progress on this front over the next months.
@julemai Any news from Shawn ?

@julemai
Copy link
Collaborator

julemai commented Apr 6, 2021

I think this has actually nothing to do with Ostrich or Raven. Didn't we find that it is actually hanging in the WPS?

@richardarsenault
Copy link
Contributor Author

I think the comment here refers to the Raven that includes DDS internally so we can accelerate the calibrations much faster and avoid this problem altogether.

@julemai
Copy link
Collaborator

julemai commented Apr 6, 2021

Ok. James has implemented DDS functionality in Raven.

But:

  • it only works for a subset of parameters we want to calibrate
  • it is only faster if ALL data can be read at once (no NetCDF chunks)

It is a longer discussion, I think, if we really want to make use of this since all calibration settings would need to be divided into "Is Raven doing the calibration internally?" or "Is Ostrich doing the calibration?"

The runtime of ALL Raven runs and hence also calibration runs can be significantly improved when the input data (forcings) are aggregated from gridded to HRU-aggregates using aggregate-forcings-to-hrus as described here.

@huard
Copy link
Contributor

huard commented Apr 6, 2021

Thanks for the update. I suggest we close this issue here, since the PyWPS problem is described elsewhere.

@richardarsenault
Copy link
Contributor Author

@huard Can you link to the PyWPS issue for posterity please? then we can close this one. Thanks!

@huard
Copy link
Contributor

huard commented Aug 13, 2021

geopython/pywps#600

@huard huard closed this as completed Aug 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants