Ostrich process hangs for larger calibrations #219

richardarsenault · 2020-03-18T18:52:54Z

Ostrich seems to hang if we ask for larger number of model evaluations. Ex:

Calibrating on 100 model evaluations = 38 seconds.
Calibrating on 1000 model evaluations = 440 seconds.
Calibrating on 10000 model evaluations = Was still incomplete (no error, just running) after 16 hours.

This leads me to believe there is some sort of config limiting the duration of processes maybe? a sort of timeout?

huard · 2020-03-18T19:12:31Z

Are you using progress=True ?

huard · 2020-03-18T19:12:55Z

Synchronous mode
    The client sends the `Execute` request to the server and waits with open
    server connection, till the process is calculated and final response is
    returned back. This is useful for fast calculations which do not take
    longer then a couple of seconds (`Apache2 httpd server uses 300 seconds <https://httpd.apache.org/docs/2.4/mod/core.html#timeout>`_ as default value for ConnectionTimeout).

Asynchronous mode
    Client sends the `Execute` request with explicit request for asynchronous
    mode. If supported by the process (in PyWPS, we have a configuration for
    that), the server returns back `ProcessAccepted` response immediately with
    URL, where the client can regularly check for *process execution status*.

huard · 2020-03-18T19:13:25Z

This should be better documented in our docs.

richardarsenault · 2020-03-18T19:15:51Z

Hmmm I thought I did in all my notebooks, turns out this one did not. I'll relaunch with a larger size and keep you posted. Thanks!

richardarsenault · 2020-03-18T19:21:28Z

I just get this error in the notebook when I try to run the process with Progress=True:

The save operation succeeded, but the notebook does not appear to be valid. The validation error was:

Notebook validation failed: {'version_major': 2, 'version_minor': 0, 'model_id': '817877db674542ad8586f18de20711ba'} is not valid under any of the given schemas:
{
"version_major": 2,
"version_minor": 0,
"model_id": "817877db674542ad8586f18de20711ba"
}

richardarsenault · 2020-03-19T12:20:05Z

After running the code on the Ouranos JupyterLab instance (on pavics.ouranos.ca/jupyter) with progress=True, OSTRICH does the same thing where it does not crash but has been running for over 12 hours and still no response whereas I expected the code to take ~1h20 mins.

huard · 2020-03-19T12:56:03Z

Ok. Could you confirm that if you run Raven on your own machine (from the terminal, no python wrapper or wps server), it works.

julemai · 2020-03-19T14:16:24Z

Hi @huard. I was testing Richard's setup on my machine. For a 4747-day simulation with different budgets for OSTRICH took:

budget 100 iterations: 10s
budget 1000 iterations: 78s
budget 10000 iterations: 8123.419s

I will contact Shawn Matott (developer of Ostrich) if he has an idea why runtime is not linear.

richardarsenault · 2020-03-19T14:48:01Z

OK, glad to see I'm not going crazy. thanks for the info!

julemai · 2020-03-19T14:50:56Z

Yeah, I'm sorry about that. I normally don't use such large budgets and never realized. I am guessing that it has something to do with the increased memory allocation of Ostrich to hold all the statistics etc of the previous runs. But let's seee what Shawn says. Just sent out the email with the runtime stats and the example setup. :)

richardarsenault · 2020-04-26T18:33:13Z

Follow-up: It would seem that the process hanging also affects other birds that demand long run times.

huard · 2021-04-06T17:54:36Z

There is a known issue with PyWPS queue management. I'm hoping to make some progress on this front over the next months.
@julemai Any news from Shawn ?

julemai · 2021-04-06T17:57:12Z

I think this has actually nothing to do with Ostrich or Raven. Didn't we find that it is actually hanging in the WPS?

richardarsenault · 2021-04-06T17:58:51Z

I think the comment here refers to the Raven that includes DDS internally so we can accelerate the calibrations much faster and avoid this problem altogether.

julemai · 2021-04-06T18:09:33Z

Ok. James has implemented DDS functionality in Raven.

But:

it only works for a subset of parameters we want to calibrate
it is only faster if ALL data can be read at once (no NetCDF chunks)

It is a longer discussion, I think, if we really want to make use of this since all calibration settings would need to be divided into "Is Raven doing the calibration internally?" or "Is Ostrich doing the calibration?"

The runtime of ALL Raven runs and hence also calibration runs can be significantly improved when the input data (forcings) are aggregated from gridded to HRU-aggregates using aggregate-forcings-to-hrus as described here.

huard · 2021-04-06T19:03:17Z

Thanks for the update. I suggest we close this issue here, since the PyWPS problem is described elsewhere.

richardarsenault · 2021-08-13T14:39:19Z

@huard Can you link to the PyWPS issue for posterity please? then we can close this one. Thanks!

huard · 2021-08-13T15:04:46Z

geopython/pywps#600

huard closed this as completed Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ostrich process hangs for larger calibrations #219

Ostrich process hangs for larger calibrations #219

richardarsenault commented Mar 18, 2020 •

edited

Loading

huard commented Mar 18, 2020

huard commented Mar 18, 2020

huard commented Mar 18, 2020

richardarsenault commented Mar 18, 2020

richardarsenault commented Mar 18, 2020

richardarsenault commented Mar 19, 2020

huard commented Mar 19, 2020

julemai commented Mar 19, 2020 •

edited

Loading

richardarsenault commented Mar 19, 2020

julemai commented Mar 19, 2020

richardarsenault commented Apr 26, 2020

huard commented Apr 6, 2021

julemai commented Apr 6, 2021

richardarsenault commented Apr 6, 2021

julemai commented Apr 6, 2021 •

edited

Loading

huard commented Apr 6, 2021

richardarsenault commented Aug 13, 2021

huard commented Aug 13, 2021

Ostrich process hangs for larger calibrations #219

Ostrich process hangs for larger calibrations #219

Comments

richardarsenault commented Mar 18, 2020 • edited Loading

huard commented Mar 18, 2020

huard commented Mar 18, 2020

huard commented Mar 18, 2020

richardarsenault commented Mar 18, 2020

richardarsenault commented Mar 18, 2020

richardarsenault commented Mar 19, 2020

huard commented Mar 19, 2020

julemai commented Mar 19, 2020 • edited Loading

richardarsenault commented Mar 19, 2020

julemai commented Mar 19, 2020

richardarsenault commented Apr 26, 2020

huard commented Apr 6, 2021

julemai commented Apr 6, 2021

richardarsenault commented Apr 6, 2021

julemai commented Apr 6, 2021 • edited Loading

huard commented Apr 6, 2021

richardarsenault commented Aug 13, 2021

huard commented Aug 13, 2021

richardarsenault commented Mar 18, 2020 •

edited

Loading

julemai commented Mar 19, 2020 •

edited

Loading

julemai commented Apr 6, 2021 •

edited

Loading