Improve performance #48

masawdah · 2023-08-08T15:10:35Z

Thank you for providing this valuable package; we are currently integrating it into our workflow. We would like to suggest incorporating the following functionalities and improvements to enhance its speed and efficiency, particularly for larger datasets. These enhancements would also enable the package to handle NaNs values effectively and provide support for various statistics.

Introduce an automatic ESMF environment setup upon importing the 'xagg' package. This implementation would mitigate the occurrence of ImportError related to ESMF.
Optimize the 'subset_find' function by implementing a dictionary comprehension method to identify common subsets. This approach would significantly enhance performance, especially when dealing with big datasets, reducing processing time from hours to few seconds.
Expand the statistical capabilities by incorporating additional stats such as median, sum, etc.
Enhance NaNs handling by introducing interpolation techniques for filling missing values.

Overall, these proposed functionalities are aimed at augmenting the package's efficiency, allowing it to handle big datasets with greater speed and accuracy while providing a broader range of statistical capabilities.

Let us know if you would like to merge the proposed functionalities.

ks905383 · 2023-08-08T21:55:30Z

Thanks for the PR and your work on this! I'm especially excited for the performance updates, and I'm glad that this package is useful to folks.

Just a couple of quick questions on my first read through the proposed changes:

New statistical capabilities

could you confirm what the count evaluation does?
are the median / sum weighted the way the mean is?
I'm trying to think whether including non-weighted statistics is a good idea in terms of the legibility of the package (since the primary benefit over, e.g., rasterstats is the weighting functionality) - but happy to hear your thoughts on this!
Tagging Option to return sum instead of mean? #15 here as well for the discussion on the preservation of the sum over weighted overlaps...

Enhanced NaNs

What would you say is the main benefit from folding this into xagg itself, instead of encouraging folks to gap-fill the raster dataset on their own before sticking into xagg? I'm just worried, since gap-filling and interpolation is such a complex and variable-dependent issue (that has its own packages and ecosystems out there), that having one simple built-in may not be as useful or cause some unneeded dependency issues down the line (or encourage bad practices if it can be too easily outsourced to a subfunction here).

Introduce an automatic ESMF environment setup upon importing the 'xagg' package. This implementation would mitigate the occurrence of ImportError related to ESMF.

That's great! From the discussion in #47 I think it would still be good to better isolate ESMF to only when it's needed (and only import it if regridding is actually used) - but this is definitely a good fix until then, and probably after as well.

Also, you should be able to have PRs run through the tests now. I think the reason they've failed is the additional variable names from the new statistical options. That's probably another thing to discuss - changing the way that variable names are output is going to affect existing workflows, so maybe there's a way to adapt that through the existing structure - maybe keeping the old workflow of running one statistical calculation per xa.aggregate() run? (Though I understand why you'd want the ease of multiple calculations easily put into the same output).

Once we've converged on what's best to include, we can work on writing new tests for the new capabilities.

That's my first thoughts on this - I'm about to be a bit hard to reach for a few days, but I'll keep thinking about this.

masawdah · 2023-08-10T13:49:59Z

Thanks for your valid points. We would like to clarify the following:

statistical capabilities

You are right. We focused on non-weighted statistics as that what we need in our workflow. However, it would certainly be advantageous to expand it to have both options, so the user does not need to use many libraries. We can add the following options:

SUPPORTED_STATISTICS_METHODS = ["weighted_mean", "weighted_sum","weighted_median", "mean", “sum”, "max", "min", "sum", "median", "count", "std"]

We’ve already now added "weighted_sum","weighted_median" in addition to the other statistics.

Output structure

We have now adapted the code to generate output data in the new structure, including all relevant statistics and their corresponding meaningful names. We need to adjust the test code to ensure that the pull request (PR) can pass through it. We did the test locally and it passed it successfully. It is similar to the previous approach, the test is still using a single statistic as before which is the default option.

Enhanced NaNs

You are correct; indeed gap-filling is not a simple issue to handle. However, we have approached this issue from an operational standpoint, aiming to provide end users with a rapid overview of potential results through gap-filling. Integrate the package into a backend process like openEO or any other EO platform, it becomes advantageous to offer users the choice of conducting a basic gap-filling procedure (which could be meaningful with densely time-series datasets) or handle it outside the package. Also, it is important from the educational aspect as an option can also serve as a valuable learning tool for users who are still in the process of developing their understanding of gap-filling techniques.

The PR won’t pass the test for now until adjust the test code. So we’ve added a notebook shows the test locally.

let us know what you think about that.

masawdah · 2023-08-11T08:31:37Z

Hi again, We've made some modifications to the code. Now, the default option for calculations is set to weighted_mean, which is in alignment with the current output workflow. Additionally, we've included a test_core and test_export in the notebook. After modifying the attribute names to be consistent with 'weighted_mean', the code successfully passed all tests.

carolinegoehner · 2023-08-11T16:43:00Z

Hello, we realised that the weighted median could not be calculated for a single time image, so we changed the function accordingly. Regarding the count, it is dividing the sum of the pixel area covered by the polygon by the maximum pixel size to get an accurate count of covered pixels in float.
We are sorry for these many comments in the last two days. As we are not in the office for the next week, we improved as much as we could. Take your time to review our additions! We are happy to answer any questions and comments as soon as we´re back in office!

masawdah · 2023-08-29T13:05:43Z

Hi @ks905383 , I hope you are doing well. I understand that everyone's schedule can be quite busy, and appreciate your time. If you've had a chance to review our changes to the PR, we're happy to hear your thoughts and feedback.

Thanks :)

bhmiel-cdphe · 2024-10-29T22:02:14Z

Hello, wanted to comment on here about the status of this branch. I'm finding myself in the position of needing to use xagg but "sum" the values of the raster within the shapefile regions (accounting for partial overlap), not calculate a weighted average. Is this feature available, at least in code somewhere?

masawdah and others added 10 commits July 24, 2023 12:31

esmf setup

96455fe

skipna for aggregation

b1f6402

notebook

2a22a51

fix & test the esmf setup

da1aef6

Added more statistics for aggregate

3d2e838

assert statement for supported stats

6b0fbc8

Added more aggregation statistics

d9495da

Added more aggregation statistics

6ad653f

Aggregation statistics example using openEO data

b34088a

clean

726c5a8

masawdah added 4 commits August 10, 2023 12:27

Handle the output workflow and weighted sum

4956754

the output workflow & the test code

1818b6c

add weighted median

99e8ed2

run the test locally

a1693d5

masawdah added 2 commits August 10, 2023 15:51

clean the code

6218cc6

default options and run test core

f34c1e8

masawdah added 2 commits August 11, 2023 18:32

update the stats

003d135

support one timestep

db55aca

clausmichele mentioned this pull request Aug 22, 2023

Roadmap towards UC8 Open-EO/openeo-processes-dask#124

Closed

fix exporting nans

92f6854

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance #48

Improve performance #48

masawdah commented Aug 8, 2023

ks905383 commented Aug 8, 2023

masawdah commented Aug 10, 2023 •

edited

Loading

masawdah commented Aug 11, 2023

carolinegoehner commented Aug 11, 2023

masawdah commented Aug 29, 2023

bhmiel-cdphe commented Oct 29, 2024

Improve performance #48

Are you sure you want to change the base?

Improve performance #48

Conversation

masawdah commented Aug 8, 2023

ks905383 commented Aug 8, 2023

masawdah commented Aug 10, 2023 • edited Loading

masawdah commented Aug 11, 2023

carolinegoehner commented Aug 11, 2023

masawdah commented Aug 29, 2023

bhmiel-cdphe commented Oct 29, 2024

masawdah commented Aug 10, 2023 •

edited

Loading