perf: vectorize yield uncertainty calculation #316

alexander-held · 2022-01-29T10:09:03Z

This vectorizes the previously loop-based calculations when determining yield uncertainties. Thanks a lot to @agoose77 for the invaluable help!

For a moderately complex workspace with roughly 20 samples in 50 bins and 200 (non-expanded) parameters, the runtime of the diagonal contributions goes from 0.7 s to below 0.01 s (including setting up an array that is re-used for off-diagonal terms). The off-diagonal contribution goes from 180 s to 0.4 s.

After these changes, the bottleneck in yield_stdev is now the initial setup of the yields for all variations (4 s in the above example). There is probably some room to optimize this a bit further as well (example in #315 (comment)). This topic will be tracked in #325.

The vectorization of the off-diagonal contribution does not take advantage of two things that were previously used: the symmetry of the correlation matrix, and the fact that staterror modifiers are orthogonal in their yield effects on bins (but not on channels, see #323). Given the performance improvement, this may not be crucial anymore, but perhaps there is a way to also take advantage of that in the future.

The implementation uses ak.flatten to sum over one dimension due to scikit-hep/awkward#1283. The subsequent summation with a three-dimensional array works fine. There is an assert in the implementation to ensure we are not affected by the upstream bug, and after an upstream fix this can be changed in cabinetry. This is tracked via #326.

resolves #315

* vectorized calculation of yield uncertainties
* speedup is two orders of magnitude for moderately complex model

src/cabinetry/model_utils.py

codecov · 2022-02-09T12:02:32Z

Codecov Report

Merging #316 (16c071d) into master (ea3ee7d) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #316   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           23        23           
  Lines         1886      1878    -8     
  Branches       305       299    -6     
=========================================
- Hits          1886      1878    -8

Impacted Files	Coverage Δ
src/cabinetry/model_utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea3ee7d...16c071d. Read the comment docs.

…omments

alexander-held · 2022-02-09T13:49:49Z

Here is an example comparing three approaches: a numpy solution using matrix multiplications, the old loop-based approach, and the new approach supporting awkward arrays with additional depth:

import awkward as ak
import numpy as np


def calc(op):
    print(f"{op} has shape {eval(op).shape}:")
    print(f"{eval(op)}\n")


# setup: 3 variations, 5 bin yields
v = np.asarray([[1, 2, 3, 4, 5], [3, 4, 5, 6, 7], [5, 6, 7, 8, 9]])
M = np.asarray([[1, 1.4, -0.3], [1.4, 1, 1.2], [-0.3, 1.2, 1]])


calc("v")
calc("v.T")
calc("M")
calc("v.T @ M @ v")
calc("np.diag(v.T @ M @ v)")

# loop-based calculation
res = 0
for i in range(3):
    for j in range(3):
        res += v[i] * v[j] * M[i][j]
print(f"loop-based calculation: {res}\n")

# variation with awkward and extra depth in v
# elements of v are now split e.g. by channel
v = ak.from_iter([[[1, 2, 3], [4, 5]], [[3, 4, 5], [6, 7]], [[5, 6, 7], [8, 9]]])
R = M[..., np.newaxis, np.newaxis] * v[np.newaxis, ...]
L = v[:, np.newaxis, ...] * R
print(f"awkward version with extra depth: {np.sum(ak.flatten(L, axis=1), axis=0)}")

output:

v has shape (3, 5):
[[1 2 3 4 5]
 [3 4 5 6 7]
 [5 6 7 8 9]]

v.T has shape (5, 3):
[[1 3 5]
 [2 4 6]
 [3 5 7]
 [4 6 8]
 [5 7 9]]

M has shape (3, 3):
[[ 1.   1.4 -0.3]
 [ 1.4  1.   1.2]
 [-0.3  1.2  1. ]]

v.T @ M @ v has shape (5, 5):
[[ 76.4  98.8 121.2 143.6 166. ]
 [ 98.8 128.8 158.8 188.8 218.8]
 [121.2 158.8 196.4 234.  271.6]
 [143.6 188.8 234.  279.2 324.4]
 [166.  218.8 271.6 324.4 377.2]]

np.diag(v.T @ M @ v) has shape (5,):
[ 76.4 128.8 196.4 279.2 377.2]

loop-based calculation: [ 76.4 128.8 196.4 279.2 377.2]

awkward version with extra depth: [[76.4, 129, 196], [279, 377]]

The awkward calculation matches the other two, there are just less digits printed.

src/cabinetry/model_utils.py

alexander-held · 2022-02-10T15:53:40Z

@lhenkelm I updated the comments based on your feedback to try and clarify the calculation. If you get the chance to take a look, I'd be happy to incorporate any additional feedback and otherwise from my side I think this is ok.

The code uses ak.flatten instead of a sum due to scikit-hep/awkward#1283, which looks like it might be related to scikit-hep/awkward#1266. As far as I can tell, the current implementation is not affected by this bug for any of the potential shapes that can be encountered in the remaining sum.

lhenkelm · 2022-02-10T16:57:41Z

@alexander-held this looks good and clear to me now :)

agoose77 · 2022-02-10T20:22:37Z

@alexander-held maybe it's worth adding an assert with a message to catch the case where/if the assumption fails? :)

alexander-held · 2022-02-11T11:07:48Z

@agoose77 by "the assumption", do you mean that we only sum over an array with three or less dimensions (after flattening the original 4d array)? I cannot think of a way in which this would break, but it also does not hurt to have that in. I'll add an assert, and we can refactor and remove it in the future.

agoose77 · 2022-02-11T11:12:42Z

@agoose77 by "the assumption", do you mean that we only sum over an array with three or less dimensions (after flattening the original 4d array)? I cannot think of a way in which this would break, but it also does not hurt to have that in. I'll add an assert, and we can refactor and remove it in the future.

Yes :) Although you have a 4D array right now, my thought process is that making this ==4D (before flatten) assumption explicit in an assert is safer than a comment longer term. Ideally you can remove this assert / flatten once we fix the bug upstream :)

alexander-held · 2022-02-11T11:46:18Z

I added the assert, #326 will track the task of changing this after an upstream fix.

Thanks to you both for the help with this PR!

alexander-held mentioned this pull request Jan 29, 2022

model_utils.yield_stdev is very slow #315

Closed

alexander-held force-pushed the perf/yield-uncertainty-vectorization branch from ec7d227 to 4dc73f6 Compare February 2, 2022 17:56

lhenkelm reviewed Feb 3, 2022

View reviewed changes

src/cabinetry/model_utils.py Outdated Show resolved Hide resolved

alexander-held force-pushed the perf/yield-uncertainty-vectorization branch from 4dc73f6 to 8fb5ea7 Compare February 3, 2022 19:23

alexander-held mentioned this pull request Feb 8, 2022

fix: per-channel yield uncertainty contribution from staterror-staterror terms #324

Merged

vectorize yield uncertainty calculation

5fc7413

alexander-held force-pushed the perf/yield-uncertainty-vectorization branch from 8fb5ea7 to 5fc7413 Compare February 9, 2022 11:57

skip duplicate calculation when off-diagonal terms are present, add c…

79a39ac

…omments

alexander-held commented Feb 10, 2022

View reviewed changes

src/cabinetry/model_utils.py Outdated Show resolved Hide resolved

improve comments in code

f795a72

alexander-held mentioned this pull request Feb 10, 2022

Batching evaluation of yield variations in yield uncertainty calculation #325

Open

alexander-held mentioned this pull request Feb 11, 2022

Update model_utils.yield_stdev after upstream awkward fix #326

Closed

assert shape before summations

16c071d

alexander-held merged commit 67cc176 into master Feb 11, 2022

alexander-held deleted the perf/yield-uncertainty-vectorization branch February 11, 2022 11:58

alexander-held mentioned this pull request Jun 7, 2022

Alternative methods for model uncertainty calculation #221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: vectorize yield uncertainty calculation #316

perf: vectorize yield uncertainty calculation #316

alexander-held commented Jan 29, 2022 •

edited

Loading

codecov bot commented Feb 9, 2022 •

edited

Loading

alexander-held commented Feb 9, 2022

alexander-held commented Feb 10, 2022

lhenkelm commented Feb 10, 2022

agoose77 commented Feb 10, 2022 •

edited

Loading

alexander-held commented Feb 11, 2022

agoose77 commented Feb 11, 2022 •

edited

Loading

alexander-held commented Feb 11, 2022

perf: vectorize yield uncertainty calculation #316

perf: vectorize yield uncertainty calculation #316

Conversation

alexander-held commented Jan 29, 2022 • edited Loading

codecov bot commented Feb 9, 2022 • edited Loading

Codecov Report

alexander-held commented Feb 9, 2022

alexander-held commented Feb 10, 2022

lhenkelm commented Feb 10, 2022

agoose77 commented Feb 10, 2022 • edited Loading

alexander-held commented Feb 11, 2022

agoose77 commented Feb 11, 2022 • edited Loading

alexander-held commented Feb 11, 2022

alexander-held commented Jan 29, 2022 •

edited

Loading

codecov bot commented Feb 9, 2022 •

edited

Loading

agoose77 commented Feb 10, 2022 •

edited

Loading

agoose77 commented Feb 11, 2022 •

edited

Loading