Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: vectorize yield uncertainty calculation #316

Merged
merged 4 commits into from
Feb 11, 2022

Conversation

alexander-held
Copy link
Member

@alexander-held alexander-held commented Jan 29, 2022

This vectorizes the previously loop-based calculations when determining yield uncertainties. Thanks a lot to @agoose77 for the invaluable help!

For a moderately complex workspace with roughly 20 samples in 50 bins and 200 (non-expanded) parameters, the runtime of the diagonal contributions goes from 0.7 s to below 0.01 s (including setting up an array that is re-used for off-diagonal terms). The off-diagonal contribution goes from 180 s to 0.4 s.

After these changes, the bottleneck in yield_stdev is now the initial setup of the yields for all variations (4 s in the above example). There is probably some room to optimize this a bit further as well (example in #315 (comment)). This topic will be tracked in #325.

The vectorization of the off-diagonal contribution does not take advantage of two things that were previously used: the symmetry of the correlation matrix, and the fact that staterror modifiers are orthogonal in their yield effects on bins (but not on channels, see #323). Given the performance improvement, this may not be crucial anymore, but perhaps there is a way to also take advantage of that in the future.

The implementation uses ak.flatten to sum over one dimension due to scikit-hep/awkward#1283. The subsequent summation with a three-dimensional array works fine. There is an assert in the implementation to ensure we are not affected by the upstream bug, and after an upstream fix this can be changed in cabinetry. This is tracked via #326.

resolves #315

* vectorized calculation of yield uncertainties
* speedup is two orders of magnitude for moderately complex model

@alexander-held alexander-held force-pushed the perf/yield-uncertainty-vectorization branch from ec7d227 to 4dc73f6 Compare February 2, 2022 17:56
@alexander-held alexander-held force-pushed the perf/yield-uncertainty-vectorization branch from 8fb5ea7 to 5fc7413 Compare February 9, 2022 11:57
@codecov
Copy link

codecov bot commented Feb 9, 2022

Codecov Report

Merging #316 (16c071d) into master (ea3ee7d) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #316   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           23        23           
  Lines         1886      1878    -8     
  Branches       305       299    -6     
=========================================
- Hits          1886      1878    -8     
Impacted Files Coverage Δ
src/cabinetry/model_utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea3ee7d...16c071d. Read the comment docs.

@alexander-held
Copy link
Member Author

Here is an example comparing three approaches: a numpy solution using matrix multiplications, the old loop-based approach, and the new approach supporting awkward arrays with additional depth:

import awkward as ak
import numpy as np


def calc(op):
    print(f"{op} has shape {eval(op).shape}:")
    print(f"{eval(op)}\n")


# setup: 3 variations, 5 bin yields
v = np.asarray([[1, 2, 3, 4, 5], [3, 4, 5, 6, 7], [5, 6, 7, 8, 9]])
M = np.asarray([[1, 1.4, -0.3], [1.4, 1, 1.2], [-0.3, 1.2, 1]])


calc("v")
calc("v.T")
calc("M")
calc("v.T @ M @ v")
calc("np.diag(v.T @ M @ v)")

# loop-based calculation
res = 0
for i in range(3):
    for j in range(3):
        res += v[i] * v[j] * M[i][j]
print(f"loop-based calculation: {res}\n")

# variation with awkward and extra depth in v
# elements of v are now split e.g. by channel
v = ak.from_iter([[[1, 2, 3], [4, 5]], [[3, 4, 5], [6, 7]], [[5, 6, 7], [8, 9]]])
R = M[..., np.newaxis, np.newaxis] * v[np.newaxis, ...]
L = v[:, np.newaxis, ...] * R
print(f"awkward version with extra depth: {np.sum(ak.flatten(L, axis=1), axis=0)}")

output:

v has shape (3, 5):
[[1 2 3 4 5]
 [3 4 5 6 7]
 [5 6 7 8 9]]

v.T has shape (5, 3):
[[1 3 5]
 [2 4 6]
 [3 5 7]
 [4 6 8]
 [5 7 9]]

M has shape (3, 3):
[[ 1.   1.4 -0.3]
 [ 1.4  1.   1.2]
 [-0.3  1.2  1. ]]

v.T @ M @ v has shape (5, 5):
[[ 76.4  98.8 121.2 143.6 166. ]
 [ 98.8 128.8 158.8 188.8 218.8]
 [121.2 158.8 196.4 234.  271.6]
 [143.6 188.8 234.  279.2 324.4]
 [166.  218.8 271.6 324.4 377.2]]

np.diag(v.T @ M @ v) has shape (5,):
[ 76.4 128.8 196.4 279.2 377.2]

loop-based calculation: [ 76.4 128.8 196.4 279.2 377.2]

awkward version with extra depth: [[76.4, 129, 196], [279, 377]]

The awkward calculation matches the other two, there are just less digits printed.

@alexander-held
Copy link
Member Author

@lhenkelm I updated the comments based on your feedback to try and clarify the calculation. If you get the chance to take a look, I'd be happy to incorporate any additional feedback and otherwise from my side I think this is ok.

The code uses ak.flatten instead of a sum due to scikit-hep/awkward#1283, which looks like it might be related to scikit-hep/awkward#1266. As far as I can tell, the current implementation is not affected by this bug for any of the potential shapes that can be encountered in the remaining sum.

@lhenkelm
Copy link
Contributor

@alexander-held this looks good and clear to me now :)

@agoose77
Copy link
Contributor

agoose77 commented Feb 10, 2022

@alexander-held maybe it's worth adding an assert with a message to catch the case where/if the assumption fails? :)

@alexander-held
Copy link
Member Author

@agoose77 by "the assumption", do you mean that we only sum over an array with three or less dimensions (after flattening the original 4d array)? I cannot think of a way in which this would break, but it also does not hurt to have that in. I'll add an assert, and we can refactor and remove it in the future.

@agoose77
Copy link
Contributor

agoose77 commented Feb 11, 2022

@agoose77 by "the assumption", do you mean that we only sum over an array with three or less dimensions (after flattening the original 4d array)? I cannot think of a way in which this would break, but it also does not hurt to have that in. I'll add an assert, and we can refactor and remove it in the future.

Yes :) Although you have a 4D array right now, my thought process is that making this ==4D (before flatten) assumption explicit in an assert is safer than a comment longer term. Ideally you can remove this assert / flatten once we fix the bug upstream :)

@alexander-held
Copy link
Member Author

I added the assert, #326 will track the task of changing this after an upstream fix.

Thanks to you both for the help with this PR!

@alexander-held alexander-held merged commit 67cc176 into master Feb 11, 2022
@alexander-held alexander-held deleted the perf/yield-uncertainty-vectorization branch February 11, 2022 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

model_utils.yield_stdev is very slow
3 participants