User experience and performance improvements for pipeline demonstrator #64

alexander-held · 2022-04-27T13:24:51Z

This collects various user experience and performance related aspects that the CMS Open Data pipeline demonstration at the AGC 2022 workshop revealed.

Completeness of pipeline

add a machine learning component (e.g. ttbar reconstruction), frequently requested and relevant for many analyses being done in practice

User experience

ServiceX+`coffea`

schema configuration with ServiceX processors in coffea feat: support custom schema specification with ServiceX executor scikit-hep/coffea#707
naming transformations Coffea Naming transformers when using ServiceX ssl-hep/ServiceX#407
understand differences between auto_schema and AGC schema with similarly named columns ServiceX auto_schema and similar column names scikit-hep/coffea#665
auto_schema for non-jagged columns auto_schema with ServiceX processor does not work for non-jagged data with underlines in names scikit-hep/coffea#664
non-async method to run ServiceX processor for easier debugging (ideally, NanoEventsFactory.from_root-like method) Get all URLs synchronously from a request ssl-hep/ServiceX_frontend#243, Synchronous Coffea Integration ssl-hep/ServiceX#432, ServiceX Year 5 ssl-hep/ServiceX#430 -> Added functions to get URLs sync from request + corresponding tests ssl-hep/ServiceX_frontend#245
single-letter error messages Error propagation going wrong somewhere scikit-hep/coffea#666, Single-letter transform errors ssl-hep/ServiceX#408
progress bar for overall progress, similar to how that is shown when using coffea without ServiceX Progress bar for ServiceX executor coffea ssl-hep/ServiceX#419

ServiceX

it can take a long time for transforms to report how many files are to be processed in total
limiting number of files when querying a rucio dataset Controlling DID finder lookup ssl-hep/ServiceX#395, works via file name suffix (cell 2 in 03_atlas_xAOD.ipynb) but currently seems broken? -> Properly implement delayed sending of batch file updates ssl-hep/ServiceX_DID_Finder_lib#24
MinIO filling up: automatic cleanup? Minio Retention Policy ssl-hep/ServiceX#133
uproot transformer returning root files instead of only parquet Transformer Output File Format Options ssl-hep/ServiceX#225 -> Allow uproot transformer to produce ROOT files directly instead of parquet ssl-hep/ServiceX#475 -> demo at AGC demo day #1
dataset grouping for efficient non-async transforms Dataset Group func_adl_servicex#56
skipping need for dummy ds to create query (mentioned in Dataset Group func_adl_servicex#56)

`coffea`

metadata caching Surprising behavior of metadata caching in subsequent "fresh" runs scikit-hep/coffea#662
objects changing in surprising ways in systematic variations Object shape and values changing with nanoevents weight systematics scikit-hep/coffea#661
allow attaching per-object systematic variations to the full event (to enable running over copies of events)? not great for performance, but convenient for usability
weight-based systematics that use object properties but are attached to events Weight-based systematic variations depending on object kinematics attached to events scikit-hep/coffea#667
bug in bytesread bytesread in metrics varies depending on file source and disagrees with pure uproot scikit-hep/coffea#717
fileset format when handling parquet inputs Fileset format for parquet inputs scikit-hep/coffea#734
investigate possibility of Added support for multiple systematic variations of a single weight scikit-hep/coffea#749 being useful in the AGC setup
column overtouching in ML input variable calculation Column overtouching in quantities derived from combined object types scikit-hep/coffea#892

coffea-casa

dask manual scaling settings seem to not be accepted
ServiceX dashboard

`func_adl`

find ways to format queries in a way that helps understand the "layer" at which a given operation acts

processor design

avoid stacking masks of different shapes together (when built after initial filtering), hard to keep track of shapes (perhaps keepdims=True, or masking with None)
improve systematics loop, potentially streamline everything to use the same pattern, or find a way to automatically track which columns change when, and automatically expand observable with systematics dimensions, avoid scaling of jet properties via helper array

Performance

ServiceX+`coffea`

dask scaling ServiceX DaskExecutor is crasing with TypeError: Cannot convert OpenFile to pyarrow.lib.NativeFile scikit-hep/coffea#611, NanoEvents can handle a parquet file sourced from an http request scikit-hep/coffea#671

ServiceX

DID finder becomes a bottleneck when running over a large amount of files

`coffea`

consider splitting out pre-processing gist / Possibility of skipping pre-processing scikit-hep/coffea#675, or merge input files to avoid bottleneck

`servicex-databinder` approach

avoid bottleneck with file conversion / copying (feed data straight to Skyhook?)

coffea-casa

understand issues showing up in dask task stream (file access?)
possibility of guaranteeing fixed number of workers for performance benchmarking

`func_adl`

implement full query with proper b-tagging of jets with pT > 25 GeV -> done in feat: benchmarking setup #85

`cabinetry`

cabinetry.templates.collect method takes a lot of time when introducing more channels (i.e. 45.3 seconds for 20 channels)
cabinetry.model_utils.prediction(model, fit_results=fit_results) causes notebook to crash due to memory issues on model with many channels. -> potentially related: Memory requirement of ak.sum vs np.sum scikit-hep/awkward#2480

The text was updated successfully, but these errors were encountered:

ekauffma · 2023-04-18T11:11:47Z

Validation model which causes notebook to crash when getting post-fit prediction (cabinetry.model_utils.prediction(model, fit_results=fit_results): https://gist.github.com/ekauffma/b9fcbba5bb6f1ba411b6be37d8586db6

alexander-held mentioned this issue Jul 18, 2022

Highest priority UX improvements for next demonstration #68

Closed

3 tasks

alexander-held added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed labels Oct 5, 2022

alexander-held added the implementation concerns analysis implementation label May 1, 2023

alexander-held pinned this issue May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User experience and performance improvements for pipeline demonstrator #64

User experience and performance improvements for pipeline demonstrator #64

alexander-held commented Apr 27, 2022 •

edited

Loading

ekauffma commented Apr 18, 2023

User experience and performance improvements for pipeline demonstrator #64

User experience and performance improvements for pipeline demonstrator #64

Comments

alexander-held commented Apr 27, 2022 • edited Loading

Completeness of pipeline

User experience

ServiceX+coffea

ServiceX

coffea

coffea-casa

func_adl

processor design

Performance

ServiceX+coffea

ServiceX

coffea

servicex-databinder approach

coffea-casa

func_adl

cabinetry

ekauffma commented Apr 18, 2023

alexander-held commented Apr 27, 2022 •

edited

Loading

ServiceX+`coffea`

`coffea`

`func_adl`

ServiceX+`coffea`

`coffea`

`servicex-databinder` approach

`func_adl`

`cabinetry`