User experience and performance improvements for pipeline demonstrator #64
Labels
bug
Something isn't working
enhancement
New feature or request
help wanted
Extra attention is needed
implementation
concerns analysis implementation
This collects various user experience and performance related aspects that the CMS Open Data pipeline demonstration at the AGC 2022 workshop revealed.
Completeness of pipeline
User experience
ServiceX+
coffea
auto_schema
and AGC schema with similarly named columns ServiceXauto_schema
and similar column names scikit-hep/coffea#665auto_schema
for non-jagged columnsauto_schema
with ServiceX processor does not work for non-jagged data with underlines in names scikit-hep/coffea#664NanoEventsFactory.from_root
-like method) Get all URLs synchronously from a request ssl-hep/ServiceX_frontend#243, Synchronous Coffea Integration ssl-hep/ServiceX#432, ServiceX Year 5 ssl-hep/ServiceX#430 -> Added functions to get URLs sync from request + corresponding tests ssl-hep/ServiceX_frontend#245ServiceX
coffea
bytesread
bytesread
in metrics varies depending on file source and disagrees with pureuproot
scikit-hep/coffea#717coffea-casa
func_adl
processor design
keepdims=True
, or masking withNone
)Performance
ServiceX+
coffea
TypeError: Cannot convert OpenFile to pyarrow.lib.NativeFile
scikit-hep/coffea#611, NanoEvents can handle a parquet file sourced from an http request scikit-hep/coffea#671ServiceX
coffea
servicex-databinder
approachcoffea-casa
func_adl
cabinetry
cabinetry.templates.collect
method takes a lot of time when introducing more channels (i.e. 45.3 seconds for 20 channels)cabinetry.model_utils.prediction(model, fit_results=fit_results)
causes notebook to crash due to memory issues on model with many channels. -> potentially related: Memory requirement of ak.sum vs np.sum scikit-hep/awkward#2480The text was updated successfully, but these errors were encountered: