v0.6.0 - Train and Converge Until it is Done

Important changes

auto_filter_on_linear_correlation now examines all features within correlated clusters, rather than just the most correlated pair. This means that the function now only needs to be run once, rather than the previously recommended multiple rerunning.
Moved to Scikit-learn 0.22.2, and moved, where possible, to keyword argument calls for sklearn methods in preparation for 0.25 enforcement of keyword arguments
Fixed error in patience when using cyclical LR callbacks, now specify the number of cycles to go without improvement. Previously had to specify 1+number.
Matrix data is no longer passed through np.nan_to_num in FoldYielder. Users should ensure that all values in matrix data are not NaN or Inf
Tensor data:
- df2foldfile, fold2foldfile, and 'add_meta_data` can now support the saving of arbitrary matrices as a matrix input
- Pass a numpy.array whose first dimension matches the length of the DataFrame to the tensor_data argument of df2foldfile and a name to tensor_name.
  The array will be split along the first dimension and the sub-arrays will be saved as matrix inputs in the resulting foldfile
- The matrices may also be passed as sparse format and be densified on loading by FoldYielder

Breaking

plot_rank_order_dendrogram now returns sets of all features in cluster with distance over the threshold, rather than just the closest features in each cluster

Additions

Addition of batch size parameter to Ensemble.predict*
Lorentz Boost Network (https://arxiv.org/abs/1812.09722):
- LorentzBoostNet basic implementation which learns boosted particles from existing particles and extracts features from them using fixed kernel functions
- AutoExtractLorentzBoostNet which also learns the kernel-functions during training
Classification Eval classes:
- BinaryAccuracy: Computes and returns the accuracy of a single-output model for binary classification tasks.
- RocAucScore: Computes and returns the area under the Receiver Operator Characteristic curve (ROC AUC) of a classifier model.
plot_binary_sample_feat: a version of plot_sample_pred designed for plotting feature histograms with stacked contributions by sample for
background.
Added compression arguments to df2foldfile, fold2foldfile, and save_to_grp
Tensor data:
- df2foldfile, fold2foldfile, and 'add_meta_data` can now support the saving of arbitrary matrices as a matrix input
- Pass a numpy.array whose first dimension matches the length of the DataFrame to the tensor_data argument of df2foldfile and a name to tensor_name.
  The array will be split along the first dimension and the sub-arrays will be saved as matrix inputs in the resulting foldfile
- The matrices may also be passed as sparse format and be densified on loading by FoldYielder
plot_lr_finders now has a log_y argument for logarithmic y-axis. Default auto set log_y if maximum fractional difference between losses is greater than 50
Added new rescaling options to ClassRegMulti using linear outputs and scaling by mean and std of targets
LsuvInit now applies scaling to nn.Conv3d layers
plot_lr_finders and fold_lr_find now have options to save the resulting LR finder plot (currently limited to png due to problems with pdf)
Addition of AdamW and an optimiser, thanks to @kiryteo
Contribution guide, thanks to @kiryteo
OneCycle lr_range now supports a non-zero final LR; just supply a three-tuple to the lr_range argument.
Ensemble.from_models classmethod for combining in-memory models into an Ensemble.

Removals

FeatureSubsample
plots keyword in fold_train_ensemble

Fixes

Docs bug for nn.training due to missing ipython in requirements
Bug in LSUV init when running on CUDA
Bug in TF export based on searching for fullstops
Bug in model_bar update during fold training
Quiet bug in 'MultHead' when matrix feats were not listed first; map construction indexed self.matrix_feats not self.feats
Slowdown in ensemble.predict_array which caused the array to get sent to device in during each model evaluations
-Model.get_param_count now includes mon-trainable params when requested
Fixed bug in fold_lr_find where LR finders would use different LR steps leading to NaNs when plotting in fold_lr_find
plot_feat used to coerce NaNs and Infs via np.nan_to_num prior to plotting, potentially impacting distributions, plotting scales, moments, etc. Fixed so that nan and inf values are removed rather than coerced.
Fixed early-stopping statement in fold_train_ensemble to state the number as "sub-epochs" (previously said "epochs")
Fixed error in patience when using cyclical LR callbacks, now specify the number of cycles to go without improvement. Previously had to specify 1+number.
Unnecessary warning df2foldfile when no strat-key is passed.
Saved matrices in fold2foldfile are now in float32
Fixed return type of get_layers methods in RNNs_CNNs_and_GNNs_for_matrix_data example
Bug in model.predict_array when predicting matrix data with a batch size
Added missing indexing in AbsMatrixHead to use torch.bool if PyTorch version is >= 1.2 (was uint8 but now depreciated for indexing)
Errors when running in terminal due to trying to call .show on fastprogress bars
Bug due to encoding of readme when trying to install when default encoder is ascii
Bug when running Model.predict in batches when the data contains less than one batch
Include missing files in sdist, thanks to @thatch
Test path correction in example notebook, thanks to @kiryteo
Doc links in hep_proc
Error in MultiHead._set_feats when matrix_head does not contain 'vecs' or 'feats_per_vec' keywords
Compatibility error in numpy >= 1.18 in bin_binary_class_pred due to float instead of int
Unnecessary second loading of fold data in fold_lr_find
Compatibility error when working in PyTorch 1.6 based on integer and true division
SWA not evaluating in batches when running in non-bulk-move mode
Moved from normed to density keywords for matplotlib

Changes

ParametrisedPrediction now accepts lists of parameterisation features
plot_sample_pred now ensures that signal and background have the same binning
PlotSettings now coerces string arguments for savepath to Path
Added default value for targ_name in EvalMetric
plot_rank_order_dendrogram:
- Now uses "optimal ordering" for improved presentation
- Now returns sets of all features in cluster with distance over the threshold, rather than just the closest features in each cluster
auto_filter_on_linear_correlation now examines all features within correlated clusters, rather than just the most correlated pair. This means that the function now only needs to be run once, rather than the previously recommended multiple rerunning.
Improved data shuffling in BatchYielder, now runs much quicker
Slight speedup when loading data from foldfiles
Matrix data is no longer passed through np.nan_to_num in FoldYielder. Users should ensure that all values in matrix data are not NaN or Inf

Depreciations

Comments

RFPImp still imports from sklearn.ensemble.forest which is depreciated, and possibly part of the private API. Hopefully the package will remedy this in time for depreciation. For now, future warnings are displayed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0 - Train and Converge Until it is Done

v0.6.0 - Train and Converge Until it is Done

Important changes

Breaking

Additions

Removals

Fixes

Changes

Depreciations

Comments