Skip to content

v0.6.0 - Train and Converge Until it is Done

Compare
Choose a tag to compare
@GilesStrong GilesStrong released this 09 Sep 14:22
· 282 commits to master since this release

v0.6.0 - Train and Converge Until it is Done

Important changes

  • auto_filter_on_linear_correlation now examines all features within correlated clusters, rather than just the most correlated pair. This means that the function now only needs to be run once, rather than the previously recommended multiple rerunning.
  • Moved to Scikit-learn 0.22.2, and moved, where possible, to keyword argument calls for sklearn methods in preparation for 0.25 enforcement of keyword arguments
  • Fixed error in patience when using cyclical LR callbacks, now specify the number of cycles to go without improvement. Previously had to specify 1+number.
  • Matrix data is no longer passed through np.nan_to_num in FoldYielder. Users should ensure that all values in matrix data are not NaN or Inf
  • Tensor data:
    • df2foldfile, fold2foldfile, and 'add_meta_data` can now support the saving of arbitrary matrices as a matrix input
    • Pass a numpy.array whose first dimension matches the length of the DataFrame to the tensor_data argument of df2foldfile and a name to tensor_name.
      The array will be split along the first dimension and the sub-arrays will be saved as matrix inputs in the resulting foldfile
    • The matrices may also be passed as sparse format and be densified on loading by FoldYielder

Breaking

  • plot_rank_order_dendrogram now returns sets of all features in cluster with distance over the threshold, rather than just the closest features in each cluster

Additions

  • Addition of batch size parameter to Ensemble.predict*
  • Lorentz Boost Network (https://arxiv.org/abs/1812.09722):
    • LorentzBoostNet basic implementation which learns boosted particles from existing particles and extracts features from them using fixed kernel functions
    • AutoExtractLorentzBoostNet which also learns the kernel-functions during training
  • Classification Eval classes:
    • BinaryAccuracy: Computes and returns the accuracy of a single-output model for binary classification tasks.
    • RocAucScore: Computes and returns the area under the Receiver Operator Characteristic curve (ROC AUC) of a classifier model.
  • plot_binary_sample_feat: a version of plot_sample_pred designed for plotting feature histograms with stacked contributions by sample for
    background.
  • Added compression arguments to df2foldfile, fold2foldfile, and save_to_grp
  • Tensor data:
    • df2foldfile, fold2foldfile, and 'add_meta_data` can now support the saving of arbitrary matrices as a matrix input
    • Pass a numpy.array whose first dimension matches the length of the DataFrame to the tensor_data argument of df2foldfile and a name to tensor_name.
      The array will be split along the first dimension and the sub-arrays will be saved as matrix inputs in the resulting foldfile
    • The matrices may also be passed as sparse format and be densified on loading by FoldYielder
  • plot_lr_finders now has a log_y argument for logarithmic y-axis. Default auto set log_y if maximum fractional difference between losses is greater than 50
  • Added new rescaling options to ClassRegMulti using linear outputs and scaling by mean and std of targets
  • LsuvInit now applies scaling to nn.Conv3d layers
  • plot_lr_finders and fold_lr_find now have options to save the resulting LR finder plot (currently limited to png due to problems with pdf)
  • Addition of AdamW and an optimiser, thanks to @kiryteo
  • Contribution guide, thanks to @kiryteo
  • OneCycle lr_range now supports a non-zero final LR; just supply a three-tuple to the lr_range argument.
  • Ensemble.from_models classmethod for combining in-memory models into an Ensemble.

Removals

  • FeatureSubsample
  • plots keyword in fold_train_ensemble

Fixes

  • Docs bug for nn.training due to missing ipython in requirements
  • Bug in LSUV init when running on CUDA
  • Bug in TF export based on searching for fullstops
  • Bug in model_bar update during fold training
  • Quiet bug in 'MultHead' when matrix feats were not listed first; map construction indexed self.matrix_feats not self.feats
  • Slowdown in ensemble.predict_array which caused the array to get sent to device in during each model evaluations
    -Model.get_param_count now includes mon-trainable params when requested
  • Fixed bug in fold_lr_find where LR finders would use different LR steps leading to NaNs when plotting in fold_lr_find
  • plot_feat used to coerce NaNs and Infs via np.nan_to_num prior to plotting, potentially impacting distributions, plotting scales, moments, etc. Fixed so that nan and inf values are removed rather than coerced.
  • Fixed early-stopping statement in fold_train_ensemble to state the number as "sub-epochs" (previously said "epochs")
  • Fixed error in patience when using cyclical LR callbacks, now specify the number of cycles to go without improvement. Previously had to specify 1+number.
  • Unnecessary warning df2foldfile when no strat-key is passed.
  • Saved matrices in fold2foldfile are now in float32
  • Fixed return type of get_layers methods in RNNs_CNNs_and_GNNs_for_matrix_data example
  • Bug in model.predict_array when predicting matrix data with a batch size
  • Added missing indexing in AbsMatrixHead to use torch.bool if PyTorch version is >= 1.2 (was uint8 but now depreciated for indexing)
  • Errors when running in terminal due to trying to call .show on fastprogress bars
  • Bug due to encoding of readme when trying to install when default encoder is ascii
  • Bug when running Model.predict in batches when the data contains less than one batch
  • Include missing files in sdist, thanks to @thatch
  • Test path correction in example notebook, thanks to @kiryteo
  • Doc links in hep_proc
  • Error in MultiHead._set_feats when matrix_head does not contain 'vecs' or 'feats_per_vec' keywords
  • Compatibility error in numpy >= 1.18 in bin_binary_class_pred due to float instead of int
  • Unnecessary second loading of fold data in fold_lr_find
  • Compatibility error when working in PyTorch 1.6 based on integer and true division
  • SWA not evaluating in batches when running in non-bulk-move mode
  • Moved from normed to density keywords for matplotlib

Changes

  • ParametrisedPrediction now accepts lists of parameterisation features
  • plot_sample_pred now ensures that signal and background have the same binning
  • PlotSettings now coerces string arguments for savepath to Path
  • Added default value for targ_name in EvalMetric
  • plot_rank_order_dendrogram:
    • Now uses "optimal ordering" for improved presentation
    • Now returns sets of all features in cluster with distance over the threshold, rather than just the closest features in each cluster
  • auto_filter_on_linear_correlation now examines all features within correlated clusters, rather than just the most correlated pair. This means that the function now only needs to be run once, rather than the previously recommended multiple rerunning.
  • Improved data shuffling in BatchYielder, now runs much quicker
  • Slight speedup when loading data from foldfiles
  • Matrix data is no longer passed through np.nan_to_num in FoldYielder. Users should ensure that all values in matrix data are not NaN or Inf

Depreciations

Comments

  • RFPImp still imports from sklearn.ensemble.forest which is depreciated, and possibly part of the private API. Hopefully the package will remedy this in time for depreciation. For now, future warnings are displayed.