"A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E." (Tom Mitchell)
- overview
- theory
- methods
- representation learning
- program synthesis
- meta-learning
- automated machine learning
- weak supervision
- interesting papers
deep learning
reinforcement learning
bayesian inference and learning
probabilistic programming
causal inference
artificial intelligence
knowledge representation and reasoning
natural language processing
recommender systems
information retrieval
"Machine Learning is The New Algorithms" by Hal Daume
"When is Machine Learning Worth It?" by Ferenc Huszar
Any source code for expression y = f(x), where f(x) has some parameters and is used to make decision, prediction or estimate, has potential to be replaced by machine learning algorithm.
http://metacademy.org
http://en.wikipedia.org/wiki/Machine_learning (guide)
http://machinelearning.ru in russian
"Machine Learning Basics" by Ian Goodfellow, Yoshua Bengio, Aaron Courville
"A Few Useful Things to Know about Machine Learning" by Pedro Domingos
"Expressivity, Trainability, and Generalization in Machine Learning" by Eric Jang
"Clever Methods of Overfitting" by John Langford
"Common Pitfalls in Machine Learning" by Daniel Nee
"Classification vs. Prediction" by Frank Harrell
"Causality in Machine Learning" by Muralidharan et al.
"Are ML and Statistics Complementary?" by Max Welling
"Introduction to Information Theory and Why You Should Care" by Gil Katz
"Ideas on Interpreting Machine Learning" by Hall et al.
"Mathematics for Machine Learning" by Marc Peter Deisenroth, A Aldo Faisal, Cheng Soon Ong
"Rules of Machine Learning: Best Practices for ML Engineering" by Martin Zinkevich
course by Nando de Freitas video
course by Nando de Freitas video
course by Pedro Domingos video
course by Alex Smola video
course by Trevor Hastie and Rob Tibshirani video
course by Jeff Miller video
course by Sergey Nikolenko video
in russian
2019-2020
course by Sergey Nikolenko video
in russian
2020
course by Sergey Nikolenko video
in russian
2018
course by Konstantin Vorontsov video
in russian
2021
course by Konstantin Vorontsov video
in russian
2020
course by Konstantin Vorontsov video
in russian
2019/2020
course by Konstantin Vorontsov video
in russian
2014
course by Igor Kuralenok video
in russian
2017
course by Igor Kuralenok video
in russian
2016
course by Igor Kuralenok video
in russian
2015
course by Igor Kuralenok video
in russian
2013/2014
course (1, 2) by Igor Kuralenok video
in russian
2012/2013
course by Evgeny Sokolov video
in russian
2019
course from Yandex video
in russian
course from OpenDataScience video
in russian
deep learning courses
reinforcement learning courses
bayesian inference and learning courses
"A First Encounter with Machine Learning" by Max Welling
"Model-Based Machine Learning" by John Winn, Christopher Bishop and Thomas Diethe
"Deep Learning" by Ian Goodfellow, Yoshua Bengio, Aaron Courville
"Reinforcement Learning: An Introduction"
(second edition) by Richard Sutton and Andrew Barto
"Machine Learning" by Tom Mitchell
"Understanding Machine Learning: From Theory to Algorithms" by Shai Shalev-Shwartz and Shai Ben-David
"Pattern Recognition and Machine Learning" by Chris Bishop
"Computer Age Statistical Inference" by Bradley Efron and Trevor Hastie
"The Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, Jerome Friedman
"Machine Learning - A Probabilistic Perspective" by Kevin Murphy
"Information Theory, Inference, and Learning Algorithms" by David MacKay
"Bayesian Reasoning and Machine Learning" by David Barber
"Foundations of Machine Learning" by Mehryar Mohri
"Scaling Up Machine Learning: Parallel and Distributed Approaches" by Ron Bekkerman, Mikhail Bilenko, John Langford
http://offconvex.org
http://argmin.net
http://inference.vc
http://blog.shakirm.com
http://machinethoughts.wordpress.com
http://hunch.net
http://machinedlearnings.com
http://nlpers.blogspot.com
http://timvieira.github.io
http://ruder.io
http://danieltakeshi.github.io
http://lilianweng.github.io
https://twimlai.com https://thetalkingmachines.com https://lexfridman.com/ai
https://jack-clark.net/import-ai
https://newsletter.ruder.io
https://getrevue.co/profile/seungjaeryanlee
https://getrevue.co/profile/wildml
https://reddit.com/r/MachineLearning
- NeurIPS 2019 [videos] [videos] [notes]
- RLDM 2019 [notes]
- ICML 2019 [videos] [videos] [videos] [notes]
- ICLR 2019 [videos] [videos] [notes]
- NeurIPS 2018 [videos] [videos]
- ICML 2018 [videos] [videos] [notes]
- ICLR 2018 [videos]
- NeurIPS 2017 [videos] [videos] [notes] [summary] [summary] [summary]
- ICML 2017 [videos]
- ICLR 2017 [videos]
- NeurIPS 2016 [videos] [videos] [videos] [videos] [videos] [videos] [summary]
- ICML 2016 [videos]
- ICLR 2016 [videos]
- NeurIPS 2015 [videos] [videos] [summary]
- ICML 2015 [videos] [videos]
- ICLR 2015 [videos]
- NeurIPS 2014 [videos]
machine learning has become alchemy by Ali Rahimi video
(post)
statistics in machine learning by Michael I. Jordan video
theory in machine learning by Michael I. Jordan video
"Learning Theory: Purely Theoretical?" by Jonathan Huggins
problems:
- What does it mean to learn?
- When is a concept/function learnable?
- How much data do we need to learn something?
- How can we make sure what we learn will generalize to future data?
theory helps to:
- design algorithms
- understand behaviour of algorithms
- quantify knowledge/uncertainty
- identify new and refine old challenges
frameworks:
- statistical learning theory
- computational learning theory (PAC learning or PAC-Bayes)
ingredients:
- distributions
- i.i.d. samples
- learning algorithms
- predictors
- loss functions
A priori analysis: How well a learning algorithm will perform on new data?
- (Vapnik's learning theory) Can we compete with best hypothesis from a given set of hypotheses?
- (statistics) Can we match the best possible loss assuming data generating distribution belongs to known family?
A posteriori analysis: How well is a learning algorithm doing on some data? Quantify uncertainty left
Fundamental theorem of statistical learning theory:
In binary classification, to match the loss of hypothesis in class H up to accuracy ε, one needs O(VC(H)/ε^2) observations.
"Machine Learning Theory" by Mostafa Samir
"Crash Course on Learning Theory" by Sebastien Bubeck
"Statistical Learning Theory" by Percy Liang
course by Tomaso Poggio and others (videos, videos, videos)
course by Yaser Abu-Mostafa video
course by Sebastien Bubeck video
"Computational Learning Theory, AI and Beyond" chapter of "Mathematics and Computation" book by Avi Wigderson
"Probably Approximately Correct - A Formal Theory of Learning" by Jeremy Kun
"A Problem That is Not (Properly) PAC-learnable" by Jeremy Kun
"Occam’s Razor and PAC-learning" by Jeremy Kun
bayesian inference and learning
challenges
- How to decide which representation is best for target knowledge?
- How to tell genuine regularities from chance occurrences?
- How to exploit pre-existing domain knowledge?
- How to learn with limited computational resources?
- How to learn with limited data?
- How to make learned results understandable?
- How to quantify uncertainty?
- How to take into account the costs of decisions?
- How to handle non-indepedent and non-stationary data?
"The Three Cultures of Machine Learning" by Jason Eisner
"Algorithmic Dimensions" by Justin Domke
"All Models of Learning Have Flaws" by John Langford
algorithms (Wikipedia)
algorithms (Metacademy)
algorithms (scikit-learn)
"Representation is a formal system which makes explicit certain entities and types of information, and which can be operated on by an algorithm in order to achieve some information processing goal. Representations differ in terms of what information they make explicit and in terms of what algorithms they support. As example, Arabic and Roman numerals - the fact that operations can be applied to particular columns of Arabic numerals in meaningful ways allows for simple and efficient algorithms for addition and multiplication."
"In representation learning, our goal isn’t to predict observables, but to learn something about the underlying structure. In cognitive science and AI, a representation is a formal system which maps to some domain of interest in systematic ways. A good representation allows us to answer queries about the domain by manipulating that system. In machine learning, representations often take the form of vectors, either real- or binary-valued, and we can manipulate these representations with operations like Euclidean distance and matrix multiplication."
"In representation learning, the goal isn’t to make predictions about observables, but to learn a representation which would later help us to answer various queries. Sometimes the representations are meant for people, such as when we visualize data as a two-dimensional embedding. Sometimes they’re meant for machines, such as when the binary vector representations learned by deep Boltzmann machines are fed into a supervised classifier. In either case, what’s important is that mathematical operations map to the underlying relationships in the data in systematic ways."
"What is representation learning?" by Roger Grosse
"Predictive learning vs. representation learning" by Roger Grosse
deep learning
probabilistic programming
knowledge representation
programmatic representations:
- well-specified
Unlike sentences in natural language, programs are unambiguous, although two distinct programs can be precisely equivalent. - compact
Programs allow us to compress data on the basis of their regularities. - combinatorial
Programs can access the results of running other programs, as well as delete, duplicate, and rearrange these results. - hierarchical
Programs have an intrinsic hierarchical organization and may be decomposed into subprograms.
challenges:
- open-endedness
In contrast to other knowledge representations in machine learning, programs may vary in size and shape, and there is no obvious problem-independent upper bound on program size. This makes it difficult to represent programs as points in a fixed-dimensional space, or learn programs with algorithms that assume such a space. - over-representation
Often syntactically distinct programs will be semantically identical (i.e. represent the same underlying behavior or functional mapping). Lacking prior knowledge, many algorithms will inefficiently sample semantically identical programs repeatedly. - chaotic execution
Programs that are very similar, syntactically, may be very different, semantically. This presents difficulty for many heuristic search algorithms, which require syntactic and semantic distance to be correlated. - high resource-variance
Programs in the same space may vary greatly in the space and time they require to execute.
"Program Synthesis Explained" by James Bornholt
"Inductive Programming Meets the Real World" by Gulwani et al. paper
"Program Synthesis" by Gulwani, Polozov, Singh paper
"Program Synthesis in 2017-18" by Alex Polozov
"Recent Advances in Neural Program Synthesis" by Neel Kant paper
"The Future of Deep Learning" by Francois Chollet (talk video
)
overview by Rishabh Singh video
overview by Rishabh Singh video
overview by Alex Polozov video
overview by Scott Reed video
overview by Alex Gaunt video
"Neural Abstract Machines & Program Induction" workshop (NIPS 2016 videos, ICML 2018 videos)
interesting recent papers
selected papers
course by Chelsea Finn (videos)
overview by Yee Whye Teh video
overview by Chelsea Finn and Sergey Levine video
overview by Pieter Abbeel video
overview by Oriol Vinyals video
overview by Chelsea Finn video
overview by Nando de Freitas video
Metalearning symposium video
Metalearning symposium panel video
RNN symposium panel video
"Meta-learning in Natural and Artificial Intelligence" by Jane Wang paper
meta-learning overview by Lilian Weng
meta reinforcement learning overview by Lilian Weng
overview by Juergen Schmidhuber
overview by Juergen Schmidhuber video
(meta-learning vs transfer learning)
overview by Juergen Schmidhuber video
overview by Juergen Schmidhuber video
overview by Juergen Schmidhuber
overview by Tom Schaul and Juergen Schmidhuber
"Current commercial AI algorithms are still missing something fundamental. They are no self-referential general purpose learning algorithms. They improve some system’s performance in a given limited domain, but they are unable to inspect and improve their own learning algorithm. They do not learn the way they learn, and the way they learn the way they learn, and so on (limited only by the fundamental limits of computability)."
(Juergen Schmidhuber)
"On GPT-3: Meta-Learning, Scaling, Implications, And Deep Theory" by Gwern Branwen
"The Future of Deep Learning" by Francois Chollet (talk video
)
AutoML aims to automate many different stages of the machine learning process:
- model selection, hyper-parameter optimization, and model search
- meta learning and transfer learning
- representation learning and automatic feature extraction / construction
- automatic generation of workflows / workflow reuse
- automatic problem "ingestion" (from raw data and miscellaneous formats)
- automatic feature transformation to match algorithm requirements
- automatic detection and handling of skewed data and/or missing values
- automatic acquisition of new data (active learning, experimental design)
- automatic report writing (providing insight on automatic data analysis)
- automatic selection of evaluation metrics / validation procedures
- automatic selection of algorithms under time/space/power constraints
- automatic prediction post-processing and calibration
- automatic leakage detection
- automatic inference and differentiation
- user interfaces for AutoML
problems:
- different data distributions: the intrinsic/geometrical complexity of the dataset
- different tasks: regression, binary classification, multi-class classification, multi-label classification
- different scoring metrics: AUC, BAC, MSE, F1, etc
- class balance: Balanced or unbalanced class proportions
- sparsity: Full matrices or sparse matrices
- missing values: Presence or absence of missing values
- categorical variables: Presence or absence of categorical variables
- irrelevant variables: Presence or absence of additional irrelevant variables (distractors)
- number Ptr of training examples: Small or large number of training examples
- number N of variables/features: Small or large number of variables
- aspect ratio Ptr/N of the training data matrix: Ptr >> N, Ptr = N or Ptr << N
"AutoML: Methods, Systems, Challenges" book by Frank Hutter, Lars Kotthoff, Joaquin Vanschoren
"Automated Machine Learning: A Short History" by Thomas Dinsmore
"AutoML at Google and Future Directions" by Jeff Dean video
tutorial by Frank Hutter and Joaquin Vanschoren video
"Automated Machine Learning" by Andreas Mueller video
"AutoML and How To Speed It Up" by Frank Hutter video
auto-sklearn project
- overview by Feurer et al.
TPOT project
- overview by Olson and Moore
auto_ml project
H2O AutoML project
The Automatic Statistician project
- overview by Steinruecken et al.
- overview by Zoubin Ghahramani
video
- overview by Zoubin Ghahramani
video
- overview by Zoubin Ghahramani
video
AlphaD3M project
Google AutoML at Kaggle challenge
"Benchmarking Automatic Machine Learning Frameworks" by Balaji and Allen paper
"CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise" by Lee et al. paper
(post,
post)
Snorkel website
Snorkel project
Snorkel blog
overview by Chris Re video
overview by Chris Re video
overview by Chris Re video
overview by Chris Re video
overview by Chris Re video
overview by Stephen Bach video
overview by Alex Ratner video
overview by Alex Ratner video
overview by Alex Ratner video
overview by Alex Ratner audio
"Data Programming: ML with Weak Supervision" post
"Socratic Learning: Debugging ML Models" post
"SLiMFast: Assessing the Reliability of Data" post
"Data Programming + TensorFlow Tutorial" post
"Babble Labble: Learning from Natural Language Explanations" post
(overview video
)
"Structure Learning: Are Your Sources Only Telling You What You Want to Hear?" post
"HoloClean: Weakly Supervised Data Repairing" post
"Scaling Up Snorkel with Spark" post
"Weak Supervision: The New Programming Paradigm for Machine Learning" post
"Learning to Compose Domain-Specific Transformations for Data Augmentation" post
"Exploiting Building Blocks of Data to Efficiently Create Training Sets" post
"Programming Training Data: The New Interface Layer for ML" post
"Accelerating Machine Learning with Training Data Management" paper
"Data Programming: Creating Large Training Sets, Quickly" paper
summary
(video)
"Socratic Learning: Empowering the Generative Model" paper
summary
(video)
"Data Programming with DDLite: Putting Humans in a Different Part of the Loop" paper
"Snorkel: A System for Lightweight Extraction" paper
(talk video
)
"Snorkel: Fast Training Set Generation for Information Extraction" paper
(talk video
)
"Learning the Structure of Generative Models without Labeled Data" paper
(talk video
)
"Learning to Compose Domain-Specific Transformations for Data Augmentation" paper
(video)
"Inferring Generative Model Structure with Static Analysis" paper
(video)
"Snorkel: Rapid Training Data Creation with Weak Supervision" paper
summary
(talk video
)
"Training Complex Models with Multi-Task Weak Supervision" paper
"Snorkel MeTaL: Weak Supervision for Multi-Task Learning" paper
"Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale" paper
(post)
"Snorkel is a system for rapidly creating, modeling, and managing training data, currently focused on accelerating the development of structured or "dark" data extraction applications for domains in which large labeled training sets are not available or easy to obtain.
Today's state-of-the-art machine learning models require massive labeled training sets--which usually do not exist for real-world applications. Instead, Snorkel is based around the new data programming paradigm, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel automatically models this process - learning, essentially, which labeling functions are more accurate than others - and then uses this to train an end model (for example, a deep neural network in TensorFlow).
Surprisingly, by modeling a noisy training set creation process in this way, we can take potentially low-quality labeling functions from the user, and use these to train high-quality end models. We see Snorkel as providing a general framework for many weak supervision techniques, and as defining a new programming model for weakly-supervised machine learning systems."
interesting papers - deep learning
interesting papers - reinforcement learning
interesting papers - bayesian inference and learning
interesting papers - probabilistic programming
"A Theory of the Learnable" Valiant
"Humans appear to be able to learn new concepts without needing to be programmed explicitly in any conventional sense. In this paper we regard learning as the phenomenon of knowledge acquisition in the absence of explicit programming. We give a precise methodology for studying this phenomenon from a computational viewpoint. It consists of choosing an appropriate information gathering mechanism, the learning protocol, and exploring the class of concepts that can be learned using it in a reasonable (polynomial) number of steps. Although inherent algorithmic complexity appears to set serious limits to the range of concepts that can be learned, we show that there are some important nontrivial classes of propositional concepts that can be learned in a realistic sense."
"Proof that if you have a finite number of functions, say N, then every training error will be close to every test error once you have more than log N training cases by a small constant factor. Clearly, if every training error is close to its test error, then overfitting is basically impossible (overfitting occurs when the gap between the training and the test error is large)."
"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools."
"Machine learning algorithms can figure out how to perform important tasks by generalizing from examples. This is often feasible and cost-effective where manual programming is not. As more data becomes available, more ambitious problems can be tackled. As a result, machine learning is widely used in computer science and other fields. However, developing successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions."
"This paper introduces an advanced setting of machine learning problem in which an Intelligent Teacher is involved. During training stage, Intelligent Teacher provides Student with information that contains, along with classification of each example, additional privileged information (explanation) of this example. The paper describes two mechanisms that can be used for significantly accelerating the speed of Student’s training: (1) correction of Student’s concepts of similarity between examples, and (2) direct Teacher-Student knowledge transfer."
"During last fifty years a strong machine learning theory has been developed. This theory includes: 1. The necessary and sufficient conditions for consistency of learning processes. 2. The bounds on the rate of convergence which in general cannot be improved. 3. The new inductive principle (SRM) which always achieves the smallest risk. 4. The effective algorithms, (such as SVM), that realize consistency property of SRM principle. It looked like general learning theory has been complied: it answered almost all standard questions that is asked in the statistical theory of inference. Meantime, the common observation was that human students require much less examples for training than learning machine. Why? The talk is an attempt to answer this question. The answer is that it is because the human students have an Intelligent Teacher and that Teacher-Student interactions are based not only on the brute force methods of function estimation from observations. Speed of learning also based on Teacher-Student interactions which have additional mechanisms that boost learning process. To learn from smaller number of observations learning machine has to use these mechanisms. In the talk I will introduce a model of learning that includes the so called Intelligent Teacher who during a training session supplies a Student with intelligent (privileged) information in contrast to the classical model where a student is given only outcomes y for events x. Based on additional privileged information x* for event x two mechanisms of Teacher-Student interactions (special and general) are introduced: 1. The Special Mechanism: To control Student's concept of similarity between training examples. and 2. The General Mechanism: To transfer knowledge that can be obtained in space of privileged information to the desired space of decision rules. Both mechanisms can be considered as special forms of capacity control in the universally consistent SRM inductive principle. Privileged information exists for almost any inference problem and can make a big difference in speed of learning processes."
video
https://video.ias.edu/csdm/2015/0330-VladimirVapnik (Vapnik)video
https://youtube.com/watch?v=UP5JvzMzCoc (Grabovoy)in russian
press
http://learningtheory.org/learning-has-just-started-an-interview-with-prof-vladimir-vapnik/
"The use of compression algorithms in machine learning tasks such as clustering and classification has appeared in a variety of fields, sometimes with the promise of reducing problems of explicit feature selection. The theoretical justification for such methods has been founded on an upper bound on Kolmogorov complexity and an idealized information space. An alternate view shows compression algorithms implicitly map strings into implicit feature space vectors, and compression-based similarity measures compute similarity within these feature spaces. Thus, compression-based methods are not a “parameter free” magic bullet for feature selection and data representation, but are instead concrete similarity measures within defined feature spaces, and are therefore akin to explicit feature vector models used in standard machine learning algorithms. To underscore this point, we find theoretical and empirical connections between traditional machine learning vector models and compression, encouraging cross-fertilization in future work."
"AlphaD3M: Machine Learning Pipeline Synthesis" Drori et al.
AlphaD3M
meta-learning
ICML 2018
"Population Based Training of Neural Networks" Jaderberg et al.
"Neural networks dominate the modern machine learning landscape, but their training and success still suffer from sensitivity to empirical choices of hyperparameters such as model architecture, loss function, and optimisation algorithm. In this work we present Population Based Training, a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance. Importantly, PBT discovers a schedule of hyperparameter settings rather than following the generally sub-optimal strategy of trying to find a single fixed set to use for the whole course of training. With just a small modification to a typical distributed hyperparameter training framework, our method allows robust and reliable training of models. We demonstrate the effectiveness of PBT on deep reinforcement learning problems, showing faster wall-clock convergence and higher final performance of agents by optimising over a suite of hyperparameters. In addition, we show the same method can be applied to supervised learning for machine translation, where PBT is used to maximise the BLEU score directly, and also to training of Generative Adversarial Networks to maximise the Inception score of generated images. In all cases PBT results in the automatic discovery of hyperparameter schedules and model selection which results in stable training and better final performance."
"Two common tracks for the tuning of hyperparameters exist: parallel search and sequential optimisation, which trade-off concurrently used computational resources with the time required to achieve optimal results. Parallel search performs many parallel optimisation processes (by optimisation process we refer to neural network training runs), each with different hyperparameters, with a view to finding a single best output from one of the optimisation processes – examples of this are grid search and random search. Sequential optimisation performs few optimisation processes in parallel, but does so many times sequentially, to gradually perform hyperparameter optimisation using information obtained from earlier training runs to inform later ones – examples of this are hand tuning and Bayesian optimisation. Sequential optimisation will in general provide the best solutions, but requires multiple sequential training runs, which is often unfeasible for lengthy optimisation processes."
"In this work, we present a simple method, Population Based Training which bridges and extends parallel search methods and sequential optimisation methods. Advantageously, our proposal has a wallclock run time that is no greater than that of a single optimisation process, does not require sequential runs, and is also able to use fewer computational resources than naive search methods such as random or grid search. Our approach leverages information sharing across a population of concurrently running optimisation processes, and allows for online propagation/transfer of parameters and hyperparameters between members of the population based on their performance."
"Furthermore, unlike most other adaptation schemes, our method is capable of performing online adaptation of hyperparameters – which can be particularly important in problems with highly non-stationary learning dynamics, such as reinforcement learning settings, where the learning problem itself can be highly non-stationary (e.g. dependent on which parts of an environment an agent is currently able to explore). As a consequence, it might be the case that the ideal hyperparameters for such learning problems are themselves highly non-stationary, and should vary in a way that precludes setting their schedule in advance."
post
https://deepmind.com/blog/population-based-training-neural-networks/post
https://deepmind.com/blog/article/how-evolutionary-selection-can-train-more-capable-self-driving-cars/video
https://vimeo.com/250399261 (Jaderberg)video
https://youtube.com/watch?v=pEANQ8uau88 (Shorten)video
https://youtube.com/watch?v=uuOoqAiB2g0 (Sazanovich)in russian
paper
https://arxiv.org/abs/1902.01894 by Li et al.
"Data Programming: Creating Large Training Sets, Quickly" Ratner, Sa, Wu, Selsam, Re
"Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users provide a set of labeling functions, which are programs that heuristically label large subsets of data points, albeit noisily. By viewing these labeling functions as implicitly describing a generative model for this noise, we show that we can recover the parameters of this model to “denoise” the training set. Then, we show how to modify a discriminative loss function to make it noise-aware. We demonstrate our method over a range of discriminative models including logistic regression and LSTMs. We establish theoretically that we can recover the parameters of these generative models in a handful of settings. Experimentally, on the 2014 TAC-KBP relation extraction challenge, we show that data programming would have obtained a winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a supervised LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way to create machine learning models for non-experts."
"In the data programming approach to developing a machine learning system, the developer focuses on writing a set of labeling functions, which create a large but noisy training set. Snorkel then learns a generative model of this noise - learning, essentially, which labeling functions are more accurate than others - and uses this to train a discriminative classifier. At a high level, the idea is that developers can focus on writing labeling functions - which are just (Python) functions that provide a label for some subset of data points - and not think about algorithms or features!"
video
https://youtube.com/watch?v=iSQHelJ1xxUvideo
https://youtube.com/watch?v=HmocI2b5YfA (Re)post
http://hazyresearch.github.io/snorkel/blog/weak_supervision.htmlpost
http://hazyresearch.github.io/snorkel/blog/dp_with_tf_blog_post.htmlaudio
https://soundcloud.com/nlp-highlights/28-data-programming-creating-large-training-sets-quickly (Ratner)notes
https://github.com/b12io/reading-group/blob/master/data-programming-snorkel.mdcode
https://github.com/HazyResearch/snorkel- Snorkel project
summary
"Socratic Learning: Empowering the Generative Model" Varma et al.
"A challenge in training discriminative models like neural networks is obtaining enough labeled training data. Recent approaches have leveraged generative models to denoise weak supervision sources that a discriminative model can learn from. These generative models directly encode the users' background knowledge. Therefore, these models may be incompletely specified and fail to model latent classes in the data. We present Socratic learning to systematically correct such generative model misspecification by utilizing feedback from the discriminative model. We prove that under mild conditions, Socratic learning can recover features from the discriminator that informs the generative model about these latent classes. Experimentally, we show that without any hand-labeled data, the corrected generative model improves discriminative performance by up to 4.47 points and reduces error for an image classification task by 80% compared to a state-of-the-art weak supervision modeling technique."
video
https://youtube.com/watch?v=0gRNochbK9cpost
http://hazyresearch.github.io/snorkel/blog/socratic_learning.htmlcode
https://github.com/HazyResearch/snorkel- Snorkel project
summary
"Snorkel: Rapid Training Data Creation with Weak Supervision" Ratner, Bach, Ehrenberg, Fries, Wu, Re
"Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train stateof-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets."
video
https://youtube.com/watch?v=HmocI2b5YfA (Re)notes
https://blog.acolyer.org/2018/08/22/snorkel-rapid-training-data-creation-with-weak-supervision- Snorkel project
summary
"Hidden Technical Debt in Machine Learning Systems" Sculley et al.
"Machine learning offers a fantastically powerful toolkit for building useful complexprediction systems quickly. This paper argues it is dangerous to think ofthese quick wins as coming for free. Using the software engineering frameworkof technical debt, we find it is common to incur massive ongoing maintenancecosts in real-world ML systems. We explore several ML-specific risk factors toaccount for in system design. These include boundary erosion, entanglement,hidden feedback loops, undeclared consumers, data dependencies, configurationissues, changes in the external world, and a variety of system-level anti-patterns."
notes
https://blog.acolyer.org/2016/02/29/machine-learning-the-high-interest-credit-card-of-technical-debtpost
http://john-foreman.com/blog/the-perilous-world-of-machine-learning-for-fun-and-profit-pipeline-jungles-and-hidden-feedback-loops
"TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous “parameter server” designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications."
"A Reliable Effective Terascale Linear Learning System" Agarwal, Chapelle, Dudik, Langford
Vowpal Wabbit
"We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features, 1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are new, but the careful synthesis required to obtain an efficient implementation is. The result is, up to our knowledge, the most scalable and efficient linear learning system reported in the literature. We describe and thoroughly evaluate the components of the system, showing the importance of the various design choices."
"
- Online by default
- Hashing, raw text is fine
- Most scalable public algorithm
- Reduction to simple problems
- Causation instead of correlation
- Learn to control based on feedback
"
- https://github.com/JohnLangford/vowpal_wabbit/wiki
video
http://youtube.com/watch?v=wwlKkFhEhxE (Langford)paper
"Bring The Noise: Embracing Randomness Is the Key to Scaling Up Machine Learning Algorithms" by Brian Dalessandro
"Making Contextual Decisions with Low Technical Debt" Agarwal et al.
"CatBoost: Gradient Boosting with Categorical Features Support" Dorogush, Ershov, Gulin
"In this paper we present CatBoost, a new open-sourced gradient boosting library that successfully handles categorical features and outperforms existing publicly available implementations of gradient boosting in terms of quality on a set of popular publicly available datasets. The library has a GPU implementation of learning algorithm and a CPU implementation of scoring algorithm, which are significantly faster than other gradient boosting libraries on ensembles of similar sizes."
"Two critical algorithmic advances introduced in CatBoost are the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features. Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms."
- https://catboost.yandex
video
https://youtube.com/watch?v=jLU6kNRiZ5ovideo
https://youtube.com/watch?v=BgDmuvPaUBo (Dorogush)video
https://youtube.com/watch?v=8o0e-r0B5xQ (Dorogush)video
https://youtube.com/watch?v=usdEWSDisS0 (Dorogush)video
https://youtube.com/watch?v=db-iLhQvcH8 (Prokhorenkova)video
https://youtube.com/watch?v=ZAGXnXmDCT8 (Ershov)video
https://youtube.com/watch?v=UYDwhuyWYSo (Dorogush)in russian
video
https://youtube.com/watch?v=9ZrfErvm97M (Dorogush)in russian
video
https://youtube.com/watch?v=Q_xa4RvnDcY (Dorogush)in russian
video
https://youtube.com/watch?v=ZaP5qFSIcIw (Dmitriev, Lyzhin, Peshaya)in russian
code
https://github.com/catboost/catboostpaper
"CatBoost: Unbiased Boosting with Categorical Features" by Prokhorenkova, Gusev, Vorobev, Dorogush, Gulin
"Consistent Individualized Feature Attribution for Tree Ensembles" Lundberg, Erion, Lee
"Interpreting predictions from tree ensemble methods such as gradient boosting machines and random forests is important, yet feature attribution for trees is often heuristic and not individualized for each prediction. Here we show that popular feature attribution methods are inconsistent, meaning they can lower a feature's assigned importance when the true impact of that feature actually increases. This is a fundamental problem that casts doubt on any comparison between features. To address it we turn to recent applications of game theory and develop fast exact tree solutions for SHAP (SHapley Additive exPlanation) values, which are the unique consistent and locally accurate attribution values. We then extend SHAP values to interaction effects and define SHAP interaction values. We propose a rich visualization of individualized feature attributions that improves over classic attribution summaries and partial dependence plots, and a unique "supervised" clustering (clustering based on feature attributions). We demonstrate better agreement with human intuition through a user study, exponential improvements in run time, improved clustering performance, and better identification of influential features. An implementation of our algorithm has also been merged into XGBoost and LightGBM."