Statistics and machine learning: from undergraduate to research

by Edgar Dobriban, Associate Prof. of Statistics & Data Science, Wharton; w/ Secondary Appointment in Computer and Information Science, Univ. of Pennsylvania

This repository contains links to references (books, courses, etc) that are useful for learning statistics and machine learning (as well as some neighboring topics). References for background materials such as linear algebra, calculus/analysis/measure theory, probability theory, etc, are usually not included.
The level of the references starts from advanced undergraduate stats/math/CS and in some cases goes up to the research level. The books are often standard references and textbooks, used at leading institutions. In particular, several of the books are used in the standard curriculum of the PhD program in Statistics at Stanford University (where I learned from them as well), as well as at the University of Pennsylvania (where I work). The goal is to benefit students, researchers seeking to enter new areas, and lifelong learners.
For each topic, materials are listed in a rough order of from basic to advanced.
The list is highly subjective and incomplete, reflecting my own preferences, interests and biases. For instance, there is an emphasis on theoretical material. Most of the references included here are something that I have at least partially (and sometimes extensively) studied; and found helpful. Others are on my to-read list. Several topics are omitted due to lack of expertise (e.g., causal inference, Bayesian statistics, time series, sequential decision-making, functional data analysis, biostatistics, ...).
The links are to freely available author copies if those are available, or to online marketplaces otherwise (you are encouraged to search for the best price).
How to use these materials to learn: To be an efficient researcher, certain core material must be mastered. However, there is too much specialized knowledge, and it can be overwhelming to know it all. Fortunately, it is often enough to know what type of results/methods/tools are available, and where to find them. When they are needed, they can be recalled and used.
Please feel free to contact me with suggestions.

Statistics

Principles and overview

Casella & Berger: Statistical Inference (2nd Edition) - Possibly the best introduction to the principles of statistical inference at an advanced undergraduate level. Mathematically rigorous but not technical. Covers key ideas and tools for constructing and evaluating estimators:
- Data reduction (sufficiency, likelihood principle),
- Methods for finding estimators (method of moments, Maximum likelihood estimation, Bayes estimators), methods for evaluating estimators (mean squared error, bias and variance, best unbiased estimators, loss function optimality),
- Hypothesis testing (likelihood ratio tests, power), confidence intervals (pivotal quantities, coverage),
- Asymptotics (consistency, efficiency, bootstrap, robustness).
Wasserman: All of Statistics: A Concise Course in Statistical Inference - A panoramic overview of statistics; mathematical but proofs are omitted. Covers material overlapping with ESL, TSH, TPE (abbreviations defined below), and other books in this list.
Cox: Principles of Statistical Inference - Covers a number of classical principles and ideas such as pivotal inference, ancillarity, conditioning, including famous paradoxes. Light on math, but containing deep thoughts.

Statistical Methodology

Hastie, Tibshirani & Friedman: The Elements of Statistical Learning - The bible of modern statistical methodology, comprehensive coverage from linear methods to kernels, basis expansions, trees/forests, model selection, high-dimensional methods, etc. Emphasizes ideas over math. Free on author's website. Known as "ESL".

Statistical Theory

Core Theory: First Year PhD Curriculum

Lehmann & Casella: Theory of Point Estimation, 2nd Edition - Solid mathematically rigorous overview of point estimation theory. Known as "TPE".
Lehmann & Casella: Testing Statistical Hypotheses, 4th Edition - A complement to TPE, covers the theory of inference (hypothesis tests and confidence intervals). Known as "TSH".
van der Vaart: Asymptotic Statistics - Covers classical fixed-dimensional asymptotics.
Candes: Theory of Statistics, STAT 300C Lecture Notes - Modern statistical theory: sparsity, detection thresholds, multiple testing, false discovery rate control, Benjamini-Hochberg procedure, model selection, conformal prediction, etc.

Advanced Theory

This section is the most detailed one, as it is the closest to my research.

Non-parametrics, minimax lower bounds

Tsybakov: Introduction to Nonparametric Estimation - The first two chapters contain many core results and techniques in nonparametric estimation, including lower bounds (Le Cam, Fano, Assouad).
Weissman, Ozgur, Han: Stanford EE 378 Course Materials. Lecture Notes - Possibly the most comprehensive set of materials on information theoretic lower bounds, including estimation and testing (Ingster's method) with examples given in high-dimensional problems, optimization, etc.
Johnstone: Gaussian estimation: Sequence and wavelet models - Beautiful overview of estimation in Gaussian noise (shrinkage, wavelet thresholding, optimality). Rigorous and deep, has challenging exercises.

Overviews of statistical machine learning theory

Duchi: Lecture Notes on Statistics and Information Theory - Eclectic modern topics in modern statistical learning, at the interface of stats and ML: intro to information theory tools, PAC-Bayes, minimax lower bounds (estimation and testing), probabilistic prediction, calibration, online game playing, online optimization, etc.
Bach: Learning Theory from First Principles
RJ Tibshirani: Lecture Notes on Advanced Topics in Statistical Learning: Spring 2023 - Overview of a variety of important dna modern topics in statistical machine learning. Some topics are advanced and hard to find summarized in other places, e.g., conformal prediction under distribution shift and calibration.

Semiparametrics

van der Vaart: Semiparametric Statistics, Chapter III of Lectures at Ecole d'Ete de Probabilites de Saint-Flour XXIX, 1999 - Concise and mathematically rigorous introduction to key ideas in semiparametrics. Defines tangent sets and spaces, differentiable paths and score functions, differentiable maps and influence functions, efficiency, etc.
Kosorok: Introduction to Empirical Processes and Semiparametric Inference - Detailed and rigorous introduction to semiparametrics, also containing the required background from empirical process theory (and necessary math background, such as topics from functional analysis). A number of detailed examples are presented, which greatly aid appreciating the power of the theory.
Bickel, Klaassen, Ritov, Wellner: Efficient and Adaptive Estimation for Semiparametric Models - Thorough and rigorous, but also heavy, treatise on semi-parametrics; including some required background on local asymptotic normality. The first few chapters present the general theory and and can be focused on during a first reading.

Multivariate statistical analysis

Anderson: An Introduction to Multivariate Statistical Analysis - Standard reference on multivariate statistical analysis (OLS, LDA, PCA, factor analysis, MANOVA). Describes practical methods with mathematical rigor. Beautifully written.

Subsampling

Polits, Romano, Wolf: Subsampling - Canonical reference for the powerful resampling methodology of subsampling.

Empirical processes

van der Vaart, Wellner: Weak convergence and empirical processes - Thorough and mathematically fully rigorous (sometimes technically heavy) book on empirical processes; key reference when working in the area.

High dimensional (mean field, proportional limit) asymptotics; random matrix theory (RMT) for stats+ML

Mei: Lecture Notes for Mean Field Asymptotics in Statistical Learning - Good overview of various techniques in the area: replica methods, Gaussian comparison inequalities/Convex Gaussian Minimax Theorem, Stieltjes transforms for random matrices, and approximate message passing (AMP). Several applications to stats+ML are presented.
Couillet & Debbah: Random Matrix Methods for Wireless Communications - The first section is a good overview of the most commonly used RMT techniques and results for stats+ML. Strikes an ideal balance between rigor and clarity (Statements are rigorous, detailed proof sketches are presented, but some of the most technical proof components are omitted and references to papers are given).
Bai & Silverstein: Spectral Analysis of Large Dimensional Random Matrices - A standard reference in the field, with citable results stated at full generality, and with proofs. Nonetheless, requires filling in details of calculations/arguments, which can take a lot of effort for students.

Applications and case studies

Peck et al: Statistics - A Guide to the Unknown - Engaging essays about applications of statistics in diverse areas: public policy, science & tech, bio & medicine, business & industry, hobbies & recreation. Elementary (minimal to no prerequisites), and written in a way that "draws you in".
Morton et al: Public Policy and Statistics: Case Studies from RAND
Peck et al.: Statistical Case Studies: A Collaboration Between Academe and Industry Student Edition. Instructor Edition.

Machine Learning

ML Theory

Shalev-Shwartz & Ben-David: Understanding Machine Learning: From Theory to Algorithms - Good single reference source of core machine learning theory ideas and results.
Srebro: Computational and Statistical Learning Theory - Great course materials on Statistical/PAC learning, online learning, crypto lower bounds.
Orabona: A Modern Introduction to Online Learning

Deep Learning

DL Practice and Conceptual Understanding

Courses at Deeplearning.ai. Their course on Intro to Deep Learning with Andrew Ng is great.
Andrej Karpathy's Neural Networks: Zero to Hero video lectures. 100% coding-based, hands-on tutorial on implementing basic autodiff, neural nets, language models, and a small version of GPT.
DeepMind x UCL | Deep Learning Lecture Series 2021 Videos
Prince: Understanding Deep Learning. Free PDF book, code notebooks, and slides available on author website.

Safe AI

ML Safety Course at Center for AI Safety. See video lectures on Youtube.
Elad Hazan's AI Safety Course at Princeton

DL Theory

This is subject to active development and research. There is no complete reference.

Telgarsky: Deep learning theory lecture notes

Language Models

The corresponding sections in the Understanding Deep Learning book. See also the associated tutorial posts: LLMs; Transformers 1, 2, 3; Training and fine-tuning; Inference
Raschka: LLMs from Scratch
UC Berkely Course Understanding Large Language Models: Foundations and Safety
Transformer Mechanistic Interpretability: Transformer Circuits. Additional notes.

Uncertainty quantification

Dobriban's course materials for STAT 991 - Contains detailed references to materials on uncertainty quantification for ML, including conformal prediction/predictive inference and calibration.

Complements

Optimization

Boyd and Vandenberghe: Convex Optimization - Good user- and algorithm-focused book on convex optimization. Mathematically rigorous and clean, but does not go deep in the theory.
Nesterov: Introductory Lectures on Convex Optimization: A Basic Course - A deep dive into convex optimization theory, including optimality results.
Bottou, Curtis, Nocedal: Optimization Methods for Large-Scale Machine Learning - Good introductory review focusing on scalable first order methods, such as SGD and variance-reduced methods. Has some proofs.
Duchi: Introductory Lectures on Stochastic Optimization

Probability

Concentration inequalities

Boucheron, Lugosi, Massart: Concentration Inequalities: A Nonasymptotic Theory of Independence - Standard reference on concentration inequalities, used often in proofs in stats/ML.
Vershynin: High-Dimensional Probability: An Introduction with Applications in Data Science - Another standard reference in the area, with citable and usable results. Also has some example applications to covariance estimation, graph estimation, etc.

Chaining

Talagrand: Upper and Lower Bounds for Stochastic Processes - Chaining is a theoretical tool invented by Talagrand, and can often give optimal bounds of the tail behavior of stochastic processes (even when standard concentration inequalities fail to do so). This is the a readable, but rigorous and complete reference by the inventor of the theory.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Statistics and machine learning: from undergraduate to research

Statistics

Principles and overview

Statistical Methodology

Statistical Theory

Core Theory: First Year PhD Curriculum

Advanced Theory

Non-parametrics, minimax lower bounds

Overviews of statistical machine learning theory

Semiparametrics

Multivariate statistical analysis

Subsampling

Empirical processes

High dimensional (mean field, proportional limit) asymptotics; random matrix theory (RMT) for stats+ML

Applications and case studies

Machine Learning

ML Theory

Deep Learning

DL Practice and Conceptual Understanding

Safe AI

DL Theory

Language Models

Uncertainty quantification

Complements

Optimization

Probability

Concentration inequalities

Chaining

About

Releases

Packages

dobriban/stat-ml-edu

Folders and files

Latest commit

History

Repository files navigation

Statistics and machine learning: from undergraduate to research

Statistics

Principles and overview

Statistical Methodology

Statistical Theory

Core Theory: First Year PhD Curriculum

Advanced Theory

Non-parametrics, minimax lower bounds

Overviews of statistical machine learning theory

Semiparametrics

Multivariate statistical analysis

Subsampling

Empirical processes

High dimensional (mean field, proportional limit) asymptotics; random matrix theory (RMT) for stats+ML

Applications and case studies

Machine Learning

ML Theory

Deep Learning

DL Practice and Conceptual Understanding

Safe AI

DL Theory

Language Models

Uncertainty quantification

Complements

Optimization

Probability

Concentration inequalities

Chaining

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages