Neural network theory: learning & generalisation
Neural networks are a powerful tool in modern machine learning, driving progress in areas ranging from protein folding to natural language processing. This half of the course will cover theoretical results with a direct bearing on machine learning practice. The course will tackle questions such as:
- How should I wire up my neural network?
- What class of functions does my network realise?
- If my wide 2-layer network can fit any dataset, why try anything else?
- How far is it safe to perturb my neural network during learning?
- Why does my network with more parameters than training data still generalise?
- Why is VC dimension not a relevant complexity measure for my network?
- How much information did my neural network extract from the training data?
Health warning: these questions are still the subject of active research. This course will present the instructor’s best understanding of the issues and their resolutions.
Instructor
This part of the course will be taught by Jeremy Bernstein (bernstein@caltech.edu).
Homeworks
# | Date set | Date due | Resources |
---|---|---|---|
3 | 4/22 | 4/29 | hw3.zip |
4 | 4/29 | 5/06 | hw4.pdf |
Lectures
# | Date | Subject | Resources |
---|---|---|---|
Main Lectures | |||
7 | 4/20 | Neural Architecture Design | pdf / vid |
8 | 4/22 | Network Function Spaces | pdf / vid / ipynb |
9 | 4/27 | Network Optimisation | pdf / vid |
10 | 4/29 | Statistical Learning Theory | pdf / vid |
11 | 5/04 | PAC-Bayesian Theory | pdf / vid |
12 | 5/06 | Project Ideas | pdf / vid |
Guest Lectures | |||
14 | 5/13 | Yasaman Bahri | pdf / vid |
18 | 5/27 | Guillermo Valle-Pérez & Ard Louis | pdf / vid |
19 | 6/01 | SueYeon Chung | pdf / vid |
Lecture 7 references
Network topologies:
- Backpropagation applied to handwritten zip code recognition—CNNs;
- Attention is all you need—transformers.
Neural architecture search:
Local network design:
- Centered weight normalization—weight constraints;
- Deep hyperspherical learning;
- Batch normalization;
- Normalization propagation—nonlinearity design.
Perturbation theory:
- Perturbation theory for the singular value decomposition;
- Perturbation theory for multilayer perceptrons.
Lecture 8 references
Universal function approximation:
- Approximation by superpositions of a sigmoidal function;
- Approximation capabilities of multilayer feedforward networks.
NNGP correspondence:
- Radford Neal’s PhD thesis—introduces NNGP;
- Kernel methods for deep learning—works out the relu kernel;
- Deep neural networks as Gaussian processes—NNGP for many layers.
Lecture 9 references
“Classic” deep learning optimisers:
Optimisation models:
Relative optimisers:
- LARS, LAMB and Fromage—per layer relative updates;
- Madam—per synapse relative updates;
- Nero and NFnets—per neuron relative updates with weight constraints.
Lecture 10 references
Function counting and generalisation:
- Cover’s function-counting theorem;
- Vapnik and Chervonenkis’s 1971 paper;
- A textbook chapter on VC theory (chapter 5).
VC theory and neural networks:
- Probable networks and plausible predictions (section 10.4);
- Understanding deep learning requires rethinking generalization.
Lecture 11 references
Bayesian data analysis and model comparison:
- David MacKay’s book (chapter 28).
PAC-Bayes derivation:
PAC-Bayes for NNs:
- Sharpness & flatness of minimisers;
- Taking the prior “local” to the initialisation;
- Function space at infinite width.
Lecture 12 references
Understanding the distribution of solutions that SGD samples from:
Using PAC-Bayes for architecture design:
- Comparing bounds for different architectures;
- Analytic lower bound on the NN evidence;
- Learning GP invariances using the marginal likelihood (a.k.a. evidence).
Adversarial examples:
Combining NN architectural properties with control:
- Neural lander—NNs with a Lipschitz constraint.