Neural network theory: learning & generalisation
Neural networks are a powerful tool in modern machine learning, driving progress in areas ranging from protein folding to natural language processing. This half of the course will cover theoretical results with a direct bearing on machine learning practice. The course will tackle questions such as:
- How should I wire up my neural network?
- What class of functions does my network realise?
- If my wide 2-layer network can fit any dataset, why try anything else?
- How far is it safe to perturb my neural network during learning?
- Why does my network with more parameters than training data still generalise?
- Why is VC dimension not a relevant complexity measure for my network?
- How much information did my neural network extract from the training data?
Health warning: these questions are still the subject of active research. This course will present the instructor’s best understanding of the issues and their resolutions.
Lecture 7 references
- Backpropagation applied to handwritten zip code recognition—CNNs;
- Attention is all you need—transformers.
Neural architecture search:
Local network design:
- Centered weight normalization—weight constraints;
- Batch normalization;
- Normalization propagation—nonlinearity design.
- Perturbation theory for the singular value decomposition;
- Perturbation theory for multilayer perceptrons.
Lecture 8 references
Universal function approximation:
- Approximation by superpositions of a sigmoidal function;
- Approximation capabilities of multilayer feedforward networks.
- Radford Neal’s PhD thesis—introduces NNGP;
- Kernel methods for deep learning—works out the relu kernel;
- Deep neural networks as Gaussian processes—NNGP for many layers.
Lecture 9 references
“Classic” deep learning optimisers:
- LARS, LAMB and Fromage—per layer relative updates;
- Madam—per synapse relative updates;
- Nero and NFnets—per neuron relative updates with weight constraints.
Lecture 10 references
Function counting and generalisation:
- Cover’s function-counting theorem;
- Vapnik and Chervonenkis’s 1971 paper;
- A textbook chapter on VC theory (chapter 5).
VC theory and neural networks:
- Probable networks and plausible predictions (section 10.4);
- Understanding deep learning requires rethinking generalization.
Lecture 11 references
Bayesian data analysis and model comparison:
- David MacKay’s book (chapter 28).
PAC-Bayes for NNs:
- Sharpness & flatness of minimisers;
- Taking the prior “local” to the initialisation;
- Function space at infinite width.
Lecture 12 references
Understanding the distribution of solutions that SGD samples from:
Using PAC-Bayes for architecture design:
- Comparing bounds for different architectures;
- Analytic lower bound on the NN evidence;
- Learning GP invariances using the marginal likelihood (a.k.a. evidence).
Combining NN architectural properties with control:
- Neural lander—NNs with a Lipschitz constraint.