Neural network theory: learning & generalisation

Neural networks are a powerful tool in modern machine learning, driving progress in areas ranging from protein folding to natural language processing. This half of the course will cover theoretical results with a direct bearing on machine learning practice. The course will tackle questions such as:

  • How should I wire up my neural network?
  • What class of functions does my network realise?
  • If my wide 2-layer network can fit any dataset, why try anything else?
  • How far is it safe to perturb my neural network during learning?
  • Why does my network with more parameters than training data still generalise?
  • Why is VC dimension not a relevant complexity measure for my network?
  • How much information did my neural network extract from the training data?

Health warning: these questions are still the subject of active research. This course will present the instructor’s best understanding of the issues and their resolutions.


This part of the course will be taught by Jeremy Bernstein (

Lecture 7 references

Network topologies:

Neural architecture search:

Local network design:

Perturbation theory:

Lecture 8 references

Universal function approximation:

NNGP correspondence:

Lecture 9 references

“Classic” deep learning optimisers:

Optimisation models:

Relative optimisers:

  • LARS, LAMB and Fromage—per layer relative updates;
  • Madam—per synapse relative updates;
  • Nero and NFnets—per neuron relative updates with weight constraints.

Lecture 10 references

Function counting and generalisation:

VC theory and neural networks:

Lecture 11 references

Bayesian data analysis and model comparison:

PAC-Bayes derivation:

PAC-Bayes for NNs:

Lecture 12 references

Understanding the distribution of solutions that SGD samples from:

Using PAC-Bayes for architecture design:

Adversarial examples:

Combining NN architectural properties with control: