## Neural network theory: learning & generalisation

Neural networks are a powerful tool in modern machine learning, driving progress in areas ranging from protein folding to natural language processing. This half of the course will cover theoretical results with a direct bearing on machine learning practice. The course will tackle questions such as:

- How should I wire up my neural network?
- What class of functions does my network realise?
- If my wide 2-layer network can fit any dataset, why try anything else?
- How far is it safe to perturb my neural network during learning?
- Why does my network with more parameters than training data still generalise?
- Why is VC dimension not a relevant complexity measure for my network?
- How much information did my neural network extract from the training data?

*Health warning: these questions are still the subject of active research. This course will present the instructor’s best understanding of the issues and their resolutions.*

### Instructor

This part of the course will be taught by Jeremy Bernstein (bernstein@caltech.edu).

### Lectures

# | Date | Subject | Resources |
---|---|---|---|

Main Lectures |
|||

7 | 4/20 | Neural Architecture Design | pdf / vid |

8 | 4/22 | Network Function Spaces | pdf / vid / ipynb |

9 | 4/27 | Network Optimisation | pdf / vid |

10 | 4/29 | Statistical Learning Theory | pdf / vid |

11 | 5/04 | PAC-Bayesian Theory | pdf / vid |

12 | 5/06 | Project Ideas | pdf / vid |

Guest Lectures |
|||

14 | 5/13 | Yasaman Bahri | pdf / vid |

18 | 5/27 | Guillermo Valle-Pérez & Ard Louis | pdf / vid |

19 | 6/01 | SueYeon Chung | pdf / vid |

### Lecture 7 references

Network topologies:

- Backpropagation applied to handwritten zip code recognition—CNNs;
- Attention is all you need—transformers.

Neural architecture search:

Local network design:

- Centered weight normalization—weight constraints;
- Deep hyperspherical learning;
- Batch normalization;
- Normalization propagation—nonlinearity design.

Perturbation theory:

- Perturbation theory for the singular value decomposition;
- Perturbation theory for multilayer perceptrons.

### Lecture 8 references

Universal function approximation:

- Approximation by superpositions of a sigmoidal function;
- Approximation capabilities of multilayer feedforward networks.

NNGP correspondence:

- Radford Neal’s PhD thesis—introduces NNGP;
- Kernel methods for deep learning—works out the relu kernel;
- Deep neural networks as Gaussian processes—NNGP for many layers.

### Lecture 9 references

“Classic” deep learning optimisers:

Optimisation models:

Relative optimisers:

- LARS, LAMB and Fromage—per layer relative updates;
- Madam—per synapse relative updates;
- Nero and NFnets—per neuron relative updates with weight constraints.

### Lecture 10 references

Function counting and generalisation:

- Cover’s function-counting theorem;
- Vapnik and Chervonenkis’s 1971 paper;
- A textbook chapter on VC theory (chapter 5).

VC theory and neural networks:

- Probable networks and plausible predictions (section 10.4);
- Understanding deep learning requires rethinking generalization.

### Lecture 11 references

Bayesian data analysis and model comparison:

- David MacKay’s book (chapter 28).

PAC-Bayes derivation:

PAC-Bayes for NNs:

- Sharpness & flatness of minimisers;
- Taking the prior “local” to the initialisation;
- Function space at infinite width.

### Lecture 12 references

Understanding the distribution of solutions that SGD samples from:

Using PAC-Bayes for architecture design:

- Comparing bounds for different architectures;
- Analytic lower bound on the NN evidence;
- Learning GP invariances using the marginal likelihood (a.k.a. evidence).

Adversarial examples:

Combining NN architectural properties with control:

- Neural lander—NNs with a Lipschitz constraint.