Understanding Modern Deep Learning from First Principles: Training Dynamics and Neural Scaling Laws

08 December 2025, 15:00 - 16:00

This is a past event

The 2024 Nobel Prizes in Physics and Chemistry were awarded to pioneers of AI for their groundbreaking work in deep learning and its applications. Yet, a long-standing debate persists: Is deep learning an alchemy or science? Despite its remarkable successes, deep learning often depends on engineering heuristics and lacks a solid theoretical foundation. The underlying mechanisms remain elusive, hindering deeper understanding and principled advancement.

In this talk, I will present our recent efforts to understand deep learning from first principles, and demonstrate how theoretical insights can guide the principled design of models and algorithms. Our works primarily focus on mathematical analysis of the training dynamics in deep learning—covering gradient descent (GD), stochastic gradient descent (SGD), and Adam—as well as the scaling behaviour of large language models (LLMs). 1) We analyse the implicit regularisation effects of gradient-based optimisers (SGD, momentum-based GD, Adam), and characterize the difference between SFT and RL finetuning for pretrained models, in the contexts of standard neural networks and Transformers. (ICML ’19, ’20, ’22; NeurIPS ’23’25;). 2) We derive an analytical Neural Scaling Law for the attention mechanism—an essential component of the Transformer architecture (ICLR ’25). Our formulation predicts how attention scales with data size, model size, and compute, offering a tractable alternative to large-scale empirical fitting.

These findings provide theoretical insights into the learning dynamics of deep models, help illuminate the black box of deep learning, and offer guidance for future algorithmic design. I will conclude with a discussion of emerging directions in the theoretical understanding of deep learning.

Speaker: Zhanxing Zhu
Venue: Meston G05 and MS Teams