Events

Why does Adam work so well for LLMs? And can we find optimal per-variable step sizes?

Lecture / Panel
 
For NYU Community

Geometric abstract image of brain

Part of the Special ECE Seminar Series

Modern Artificial Intelligence

Title:

Why does Adam work so well for LLMs? And can we find optimal per-variable step sizes?

Speaker:

Mark Schmidt, University of British Columbia

Abstract:

The success of the Adam optimizer on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, it is unclear why the gap between Adam and SGD is often big for large language models (LLMs) but small for computer vision benchmarks. Recent work proposed that Adam works better for LLMs due to heavy-tailed noise in the stochastic gradients. We show evidence that the noise is not a major factor in the performance gap between SGD and Adam. Instead, we show that a key factor is the class imbalance found in language tasks. In particular, the large number of low-frequency classes causes SGD to converge slowly but has a smaller effect on Adam and sign descent. We show that a gap between SGD and Adam can be induced by adding a large number of low-frequency classes to computer vision models or even to linear models. We further prove in a simple setting that gradient descent converges slowly while sign descent does not.

A key component of the Adam optimizer's success is using per-variable step sizes. However, neither Adam nor any other "adaptive" algorithm is known to perform within any known factor of the optimal fixed per-variable step sizes for the textbook problem of minimizing a smooth strongly-convex function. We propose the first method to update per-variable step sizes that provably performs within a known factor of the optimal step sizes. The method is based on a multi-dimensional backtracking procedure that adaptively uses hyper-gradients to generate cutting planes that reduce the search space for optimal step sizes. As black-box cutting-plane approaches like the ellipsoid method are computationally prohibitive, we develop practical linear-time variants for this setting.

 

Bio:

Mark Schmidt is a professor in the Department of Computer Science at the University of British Columbia. His research focuses on developing faster algorithms for large-scale machine learning, and exploring applications of machine learning. He is a Canada Research Chair, Alfred P. Sloan Fellow, NSERC Arthur B. McDonald Fellow, CIFAR Canada AIChair with the Alberta Machine Intelligence Institute (Amii). Along with Nicolas Le Roux and Francis Bach, Mark was awarded the 2018 SIAM/MOS Lagrange Prize in Continuous Optimization.