Theoretical Explorations of Deep Learning

Why does machine learning work as well as it usually does?

In particular, machine learning uses zillions of parameters, way more than the number of data points.  This can’t possibly work,! But in practice it does (at least a lot of the time).

“Deep learning should not work as well as it seems to”

([1], p. 19)

This summer, Don Monroe discusses recent theorizing that suggests new understanding of machine learning and other statistical methods [1].  (Caveat:  the details of this stuff are way past my meager grasp.  I’m working on Monroe’s “baby” version. See the article for pointers to the details.)

Basically, the idea is that, as is well known, as the number of parameters approaches the number of data points, the solution will overfit—it will match every data point exactly with a solution that often has little predictive ability for new data.

The interesting thing is that beyond that point, more parameters generates solutions that are more generalizable.  Actually, the space beyond this peak is described as “a huge, complex manifold of solutions” that match the input data perfectly.  So the search will settle in one of these many possible solutions. When ML works, it somehow chooses one of the solutions that generalizes well. (“Somehow” includes the choice of input data, and adjustments to parameters.)

Even cooler, in this framework, other methods work the same way.  This means that simpler and easier to understand kernel methods (e.g., support vector machines) are theoretically related to machine learning.  (A key difference is that machine learning learns features, which kernel methods do not.)

These results suggest that it may be possible to combine machine learning and simpler methods, to get the best of both worlds, the feature learning of ML, and the simplicity of SVM.

For me, the cool thing is a vision of the algorithm breaking through the “overfitting barrier”, out into this complex ocean of “perfect fits”.  Now we are not looking for one solution, we are searching among many good solutions for the most useful, i.e., the best predictors.

This picture makes the magic of machine learning a little more believable.  It also helps me imagine how some of those perfect fits may be pathological (as in psychedelic toasters).

Interesting, even if the details are more than I can manage.


  1. Don Monroe, A deeper understanding of deep learning. Communications of the ACM, 65 (6):19–20,  2022. https://doi.org/10.1145/3530686

3 thoughts on “Theoretical Explorations of Deep Learning”

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.