In the summer of 2015, a recent Google hire secured a 15-minute meeting with Ilya Sutskever. The “Noogler”, as they called recent hires, came prepared with a set of questions about deep learning. Deep learning was the mysterious new buzz that had taken Google by storm.
In an overcrowded open office humming with keystrokes, a coveted new team, called Google Brain, developed predictive models aspiring to resemble the human neural circuit. Interns wore T-shirts labeled my other brain is a datacenter on the back and a cartoon brain in the corporate colors on the front.
Among those familiar with the new stuff at Google, Sutskever ranked second only to the godfather Geoff Hinton. Hinton, Krishevsky, and Sutskever had kicked off the deep learning revolution with the AlexNet architecture, a deep convolutional neural network that defeated the competition on the new and challenging ImageNet benchmark. This breakthrough had Google go all in on deep learning.
One of the Noogler’s questions was about the learning rate of the optimization algorithm used during training of a deep neural network. Training neural networks was notoriously finicky, with many knobs to turn. Experts call these knobs hyperparameters to insinuate their significance compared to the more mundane model parameters that change with each step of the optimizer. Among the hyperparameters, the learning rate stands as an archetype.
Pick too large a learning rate and the computer will “NaN out”. Numbers on your screen turn to letters of defeat. Floating point arithmetic crumbles under the GPU’s indifferent onslaught. At any point in time, you’d see at least one screen in the office with a terminal window open printing out repeated “NaN” statements. This meant the model builder had checked in, hoping to see the error decrease, but experienced only the sinking feeling of a failed training run. A single NaN propagates like a wildfire through the computation graph, torching every number it touches. There is only one thing to do about it.
Try again.
Pick too small a learning rate and the algorithm will converge to disappointing performance on the task, wasting a precious run on the cluster. What’s worse is that you won’t know whether your model is bad or you dialed the knobs the wrong way. The only recourse is to turn the knobs some other way and try again.
Researchers traded heuristics like Pokemon cards. Pick the largest learning rate that doesn’t NaN out. Decay the learning rate smoothly over time. Drop it by a factor 10 every so often, and three times total. Warm it up. Make it oscillate. Pick different learning rates for different layers of the neural net. These choices seemed to hold the key to unlocking the riches of deep learning.
The time had come for the meeting.
“So, Ilya, what learning rate do you use?”
Ilya looked the Noogler in the eyes with his characteristically stern look. After a deliberate pause, he replied:
“.1”
The tone was serious. There was no room to mistake the answer for a joke. Asking for clarification was not an option.
The Noogler had expected a sophisticated technical answer revealing some of the genius and inside knowledge he had come here for. Instead, he found himself unable to grasp the significance of an arbitrary number. Did the prophet of deep learning really believe that the universal answer to the learning rate problem was .1?
Sutskever would soon leave Google to create OpenAI. OpenAI would go on to develop the artificial intelligence many now fear. The Noogler never learned what the prophet had meant to say. With time he nonetheless found an answer to Sutskever’s riddle. The secret was not in the learning rate. Nor the specifics of the optimizer. And not even the model architecture.
The lesson was something else.