Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Isn't double descent explained by the following?

The network contains many more parameters than data points.

Therefore there is an entire region of lowest training error.

A random point sonewhere in the middle of that region probably generalises better than at the edge.

Once SGD enters this region at the edge, generalisation can still occur because the randomness will most likely cause it to random walk inside the region.

To test this hypothesis you could run gradient descent with line search instead of SGD, and then you should not see this extra generalisation. Then if you add a bit of randomness to gradient descent you should see this extra generalisation again, if this hypothesis is correct. Also, under this hypothesis you'd predict that the speed at which generalisation improves depends on the batch size.



Might be a stupid question, but can we skip the random walk and just pick the middle?


Not a stupid question at all, but one problem is that the boundaries of a zero-train-loss region are not well characterized and evaluating the validation loss even at a single point is computationally expensive. The centroid of one of these regions might not even be inside it (eg. donut shape but in higher dimensions) Interesting discussion though -- probably worth a few papers if someone were to investigate further.


This just sounds like Stochastic Weight Averaging, which works quite well: https://arxiv.org/abs/1803.05407




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: