A Brief (and Comprehensive) Guide to Stochastic Gradient Descent Algorithms

Stochastic Gradient Descent (SGD) is a very powerful technique, currently employed to optimize all deep learning models. However, the vanilla algorithm has many limitations, in particular when the system is ill-conditioned and could never find the global minimum. In this post, we’re going to analyze how it works and the most important variations that can speed up the convergence in deep models. First of all, it’s necessary to standardize the naming. In some books, the expression “Stochastic Gradient Descent” refers to an algorithm which operates on a batch size equal to 1, while “Mini-batch Gradient Descent” is adopted when the batch size is greater than 1. In this context, we assume that Stochastic Gradient Descent operates on batch sizes equal or greater than 1. In particular, if we define the loss function for a single sample as: where x is the input sample, y the label(s) and θ is the parameter […]