A concise overview of cost function optimizers!

Photo by Jon Tyson on Unsplash

Optimizers are needed to find the optimal solution for the given task. Optimizers associate themselves with cost function and model parameters together by updating the model. i.e. when you want to identify weights that minimize your mean squared error in linear regression, you need to use some function to find parameters such that mean squared error is minimum, this function is called optimizer. So you use the optimizer function to reach global minima with respect to the cost function.

Types of optimizers:

1. Gradient Descent

2. Momentum

3. Nesterov Momentum

4. Adagrad

5. RMSProp

6. Adam

Gradient Descent

Gradient Descent, is one of the simplest optimization algorithms. It uses just one static learning rate for all parameters during the entire training phase.

The static learning rate does not imply an equal update after every minibatch. As the optimizers approach an (sub)optimal value, their gradients start to decrease.

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

Types of gradient descent method:

a. Batch Gradient Descent

b. Stochastic Gradient Descent

c. Mini-batch gradient Descent

Challenges:

1. It does not guarantee convergence and slower than other newer methods.

2. It can stuck in local minima.

3. Choosing a right learning rate is difficult.

Momentum

This method has literal meaning. Its a method that helps accelerate SGD in the relavant direction and dampen oscillations.

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way. The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation. The momentum helps to avoid local minima.

Read more about momentum [here](https://distill.pub/2017/momentum/)

Nesterov Momentum

It is same as Momentum but with one additional information of notional momentum.

However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We’d like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.

Adagrad

Adagrad is a gradient based algorithm that adapts the learning rate to the parameters. In momentum based optimizers we adapted our updates to the slope of error function and speed up SGD. While adagrad updates based on learning rate.

It adapts the learning rate to the parameters, performing smaller updates

(i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

Adagrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge.

RMSProp

RMSProp was developed to resolve the weakness of adagrad’s radically diminishing learning rates.

To combat that problem, RMSprop decay the past accumulated gradient, so only a portion of past gradients are considered. Now, instead of considering all of the past gradients, RMSprop behaves like moving average. RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients.

Adam

Adam is the latest state of the art of first order optimization method that’s widely used in the real world. It’s a modification of RMSprop. Loosely speaking, Adam is RMSprop with momentum. So, Adam tries to combine the best of both world of momentum and adaptive learning rate.

Which optimizer to use?

So, which optimizer should you now use? If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. An additional benefit is that you won’t need to tune the learning rate but likely achieve the best results with the default value.

In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates.Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, and Adam are very similar algorithms that do well in similar circumstances. its bias-correction helps Adam slightly outperform RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.

Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima. Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods.

Conclusion

In this post we looked at the optimization algorithms beyond SGD. We looked at two classes of algorithms: momentum based and adaptive learning rate methods.

Further Reading

https://cs231n.github.io/neural-networks-3/#sgd

https://ruder.io/optimizing-gradient-descent/