What after Gradient Descent

by thinkman · Published May 17, 2019 · Updated June 1, 2019

In Batch Gradient Descent, the parameters(weights) are updated after a pass over the entire data (one epoch). So, in case of very large number of records, updates happen less, and large number of epochs are required to achieve minimal loss.

In Stochastic Gradient Descent, the parameters are updated after pass over single data point. It leads to large number of parameter updates in a single run over the entire data (one epoch). This eventually causes a lot of oscillations in the loss, as each data point directs the gradient as per its understanding without considering other points. Also, Stochastic gradient descent is not true but an approximate of the gradient value as the true value can be derived only considering the whole data.

In Mini Batch Gradient Descent, the parameters are updated after one pass over the mini batch (usually a power of 2). This reduces the number of oscillations and also serves to update the parameters more often than the Batch Gradient Descent. The size of mini batch is usually 32,64,128.

Here N = No. of data points, B = size of Mini Batch

So, if we substitute $ \frac{\partial y}{\partial x} $ in the gradient descent algorithm, we observe that the rate of decrease of weights(optimization) is higher in case of steep slopes, while the same is very less in case of gentle slope.

So, in scenario where gentle slope is encountered the gradient is very small and algorithm learns very slowly, consuming large number of epochs.

This limitation of Gradient Descent can be overcome by the implementation of Momentum Based Gradient Descent.

Momentum Based Gradient Descent (MBGD):

In Momentum Based Gradient Descent, the history component is taken into consideration to compute the gradient:

Since the value of γ lies between 0 & 1. Higher the power of γ, lower the value of its term. This is called as Exponentially Decaying Average.

So, the value of current gradient is higher than the previous one.

Advantages: Even in regions having gentle slopes, momentum based Gradient Descent is able to take larger steps because of its momentum.

Disadvantage: Due to higher momentum, the algorithm runs past the minima and takes a lot of u turns to achieve there. It overshoots the target due to momentum, showing lots of oscillation.

But still converges faster than the plain Gradient Descent.

To overcome the limitation of Momentum Based Gradient Descent, Nesterov Accelerated Gradient Descent can be implemented.

To overcome this shortcoming, NAG first moves as per the history component, calculates the gradient at the temporary point and then moves forward or backward on the present gradient. This reduces the overshooting and hence the oscillations.

Advantages: Looking ahead helps NAG in correcting its course quicker than MBG. Hence oscillations are smaller and chances for escaping the minima is also smaller.

In case of data with many features, Plain Gradient Descent, Momentum Based Gradient Descent and Nesterov Accelerated Gradient Descent fail to reach the global minima. This is because of the similar learning rate for all the features be it either dense or sparse.

Now what is a Dense Features: are the ones which receive a lot of updates in an epoch while Sparse features receive very a smaller number of updates (smaller number of records exhibiting the feature)

So, the algorithm would need to update the learning rate differently for dense and sparse features.

It decays the learning rate of parameters in proportion to their update history (fewer updates, lesser decay). Just as we needed to handle learning rate for Dense and Sparse features.

The denominator term is high for features which receives a lot of updates in the past (dense features) while smaller number of updates for sparse features.

Advantages: Parameters corresponding to Sparse features get better updates

Disadvantages: The learning rate decays very aggressively as the denominator grows (not good for parameters corresponding to dense features). Adagrad is unable to reach the minimal loss point due to rapid decay of learning rate.

What after Gradient Descent

You may also like...

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

What after Gradient Descent

You may also like...

Zealandia – new place to explore

ISRO reaching new heights

Principles of a Good Visualization

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta