3/9/18
Gradient Descent Optimization methods from Deep Learning Perspective
By Santanu Pattanayak
It is important to understand and appreciate few key points regarding full batch Gradient Descent and Stochastic gradient descent methods along with their shortcomings so that one can appreciate the need of using different variants of gradient based optimizers in Deep Learning.
Elliptical Contours
The cost function for a linear neuron with least square error is quadratic. When the cost function is quadratic the direction of gradient in full batch gradient descent method gives the best direction for cost reduction in a linear sense but it doesn’t point to the minimum unless the different elliptical contours of the cost function are circles. Incase of long elliptical contours the gradient components might be large in directions where less change is required and less in directions where more change is required to move to the minimum point. As we can see in the Figure 1 below the gradient at doesn’t point to the direction of the minimum i.e. at point . The problem with this condition is if we take small steps by making the learning rate small then the gradient descent would take a while to converge whereas if we take a big learning rate then the gradients would change directions rapidly in directions where the cost function has curvature, leading to oscillations. The cost function for a multi-layer neural network is not quadratic but mostly a smooth function. However locally such non-quadratic cost function can be approximated by quadratic functions and hence the problems of gradient descent inherent to elliptical contours still prevails for non-quadratic cost functions.
Figure 1. Contour plot for a quadratic cost function with elliptical contours.
The best way to get around this problem is to take larger steps in directions in which gradients are small but consistent and take smaller steps in directions with big but inconsistent gradients. This can be achieved if instead of having a fixed learning rate for all the dimensions we have separate learning rate for each dimension.
Figure 2. Gradient descent for a cost function in one variable.
In the Figure 2 above the cost function between A to C is almost linear and hence gradient descent does well. However, from the point C the curvature of the cost function takes over and hence the gradient at C is not able to keep up with the direction of the change in cost function. Based on the gradient if we take a small learning rate at C we will end up at D which is reasonable enough since it doesn’t overshoot the point of minima. However, a larger step size at C will get us to D’ which is not desirable since it's on the other side of the minima. Again, a large step size at D’ would get us to E and if the learning rate is not reduced the algorithm tends to toggle between points on either side of the minima, leading to oscillations. When this happens one way to stop this oscillation and achieve convergence is to look at the sign of the gradient or in successive iterations and if they are having opposite signs reduce the learning so that the oscillations are reduced. Similarly, if the successive gradients are having the same sign then the learning rate can be accordingly increased. When the cost function is a function of multiple weights then the cost function might have curvatures in some dimensions of the weights while along other dimensions the cost function might be linear. Hence for multivariate cost functions the partial derivative of the cost function with respect to each weight can be similarly analyzed to update the learning rate for each weight or dimension of the cost function.
Non-Convexity of Cost functions
The other big problem with neural networks is that the cost functions are mostly non-convex and hence the gradient descent method might get stuck at local minimum points leading to sub-optimal solution. The non-convex nature of the neural network is due to the hidden layer units with non-linear activation functions such as sigmoid. Full batch gradient descent uses the full dataset for the gradient computation. While the same is good for convex cost surfaces it has its own problems in case of non-convex cost functions. For non-convex cost surfaces with full batch gradient the model is going to end up in the minima in its basin of attraction. For if the initialized parameters are in the basic of attraction of a local minima that doesn’t provide good generalization full batch gradient would give a sub-optimal solution.
With stochastic gradient descent, the noisy gradients computed may force the model out of basin of attraction of the bad local minima i.e. one that doesn’t provide good generalization and place it in a more optimal region. Stochastic gradient descent with single data points produces very random and noisy gradients. Gradients with the mini-batches tend to produce much stable estimates of gradients when compared to that of single data points but still noisier than those produced by the full batches. Ideally the mini-batch size should be carefully chosen such that the gradients are noisy enough to avoid or escape bad local minima points but stable enough to converge at global minima or a local minimum that provides good generalization.
Figure 3. Contour plot showing basins of attraction for Global and Local minima and traversal of paths for gradient descent and Stochastic gradient descent.
In the Figure 3 above the dotted arrows correspond to the path taken by Stochastic gradient descent(SGD) while the continuous arrows correspond to the path taken by full batch gradient descent. Full batch gradient descent computes the actual gradient at a point and hence if it is in the basin of attraction of a poor local minimum, gradient descent almost surely ensures that the local minima L is reached. However, in the case of Stochastic gradient because the gradient is based on only portion of the data and not the full batch hence the gradient direction is only a rough estimate. Since the noisy rough estimate doesn’t always point to the actual gradient at the point C hence stochastic gradient descent may escape the basin of attraction of the local minima and fortunately land in the basin of a global minima. Stochastic gradient descent may escape the global minima basin of attraction too but generally if the basin of attraction is large and the mini-batch size are carefully chosen so that the gradients they produce are moderately noisy Stochastic gradient descent is most likely to reach the global minima G (as in this case) or in general some other optimal minima having a large basin of attraction. For non-convex optimization, there are other heuristics as well such as momentum which when adopted along with Stochastic gradient descent increases the chances of the SGD avoiding shallow local minima. Momentum generally keeps track of the previous gradients through the velocity component. So, if the gradients are steadily pointing towards a good local minimum having a large basin of attraction then the velocity component would be high in the direction of the good local minimum. If the new gradient is noisy and points towards a bad local minimum the velocity component will provide momentum to continue in the same direction and not get influenced by the new gradient too much.
Saddle points in the High Dimensional Cost functions
Another impediment to optimizing non-convex cost functions is the presence of saddle points. The number of saddle points increases exponentially with the dimensionality increase of the parameter space of a cost function. Saddle points are stationary points (i.e. points where the gradient is zero) but are neither a local minimum or a local maximum point. Since the saddle points are associated with long plateau of points with same cost as that of the saddle point the gradient in the plateau region is either zero or very close to zero. Because of this near zero gradient in all directions gradient based optimizers have a hard time coming out of these saddle points. Mathematically to determine whether a point is a saddle point the eigen values of the Hessian matrix of the cost function can be computed at the given point. If there are both positive and negative Eigen values, then it is a saddle point. Just to refresh our memory of local and global minima test, if all the Eigen values of the Hessian matrix are positive at a stationary point then the point is a global minimum whereas if all the Eigen values of the Hessian Matrix are negative at the stationary point then the point is a global maximum. The Eigen vectors of the Hessian matrix for a cost function gives the directions of change in curvature of the cost function whereas the eigen values denotes the magnitude of the curvature changes along those directions. Also for cost functions with continuous second derivatives the Hessian matrix is symmetrical and hence would always produce orthogonal set of Eigen vector giving mutually orthogonal directions for cost curvature changes. If in all such directions given by Eigen vectors the values of the curvature changes (Eigen values) are positive then the point must be a local minimum, whereas if all the values of curvature changes are negative then the point is a local maximum. This generalization works for cost functions with any input dimensionality whereas the determinant rules for determining extremum points varies with the dimensionality of the input to the cost function. Coming back to saddle point, since the Eigen values are positive for some directions while negative for other directions means the curvature of the cost function increases in directions of Eigen values having positive Eigen value while decreases in those directions of Eigen vectors having negative co-efficient. This nature of the cost surface around a saddle point generally leads to a region of long plateau with near to zero gradient and hence makes life tough for gradient descent methods to escape the plateau of this low gradient. The point (0,0) is a saddle point for the function f (x, y) = x2 - y2 as we can see from the evaluation below:
So, (x, y) = (0, 0) is a stationary point. The next thing to compute is the Hessian matrix and evaluate its Eigen values at (x, y) = (0, 0). The Hessian matrix H f(x, y) is as below:
So, the Hessian H f(x, y) at all points including (x, y) = (0, 0) is
The two eigen values of the H f(x, y) are 2 and -2 corresponding to the Eigen vectors and which are nothing but the directions along X and Y axis. Since one Eigen value is positive and the other negative hence (x, y) = (0, 0) is a saddle point. Since the Eigen value is positive along X direction hence with respect to X axis the point (0,0) is a local minima whereas with respect to Y direction it is a local maxima.
Figure 4. Plot of f(x,y) = x2 - y2
The non-convex function f(x,y) = x2 - y2 is plotted in the Figure 4 above where S is the saddle point at x, y = (0, 0)
For more information kindly refer to the author’s book Pro Deep Learning with TensorFlow
About the Author
Santanu Pattanayak currently works at GE, Digital as a Senior Data Scientist. He has 10 years of overall work experience with six of years of experience in the data analytics/data science field and also has a background in development and database technologies. Prior to joining GE, Santanu worked in companies such as RBS, Capgemini, and IBM. He graduated with a degree in electrical engineering from Jadavpur University, Kolkata and is an avid math enthusiast. Santanu is currently pursuing a master's degree in data science from Indian Institute of Technology (IIT), Hyderabad. He also devotes his time to data science hackathons and Kaggle competitions where he ranks within the top 500 across the globe. Santanu was born and brought up in West Bengal, India and currently resides in Bangalore, India with his wife.
For more, check out Santanu's book Pro Deep Learning with TensorFlow.