Link Search Menu Expand Document

Neural Network Algorithms

Optimization algorithms train neural networks. These algorithms are also called optimizers. There are various types of optimizers; each type has its characteristics. The performance-optimizing algorithms depend on the processing speed, memory requirement, and computational accuracy. The process of optimization can either be one-dimensional or multidimensional.

Gradient Descent

Gradient descent or steepest descent is the simplest training algorithm. Gradient descent is classified as a first-order method because it involves information from the gradient vector. In gradient descent, the training rate can either be a fixed value or determined iteratively by one-dimensional optimization along the direction of training at each step. Generally, an optimum training rate is attained by line minimization at each iteration. Still, many applications employ a fixed training rate. For training purposes, initially, the direction is calculated, then an appropriate training rate is applied. The gradient descent saves the gradient vector and leaves out the Hessian matrix. That is why it is useful for massive neural networks containing thousands of parameters. One drawback of the gradient descent algorithm is that it requires multiple iterations for functions which have long, narrow valley shapes. The loss function declines rapidly in a downhill gradient direction, but this does not necessarily generate the fastest convergence.

Newton's Method

It is a second-order algorithm, and it employs the Hessian matrix during the training process. The objective is to search for better training directions utilizing the second derivative of the loss function. one drawback of Newton’s method is its computational intensity in evaluating Hessian Matrix and its inverse. In this algorithm, the training rate can be a fixed value or calculated by line minimization.

Conjugate Gradient

Between gradient descent and Newton’s method, the conjugate gradient method can be considered an intermediate choice. The conjugate gradient tries to address the slow convergence rate generally associated with gradient descent. In training neural networks, this approach has proved to be more efficient than gradient descent. As the Hessian matrix is not required in this algorithm, the Conjugate Gradient becomes an optimal choice for massive neural networks. Line minimization typically finds the training rate, and the gradient is regularly reset towards the negative gradient.

Quasi-Newton method

Newton’s method is slow because the calculation of the Hessian matrix and its inverse is a computationally intensive task. Quasi-Newton method address this issue by building an approximate inverse Hessian at each iteration of the algorithm. The Davidon-Fletcher-Powell formula (DFP) and the Broyden-F Fletcher-Goldfarb-Shanno formula (BFGS) are two of the most used methods to find the approximate inverse Hessian matrix. This algorithm is simpler than the gradient descent and conjugate gradient.

Levenberg-Marquardt Algorithm

The Levenberg-Marquardt algorithm is a method suitable for sum-of-squared-error type functions. This algorithm operates without computing the exact Hessian matrix. The Jacobian Matrix becomes enormous for big data sets and neural networks and requires a tremendous amount of memory for training neural networks. If we have massive data sets or neural networks with big data or deep neural networks in the future, this is not recommended. There are some limitations to the algorithm, such as that it cannot be extended to functions such as root mean squared error or cross-entropy error

Comparison of Different Methods and Conclusion

Gradient descent is typically the slowest training algorithm. The Levenberg-Marquardt algorithm may be the fastest one, but it typically takes a lot of memory. The quasi-Newton approach could be a reasonable compromise. The Levenberg-Marquardt algorithm might be the best choice if we have many neural networks to train. The quasi-Newton approach would fit well in the majority of the scenarios.

Other useful articles:


Back to top

© , Neural Network 101 — All Rights Reserved - Terms of Use - Privacy Policy