# Strong error analysis for stochastic gradient descent optimization algorithms

@article{Jentzen2018StrongEA, title={Strong error analysis for stochastic gradient descent optimization algorithms}, author={Arnulf Jentzen and Benno Kuckuck and Ariel David Neufeld and Philippe von Wurstemberger}, journal={Ima Journal of Numerical Analysis}, year={2018} }

Stochastic gradient descent (SGD) optimization algorithms are key ingredients in a series of machine learning applications. In this article we perform a rigorous strong error analysis for SGD optimization algorithms. In particular, we prove for every arbitrarily small $\varepsilon \in (0,\infty)$ and every arbitrarily large $p\in (0,\infty)$ that the considered SGD optimization algorithm converges in the strong $L^p$-sense with order $\frac{1}{2}-\varepsilon$ to the global minimum of the… Expand

#### 27 Citations

Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates

- Computer Science, Mathematics
- J. Complex.
- 2020

This article establishes for every $\gamma, \nu \in (0,\infty)$ essentially matching lower and upper bounds for the mean square error of the SGD process with learning rates associated to a simple quadratic stochastic optimization problem. Expand

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis

- Computer Science, Mathematics
- ArXiv
- 2021

It is shown that the learning rate in SGD with machine learning noise can be chosen to be small, but uniformly positive for all times if the energy landscape resembles that of overparametrized deep learning problems. Expand

A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions

- Computer Science, Mathematics
- ArXiv
- 2021

This article proves in the training of rectified fully-connected feedforward ANNs with one-hidden layer that the risk function of the gradient descent method does indeed converge to zero in the special situation where the target function under consideration is a constant function. Expand

Uniform-in-Time Weak Error Analysis for Stochastic Gradient Descent Algorithms via Diffusion Approximation

- Computer Science, Mathematics
- Communications in Mathematical Sciences
- 2020

New tools motivated by the backward error analysis of numerical stochastic differential equations into the theoretical framework of diffusion approximation are introduced, extending the validity of the weak approximation from finite to infinite time horizon. Expand

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

- Computer Science, Mathematics
- ArXiv
- 2021

This article proves the conjecture that the risk of the GD optimization method converges in the training of such ANNs to zero as the width of the ANNs, the number of independent random initializations, and the numberof GD steps increase to infinity in the situation where the probability distribution of the input data is equivalent to the continuous uniform distribution on a compact interval. Expand

Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation

- Mathematics, Computer Science
- ArXiv
- 2020

This article provides a mathematically rigorous full error analysis of deep learning based empirical risk minimisation with quadratic loss function in the probabilistically strong sense, where the underlying deep neural networks are trained using stochastic gradient descent with random initialisation. Expand

High-dimensional approximation spaces of artificial neural networks and applications to partial differential equations

- Mathematics, Computer Science
- ArXiv
- 2020

The developed theory is employed to prove that ANNs have the capacity to overcome the curse of dimensionality in the numerical approximation of certain first order transport partial differential equations (PDEs). Expand

Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation

- Computer Science, Mathematics
- ArXiv
- 2021

Two basic results for GF differential equations are established in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function. Expand

Analysis of Stochastic Gradient Descent in Continuous Time

- Computer Science, Mathematics
- Stat. Comput.
- 2021

This work introduces the stochastic gradient process as a continuous-time representation of stochastically gradient descent, and shows that it converges weakly to the gradient flow with respect to the full target function, as the learning rate approaches zero. Expand

Full error analysis for the training of deep neural networks

- Computer Science, Mathematics
- ArXiv
- 2019

The main contribution of this work is to provide a full error analysis which covers each of the three different sources of errors usually emerging in deep learning algorithms and which merges these three Sources of errors into one overall error estimate for the considered deep learning algorithm. Expand

#### References

SHOWING 1-10 OF 194 REFERENCES

Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates

- Computer Science, Mathematics
- J. Complex.
- 2020

This article establishes for every $\gamma, \nu \in (0,\infty)$ essentially matching lower and upper bounds for the mean square error of the SGD process with learning rates associated to a simple quadratic stochastic optimization problem. Expand

Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

- Computer Science, Mathematics
- NIPS
- 2011

This work provides a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent as well as a simple modification where iterates are averaged, suggesting that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate, is not robust to the lack of strong convexity or the setting of the proportionality constant. Expand

Stochastic Gradient Descent in Continuous Time

- Mathematics, Computer Science
- SIAM J. Financial Math.
- 2017

It is proved that $\lim_{t \rightarrow \infty} \nabla \bar g(\theta_t) = 0$, where $\bar g$ is a natural objective function for the estimation of the continuous-time dynamics. Expand

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

- Mathematics, Computer Science
- ICML
- 2012

This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis. Expand

Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization

- Mathematics, Computer Science
- Math. Program.
- 2016

A randomized stochastic projected gradient (RSPG) algorithm, in which proper mini-batch of samples are taken at each iteration depending on the total budget of Stochastic samples allowed, is proposed, which shows nearly optimal complexity of the algorithm for convex stoChastic programming. Expand

Stochastic approximation with averaging of the iterates: Optimal asymptotic rate of convergence for

- Mathematics
- 1993

Consider the stochastic approximation algorithm \[X_{n + 1} = X_n + a_n g(X_n ,\xi _n ).\] In an important paper, Polyak and Juditsky [SIAM J. Control Optim., 30 (1992), pp. 838–855] showed that… Expand

Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

- Computer Science, Mathematics
- NIPS
- 2013

We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which… Expand

Pegasos: primal estimated sub-gradient solver for SVM

- Mathematics, Computer Science
- Math. Program.
- 2011

A simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines, which is particularly well suited for large text classification problems, and demonstrates an order-of-magnitude speedup over previous SVM learning methods. Expand

Stochastic Gradient Descent in Continuous Time: A Central Limit Theorem

- Mathematics, Economics
- Stochastic Systems
- 2020

Stochastic gradient descent in continuous time (SGDCT) provides a computationally efficient method for the statistical learning of continuous-time models, which are widely used in science,… Expand

Robust Stochastic Approximation Approach to Stochastic Programming

- Mathematics, Computer Science
- SIAM J. Optim.
- 2009

It is intended to demonstrate that a properly modified SA approach can be competitive and even significantly outperform the SAA method for a certain class of convex stochastic problems. Expand