Gradient descent variants for dummies

Gradient descent is an optimization algorithm used in machine learning and deep learning to minimize a function, essentially finding the best model parameters that minimize the error on the data.

The basic idea of gradient descent is to find the direction of the steepest slope and move towards it, iteratively adjusting the model parameters until the error is minimized.

In this article, we’ll discuss three popular variants of gradient descent: Batch Gradient Descent, Stochastic Gradient Descent, and Mini-batch Gradient Descent. Each of these has its own unique characteristics and advantages, so let’s dive in!

Batch Gradient Descent

Batch gradient descent, also known as “vanilla” gradient descent, is the simplest form of the algorithm. It calculates the gradient of the entire dataset before taking a single step in the direction of the steepest descent.


  • It provides a stable and accurate estimate of the gradient, as it uses the entire dataset.
  • It’s deterministic, meaning that it will always follow the same path to the minimum.


  • It can be computationally expensive, as it requires calculating the gradient for the entire dataset before taking a single step.
  • It can be slow to converge, especially for large datasets.

Stochastic Gradient Descent:

Stochastic gradient descent (SGD) is an alternative to batch gradient descent that addresses some of its shortcomings.

Instead of computing the gradient using the entire dataset, SGD calculates the gradient using only one randomly-selected data point per iteration. This introduces randomness into the process, which can help escape local minima.


  • It’s computationally efficient, as it processes one data point at a time.
  • It can converge faster than batch gradient descent, as it takes more frequent steps.


  • It’s less stable and less accurate in estimating the gradient due to the inherent randomness.
  • It may not always converge to the global minimum, but often settles at a “good enough” solution.

Mini-batch Gradient Descent:

Mini-batch gradient descent combines the best of both worlds: it uses a subset (or mini-batch) of the dataset to calculate the gradient, rather than the entire dataset or just one data point. This makes it a good compromise between computational efficiency and accuracy.


  • It provides a balance between the stability of batch gradient descent and the efficiency of stochastic gradient descent.
  • It can converge faster than batch gradient descent, while still maintaining a more stable and accurate gradient estimate than SGD.


  • It still requires tuning the mini-batch size, which can affect convergence speed and stability.
  • It may still get stuck in local minima, although less likely than with SGD.


Understanding the differences between gradient descent variants is crucial when choosing the right optimization algorithm for a specific machine learning problem.

Batch gradient descent is more stable but slower, while stochastic gradient descent is faster but less stable.

Mini-batch gradient descent strikes a balance between the two, offering a good compromise in many cases. Choosing the right variant will depend on your dataset, computational resources, and desired level of accuracy.

Leave a Comment