Optimizers: SGD, Adam, RMSprop

Optimizers determine how model parameters are updated during training, and they play a central role in the speed, stability, and final quality of machine learning models. In deep learning especially, the optimizer can strongly influence convergence behavior, sensitivity to initialization, robustness to noisy gradients, and generalization. This whitepaper provides a detailed technical explanation of three foundational optimizers: Stochastic Gradient Descent (SGD), RMSprop, and Adam.

Abstract

Most modern machine learning models are trained by minimizing an objective function with iterative gradient-based optimization. However, plain gradient descent is often too slow, unstable, or poorly scaled for high-dimensional, noisy, and non-convex problems such as neural network training. Optimizers such as SGD, RMSprop, and Adam modify the update rule to improve convergence. SGD introduces stochasticity and, often, momentum. RMSprop rescales updates using an exponentially weighted average of squared gradients. Adam combines momentum and adaptive learning-rate scaling through first- and second-moment estimates. This paper explains their mathematical formulations, interpretations, bias-variance behavior in optimization, convergence dynamics, practical trade-offs, and common tuning considerations. All formulas are embedded inline in HTML-friendly format for direct use in WordPress or similar editors.

1. Introduction

Let a model be parameterized by θ ∈ ℝ^m, and let the objective function be J(θ). Training seeks to find: θ^* = argmin_θ J(θ).

In supervised learning, this objective is typically the empirical average loss over a dataset: J(θ) = (1/n) Σ_i=1ⁿ L(y_i, f(x_i; θ)).

The gradient ∇_θJ(θ) indicates the direction of steepest increase of the objective. Therefore, gradient-based optimizers update parameters in the negative gradient direction. The simplest update is: θ := θ - η ∇_θJ(θ), where η is the learning rate.

This basic rule is the starting point for nearly all optimizers discussed in this whitepaper.

2. Why Optimizers Matter

Optimization in machine learning is not merely a mathematical formality. Real-world objectives are often:

high-dimensional
non-convex
noisy due to minibatching
ill-conditioned, with different curvature in different directions
sensitive to initialization and scaling

Because of this, the naive update rule may converge slowly, oscillate, or get stuck in flat or unstable regions. Optimizers modify how gradients are used so that training becomes more stable and efficient.

3. Full-Batch Gradient Descent

Before discussing SGD and its descendants, it is useful to distinguish full-batch gradient descent. In full-batch training, the gradient is computed over the entire dataset: g = ∇_θJ(θ) = (1/n) Σ_i=1ⁿ ∇_θL(y_i, f(x_i; θ)).

The update is then: θ := θ - η g.

This gives an exact empirical gradient but can be expensive on large datasets and often lacks the beneficial noise properties of stochastic methods.

4. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent replaces the full-dataset gradient with a gradient computed from one example or a small minibatch. If B is the minibatch, then the stochastic gradient estimate is: g_B = (1/|B|) Σ_{i ∈ B} ∇_θL(y_i, f(x_i; θ)).

The update becomes: θ := θ - η g_B.

4.1 Why SGD Works

The minibatch gradient is an unbiased or approximately unbiased estimator of the full gradient: E[g_B] ≈ ∇_θJ(θ).

Although noisy, SGD is computationally cheaper per step and can make rapid progress. The noise also helps the optimizer explore the landscape and potentially escape shallow local minima or saddle regions.

4.2 Mini-Batch SGD

In practice, “SGD” usually refers to mini-batch SGD rather than single-example updates. Mini-batches offer a good trade-off between stable gradient estimates and hardware-efficient parallel computation.

4.3 SGD Geometry

SGD takes steps directly proportional to the current gradient. If one parameter dimension consistently has much larger gradients than another, SGD may oscillate in steep directions and progress slowly in shallow directions. This sensitivity to curvature motivates momentum and adaptive scaling methods.

5. SGD with Momentum

A major extension of SGD is momentum, which accumulates a running velocity in the direction of past gradients: v_t = β v_t-1 + g_t and θ_t+1 = θ_t - η v_t.

Here, β is the momentum coefficient, usually between 0 and 1.

Momentum smooths noisy updates, accelerates movement along consistent directions, and reduces oscillation across directions with steep curvature.

5.1 Interpretation of Momentum

Momentum can be viewed as a physical analogy: the optimizer behaves like a particle with inertia. Instead of reacting only to the current gradient, it carries forward part of its previous motion.

This is especially helpful in ravine-like loss landscapes where one direction has steep curvature and another has shallow curvature. Momentum dampens oscillation in the steep direction and speeds progress in the shallow one.

5.2 Nesterov Momentum

A refined version is Nesterov Accelerated Gradient (NAG), which computes the gradient at a look-ahead position: v_t = β v_t-1 + ∇J(θ_t - ηβv_t-1), then θ_t+1 = θ_t - η v_t.

Nesterov momentum can offer better anticipatory correction than classical momentum.

6. Limitations of Plain SGD

Despite its strengths, plain SGD has several challenges:

requires careful learning-rate tuning
uses the same global learning rate for all parameters
can struggle when gradients vary greatly in scale across dimensions
may converge slowly on ill-conditioned objectives

These limitations motivated adaptive methods such as RMSprop and Adam.

7. Adaptive Learning Rates

The main idea of adaptive optimization is to scale the update differently for each parameter dimension based on past gradients. Parameters that consistently receive large gradients should often get smaller step sizes, while parameters with small gradients can receive larger effective updates.

This creates coordinate-wise adaptation rather than relying on a single scalar learning rate.

8. RMSprop

RMSprop is an adaptive optimizer that maintains an exponentially weighted moving average of squared gradients. Let the gradient at step t be g_t. RMSprop computes: s_t = ρ s_t-1 + (1 - ρ) g_t², where the square is applied elementwise.

The update is then: θ_t+1 = θ_t - η g_t / (√(s_t) + ε).

Here:

ρ is the decay factor, often around 0.9
ε is a small constant for numerical stability

8.1 Interpretation of RMSprop

RMSprop divides each gradient component by the root mean square of recent gradients in that dimension. If a parameter has consistently large gradients, its denominator grows, reducing the effective step size. If a parameter has smaller gradients, its updates are relatively amplified.

This helps stabilize training on objectives with uneven curvature and makes learning rates less sensitive to raw gradient scale.

8.2 Why RMSprop Helps

Consider a loss surface with steep curvature in one direction and shallow curvature in another. Plain SGD may zigzag inefficiently. RMSprop rescales updates in each direction separately, often reducing oscillations and improving convergence speed.

8.3 RMSprop Limitations

RMSprop adapts learning rates but does not explicitly track a first-moment momentum term in its simplest form. It may still require careful tuning and, depending on the problem, may generalize differently from SGD-based methods.

9. Adam

Adam, short for Adaptive Moment Estimation, combines ideas from momentum and RMSprop. It tracks:

the first moment (mean) of gradients
the second moment (uncentered variance proxy) of gradients

At step t, Adam computes: m_t = β₁ m_t-1 + (1 - β₁) g_t and v_t = β₂ v_t-1 + (1 - β₂) g_t².

Here:

m_t is the exponentially weighted mean of gradients
v_t is the exponentially weighted mean of squared gradients

9.1 Bias Correction

Because m₀ = 0 and v₀ = 0, the moving averages are biased toward zero early in training. Adam corrects this by using: m̂_t = m_t / (1 - β₁^t) and v̂_t = v_t / (1 - β₂^t).

9.2 Adam Update Rule

The parameter update is: θ_t+1 = θ_t - η m̂_t / (√(v̂_t) + ε).

Default hyperparameters are often: β₁ = 0.9, β₂ = 0.999, and ε = 10^-8.

9.3 Interpretation of Adam

Adam behaves like momentum because it uses m_t to smooth gradients over time. It behaves like RMSprop because it divides by a running scale estimate √(v̂_t). This often leads to fast convergence, especially in noisy and sparse-gradient settings.

10. Adam vs RMSprop

RMSprop uses only a second-moment-like scaling term: s_t. Adam extends this with a first-moment term: m_t.

As a result, Adam can be viewed as RMSprop plus momentum plus bias correction. In many practical settings, Adam is a strong default optimizer because it tends to converge quickly and with relatively little tuning.

11. AdamW and Weight Decay Note

In practice, a commonly used variant is AdamW, which decouples weight decay from the adaptive gradient update. While this goes beyond vanilla Adam, it is often preferred in modern deep learning because naive L2 regularization and adaptive updates can interact in undesirable ways.

In AdamW-style updates, one may write: θ_t+1 = θ_t - η [m̂_t / (√(v̂_t) + ε)] - η λ θ_t.

12. Convergence Behavior and Generalization

Different optimizers do not just affect speed; they can also affect the final solution and its generalization behavior. In some empirical studies, SGD with momentum has shown better final generalization than Adam on certain deep learning benchmarks, even when Adam reaches low training loss faster.

One possible interpretation is that SGD’s noisier and less adaptive updates may bias the optimizer toward flatter minima, while adaptive methods can sometimes settle into sharper regions. This is still an area of active research and depends heavily on the architecture, dataset, and training regime.

13. Sparse Gradients

Adaptive methods such as RMSprop and Adam are often especially effective when gradients are sparse. If only a few parameters receive updates frequently, coordinate-wise scaling can help these parameters learn effectively without requiring a globally tuned learning rate.

This is one reason Adam has been widely used in NLP and embedding-heavy models.

14. Learning Rate Sensitivity

SGD is highly sensitive to the learning rate and often requires schedules such as step decay, cosine decay, or warmup. Adam and RMSprop are generally more forgiving, but they still depend on appropriate η.

A learning rate that is too high can destabilize any optimizer; a learning rate that is too low can make training unnecessarily slow.

15. Optimization in Ill-Conditioned Problems

If the Hessian of the objective has eigenvalues with very different magnitudes, the landscape is ill-conditioned. SGD may zigzag across steep directions while moving slowly along shallow ones. Adaptive methods approximately normalize gradient magnitudes per coordinate, which can make progress more balanced.

16. Noise in Stochastic Optimization

Stochastic gradients are inherently noisy because they are estimated from minibatches. This noise can be beneficial:

it reduces the cost per update
it helps exploration of the loss landscape
it may provide an implicit regularization effect

However, too much noise can destabilize training. Batch size, optimizer choice, and learning rate together determine the effective noise level.

17. Practical Hyperparameters

17.1 SGD

Key hyperparameters:

learning rate η
momentum coefficient β
weight decay λ
learning rate schedule

17.2 RMSprop

Key hyperparameters:

learning rate η
decay factor ρ
stability constant ε

17.3 Adam

Key hyperparameters:

learning rate η
first-moment decay β₁
second-moment decay β₂
stability constant ε

18. Typical Use Cases

18.1 When SGD Is Preferred

SGD with momentum is often preferred when:

generalization is the top priority
the practitioner can tune learning-rate schedules carefully
large-scale vision models are being trained from scratch

18.2 When RMSprop Is Useful

RMSprop is often useful in recurrent or nonstationary settings, especially where gradient magnitudes vary strongly over time.

18.3 When Adam Is Preferred

Adam is often preferred when:

fast convergence is important
the model is complex and hard to tune
gradients are sparse or noisy
the user wants a strong practical default optimizer

19. Limitations of Adaptive Methods

Although Adam and RMSprop can converge quickly, they may sometimes generalize worse than SGD with momentum on certain tasks. Adaptive step scaling may also lead to unusual training dynamics if hyperparameters are not well chosen.

This does not make adaptive optimizers inferior — it means optimizer choice must be aligned with the problem and training regime.

20. Common Training Heuristics

Use learning-rate warmup when training deep or sensitive models.
Combine SGD with momentum and a decay schedule for strong generalization-focused training.
Use Adam as a baseline optimizer when rapid prototyping.
Monitor validation metrics, not just training loss.
Retune learning rates when changing batch size or optimizer family.

21. Optimizer Comparison Summary

SGD is simple, effective, and often strong for generalization, but requires more careful learning-rate tuning. RMSprop rescales updates using recent squared gradients and is useful when gradient magnitudes vary substantially. Adam combines momentum and adaptive scaling, often making it the most convenient and robust practical default for many deep learning tasks.

22. Best Practices

Start with Adam when fast and stable optimization is needed quickly.
Switch to SGD with momentum if final generalization becomes the priority and tuning budget is available.
Use learning-rate schedules regardless of optimizer choice.
Track gradient norms and optimizer state behavior when debugging training.
Benchmark multiple optimizers because architecture and data strongly influence which works best.

23. Conclusion

Optimizers are not merely implementation details — they are central algorithmic choices that shape the trajectory of learning. SGD provides the foundational stochastic first-order method and remains a strong baseline, especially when enhanced with momentum and schedules. RMSprop addresses scale sensitivity by adapting updates using recent squared gradients. Adam unifies momentum and adaptive scaling, making it one of the most widely used optimizers in modern deep learning.

A mature understanding of optimization requires recognizing the distinct strengths of these methods. SGD often rewards careful tuning with strong generalization. RMSprop offers stability in uneven gradient landscapes. Adam delivers practical efficiency and ease of use. The right optimizer is therefore not chosen by habit alone, but by understanding the geometry, noise, scaling, and generalization demands of the learning problem.

Abstract

1. Introduction

2. Why Optimizers Matter

3. Full-Batch Gradient Descent

4. Stochastic Gradient Descent (SGD)

4.1 Why SGD Works

4.2 Mini-Batch SGD

4.3 SGD Geometry

5. SGD with Momentum

5.1 Interpretation of Momentum

5.2 Nesterov Momentum

6. Limitations of Plain SGD

7. Adaptive Learning Rates

8. RMSprop

8.1 Interpretation of RMSprop

8.2 Why RMSprop Helps

8.3 RMSprop Limitations

9. Adam

9.1 Bias Correction

9.2 Adam Update Rule

9.3 Interpretation of Adam

10. Adam vs RMSprop

11. AdamW and Weight Decay Note

12. Convergence Behavior and Generalization

13. Sparse Gradients

14. Learning Rate Sensitivity

15. Optimization in Ill-Conditioned Problems

16. Noise in Stochastic Optimization

17. Practical Hyperparameters

17.1 SGD

17.2 RMSprop

17.3 Adam

18. Typical Use Cases

18.1 When SGD Is Preferred

18.2 When RMSprop Is Useful

18.3 When Adam Is Preferred

19. Limitations of Adaptive Methods

20. Common Training Heuristics

21. Optimizer Comparison Summary

22. Best Practices

23. Conclusion

Uma Mahesh

Related Posts

What Is Ensemble Learning and Why Does It Work?

MLOps Tools: Kubeflow, MLflow

Reinforcement Learning Frameworks: Gym, Stable Baselines