Optimizers determine how model parameters are updated during training, and they play a central role in the speed, stability, and final quality of machine learning models. In deep learning especially, the optimizer can strongly influence convergence behavior, sensitivity to initialization, robustness to noisy gradients, and generalization. This whitepaper provides a detailed technical explanation of three foundational optimizers: Stochastic Gradient Descent (SGD), RMSprop, and Adam.
Abstract
Most modern machine learning models are trained by minimizing an objective function with iterative gradient-based optimization. However, plain gradient descent is often too slow, unstable, or poorly scaled for high-dimensional, noisy, and non-convex problems such as neural network training. Optimizers such as SGD, RMSprop, and Adam modify the update rule to improve convergence. SGD introduces stochasticity and, often, momentum. RMSprop rescales updates using an exponentially weighted average of squared gradients. Adam combines momentum and adaptive learning-rate scaling through first- and second-moment estimates. This paper explains their mathematical formulations, interpretations, bias-variance behavior in optimization, convergence dynamics, practical trade-offs, and common tuning considerations. All formulas are embedded inline in HTML-friendly format for direct use in WordPress or similar editors.
1. Introduction
Let a model be parameterized by θ ∈ ℝm, and let the objective function be
J(θ). Training seeks to find:
θ* = argminθ J(θ).
In supervised learning, this objective is typically the empirical average loss over a dataset:
J(θ) = (1/n) Σi=1n L(yi, f(xi; θ)).
The gradient
∇θJ(θ)
indicates the direction of steepest increase of the objective. Therefore, gradient-based optimizers update
parameters in the negative gradient direction. The simplest update is:
θ := θ - η ∇θJ(θ),
where η is the learning rate.
This basic rule is the starting point for nearly all optimizers discussed in this whitepaper.
2. Why Optimizers Matter
Optimization in machine learning is not merely a mathematical formality. Real-world objectives are often:
- high-dimensional
- non-convex
- noisy due to minibatching
- ill-conditioned, with different curvature in different directions
- sensitive to initialization and scaling
Because of this, the naive update rule may converge slowly, oscillate, or get stuck in flat or unstable regions. Optimizers modify how gradients are used so that training becomes more stable and efficient.
3. Full-Batch Gradient Descent
Before discussing SGD and its descendants, it is useful to distinguish full-batch gradient descent. In full-batch
training, the gradient is computed over the entire dataset:
g = ∇θJ(θ) = (1/n) Σi=1n ∇θL(yi, f(xi; θ)).
The update is then:
θ := θ - η g.
This gives an exact empirical gradient but can be expensive on large datasets and often lacks the beneficial noise properties of stochastic methods.
4. Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent replaces the full-dataset gradient with a gradient computed from one example or a small
minibatch. If B is the minibatch, then the stochastic gradient estimate is:
gB = (1/|B|) Σi ∈ B ∇θL(yi, f(xi; θ)).
The update becomes:
θ := θ - η gB.
4.1 Why SGD Works
The minibatch gradient is an unbiased or approximately unbiased estimator of the full gradient:
E[gB] ≈ ∇θJ(θ).
Although noisy, SGD is computationally cheaper per step and can make rapid progress. The noise also helps the optimizer explore the landscape and potentially escape shallow local minima or saddle regions.
4.2 Mini-Batch SGD
In practice, “SGD” usually refers to mini-batch SGD rather than single-example updates. Mini-batches offer a good trade-off between stable gradient estimates and hardware-efficient parallel computation.
4.3 SGD Geometry
SGD takes steps directly proportional to the current gradient. If one parameter dimension consistently has much larger gradients than another, SGD may oscillate in steep directions and progress slowly in shallow directions. This sensitivity to curvature motivates momentum and adaptive scaling methods.
5. SGD with Momentum
A major extension of SGD is momentum, which accumulates a running velocity in the direction of past gradients:
vt = β vt-1 + gt
and
θt+1 = θt - η vt.
Here, β is the momentum coefficient, usually between
0 and 1.
Momentum smooths noisy updates, accelerates movement along consistent directions, and reduces oscillation across directions with steep curvature.
5.1 Interpretation of Momentum
Momentum can be viewed as a physical analogy: the optimizer behaves like a particle with inertia. Instead of reacting only to the current gradient, it carries forward part of its previous motion.
This is especially helpful in ravine-like loss landscapes where one direction has steep curvature and another has shallow curvature. Momentum dampens oscillation in the steep direction and speeds progress in the shallow one.
5.2 Nesterov Momentum
A refined version is Nesterov Accelerated Gradient (NAG), which computes the gradient at a look-ahead position:
vt = β vt-1 + ∇J(θt - ηβvt-1),
then
θt+1 = θt - η vt.
Nesterov momentum can offer better anticipatory correction than classical momentum.
6. Limitations of Plain SGD
Despite its strengths, plain SGD has several challenges:
- requires careful learning-rate tuning
- uses the same global learning rate for all parameters
- can struggle when gradients vary greatly in scale across dimensions
- may converge slowly on ill-conditioned objectives
These limitations motivated adaptive methods such as RMSprop and Adam.
7. Adaptive Learning Rates
The main idea of adaptive optimization is to scale the update differently for each parameter dimension based on past gradients. Parameters that consistently receive large gradients should often get smaller step sizes, while parameters with small gradients can receive larger effective updates.
This creates coordinate-wise adaptation rather than relying on a single scalar learning rate.
8. RMSprop
RMSprop is an adaptive optimizer that maintains an exponentially weighted moving average of squared gradients. Let the
gradient at step t be gt. RMSprop computes:
st = ρ st-1 + (1 - ρ) gt2,
where the square is applied elementwise.
The update is then:
θt+1 = θt - η gt / (√(st) + ε).
Here:
ρis the decay factor, often around0.9εis a small constant for numerical stability
8.1 Interpretation of RMSprop
RMSprop divides each gradient component by the root mean square of recent gradients in that dimension. If a parameter has consistently large gradients, its denominator grows, reducing the effective step size. If a parameter has smaller gradients, its updates are relatively amplified.
This helps stabilize training on objectives with uneven curvature and makes learning rates less sensitive to raw gradient scale.
8.2 Why RMSprop Helps
Consider a loss surface with steep curvature in one direction and shallow curvature in another. Plain SGD may zigzag inefficiently. RMSprop rescales updates in each direction separately, often reducing oscillations and improving convergence speed.
8.3 RMSprop Limitations
RMSprop adapts learning rates but does not explicitly track a first-moment momentum term in its simplest form. It may still require careful tuning and, depending on the problem, may generalize differently from SGD-based methods.
9. Adam
Adam, short for Adaptive Moment Estimation, combines ideas from momentum and RMSprop. It tracks:
- the first moment (mean) of gradients
- the second moment (uncentered variance proxy) of gradients
At step t, Adam computes:
mt = β1 mt-1 + (1 - β1) gt
and
vt = β2 vt-1 + (1 - β2) gt2.
Here:
mtis the exponentially weighted mean of gradientsvtis the exponentially weighted mean of squared gradients
9.1 Bias Correction
Because m0 = 0 and v0 = 0,
the moving averages are biased toward zero early in training. Adam corrects this by using:
m̂t = mt / (1 - β1t)
and
v̂t = vt / (1 - β2t).
9.2 Adam Update Rule
The parameter update is:
θt+1 = θt - η m̂t / (√(v̂t) + ε).
Default hyperparameters are often:
β1 = 0.9,
β2 = 0.999,
and
ε = 10-8.
9.3 Interpretation of Adam
Adam behaves like momentum because it uses mt to smooth gradients over time.
It behaves like RMSprop because it divides by a running scale estimate √(v̂t).
This often leads to fast convergence, especially in noisy and sparse-gradient settings.
10. Adam vs RMSprop
RMSprop uses only a second-moment-like scaling term:
st.
Adam extends this with a first-moment term:
mt.
As a result, Adam can be viewed as RMSprop plus momentum plus bias correction. In many practical settings, Adam is a strong default optimizer because it tends to converge quickly and with relatively little tuning.
11. AdamW and Weight Decay Note
In practice, a commonly used variant is AdamW, which decouples weight decay from the adaptive gradient update. While this goes beyond vanilla Adam, it is often preferred in modern deep learning because naive L2 regularization and adaptive updates can interact in undesirable ways.
In AdamW-style updates, one may write:
θt+1 = θt - η [m̂t / (√(v̂t) + ε)] - η λ θt.
12. Convergence Behavior and Generalization
Different optimizers do not just affect speed; they can also affect the final solution and its generalization behavior. In some empirical studies, SGD with momentum has shown better final generalization than Adam on certain deep learning benchmarks, even when Adam reaches low training loss faster.
One possible interpretation is that SGD’s noisier and less adaptive updates may bias the optimizer toward flatter minima, while adaptive methods can sometimes settle into sharper regions. This is still an area of active research and depends heavily on the architecture, dataset, and training regime.
13. Sparse Gradients
Adaptive methods such as RMSprop and Adam are often especially effective when gradients are sparse. If only a few parameters receive updates frequently, coordinate-wise scaling can help these parameters learn effectively without requiring a globally tuned learning rate.
This is one reason Adam has been widely used in NLP and embedding-heavy models.
14. Learning Rate Sensitivity
SGD is highly sensitive to the learning rate and often requires schedules such as step decay, cosine decay, or
warmup. Adam and RMSprop are generally more forgiving, but they still depend on appropriate
η.
A learning rate that is too high can destabilize any optimizer; a learning rate that is too low can make training unnecessarily slow.
15. Optimization in Ill-Conditioned Problems
If the Hessian of the objective has eigenvalues with very different magnitudes, the landscape is ill-conditioned. SGD may zigzag across steep directions while moving slowly along shallow ones. Adaptive methods approximately normalize gradient magnitudes per coordinate, which can make progress more balanced.
16. Noise in Stochastic Optimization
Stochastic gradients are inherently noisy because they are estimated from minibatches. This noise can be beneficial:
- it reduces the cost per update
- it helps exploration of the loss landscape
- it may provide an implicit regularization effect
However, too much noise can destabilize training. Batch size, optimizer choice, and learning rate together determine the effective noise level.
17. Practical Hyperparameters
17.1 SGD
Key hyperparameters:
- learning rate
η - momentum coefficient
β - weight decay
λ - learning rate schedule
17.2 RMSprop
Key hyperparameters:
- learning rate
η - decay factor
ρ - stability constant
ε
17.3 Adam
Key hyperparameters:
- learning rate
η - first-moment decay
β1 - second-moment decay
β2 - stability constant
ε
18. Typical Use Cases
18.1 When SGD Is Preferred
SGD with momentum is often preferred when:
- generalization is the top priority
- the practitioner can tune learning-rate schedules carefully
- large-scale vision models are being trained from scratch
18.2 When RMSprop Is Useful
RMSprop is often useful in recurrent or nonstationary settings, especially where gradient magnitudes vary strongly over time.
18.3 When Adam Is Preferred
Adam is often preferred when:
- fast convergence is important
- the model is complex and hard to tune
- gradients are sparse or noisy
- the user wants a strong practical default optimizer
19. Limitations of Adaptive Methods
Although Adam and RMSprop can converge quickly, they may sometimes generalize worse than SGD with momentum on certain tasks. Adaptive step scaling may also lead to unusual training dynamics if hyperparameters are not well chosen.
This does not make adaptive optimizers inferior — it means optimizer choice must be aligned with the problem and training regime.
20. Common Training Heuristics
- Use learning-rate warmup when training deep or sensitive models.
- Combine SGD with momentum and a decay schedule for strong generalization-focused training.
- Use Adam as a baseline optimizer when rapid prototyping.
- Monitor validation metrics, not just training loss.
- Retune learning rates when changing batch size or optimizer family.
21. Optimizer Comparison Summary
SGD is simple, effective, and often strong for generalization, but requires more careful learning-rate tuning. RMSprop rescales updates using recent squared gradients and is useful when gradient magnitudes vary substantially. Adam combines momentum and adaptive scaling, often making it the most convenient and robust practical default for many deep learning tasks.
22. Best Practices
- Start with Adam when fast and stable optimization is needed quickly.
- Switch to SGD with momentum if final generalization becomes the priority and tuning budget is available.
- Use learning-rate schedules regardless of optimizer choice.
- Track gradient norms and optimizer state behavior when debugging training.
- Benchmark multiple optimizers because architecture and data strongly influence which works best.
23. Conclusion
Optimizers are not merely implementation details — they are central algorithmic choices that shape the trajectory of learning. SGD provides the foundational stochastic first-order method and remains a strong baseline, especially when enhanced with momentum and schedules. RMSprop addresses scale sensitivity by adapting updates using recent squared gradients. Adam unifies momentum and adaptive scaling, making it one of the most widely used optimizers in modern deep learning.
A mature understanding of optimization requires recognizing the distinct strengths of these methods. SGD often rewards careful tuning with strong generalization. RMSprop offers stability in uneven gradient landscapes. Adam delivers practical efficiency and ease of use. The right optimizer is therefore not chosen by habit alone, but by understanding the geometry, noise, scaling, and generalization demands of the learning problem.




