Dimensionality Reduction: PCA, t-SNE, LDA

Dimensionality reduction is a core technique in machine learning, statistics, signal processing, and data mining. Its goal is to transform high-dimensional data into a lower-dimensional representation that preserves as much useful structure as possible. This whitepaper provides a detailed technical explanation of three influential methods: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Linear Discriminant Analysis (LDA). Although often grouped together, these methods solve fundamentally different problems and rely on different mathematical principles.

Abstract

High-dimensional data poses challenges such as computational cost, noise accumulation, redundancy, multicollinearity, poor visualization, and the curse of dimensionality. Dimensionality reduction addresses these issues by projecting data into a lower-dimensional space. PCA is an unsupervised linear projection method that maximizes variance. t-SNE is a nonlinear manifold-learning and visualization technique that preserves local neighborhoods. LDA is a supervised linear method that seeks directions maximizing class separability. This paper explains the mathematical foundations, optimization objectives, interpretations, use cases, limitations, and practical guidelines for all three methods, with formulas embedded inline in HTML-friendly format.

1. Introduction

Let the dataset be X ∈ ℝ^n×p, where n is the number of observations and p is the number of original features. When p is large, several issues arise:

distances become less meaningful
models become harder to interpret
noise can dominate signal
training may become computationally expensive
visualization beyond 2D or 3D becomes impossible directly

Dimensionality reduction constructs a mapping from the original space ℝ^p into a lower-dimensional space ℝ^d, where typically d << p. In general, this means finding a transformation f: ℝ^p → ℝ^d such that the new representation preserves important structure for visualization, compression, denoising, or downstream learning.

2. Why Dimensionality Reduction Matters

The main motivations for dimensionality reduction include improved computational efficiency, reduced storage, noise filtering, feature compression, visualization, and mitigation of overfitting. In exploratory data analysis, dimensionality reduction reveals latent structure. In modeling, it can improve downstream classifiers or regressors by removing irrelevant variation and multicollinearity. In signal processing and recommendation systems, it can compress data while preserving dominant patterns.

3. Taxonomy of Dimensionality Reduction

Dimensionality reduction methods can be grouped along several axes:

Linear vs nonlinear: PCA and LDA are linear; t-SNE is nonlinear.
Supervised vs unsupervised: PCA and t-SNE are typically unsupervised; LDA is supervised.
Projection vs embedding: PCA and LDA provide explicit linear projections; t-SNE creates a low-dimensional embedding but not a stable global projection in the classical sense.

These distinctions are critical. PCA preserves global variance, t-SNE preserves local neighborhood probabilities, and LDA preserves discriminative class structure.

4. Principal Component Analysis (PCA)

PCA is one of the most widely used dimensionality reduction techniques. It finds orthogonal directions in the data that capture maximum variance. These directions are called principal components.

4.1 Data Centering

PCA assumes the data is centered. If x_i is the original feature vector, the centered version is x'_i = x_i - μ, where μ = (1/n) Σ_i=1ⁿ x_i is the sample mean vector.

Centering ensures that the first principal component captures variance around the mean rather than absolute location.

4.2 Covariance Matrix

The covariance matrix of the centered data is S = (1/n) X^TX or sometimes S = (1/(n-1)) X^TX, depending on convention. Here, X denotes the centered data matrix.

The diagonal entries of S represent feature variances, and the off-diagonal entries represent pairwise covariances.

4.3 Principal Components as Variance Maximizers

PCA seeks a direction w ∈ ℝ^p such that the projected data z_i = w^Tx_i has maximum variance under the constraint ||w|| = 1.

The variance of the projection is Var(z) = w^T S w. Therefore, the first principal component solves:

maximize w^T S w subject to w^Tw = 1

Using a Lagrange multiplier, the solution satisfies S w = λ w. Thus the principal components are eigenvectors of the covariance matrix, and the associated eigenvalues λ represent explained variance along those directions.

4.4 Multiple Components

The first principal component captures the maximum possible variance. The second captures the maximum remaining variance subject to orthogonality with the first, and so on. If the eigenvalues are ordered as λ₁ ≥ λ₂ ≥ ... ≥ λ_p, then the first d eigenvectors form the projection matrix W ∈ ℝ^p×d.

The lower-dimensional representation is Z = XW.

4.5 Explained Variance Ratio

The fraction of total variance explained by the k-th principal component is λ_k / Σ_j=1^p λ_j. The cumulative explained variance for the first d components is [Σ_k=1^d λ_k] / [Σ_j=1^p λ_j].

This is often used to choose how many components to retain.

4.6 SVD View of PCA

PCA can also be computed using singular value decomposition. If the centered matrix is X = U Σ V^T, then the right singular vectors in V correspond to principal directions, and the squared singular values are related to the covariance eigenvalues.

Specifically, if σ_k is the k-th singular value, then the associated variance is proportional to σ_k².

4.7 Geometric Interpretation

PCA rotates the coordinate system to align with the directions of maximum data spread. The first principal component defines the line of best fit in the least-squares reconstruction sense. More generally, the first d components define the best d-dimensional linear subspace for minimizing reconstruction error.

The reconstruction of a point x from the reduced representation is x̂ = WW^Tx, and the PCA subspace minimizes Σ ||x_i - WW^Tx_i||².

4.8 Strengths of PCA

computationally efficient
interpretable linear components
reduces multicollinearity
useful for denoising and compression
good preprocessing step for downstream models

4.9 Limitations of PCA

captures variance, not necessarily class separation or nonlinear structure
sensitive to scaling
principal components may be hard to interpret when many features mix together
linear method, so cannot model nonlinear manifolds effectively

5. t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a nonlinear dimensionality reduction technique designed mainly for visualization. Its purpose is to preserve local neighborhoods: points close in high-dimensional space should remain close in low-dimensional space.

5.1 High-Dimensional Similarities

For each pair of points x_i and x_j, t-SNE defines a conditional similarity: p_j|i = exp(-||x_i - x_j||² / 2σ_i²) / Σ_{k ≠ i} exp(-||x_i - x_k||² / 2σ_i²).

This can be viewed as the probability that point x_i would choose x_j as a neighbor under a Gaussian centered at x_i.

These similarities are symmetrized as p_ij = (p_j|i + p_i|j) / 2n.

5.2 Perplexity

The bandwidth σ_i is chosen so that the effective neighborhood size around each point matches a user-defined perplexity. Perplexity is defined as Perp(P_i) = 2^H(P_i), where the Shannon entropy is H(P_i) = - Σ_j p_j|i log₂ p_j|i.

Intuitively, perplexity controls how many neighbors each point effectively pays attention to.

5.3 Low-Dimensional Similarities

Let the low-dimensional embedding points be y₁, ..., y_n, usually in 2D or 3D. Instead of a Gaussian, t-SNE uses a Student t-distribution with one degree of freedom in the low-dimensional space: q_ij = (1 + ||y_i - y_j||²)^-1 / Σ_{k ≠ l} (1 + ||y_k - y_l||²)^-1.

This heavy-tailed distribution helps solve the crowding problem by allowing moderately distant points in the low-dimensional map to stay farther apart than a Gaussian would permit.

5.4 Objective Function

t-SNE finds the embedding by minimizing the Kullback–Leibler divergence between the high-dimensional similarity distribution P and the low-dimensional similarity distribution Q: KL(P || Q) = Σ_{i ≠ j} p_ij log(p_ij / q_ij).

The objective strongly penalizes cases where points that are close in high dimensions become far apart in the low-dimensional map.

5.5 Optimization

t-SNE uses gradient descent to minimize the KL divergence. The gradients depend on attractive forces between pairs with high p_ij and repulsive forces between points according to q_ij. The resulting dynamics resemble a force-based layout: nearby neighbors are pulled together while unrelated points are pushed apart.

5.6 Interpretation of t-SNE Plots

t-SNE is excellent for visualizing local neighborhoods and cluster-like structures. However, distances between clusters, cluster sizes, and global geometry in a t-SNE plot should be interpreted cautiously. t-SNE preserves local relationships much better than global distances.

Two clusters appearing far apart in a t-SNE visualization do not necessarily mean they are globally distant in the original space. Likewise, apparent empty gaps or cluster sizes may be visualization artifacts of the optimization.

5.7 Strengths of t-SNE

excellent for visualizing complex nonlinear structure
reveals local neighborhoods and manifolds
works well for embeddings, image features, text vectors, and biological data

5.8 Limitations of t-SNE

primarily a visualization tool, not ideal as a general-purpose feature extractor
nonlinear embedding is hard to interpret parametrically
sensitive to hyperparameters such as perplexity and learning rate
global distances are not reliably preserved
results can vary across random initializations

6. Linear Discriminant Analysis (LDA)

In the dimensionality reduction context, LDA refers to Linear Discriminant Analysis, also known as Fisher’s Linear Discriminant. It is a supervised method that seeks directions maximizing class separability rather than raw variance.

6.1 Problem Setup

Suppose the dataset has class labels y ∈ {1, 2, ..., K}. Let μ be the overall mean and μ_k the mean of class k. Let N_k be the number of samples in that class.

6.2 Within-Class Scatter

The within-class scatter matrix measures variation of samples around their own class means: S_W = Σ_k=1^K Σ_{x ∈ C_k} (x - μ_k)(x - μ_k)^T.

Smaller within-class scatter means points in the same class are tightly grouped.

6.3 Between-Class Scatter

The between-class scatter matrix measures how far class means are separated from the global mean: S_B = Σ_k=1^K N_k(μ_k - μ)(μ_k - μ)^T.

Larger between-class scatter means classes are more widely separated.

6.4 Fisher Criterion

LDA seeks a projection direction w that maximizes the ratio of between-class scatter to within-class scatter. In the two-class case, the Fisher criterion is J(w) = [w^TS_Bw] / [w^TS_Ww].

The optimal direction satisfies a generalized eigenvalue problem: S_B w = λ S_W w.

For multiple classes, the solution consists of the top eigenvectors of S_W^-1 S_B, provided S_W is invertible or suitably regularized.

6.5 Dimensionality Bound

An important property of LDA is that the resulting subspace has dimension at most K - 1, where K is the number of classes. This is because the rank of S_B is at most K - 1.

6.6 Geometric Meaning

PCA looks for directions of maximum overall variance, even if that variance is irrelevant for class discrimination. LDA instead looks for directions where class means are far apart relative to the spread of points within each class. So LDA is often better than PCA when the downstream goal is classification and class labels are available.

6.7 LDA as Both Classifier and Reducer

LDA is widely known as a classifier under Gaussian class-conditional assumptions with shared covariance. But even when viewed as a classifier, its underlying discriminant subspace gives a principled method of supervised feature extraction. In many pipelines, LDA is used first to reduce dimensionality and then another classifier is trained on the reduced representation.

6.8 Strengths of LDA

supervised dimensionality reduction aligned with class separation
often improves classification when labels exist
interpretable linear projection directions
computationally efficient on moderate data sizes

6.9 Limitations of LDA

requires class labels
assumes linear separability in projection space
limited to at most K - 1 dimensions
can struggle when within-class covariance estimates are unstable
sensitive to class imbalance and non-Gaussian structure in some cases

7. PCA vs t-SNE vs LDA

7.1 Nature of Supervision

PCA is unsupervised and ignores labels. t-SNE is also generally unsupervised and focuses on neighborhood structure. LDA is supervised and explicitly uses class information.

7.2 Linear vs Nonlinear

PCA and LDA are linear methods: both produce transformations of the form Z = XW. t-SNE is nonlinear and does not yield a simple global projection matrix.

7.3 Objective Functions

PCA maximizes projected variance w^TS w. LDA maximizes the Fisher ratio [w^TS_Bw] / [w^TS_Ww]. t-SNE minimizes the divergence KL(P || Q) between neighborhood similarity distributions.

7.4 Best Use Cases

PCA is best for compression, denoising, multicollinearity reduction, and general-purpose preprocessing. t-SNE is best for 2D or 3D visualization of complex manifolds and local structure. LDA is best when class labels exist and the goal is discriminative dimensionality reduction.

7.5 Interpretability

PCA components can often be interpreted through loadings, though this may be difficult if many features mix. LDA directions are more directly tied to class separation. t-SNE is visually interpretable but not easily interpretable as a stable feature transformation.

8. Reconstruction and Information Loss

Linear methods like PCA enable explicit reconstruction through x̂ = WW^Tx when the data is centered and W is orthonormal. Reconstruction error is a natural measure of information loss.

LDA is not designed for reconstruction; it is designed for discrimination. t-SNE is also not a reconstruction-based method. Its purpose is to preserve local probabilities for visualization rather than to support invertible feature compression.

9. Scaling and Preprocessing

Feature scaling is especially important for PCA and LDA because covariance structure depends on variable scale. Standardization using x'_j = (x_j - μ_j) / σ_j is often applied when features have different units or ranges.

For t-SNE, preprocessing with PCA is common. First reducing the data to an intermediate dimension, such as 30 or 50, can reduce noise and speed up t-SNE substantially.

10. Computational Considerations

PCA is computationally efficient, especially with SVD and randomized methods on large data. LDA is also efficient when the number of classes and feature dimensions are moderate, though singular S_W can require regularization. t-SNE is computationally much heavier because it optimizes pairwise similarities and often requires many iterations. Approximate variants such as Barnes-Hut t-SNE reduce complexity for larger datasets.

11. Practical Applications

11.1 PCA Applications

PCA is used in image compression, denoising, exploratory factor compression, genomic preprocessing, sensor data compression, finance risk analysis, and as preprocessing before regression, clustering, or anomaly detection.

11.2 t-SNE Applications

t-SNE is widely used to visualize word embeddings, image embeddings, single-cell RNA sequencing data, customer embeddings, latent spaces from neural networks, and other high-dimensional feature sets where local structure is important.

11.3 LDA Applications

LDA is useful in face recognition, medical classification support, document classification with labeled categories, biometrics, and supervised preprocessing before downstream classifiers.

12. Common Pitfalls

Using PCA without centering or scaling when features have different units.
Interpreting t-SNE global distances as if they were metric-preserving.
Using t-SNE as a drop-in replacement for predictive feature engineering without validation.
Applying LDA when labels are unavailable or when classes are heavily overlapping in nonlinear ways.
Choosing dimensionality solely by habit instead of variance, discrimination, or visualization purpose.

13. Best Practices

Use PCA when the goal is compression, denoising, or general-purpose linear reduction.
Use t-SNE primarily for visualization, not as a default modeling feature extractor.
Use LDA when class labels are known and class separation matters.
Standardize features before PCA and LDA unless domain knowledge suggests otherwise.
For t-SNE, experiment with perplexity and random seeds, and compare multiple runs.
Validate whether the reduced representation actually helps the downstream task.

14. Conclusion

PCA, t-SNE, and LDA all reduce dimensionality, but they do so for very different reasons. PCA seeks directions of maximum variance and is ideal for linear compression and noise reduction. t-SNE seeks a low-dimensional map that preserves local neighborhoods and is ideal for visualization of complex nonlinear structure. LDA seeks directions maximizing class separability and is ideal when labels are available and discrimination matters.

A mature understanding of dimensionality reduction requires recognizing that no single method is universally best. The right choice depends on whether the objective is compression, visualization, or supervised discrimination; on whether the structure is linear or nonlinear; and on whether interpretability or predictive utility is the primary concern. Used thoughtfully, these methods can transform high-dimensional complexity into representations that are computationally useful, visually meaningful, and scientifically informative.