Transformer Architecture and BERT

The Transformer architecture fundamentally changed natural language processing by replacing recurrence with attention-based sequence modeling. BERT, built on the Transformer encoder, became one of the most influential pretrained language models by introducing bidirectional contextual pretraining at scale. This whitepaper explains the Transformer architecture and BERT in technical depth, covering self-attention, positional encoding, multi-head attention, encoder stacks, pretraining objectives, fine-tuning behavior, and practical limitations.

Abstract

Traditional sequence models such as RNNs and LSTMs process tokens sequentially, which limits parallelization and often makes long-range dependency modeling difficult. The Transformer introduced a new architecture based entirely on attention mechanisms, allowing direct interaction among all sequence positions in parallel. Its core operation, scaled dot-product attention, enables flexible context aggregation without recurrence. BERT, or Bidirectional Encoder Representations from Transformers, uses a deep Transformer encoder pretrained with masked language modeling and next sentence prediction objectives to learn rich bidirectional contextual representations. This paper explains the mathematical structure of the Transformer, how BERT is built from it, the nature of bidirectional pretraining, fine-tuning strategies, strengths, limitations, and its impact on NLP. All formulas are embedded inline in HTML-friendly format for direct use in WordPress or similar editors.

1. Introduction

A natural language sequence may be written as x = (w₁, w₂, ..., w_T), where T is sequence length. A model must often assign a representation or prediction to each token or to the sequence as a whole.

Earlier neural NLP architectures relied heavily on recurrence. Although powerful, recurrent models process tokens sequentially, which limits training efficiency and makes long-distance dependency modeling challenging. The Transformer replaced this with attention, allowing each token to interact directly with all other tokens in the sequence.

2. Motivation for the Transformer

Recurrent models compute hidden states iteratively: h_t = f(x_t, h_t-1).

This means:

computation is inherently sequential
long-range information must travel through many recurrent steps
training can suffer from vanishing or exploding gradients

The Transformer addresses these limitations by making token-token interaction explicit through attention. This allows the model to capture long-range dependencies more directly and compute all token interactions in parallel.

3. Embedding Input Tokens

Each input token is first mapped to a vector embedding: e(w_t) ∈ ℝ^d.

Because the Transformer has no recurrence, it also needs positional information. The final input representation is typically: x_t = e(w_t) + p_t, where p_t is a positional encoding or learned positional embedding.

4. Positional Encoding

Without positional information, attention would treat the input as an unordered set. The original Transformer used sinusoidal positional encodings: PE(pos, 2i) = sin(pos / 10000^2i/d) and PE(pos, 2i+1) = cos(pos / 10000^2i/d).

Here:

pos is the token position
i indexes embedding dimensions
d is the model dimension

These encodings allow the model to infer relative and absolute position information.

5. Scaled Dot-Product Attention

The core computation in the Transformer is scaled dot-product attention. Given query, key, and value matrices Q, K, and V, attention is: Attention(Q,K,V) = softmax(QK^T / √d_k) V.

Here, d_k is the key dimension used for scaling.

5.1 Interpretation

The matrix product QK^T computes similarity scores between queries and keys. The softmax converts these into normalized attention weights. Multiplying by V produces weighted combinations of value vectors.

In sequence terms, each token builds a context-aware representation by attending to other tokens.

6. Self-Attention

In self-attention, the queries, keys, and values all come from the same sequence representation X. These are computed through learned linear projections: Q = XW_Q, K = XW_K, and V = XW_V.

Thus each token can attend to all tokens, including itself, based on learned pairwise compatibility.

7. Multi-Head Attention

Instead of using a single attention mechanism, the Transformer uses multiple attention heads. For head h: head_h = Attention(QW_Q^(h), KW_K^(h), VW_V^(h)).

The outputs are concatenated and projected: MultiHead(Q,K,V) = Concat(head₁, ..., head_H) W_O.

Multi-head attention allows the model to capture different relational patterns simultaneously, such as syntactic, semantic, or positional dependencies.

8. Feedforward Layer

Each Transformer block also contains a position-wise feedforward network applied independently to each token representation: FFN(x) = W₂ σ(W₁x + b₁) + b₂.

The activation σ is often ReLU or GELU. This feedforward layer increases representational capacity beyond pure attention-based mixing.

9. Residual Connections and Layer Normalization

Each sublayer in the Transformer is wrapped with a residual connection and normalization. A simplified pattern is: y = LayerNorm(x + Sublayer(x)).

Residual connections help preserve gradient flow, while layer normalization stabilizes hidden-state distributions. Together they make very deep Transformer stacks trainable.

10. Transformer Encoder Block

A standard encoder block consists of:

multi-head self-attention
residual connection + layer normalization
position-wise feedforward network
residual connection + layer normalization

Stacking these blocks yields a deep contextual encoder.

11. Transformer Decoder Block

The original Transformer architecture for sequence-to-sequence tasks includes a decoder. A decoder block contains:

masked self-attention
encoder-decoder attention
feedforward layer

Masked attention ensures that during generation, each position can attend only to earlier positions. BERT, however, uses only the encoder side of the Transformer.

12. Complexity and Parallelization

The Transformer’s attention mechanism enables parallel computation across tokens, unlike recurrent models. This greatly improves training efficiency on modern hardware.

However, self-attention has quadratic complexity in sequence length because QK^T produces a T×T matrix. This becomes expensive for long contexts.

13. Encoder-Only, Decoder-Only, and Encoder-Decoder Variants

Transformer-based models come in several forms:

Encoder-only: bidirectional contextual encoding, as in BERT
Decoder-only: autoregressive generation, as in GPT-style models
Encoder-decoder: sequence transduction, as in T5 or translation models

BERT belongs to the encoder-only family.

14. What Is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. It is a deep stack of Transformer encoder layers pretrained on large corpora to learn contextual token representations.

The key innovation is bidirectional conditioning: each token representation is informed by both left and right context, unlike left-to-right language models that only use preceding tokens.

15. BERT Input Representation

BERT’s input embedding for each token is typically the sum of:

token embedding
segment embedding
position embedding

So for token t, the input can be written as: x_t = e_token(w_t) + e_segment(s_t) + e_position(t).

Segment embeddings help distinguish sentence A from sentence B in paired-input tasks.

16. Special Tokens in BERT

BERT uses special tokens such as:

[CLS]: classification token placed at the beginning
[SEP]: separator token between sequences
[MASK]: token used in masked language modeling

The final hidden state of [CLS] is often used as the aggregate representation for sequence-level classification tasks.

17. Bidirectional Context in BERT

Because BERT uses self-attention without causal masking, each token can attend to tokens on both sides. Therefore the contextual representation of token w_t is: h_t = f(w₁, ..., w_T, t), where the representation depends on the full sequence context, not just the prefix.

This was a major departure from many earlier language models.

18. Masked Language Modeling (MLM)

BERT is pretrained using masked language modeling. A subset of tokens is masked or perturbed, and the model must predict the original tokens from context.

If the masked position is t, the objective is to maximize: log P(w_t | x_\mask), where x_\mask is the corrupted input sequence.

Over all masked positions, the MLM loss is: L_MLM = - Σ_{t ∈ M} log P(w_t | x_\mask), where M is the set of masked positions.

18.1 Why MLM Matters

MLM allows bidirectional context usage because the model predicts missing words using both left and right context. This gives BERT deeply contextual internal representations.

19. Next Sentence Prediction (NSP)

The original BERT also used Next Sentence Prediction. Given two text segments, the model predicts whether the second segment is the actual next sentence following the first in the corpus.

This is a binary classification objective: L_NSP = - [y log ŷ + (1-y) log(1-ŷ)].

Although NSP was part of original BERT pretraining, later research found it less essential than MLM in some settings, and some later models replaced or removed it.

20. Total BERT Pretraining Objective

The original BERT objective combines masked language modeling and next sentence prediction: L = L_MLM + L_NSP.

The model is trained on large corpora so that its encoder layers learn reusable contextual language representations.

21. Fine-Tuning BERT

After pretraining, BERT can be fine-tuned for downstream tasks by adding a lightweight task head and updating the pretrained parameters on supervised data.

21.1 Sequence Classification

For classification, the final hidden state of [CLS] is often used: h_CLS.

A classifier head produces logits: z = Wh_CLS + b, followed by softmax: ŷ_k = e^z_k / Σ_j=1^K e^z_j.

21.2 Token Classification

For tasks like named entity recognition, each token representation h_t is passed to a classifier: z_t = Wh_t + b.

21.3 Question Answering

For extractive QA, BERT often predicts start and end positions of an answer span: P(start = t) and P(end = t).

22. Why BERT Was Powerful

BERT was highly influential because it combined:

deep bidirectional contextual encoding
large-scale self-supervised pretraining
a flexible architecture reusable across many downstream tasks
simple fine-tuning with minimal task-specific architecture changes

This dramatically improved performance on many NLP benchmarks.

23. BERT vs Earlier Embedding Methods

Static embeddings such as Word2Vec or GloVe assign one vector per word type. BERT instead produces context-sensitive representations. Thus “bank” in different contexts receives different embeddings, resolving many limitations of static word vectors.

24. BERT vs Autoregressive Language Models

Traditional left-to-right language models optimize: P(w_1:T) = Π_t=1^T P(w_t | w_<t).

BERT does not model sequences autoregressively during pretraining. Instead, it learns from masked-token prediction, allowing bidirectional encoding. This makes BERT especially strong for understanding and representation tasks, whereas decoder-only models are naturally suited to generation.

25. BERT Limitations

Despite its power, BERT has important limitations:

quadratic attention cost with sequence length
not inherently generative in the same way as decoder-only language models
pretraining and fine-tuning are computationally expensive
fixed maximum context length in standard implementations
sensitivity to domain mismatch between pretraining and downstream data

26. BERT Variants and Extensions

Many models extended BERT’s ideas:

RoBERTa: removed NSP and changed pretraining strategy
ALBERT: parameter sharing and factorized embeddings
DistilBERT: compressed student model
Domain-specific variants such as BioBERT or LegalBERT

These variants improved efficiency, domain specialization, or training methodology.

27. Practical Applications

Transformer encoders and BERT-like models are widely used in:

search relevance and ranking
document and sentiment classification
named entity recognition
question answering
semantic similarity and retrieval
legal, biomedical, and enterprise NLP

28. Evaluation Metrics

Evaluation depends on the downstream task. Common metrics include: Accuracy = (TP + TN)/(TP + TN + FP + FN), Precision = TP/(TP + FP), Recall = TP/(TP + FN), and F1 = 2(Precision × Recall)/(Precision + Recall).

For QA, exact match and token-level F1 are common. For retrieval or similarity, ranking metrics may apply. For masked language modeling, cross-entropy or token prediction accuracy may be used internally.

29. Strengths of the Transformer and BERT

strong long-range contextual modeling
parallelizable training
flexible reusable encoder representations
state-of-the-art results across many language understanding tasks
effective transfer learning via pretraining and fine-tuning

30. Best Practices

Use pretrained BERT-like models when labeled data is limited.
Choose sequence length carefully because attention cost grows quadratically.
Match tokenization and casing to the chosen pretrained checkpoint.
Use smaller learning rates for fine-tuning than for training small models from scratch.
Consider domain-adapted BERT variants when working in specialized text domains.

31. Conclusion

The Transformer architecture changed NLP by showing that sequence modeling could be built entirely around attention, without recurrence. Its self-attention mechanism enabled direct token-to-token interaction, deep contextual representation learning, and large-scale parallel training. BERT then demonstrated that a deep Transformer encoder, pretrained bidirectionally with masked language modeling, could become a broadly reusable language understanding model.

Understanding Transformers and BERT is essential for modern NLP because they introduced the architectural and pretraining principles that underpin much of today’s language modeling ecosystem. Even as newer models evolve beyond original BERT, the conceptual foundations of self-attention, encoder stacks, bidirectional context, and pretrain-then fine-tune learning remain central to the field.