A single decision tree trained on your full dataset is highly sensitive to the training data: change a few rows and the tree changes completely. Bagging — Bootstrap Aggregating — fixes this by training many trees on independently sampled versions of the data and averaging their predictions. The result is a model with the same bias as a single tree but dramatically lower variance. This article builds a bagging classifier from scratch in pure Python/NumPy, explains why bootstrap sampling reduces variance, and compares the hand-rolled implementation against sklearn’s BaggingClassifier on a concrete classification problem.
1. Problem Statement
You are building a fraud detection classifier and choose a decision tree because it is interpretable. But on your validation set, performance swings by ±5 percentage points depending on which 80% of the data you use for training — a sign that the model has high variance. You need to reduce this variance without switching to a completely different algorithm, losing interpretability, or collecting more data. Bagging solves exactly this: it creates B independent training sets from your existing data by sampling with replacement (bootstrap), trains one tree on each, and aggregates predictions by majority vote. The individual trees are unstable but their errors are uncorrelated, so their average cancels out much of the noise.
2. Why This Matters
Variance reduction is one of the most practically important improvements a machine learning engineer can make. High-variance models fail in production not because they are wrong on average but because their predictions are unreliable — the same model architecture trained on two different batches of production data gives very different answers. Bagging is the cleanest variance reduction technique available: it requires no changes to the base learner, no hyperparameter re-tuning, and no additional data. Understanding how and why it works — specifically, why bootstrap sampling produces uncorrelated errors, and why averaging uncorrelated errors reduces variance — is the conceptual foundation for random forests, which extend bagging with one additional idea.
3. The Approach
We implement bagging in three stages. First, a from-scratch implementation using only NumPy and sklearn’s DecisionTreeClassifier: bootstrap sample the training set B times, train one tree on each, and aggregate predictions by majority vote. Second, we visualise the variance reduction directly: train B=1, 5, 10, 25, 50, and 100 trees and plot test accuracy variance across many random seeds, showing the variance shrinking as B grows. Third, we benchmark the from-scratch implementation against sklearn’s BaggingClassifier to confirm they produce equivalent results, and compare both against a single decision tree baseline.
4. Mathematical Foundation
Let h1(x), …, hB(x) be B decision trees trained on bootstrap samples of size N. The bagging prediction for regression is:
ŷ(x) = (1/B) Σb=1B hb(x)
If the trees are identically distributed with variance σ² and pairwise correlation ρ, the variance of the average is:
Var(ŷ) = ρ·σ² + (1−ρ)/B · σ²
As B → ∞, the second term vanishes: Var(ŷ) → ρ·σ². If trees are uncorrelated (ρ = 0), variance collapses entirely. Bootstrap sampling approximates ρ ≈ 0 by ensuring each tree sees a different random subset of the data. For classification, the bagging prediction is:
H(x) = argmaxc Σb=1B 𝟙[hb(x) = c]
A bootstrap sample of size N drawn with replacement from N points contains on average 1 − (1 − 1/N)N ≈ 1 − e−1 ≈ 63.2% of the unique original training examples. The remaining 36.8% are out-of-bag (OOB) samples — never seen by that tree — providing a free internal validation estimate.
5. Algorithm Walkthrough
- Input: training set (X, y), number of bags B, base learner (decision tree with max_depth d).
- For b = 1, …, B: draw a bootstrap sample of size N with replacement from (X, y); train one decision tree hb on the bootstrap sample.
- Store all B trees.
- Prediction: for a new sample x, collect predictions {h1(x), …, hB(x)}; return the majority class (classification) or mean (regression).
- OOB estimate: for each training sample i, collect predictions from all trees b where i was NOT in the bootstrap sample; aggregate those OOB predictions to get an unbiased accuracy estimate without a separate validation set.
6. Dataset
This article uses load_breast_cancer: 569 samples, 30 features, 2 classes (malignant/benign). This dataset is small enough that single-tree variance is clearly visible across repeated train-test splits, making the variance reduction from bagging directly measurable. A second experiment uses make_classification with 1,000 samples and 20 features to show the convergence of OOB error as B grows. Open Notebook
7. Implementation
The from-scratch bagging classifier is implemented in ~30 lines of Python. It stores a list of (tree, oob_indices) tuples and computes both the test prediction and the OOB accuracy estimate in a single forward pass over the stored trees.
class BaggingFromScratch:
def __init__(self, n_estimators=50, max_depth=None, random_state=42):
self.n_estimators = n_estimators
self.max_depth = max_depth
self.rng = np.random.RandomState(random_state)
self.trees_ = []
def fit(self, X, y):
N = len(y)
self.trees_ = []
for _ in range(self.n_estimators):
idx = self.rng.choice(N, N, replace=True)
oob = np.setdiff1d(np.arange(N), idx)
tree = DecisionTreeClassifier(max_depth=self.max_depth,
random_state=self.rng.randint(1e6))
tree.fit(X[idx], y[idx])
self.trees_.append((tree, oob))
return self
def predict(self, X):
votes = np.array([t.predict(X) for t, _ in self.trees_])
return np.apply_along_axis(
lambda col: np.bincount(col.astype(int)).argmax(), 0, votes)
8. Evaluation Approach
Three metrics. Test accuracy on a fixed 25% held-out test set — the primary performance measure. Variance of test accuracy across 30 random train-test splits — the direct measure of the stability improvement from bagging. OOB accuracy — the free internal estimate computed from training data alone, validated against the held-out test accuracy to confirm it is unbiased. The variance plot shows accuracy vs B (number of trees) for both a single tree (B=1) and the bagged ensemble, making the convergence of variance directly visible.
9. Results and Interpretation
On the Breast Cancer dataset, a single decision tree (max_depth=None) achieves test accuracy in the range 88–96% depending on the random seed — an 8-point variance. With B=50 bagged trees, the accuracy stabilises to 94–96% across all random seeds — the variance has collapsed to 2 points. The OOB accuracy estimate tracks the test accuracy within 0.5–1.5 percentage points, confirming it is an unbiased estimate. Variance essentially converges by B=30–40 trees; adding more trees beyond B=50 yields diminishing returns.
10. Hyperparameter Considerations
The two key hyperparameters are n_estimators (B) and max_depth of the base tree. For B: variance reduction is fastest in the first 10–20 trees and nearly complete by 50. Running more than 100 trees gives negligible variance benefit and scales training time linearly. For max_depth: bagging works best with high-variance base learners — fully grown trees (max_depth=None) give the most variance to reduce. Shallow trees (max_depth=2–3) are already low-variance and benefit less from bagging, though they may still improve over a single shallow tree. max_features (fraction of features considered at each split) is set to 1.0 in pure bagging — when you reduce it you get random forest, which is bagging with additional feature subsampling.
11. Comparison with Baseline
The notebook compares four models: a single Decision Tree (high variance baseline), BaggingFromScratch (50 trees), sklearn’s BaggingClassifier (50 trees), and sklearn’s RandomForestClassifier (50 trees). The two bagging implementations produce nearly identical accuracy distributions across 30 seeds, confirming the from-scratch version is correct. Random Forest typically achieves 1–3 percentage points higher accuracy than pure bagging because feature subsampling further reduces tree correlation, driving ρ closer to zero in the variance formula.
12. Strengths
- Bagging is algorithm-agnostic — it can wrap any base learner (trees, k-NN, SVMs) without modification. The only requirement is that the base learner is unstable (high variance) enough to benefit from averaging.
- The OOB estimate provides a free cross-validation-quality accuracy estimate without holding out data, making bagging especially useful on small datasets where every training sample matters.
- Training is embarrassingly parallel — each of the B trees is completely independent and can be trained on a separate CPU core. sklearn’s n_jobs=-1 enables this directly.
13. Limitations
- Bagging does not reduce bias. If the base learner is systematically wrong (e.g., a shallow tree that underfits), averaging B shallow trees still gives the same systematic error. Bagging only helps when the error is due to variance, not bias.
- Each tree sees only ~63% of training samples, which means bagging implicitly reduces effective training set size per learner. On very small datasets (N < 100), this can hurt base learner performance enough to offset the variance reduction.
- Aggregated predictions are less interpretable than a single tree. The voting of 50 trees cannot be expressed as a simple if-then-else rule, which is a significant limitation in regulated industries that require model explanations.
14. Common Failure Modes
- Applying bagging to stable base learners (e.g., linear regression, logistic regression). These models have low variance by construction — they are not sensitive to the training sample — so bagging provides no benefit. Bagging is useful only when the base learner has high variance.
- Setting n_estimators too low (B = 5–10) and concluding that bagging does not help. Variance reduction is only meaningful above B ≈ 20–30 trees. Plot the accuracy vs B curve to confirm convergence before evaluating bagging’s effectiveness.
- Using bagging to fix underfitting. A single tree that achieves 65% accuracy due to poor features or a misspecified model will bag to 65% accuracy. Bagging does not improve bias.
- Not using oob_score=True in sklearn’s BaggingClassifier. The OOB estimate is free and provides better variance estimates of model quality than a single held-out split. Always enable it when not using cross-validation.
15. Best Practices
- Use fully grown trees (max_depth=None) as base learners. High-variance base learners give bagging the most to gain. Pruned trees provide less variance reduction because there is less variance to reduce.
- Set n_estimators=50–100 as a default. Convergence typically occurs by 50 trees; more rarely adds accuracy but always adds training time.
- Enable oob_score=True and use the OOB score for model selection instead of a separate validation split. On small datasets this is especially valuable.
- Use n_jobs=-1 for parallel training — it is free performance and matters significantly at B=100+ trees on multi-core machines.
- Prefer RandomForestClassifier over BaggingClassifier for tree-based bagging — random forests add feature subsampling which further reduces correlation and almost always improves accuracy with no additional cost.
16. Conclusion
Bagging is the conceptual foundation of the random forest and one of the most reliable variance reduction techniques in machine learning. By training B independent learners on bootstrap samples and aggregating their predictions, it converts an unstable high-variance model into a stable, low-variance ensemble with the same bias. Building it from scratch reveals that the core mechanism — bootstrap sampling plus majority vote — requires fewer than 30 lines of code, and understanding why it works (the ρσ² + (1−ρ)σ²/B variance decomposition) provides the theoretical grounding for every ensemble method in Part 3 of this series.




