Random Forest in Python: Classification and Regression

Random Forest extends bagging with one additional idea: at every split in every tree, only a random subset of features is considered as candidates. This feature subsampling decorrelates the trees far more than bootstrap sampling alone, driving the inter-tree correlation ρ toward zero and collapsing ensemble variance. The result is one of the most reliable, best-calibrated, and easiest-to-tune models in practical machine learning — consistently matching or exceeding gradient boosting on medium-sized datasets while requiring far less hyperparameter tuning. This article builds Random Forest from the ground up: the theory, the full sklearn implementation for both classification and regression, feature importance analysis, and a thorough comparison against baselines.

All code is in the companion notebook: Download Notebook. Uses scikit-learn’s load_breast_cancer and load_diabetes — no external downloads required.

1. Problem Statement

You need a strong default model for a medium-sized tabular dataset — one that works well out of the box without extensive hyperparameter search, handles missing features gracefully by design, and gives you a reliable feature importance ranking to guide downstream analysis. A single decision tree overfits and is unstable; gradient boosting requires careful tuning of learning rate, max_depth, and subsample. Random Forest solves the out-of-the-box problem: with n_estimators=100 and default hyperparameters, it consistently achieves strong performance across a wide range of tabular datasets, and the OOB score gives you a free model quality estimate without cross-validation overhead.

2. Why This Matters

Random Forest is the most widely deployed tree ensemble in production. Its combination of strong accuracy, robustness to hyperparameter choice, native feature importance, and built-in OOB validation makes it the correct default baseline for nearly every tabular ML project. Understanding the mechanism — specifically how feature subsampling at each split reduces inter-tree correlation, and how that reduction translates to lower ensemble variance — lets you tune it intelligently: you know that max_features is the most important hyperparameter (it controls ρ), that n_estimators only needs to be large enough for variance to converge, and that max_depth and min_samples_leaf control bias-variance trade-off identically to a single tree.

3. The Approach

We implement two Random Forest experiments. For classification, we use the Breast Cancer dataset and compare RandomForestClassifier against a single Decision Tree, BaggingClassifier, and Logistic Regression, reporting accuracy, AUC-ROC, and OOB score. For regression, we use load_diabetes and compare RandomForestRegressor against a single Decision Tree Regressor and Linear Regression, reporting RMSE and R². Both experiments include feature importance analysis (MDI and permutation importance) and a key hyperparameter sweep over max_features and n_estimators to show convergence behaviour.

4. Mathematical Foundation

Random Forest is bagging with feature subsampling. At each node split, instead of considering all F features, only m = max_features features are randomly selected as candidates, and the best split among those m is chosen. This introduces an additional source of randomness beyond bootstrap sampling, further decorrelating the trees.

The ensemble variance formula from bagging was Var(ŷ) = ρ·σ² + (1−ρ)/B · σ². Bagging with bootstrap samples achieves ρ ≈ 0.3–0.5 (trees share ~63% of training data, creating moderate correlation). Random Forest’s feature subsampling reduces ρ further, typically to 0.05–0.15, bringing the ensemble variance close to σ²/B — the theoretical minimum achievable by averaging B independent models.

The default for max_features is m = ⌊√F⌋ for classification and m = ⌊F/3⌋ for regression. These defaults were determined empirically to balance tree diversity and individual tree accuracy across hundreds of benchmark datasets.

Feature importance in Random Forest is measured by Mean Decrease in Impurity (MDI): FI_j = Σ_b=1^B Σ_{t: split on j in tree b} n_t · (impurity_t − impurity_left − impurity_right) / N. This measures how much feature j contributes to total impurity reduction across all trees and splits.

5. Algorithm Walkthrough

For b = 1, …, B: draw a bootstrap sample of size N; grow a full decision tree using the following split rule at each node: randomly select m features from F total; find the best split (highest Gini/information gain reduction) among those m features; split the node; repeat recursively until stopping criterion (min_samples_leaf, max_depth, or pure node).
Classification prediction: for each new sample x, collect the predicted class from all B trees; return the majority class.
Regression prediction: for each new sample x, collect the predicted value from all B trees; return the mean.
OOB estimate: for each training sample i, collect predictions from trees that did NOT include i in their bootstrap sample; aggregate these OOB predictions to produce an unbiased accuracy or R² estimate.

6. Dataset

Classification: load_breast_cancer — 569 samples, 30 features, 2 classes. Regression: load_diabetes — 442 samples, 10 features, continuous target (disease progression). The diabetes dataset is small enough to show clear overfitting of a single tree, making the variance reduction from Random Forest directly visible in RMSE curves across train/test. Open Notebook

7. Implementation

sklearn’s RandomForestClassifier and RandomForestRegressor follow the same API as all sklearn estimators, with the most important parameters being n_estimators, max_features, max_depth, min_samples_leaf, and oob_score. Both accept n_jobs=-1 for parallel training.

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# Classification
rf_clf = RandomForestClassifier(
    n_estimators=100,
    max_features='sqrt',   # default: sqrt(F) features per split
    min_samples_leaf=1,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf_clf.fit(X_train, y_train)
print(f'OOB accuracy: {rf_clf.oob_score_:.4f}')

# Regression
rf_reg = RandomForestRegressor(
    n_estimators=100,
    max_features='sqrt',
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)

8. Evaluation Approach

Classification: accuracy, AUC-ROC, and OOB accuracy on a 25% held-out test set. Regression: RMSE and R² on a 25% held-out test set, plus train vs test RMSE across n_estimators (convergence plot). Feature importance: MDI importance (from feature_importances_) and permutation importance (from sklearn.inspection.permutation_importance), compared side by side to expose any discrepancy due to high-cardinality or correlated features. Hyperparameter sweep: max_features ∈ {1, 2, 4, ‘sqrt’, 8, 10} and n_estimators ∈ {10, 25, 50, 100, 200}.

9. Results and Interpretation

On Breast Cancer: Random Forest achieves test accuracy 96–97%, AUC-ROC 0.997–0.999, OOB accuracy within 0.5% of test accuracy. Single Decision Tree achieves 91–93% accuracy with high variance across seeds. The gap between bagging and Random Forest is 0.5–1.5 percentage points, confirming that feature subsampling adds meaningful decorrelation beyond bootstrap sampling alone. On Diabetes: Random Forest RMSE is typically 52–58 vs single-tree RMSE of 65–75 — a 15–20% improvement. The RMSE convergence plot shows test RMSE stabilising by n_estimators=50, with no degradation at 200 trees.

10. Hyperparameter Considerations

max_features is the most important hyperparameter. At max_features=1 (all trees split on one random feature per node), trees are maximally diverse but individually weak. At max_features=F (all features, equivalent to pure bagging), trees are maximally accurate individually but highly correlated. The default sqrt(F) balances these extremes. For datasets with many irrelevant features, increasing max_features above sqrt(F) can help because there is a greater chance the random subset includes relevant features. n_estimators should be set large (100–500) and does not need tuning — just run until the OOB curve flattens. min_samples_leaf provides bias-variance control: increasing it (from 1 to 5–10) reduces variance further at the cost of slightly higher bias, useful on noisy datasets.

11. Comparison with Baseline

The notebook compares Random Forest against four baselines: single Decision Tree, BaggingClassifier (bootstrap only, no feature subsampling), Logistic Regression, and a dummy classifier. Random Forest consistently outperforms all of them on both classification and regression tasks. The comparison with BaggingClassifier isolates the effect of feature subsampling: Random Forest is 0.5–2 points better, with lower variance across seeds, directly attributable to the lower inter-tree correlation from max_features.

12. Strengths

Strong out-of-the-box performance. Random Forest with default hyperparameters is competitive with tuned models on most medium-sized tabular datasets, making it the correct first model to try after establishing a baseline.
Built-in OOB validation. Setting oob_score=True provides a free cross-validation estimate without holding out data, which is especially valuable on small datasets.
Parallel training. Each tree is fully independent and trains on a separate core. n_jobs=-1 uses all available cores with zero code changes.
Robust to irrelevant features. Feature subsampling at each split means irrelevant features are automatically diluted — they are only considered at a fraction of nodes and never systematically dominate any tree.

13. Limitations

Random Forest does not reduce bias. If the base trees systematically underfit (e.g., due to very shallow max_depth), averaging 500 underfitting trees still gives the same systematic error. Bias must be addressed by changing max_depth or min_samples_leaf, not by adding more trees.
Memory usage scales linearly with n_estimators and tree depth. Very large forests on high-dimensional data can exhaust RAM. Use max_depth or min_samples_leaf to control tree size when memory is constrained.
Prediction is slower than a single tree and slower than a trained GBM of the same n_estimators, because all B trees must be evaluated. For real-time inference with strict latency requirements, consider pruning the forest or distilling it into a single model.

14. Common Failure Modes

Setting n_estimators too small (10–20) and observing high variance. With B=10 trees, the variance term (1−ρ)/B has not converged. Always plot the OOB error vs n_estimators curve and stop where it flattens — typically at 50–100 trees for most datasets.
Using MDI feature importance on correlated features. MDI can split importance between two correlated features arbitrarily, making both appear less important than they are individually. Always cross-check with permutation importance when features are correlated.
Forgetting that Random Forest does not extrapolate. For regression problems where the test set has values outside the range of training targets, Random Forest predicts the mean of the closest training samples — it cannot extrapolate beyond the training range. Linear Regression or GBM with appropriate regularisation handles extrapolation better.
Running Random Forest without n_jobs=-1 on a multi-core machine. Training 100 trees sequentially on a 16-core machine uses 1/16 of available compute. Always set n_jobs=-1 for wall-clock efficiency.

15. Best Practices

Use Random Forest as your first non-linear model on any tabular dataset. It requires less tuning than GBM and provides reliable performance and OOB validation out of the box.
Set oob_score=True always. It costs nothing and provides a reliable model quality estimate that saves a cross-validation pass.
Cross-check MDI with permutation importance. If the two rankings disagree substantially, the MDI ranking is likely biased by feature cardinality or collinearity. Use permutation importance for final feature selection decisions.
Tune max_features if accuracy matters. Try max_features ∈ {‘sqrt’, 0.3, 0.5, 1.0} and use the OOB score to select — this single hyperparameter sweep covers most of the available accuracy improvement.
For regression tasks on small datasets, use warm_start=True to grow the forest incrementally and monitor OOB RMSE, stopping when it plateaus rather than guessing n_estimators upfront.

16. Conclusion

Random Forest is bagging plus feature subsampling, and the second ingredient is responsible for most of its accuracy advantage over pure bagging. By limiting each split to a random subset of features, Random Forest reduces inter-tree correlation from ~0.4 (bagging) to ~0.1 (Random Forest), driving ensemble variance to near-theoretical minimum for a fixed number of trees. The result is one of the most reliable models in tabular ML: it works well without tuning, scales trivially to multiple cores, and provides built-in OOB validation and interpretable feature importance. For any new structured data problem, Random Forest is the right first ensemble to reach for — not because it is always optimal, but because it rarely fails catastrophically and almost always gives you an honest picture of what is achievable.