{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
  "language_info": {"name": "python", "version": "3.10.0"}
 },
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c01",
   "metadata": {},
   "source": ["# Why Boosting Often Resists Overfitting\n\nMakes the margin-maximisation effect and gradient boosting's plateau directly observable. Tracks training vs test accuracy curves across hundreds of rounds, plots margin distribution evolution, and shows how shrinkage and subsampling widen the safe training window."]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c02",
   "metadata": {},
   "outputs": [],
   "source": ["import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport warnings\nwarnings.filterwarnings('ignore')\nnp.random.seed(42)\nimport sklearn\nprint(f'sklearn {sklearn.__version__}')"]
  },
  {
   "cell_type": "markdown",
   "id": "c03",
   "metadata": {},
   "source": ["## 1. Datasets"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c04",
   "metadata": {},
   "outputs": [],
   "source": ["# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html\n# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html\nfrom sklearn.datasets import make_classification, load_breast_cancer\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\n\n# Synthetic dataset for margin analysis\nX_syn, y_syn = make_classification(\n    n_samples=3000, n_features=20, n_informative=15,\n    n_redundant=3, flip_y=0.0, random_state=42\n)\ny_syn_ada = 2*y_syn - 1  # {-1, +1} for margin computation\nX_syn_tr, X_syn_te, y_syn_tr, y_syn_te = train_test_split(\n    X_syn, y_syn, test_size=0.25, random_state=42, stratify=y_syn\n)\ny_syn_tr_ada, y_syn_te_ada = 2*y_syn_tr-1, 2*y_syn_te-1\n\n# Breast Cancer dataset\nbc = load_breast_cancer()\nXbc, ybc = bc.data, bc.target\nXbc_tr, Xbc_te, ybc_tr, ybc_te = train_test_split(\n    Xbc, ybc, test_size=0.25, random_state=42, stratify=ybc\n)\nsc = StandardScaler()\nXbc_tr_sc = sc.fit_transform(Xbc_tr)\nXbc_te_sc  = sc.transform(Xbc_te)\n\nprint(f'Synthetic: train={X_syn_tr.shape}  test={X_syn_te.shape}')\nprint(f'BreastCancer: train={Xbc_tr.shape}  test={Xbc_te.shape}')"]
  },
  {
   "cell_type": "markdown",
   "id": "c05",
   "metadata": {},
   "source": ["## 2. EDA"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c06",
   "metadata": {},
   "outputs": [],
   "source": ["fig, axes = plt.subplots(1, 3, figsize=(16, 4))\n\naxes[0].bar(['Class 0','Class 1'], np.bincount(y_syn),\n            color=['#6366f1','#22c55e'], edgecolor='k')\naxes[0].set_title('Synthetic — Class Balance')\n\nfor c, col in zip([0,1], ['#6366f1','#22c55e']):\n    axes[1].hist(X_syn[y_syn==c, 0], bins=30, alpha=0.55, color=col, label=f'Class {c}')\naxes[1].set_title('Synthetic — Feature 0 by Class'); axes[1].legend()\n\naxes[2].bar(['Malignant','Benign'], np.bincount(ybc),\n            color=['#ef4444','#22c55e'], edgecolor='k')\naxes[2].set_title('Breast Cancer — Class Balance')\n\nplt.tight_layout(); plt.show()"]
  },
  {
   "cell_type": "markdown",
   "id": "c07",
   "metadata": {},
   "source": ["## 3. AdaBoost — Train vs Test Accuracy Over 500 Rounds"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c08",
   "metadata": {},
   "outputs": [],
   "source": ["from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.metrics import accuracy_score\n\nada = AdaBoostClassifier(\n    estimator=DecisionTreeClassifier(max_depth=1),\n    n_estimators=500, algorithm='SAMME', random_state=42\n)\nada.fit(X_syn_tr, y_syn_tr)\n\ntrain_accs = [accuracy_score(y_syn_tr, p) for p in ada.staged_predict(X_syn_tr)]\ntest_accs  = [accuracy_score(y_syn_te, p) for p in ada.staged_predict(X_syn_te)]\n\nzero_err_round = next((i+1 for i,a in enumerate(train_accs) if a >= 1.0), None)\npeak_test_round = np.argmax(test_accs) + 1\n\nfig, ax = plt.subplots(figsize=(13, 5))\nrounds = range(1, 501)\nax.plot(rounds, train_accs, color='#22c55e', linewidth=1.8, label='Train Accuracy')\nax.plot(rounds, test_accs,  color='#6366f1', linewidth=1.8, label='Test Accuracy')\nif zero_err_round:\n    ax.axvline(zero_err_round, color='#f59e0b', linestyle='--',\n               label=f'Zero train error (round {zero_err_round})')\nax.axvline(peak_test_round, color='#ef4444', linestyle=':',\n           label=f'Peak test accuracy (round {peak_test_round})')\nax.set_xlabel('Boosting Round'); ax.set_ylabel('Accuracy')\nax.set_title('AdaBoost: Training vs Test Accuracy — 500 Rounds')\nax.legend(); plt.tight_layout(); plt.show()\n\nprint(f'Zero training error: round {zero_err_round}')\nprint(f'Peak test accuracy:  round {peak_test_round}  ({max(test_accs):.4f})')\nprint(f'Rounds of improvement past zero train error: {peak_test_round - (zero_err_round or 0)}')"]
  },
  {
   "cell_type": "markdown",
   "id": "c09",
   "metadata": {},
   "source": ["## 4. Margin Distribution Evolution"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c10",
   "metadata": {},
   "outputs": [],
   "source": ["# Compute margin distribution at specific rounds\ncheck_rounds = [10, 50, 100, 200, 400]\n\nfig, axes = plt.subplots(1, len(check_rounds), figsize=(18, 4))\ncolors_rnd = ['#94a3b8','#6366f1','#22c55e','#f59e0b','#ef4444']\n\nfor ax, rnd, col in zip(axes, check_rounds, colors_rnd):\n    # Get decision function at this round\n    for t, df in enumerate(ada.staged_decision_function(X_syn_tr)):\n        if t + 1 == rnd:\n            alpha_sum = ada.estimator_weights_[:rnd].sum()\n            margins = y_syn_tr_ada * df.squeeze() / alpha_sum\n            break\n    ax.hist(margins, bins=25, color=col, edgecolor='white', alpha=0.85)\n    ax.axvline(0, color='black', linewidth=1.5, linestyle='--')\n    frac_pos = (margins > 0).mean()\n    ax.set_title(f'Round {rnd}\\n{frac_pos*100:.1f}% > 0', fontsize=10)\n    ax.set_xlabel('Margin')\n\naxes[0].set_ylabel('Count')\nfig.suptitle('AdaBoost Margin Distribution at Selected Rounds', y=1.02)\nplt.tight_layout(); plt.show()\nprint('Larger margins at later rounds → tighter test error bound')"]
  },
  {
   "cell_type": "markdown",
   "id": "c11",
   "metadata": {},
   "source": ["## 5. Gradient Boosting — Accuracy Plateau"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c12",
   "metadata": {},
   "outputs": [],
   "source": ["gbm_bc = GradientBoostingClassifier(\n    n_estimators=400, learning_rate=0.05,\n    max_depth=3, subsample=0.8, random_state=42\n)\ngbm_bc.fit(Xbc_tr_sc, ybc_tr)\n\nbc_train_accs = [accuracy_score(ybc_tr, p) for p in gbm_bc.staged_predict(Xbc_tr_sc)]\nbc_test_accs  = [accuracy_score(ybc_te, p) for p in gbm_bc.staged_predict(Xbc_te_sc)]\n\npeak_bc = np.argmax(bc_test_accs) + 1\n# Find where test accuracy drops > 0.5% below peak\npeak_val = max(bc_test_accs)\nplateau_end = next((i+1 for i,a in enumerate(bc_test_accs) if i > peak_bc and a < peak_val - 0.005), 400)\n\nfig, ax = plt.subplots(figsize=(13, 5))\nrds = range(1, 401)\nax.plot(rds, bc_train_accs, color='#22c55e', linewidth=1.8, label='Train')\nax.plot(rds, bc_test_accs,  color='#6366f1', linewidth=1.8, label='Test')\nax.axvspan(peak_bc, plateau_end, alpha=0.1, color='#6366f1', label=f'Plateau ({plateau_end - peak_bc} rounds wide)')\nax.axvline(peak_bc, color='#f59e0b', linestyle='--', label=f'Peak test round {peak_bc}')\nax.set_xlabel('Boosting Round'); ax.set_ylabel('Accuracy')\nax.set_title('GBM (Breast Cancer) — Accuracy Plateau')\nax.legend(); plt.tight_layout(); plt.show()\n\nprint(f'Peak test accuracy: {peak_val:.4f} at round {peak_bc}')\nprint(f'Plateau width: ~{plateau_end - peak_bc} rounds')"]
  },
  {
   "cell_type": "markdown",
   "id": "c13",
   "metadata": {},
   "source": ["## 6. Shrinkage Effect on Plateau Width"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c14",
   "metadata": {},
   "outputs": [],
   "source": ["lrs = [0.01, 0.05, 0.1, 0.3]\ncolors_lr = ['#94a3b8','#6366f1','#22c55e','#ef4444']\n\nfig, ax = plt.subplots(figsize=(13, 5))\n\nfor lr, col in zip(lrs, colors_lr):\n    m = GradientBoostingClassifier(\n        n_estimators=400, learning_rate=lr,\n        max_depth=3, subsample=0.8, random_state=42\n    )\n    m.fit(Xbc_tr_sc, ybc_tr)\n    accs = [accuracy_score(ybc_te, p) for p in m.staged_predict(Xbc_te_sc)]\n    ax.plot(range(1,401), accs, color=col, linewidth=1.8, label=f'lr={lr}')\n\nax.set_xlabel('Boosting Round'); ax.set_ylabel('Test Accuracy')\nax.set_title('Learning Rate vs Plateau Width (Breast Cancer)')\nax.legend(); plt.tight_layout(); plt.show()\nprint('Smaller learning rate → wider plateau, later peak, more graceful degradation')"]
  },
  {
   "cell_type": "markdown",
   "id": "c15",
   "metadata": {},
   "source": ["## 7. Subsampling Widens the Plateau"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c16",
   "metadata": {},
   "outputs": [],
   "source": ["subsamples = [1.0, 0.8, 0.5]\ncolors_ss = ['#ef4444','#6366f1','#22c55e']\n\nfig, ax = plt.subplots(figsize=(13, 5))\n\nfor ss, col in zip(subsamples, colors_ss):\n    m = GradientBoostingClassifier(\n        n_estimators=400, learning_rate=0.05,\n        max_depth=3, subsample=ss, random_state=42\n    )\n    m.fit(Xbc_tr_sc, ybc_tr)\n    accs = [accuracy_score(ybc_te, p) for p in m.staged_predict(Xbc_te_sc)]\n    peak = np.argmax(accs)+1\n    ax.plot(range(1,401), accs, color=col, linewidth=2, label=f'subsample={ss}  (peak round {peak})')\n\nax.set_xlabel('Boosting Round'); ax.set_ylabel('Test Accuracy')\nax.set_title('Subsampling vs Plateau Width (Breast Cancer, lr=0.05)')\nax.legend(); plt.tight_layout(); plt.show()"]
  },
  {
   "cell_type": "markdown",
   "id": "c17",
   "metadata": {},
   "source": ["## 8. When Does GBM Actually Overfit? (Deep Trees + High lr)"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c18",
   "metadata": {},
   "outputs": [],
   "source": ["configs = [\n    {'max_depth': 1, 'learning_rate': 0.3, 'label': 'Stump + high lr'},\n    {'max_depth': 3, 'learning_rate': 0.1, 'label': 'Depth 3 + mid lr'},\n    {'max_depth': 6, 'learning_rate': 0.3, 'label': 'Deep + high lr (overfits)'},\n]\ncolors_cf = ['#22c55e','#6366f1','#ef4444']\n\nfig, axes = plt.subplots(1, 3, figsize=(18, 4))\n\nfor ax, cfg, col in zip(axes, configs, colors_cf):\n    m = GradientBoostingClassifier(\n        n_estimators=300,\n        learning_rate=cfg['learning_rate'],\n        max_depth=cfg['max_depth'],\n        subsample=1.0, random_state=42\n    )\n    m.fit(Xbc_tr_sc, ybc_tr)\n    tr_accs = [accuracy_score(ybc_tr, p) for p in m.staged_predict(Xbc_tr_sc)]\n    te_accs = [accuracy_score(ybc_te, p) for p in m.staged_predict(Xbc_te_sc)]\n    rds = range(1,301)\n    ax.plot(rds, tr_accs, color='#22c55e', linewidth=1.8, label='Train')\n    ax.plot(rds, te_accs, color=col, linewidth=1.8, label='Test')\n    ax.set_title(cfg['label'], fontsize=10)\n    ax.set_xlabel('Round'); ax.set_ylabel('Accuracy')\n    ax.legend(fontsize=8)\n\nplt.tight_layout(); plt.show()"]
  },
  {
   "cell_type": "markdown",
   "id": "c19",
   "metadata": {},
   "source": ["## 9. Comparison: AdaBoost vs GBM vs Random Forest vs Decision Tree"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c20",
   "metadata": {},
   "outputs": [],
   "source": ["from sklearn.ensemble import RandomForestClassifier\n\n# Train final models\nada_final = AdaBoostClassifier(\n    estimator=DecisionTreeClassifier(max_depth=1),\n    n_estimators=300, algorithm='SAMME', random_state=42\n)\nada_final.fit(Xbc_tr_sc, ybc_tr)\n\ngbm_final = GradientBoostingClassifier(\n    n_estimators=200, learning_rate=0.05,\n    max_depth=3, subsample=0.8, random_state=42\n)\ngbm_final.fit(Xbc_tr_sc, ybc_tr)\n\nrf_final = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)\nrf_final.fit(Xbc_tr_sc, ybc_tr)\n\ndt_final = DecisionTreeClassifier(max_depth=5, random_state=42)\ndt_final.fit(Xbc_tr_sc, ybc_tr)\n\nresults = {\n    'Decision Tree': accuracy_score(ybc_te, dt_final.predict(Xbc_te_sc)),\n    'AdaBoost (300)': accuracy_score(ybc_te, ada_final.predict(Xbc_te_sc)),\n    'Random Forest (200)': accuracy_score(ybc_te, rf_final.predict(Xbc_te_sc)),\n    'GBM (200, lr=0.05)': accuracy_score(ybc_te, gbm_final.predict(Xbc_te_sc)),\n}\n\nfig, ax = plt.subplots(figsize=(10, 4))\nnames = list(results.keys())\nvals  = list(results.values())\ncolors_bar = ['#94a3b8','#f59e0b','#22c55e','#6366f1']\nbars = ax.bar(names, vals, color=colors_bar, edgecolor='k', alpha=0.85)\nax.set_ylabel('Test Accuracy (Breast Cancer)')\nax.set_title('Final Model Comparison')\nax.set_ylim(min(vals) - 0.02, 1.0)\nfor bar, v in zip(bars, vals):\n    ax.text(bar.get_x() + bar.get_width()/2, v + 0.002, f'{v:.4f}', ha='center', fontweight='bold')\nplt.tight_layout(); plt.show()\n\nfor k, v in results.items():\n    print(f'{k:25s}: {v:.4f}')"]
  },
  {
   "cell_type": "markdown",
   "id": "c21",
   "metadata": {},
   "source": ["## 10. Margin Quantile Tracking"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c22",
   "metadata": {},
   "outputs": [],
   "source": ["# Track 10th, 25th, 50th percentile of margin distribution over rounds\nquantiles_10, quantiles_25, quantiles_50 = [], [], []\n\nfor t, df in enumerate(ada.staged_decision_function(X_syn_tr)):\n    alpha_sum = ada.estimator_weights_[:t+1].sum()\n    if alpha_sum == 0:\n        continue\n    margins = y_syn_tr_ada * df.squeeze() / alpha_sum\n    quantiles_10.append(np.percentile(margins, 10))\n    quantiles_25.append(np.percentile(margins, 25))\n    quantiles_50.append(np.percentile(margins, 50))\n\nfig, ax = plt.subplots(figsize=(13, 5))\nrds = range(1, len(quantiles_10)+1)\nax.plot(rds, quantiles_10, color='#ef4444', linewidth=1.8, label='10th percentile margin')\nax.plot(rds, quantiles_25, color='#f59e0b', linewidth=1.8, label='25th percentile margin')\nax.plot(rds, quantiles_50, color='#22c55e', linewidth=1.8, label='Median margin')\nax.axhline(0, color='black', linestyle='--', alpha=0.5)\nax.set_xlabel('Boosting Round'); ax.set_ylabel('Margin')\nax.set_title('AdaBoost — Margin Quantiles Over Rounds (Synthetic Dataset)')\nax.legend(); plt.tight_layout(); plt.show()\nprint('Rising quantiles past zero training error = continued test improvement via margin maximisation')"]
  },
  {
   "cell_type": "markdown",
   "id": "c23",
   "metadata": {},
   "source": ["## 11. Discussion\n\n1. **The training-test accuracy gap tells the complete story.** On clean data, both curves rise together until training saturates; then test accuracy continues rising while training accuracy stays at 1.0. This is margin maximisation in action — more rounds increase the margin on already-correct examples, tightening the test error bound.\n\n2. **The plateau is wide for small learning rates.** With lr=0.01, the model can be trained for hundreds of rounds past peak test accuracy without meaningful degradation. This gives practitioners a large safe window for early stopping. With lr=0.3, the plateau narrows to near zero.\n\n3. **Deep trees + high learning rate breaks the plateau.** Once base learners have high individual variance (deep trees), the gradient signal is dominated by variance rather than systematic bias correction. Combined with a high learning rate, this produces classical overfitting with a sharp training-test gap.\n\n4. **The 10th percentile margin is the key indicator.** The test error bound depends on the fraction of samples above a margin threshold θ. Watching the 10th percentile cross zero and then rise indicates that the hardest-to-classify examples are gaining confidence — the most direct signal that more rounds are still helping."]
  },
  {
   "cell_type": "markdown",
   "id": "c24",
   "metadata": {},
   "source": ["## 12. Next Steps\n\n- **Part 3: Bagging Series** — Random Forests and the variance reduction perspective on overfitting\n- **Stacking and Blending** — combining multiple boosted models with a meta-learner\n- **Cost-sensitive boosting** — modifying the weight update for asymmetric misclassification costs"]
  }
 ]
}