{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
  "language_info": {"name": "python", "version": "3.10.0"}
 },
 "cells": [
  {"cell_type":"markdown","id":"c01","metadata":{},"source":["# Why Bagging Works Better for Unstable Models Like Trees\n\nQuantifies model instability via pairwise bootstrap disagreement, decomposes test error into bias and variance for four base learners, and empirically confirms that bagging gain scales with instability."]},
  {"cell_type":"code","execution_count":null,"id":"c02","metadata":{},"outputs":[],"source":["import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport warnings\nwarnings.filterwarnings('ignore')\nnp.random.seed(42)\nimport sklearn\nprint(f'sklearn {sklearn.__version__}')"]},
  {"cell_type":"markdown","id":"c03","metadata":{},"source":["## 1. Dataset"]},
  {"cell_type":"code","execution_count":null,"id":"c04","metadata":{},"outputs":[],"source":["# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html\nfrom sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\n\nX, y = make_classification(\n    n_samples=2000, n_features=20, n_informative=15,\n    n_redundant=3, flip_y=0.02, random_state=42\n)\nX_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)\n\nsc = StandardScaler()\nX_tr_sc = sc.fit_transform(X_tr)\nX_te_sc  = sc.transform(X_te)\n\nprint(f'Train: {X_tr.shape}  Test: {X_te.shape}')\nprint(f'Class balance: {np.bincount(y)}')"]},
  {"cell_type":"markdown","id":"c05","metadata":{},"source":["## 2. EDA"]},
  {"cell_type":"code","execution_count":null,"id":"c06","metadata":{},"outputs":[],"source":["fig, axes = plt.subplots(1, 3, figsize=(16, 4))\naxes[0].bar(['Class 0','Class 1'], np.bincount(y), color=['#6366f1','#22c55e'], edgecolor='k')\naxes[0].set_title('Class Distribution')\nfor c, col in zip([0,1],['#6366f1','#22c55e']):\n    axes[1].hist(X[y==c,0], bins=30, alpha=0.55, color=col, label=f'Class {c}')\naxes[1].set_title('Feature 0 Distribution by Class'); axes[1].legend()\ncorrs = [abs(np.corrcoef(X[:,i], y)[0,1]) for i in range(20)]\naxes[2].bar(range(20), corrs, color='#6366f1', alpha=0.8)\naxes[2].set_xlabel('Feature index'); axes[2].set_title('|Correlation| with Target')\nplt.tight_layout(); plt.show()"]},
  {"cell_type":"markdown","id":"c07","metadata":{},"source":["## 3. Instability Index"]},
  {"cell_type":"code","execution_count":null,"id":"c08","metadata":{},"outputs":[],"source":["from sklearn.tree import DecisionTreeClassifier\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.svm import SVC\nfrom sklearn.metrics import accuracy_score\n\ndef instability_index(model_factory, X_tr, y_tr, X_te, n_pairs=25, seed=42):\n    \"\"\"Average pairwise disagreement between models trained on different bootstraps.\"\"\"\n    rng = np.random.RandomState(seed)\n    N = len(y_tr)\n    disagreements = []\n    for _ in range(n_pairs):\n        idxA = rng.choice(N, N, replace=True)\n        idxB = rng.choice(N, N, replace=True)\n        mA = model_factory().fit(X_tr[idxA], y_tr[idxA])\n        mB = model_factory().fit(X_tr[idxB], y_tr[idxB])\n        disagree = (mA.predict(X_te) != mB.predict(X_te)).mean()\n        disagreements.append(disagree)\n    return np.mean(disagreements), np.std(disagreements)\n\nbase_learners = {\n    'Decision Tree (full)': lambda: DecisionTreeClassifier(max_depth=None, random_state=42),\n    'k-NN (k=5)':           lambda: KNeighborsClassifier(n_neighbors=5),\n    'SVM (RBF)':            lambda: SVC(kernel='rbf', C=1.0, probability=True, random_state=42),\n    'Logistic Regression':  lambda: LogisticRegression(max_iter=300, random_state=42),\n}\n\ninstab_results = {}\nprint('Instability Index (pairwise disagreement):')\nfor name, factory in base_learners.items():\n    # Trees use raw features; scaled learners need standardised features\n    Xtr_use = X_tr if 'Tree' in name else X_tr_sc\n    Xte_use = X_te if 'Tree' in name else X_te_sc\n    mean_i, std_i = instability_index(factory, Xtr_use, y_tr, Xte_use)\n    instab_results[name] = mean_i\n    print(f'  {name:30s}: {mean_i:.4f} ± {std_i:.4f}')"]},
  {"cell_type":"markdown","id":"c09","metadata":{},"source":["## 4. Accuracy Variance Across 30 Seeds"]},
  {"cell_type":"code","execution_count":null,"id":"c10","metadata":{},"outputs":[],"source":["N_SEEDS = 30\nsingle_accs  = {name: [] for name in base_learners}\nbagged_accs  = {name: [] for name in base_learners}\n\nfrom sklearn.ensemble import BaggingClassifier\n\nfor seed in range(N_SEEDS):\n    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=seed, stratify=y)\n    sc_ = StandardScaler()\n    Xtr_sc = sc_.fit_transform(Xtr)\n    Xte_sc  = sc_.transform(Xte)\n\n    for name, factory in base_learners.items():\n        Xtr_u = Xtr if 'Tree' in name else Xtr_sc\n        Xte_u = Xte if 'Tree' in name else Xte_sc\n\n        m = factory(); m.fit(Xtr_u, ytr)\n        single_accs[name].append(accuracy_score(yte, m.predict(Xte_u)))\n\n        bag = BaggingClassifier(estimator=factory(), n_estimators=50,\n                                random_state=seed, n_jobs=-1)\n        bag.fit(Xtr_u, ytr)\n        bagged_accs[name].append(accuracy_score(yte, bag.predict(Xte_u)))\n\nprint(f'\\n{\"Model\":30s} {\"Single Mean\":>11} {\"Single Std\":>10} {\"Bagged Mean\":>11} {\"Bagged Std\":>10} {\"Gain\":>7}')\nfor name in base_learners:\n    sm = np.mean(single_accs[name]); ss = np.std(single_accs[name])\n    bm = np.mean(bagged_accs[name]); bs = np.std(bagged_accs[name])\n    print(f'{name:30s} {sm:11.4f} {ss:10.4f} {bm:11.4f} {bs:10.4f} {bm-sm:7.4f}')"]},
  {"cell_type":"markdown","id":"c11","metadata":{},"source":["## 5. Violin Plot — Accuracy Distribution"]},
  {"cell_type":"code","execution_count":null,"id":"c12","metadata":{},"outputs":[],"source":["fig, axes = plt.subplots(1, 4, figsize=(18, 5))\nfor ax, name in zip(axes, base_learners):\n    data = [single_accs[name], bagged_accs[name]]\n    parts = ax.violinplot(data, positions=[0,1], showmedians=True)\n    for i, color in enumerate(['#ef4444','#6366f1']):\n        parts['bodies'][i].set_facecolor(color)\n        parts['bodies'][i].set_alpha(0.7)\n    ax.set_xticks([0,1]); ax.set_xticklabels(['Single','Bagged'], fontsize=9)\n    ax.set_title(name, fontsize=9)\n    ax.set_ylabel('Test Accuracy')\nplt.suptitle('Accuracy Distribution: Single vs Bagged (30 seeds)', y=1.02)\nplt.tight_layout(); plt.show()"]},
  {"cell_type":"markdown","id":"c13","metadata":{},"source":["## 6. Instability vs Bagging Gain Scatter"]},
  {"cell_type":"code","execution_count":null,"id":"c14","metadata":{},"outputs":[],"source":["gains = {name: np.mean(bagged_accs[name]) - np.mean(single_accs[name])\n         for name in base_learners}\n\nfig, ax = plt.subplots(figsize=(8, 5))\nfor name in base_learners:\n    ax.scatter(instab_results[name], gains[name], s=120, zorder=5)\n    ax.annotate(name, (instab_results[name], gains[name]),\n                xytext=(5, 5), textcoords='offset points', fontsize=9)\n\ninstab_vals = np.array(list(instab_results.values()))\ngain_vals   = np.array(list(gains.values()))\nif len(np.unique(instab_vals)) > 1:\n    fit = np.polyfit(instab_vals, gain_vals, 1)\n    xs = np.linspace(instab_vals.min(), instab_vals.max(), 50)\n    ax.plot(xs, np.polyval(fit, xs), '--', color='gray', alpha=0.7, label='Trend')\n    ax.legend()\nax.set_xlabel('Instability Index (pairwise disagreement)')\nax.set_ylabel('Bagging Gain (mean accuracy improvement)')\nax.set_title('Instability vs Bagging Gain')\nplt.tight_layout(); plt.show()"]},
  {"cell_type":"markdown","id":"c15","metadata":{},"source":["## 7. Tree Depth vs Instability"]},
  {"cell_type":"code","execution_count":null,"id":"c16","metadata":{},"outputs":[],"source":["depths = [1, 2, 3, 5, 8, None]\ninstab_by_depth, gain_by_depth = [], []\n\nfor d in depths:\n    factory_d = lambda depth=d: DecisionTreeClassifier(max_depth=depth, random_state=42)\n    mean_i, _ = instability_index(factory_d, X_tr, y_tr, X_te)\n    instab_by_depth.append(mean_i)\n\n    accs_s, accs_b = [], []\n    for seed in range(15):\n        Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=seed, stratify=y)\n        m = factory_d(); m.fit(Xtr, ytr)\n        accs_s.append(accuracy_score(yte, m.predict(Xte)))\n        bag = BaggingClassifier(estimator=factory_d(), n_estimators=50, random_state=seed, n_jobs=-1)\n        bag.fit(Xtr, ytr)\n        accs_b.append(accuracy_score(yte, bag.predict(Xte)))\n    gain_by_depth.append(np.mean(accs_b) - np.mean(accs_s))\n\ndepth_labels = [str(d) if d is not None else 'None' for d in depths]\nfig, axes = plt.subplots(1, 2, figsize=(13, 4))\naxes[0].plot(depth_labels, instab_by_depth, 'o-', color='#ef4444', linewidth=2)\naxes[0].set_xlabel('max_depth'); axes[0].set_ylabel('Instability Index')\naxes[0].set_title('Tree Depth vs Instability')\naxes[1].plot(depth_labels, gain_by_depth, 's-', color='#6366f1', linewidth=2)\naxes[1].set_xlabel('max_depth'); axes[1].set_ylabel('Bagging Gain')\naxes[1].set_title('Tree Depth vs Bagging Gain')\nplt.tight_layout(); plt.show()\nprint('Deeper trees = more instability = more bagging gain')"]},
  {"cell_type":"markdown","id":"c17","metadata":{},"source":["## 8. Bias-Variance Decomposition (Classification Proxy)"]},
  {"cell_type":"code","execution_count":null,"id":"c18","metadata":{},"outputs":[],"source":["# Domingos (2000) bias-variance decomposition for classification\n# Bias: fraction of points where majority prediction ≠ true label\n# Variance: average disagreement with majority prediction\n\nN_MODELS = 40\nbv_results = {}\n\nfor name, factory in base_learners.items():\n    preds_matrix = []\n    for seed in range(N_MODELS):\n        Xtr, _, ytr, _ = train_test_split(X, y, test_size=0.25, random_state=seed, stratify=y)\n        Xtr_u = Xtr if 'Tree' in name else StandardScaler().fit_transform(Xtr)\n        Xte_u = X_te if 'Tree' in name else X_te_sc\n        m = factory(); m.fit(Xtr_u, ytr)\n        preds_matrix.append(m.predict(Xte_u))\n    preds_matrix = np.array(preds_matrix)  # (N_MODELS, N_test)\n\n    # Majority prediction for each test sample\n    majority = np.apply_along_axis(\n        lambda col: np.bincount(col.astype(int), minlength=2).argmax(), 0, preds_matrix\n    )\n    # Bias: fraction where majority ≠ true\n    bias = (majority != y_te).mean()\n    # Variance: average disagreement with majority\n    variance = np.mean(preds_matrix != majority, axis=0).mean()\n    bv_results[name] = {'bias': bias, 'variance': variance, 'total': bias + variance}\n    print(f'{name:30s}: bias={bias:.4f}  variance={variance:.4f}  total={bias+variance:.4f}')\n\ndf_bv = pd.DataFrame(bv_results).T\nfig, ax = plt.subplots(figsize=(11, 4))\nx = np.arange(len(df_bv))\nw = 0.35\nax.bar(x - w/2, df_bv['bias'],     w, color='#ef4444', label='Bias', alpha=0.85)\nax.bar(x + w/2, df_bv['variance'], w, color='#6366f1', label='Variance', alpha=0.85)\nax.set_xticks(x); ax.set_xticklabels(df_bv.index, rotation=15, ha='right')\nax.set_ylabel('Decomposed Error'); ax.set_title('Bias-Variance Decomposition by Base Learner')\nax.legend(); plt.tight_layout(); plt.show()"]},
  {"cell_type":"markdown","id":"c19","metadata":{},"source":["## 9. k-NN Instability vs k"]},
  {"cell_type":"code","execution_count":null,"id":"c20","metadata":{},"outputs":[],"source":["k_values = [1, 3, 5, 10, 20, 50]\nknn_instab, knn_gain = [], []\n\nfor k in k_values:\n    factory_k = lambda kk=k: KNeighborsClassifier(n_neighbors=kk)\n    mean_i, _ = instability_index(factory_k, X_tr_sc, y_tr, X_te_sc)\n    knn_instab.append(mean_i)\n\n    accs_s, accs_b = [], []\n    for seed in range(15):\n        Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=seed, stratify=y)\n        sc_ = StandardScaler(); Xtr_sc = sc_.fit_transform(Xtr); Xte_sc_ = sc_.transform(Xte)\n        m = factory_k(); m.fit(Xtr_sc, ytr)\n        accs_s.append(accuracy_score(yte, m.predict(Xte_sc_)))\n        bag = BaggingClassifier(estimator=factory_k(), n_estimators=30, random_state=seed, n_jobs=-1)\n        bag.fit(Xtr_sc, ytr)\n        accs_b.append(accuracy_score(yte, bag.predict(Xte_sc_)))\n    knn_gain.append(np.mean(accs_b) - np.mean(accs_s))\n    print(f'k={k:3d}: instability={mean_i:.4f}  bagging_gain={knn_gain[-1]:.4f}')\n\nfig, axes = plt.subplots(1, 2, figsize=(13, 4))\naxes[0].plot(k_values, knn_instab, 'o-', color='#22c55e', linewidth=2)\naxes[0].set_xlabel('k'); axes[0].set_title('k-NN Instability vs k')\naxes[1].plot(k_values, knn_gain,   's-', color='#6366f1', linewidth=2)\naxes[1].set_xlabel('k'); axes[1].set_title('k-NN Bagging Gain vs k')\nplt.tight_layout(); plt.show()"]},
  {"cell_type":"markdown","id":"c21","metadata":{},"source":["## 10. Discussion\n\n1. **Decision trees are the most unstable base learner.** Two trees trained on different bootstrap samples have ~30% pairwise disagreement. This high instability translates directly into high bagging gain: the averaging removes variance that accounts for most of the single-tree error.\n\n2. **Logistic Regression has near-zero instability.** The stable, convex optimisation landscape means different training samples converge to nearly the same model. Bagging provides essentially zero benefit — all added computation is wasted.\n\n3. **The Bias-Variance decomposition confirms the mechanism.** Decision trees have high variance (many models disagree with the majority prediction) but relatively low bias (the majority prediction is usually correct). Bagging specifically targets variance, leaving bias unchanged — exactly what is needed for decision trees.\n\n4. **Small k in k-NN creates high instability.** At k=1, the prediction for each test point depends on exactly which training sample is nearest — a highly unstable quantity. At k=20, the prediction depends on a stable local average. The instability index correctly predicts higher bagging gain at smaller k."]},
  {"cell_type":"markdown","id":"c22","metadata":{},"source":["## 11. Next Steps\n\n- **Article 18: Random Forest** — adding feature subsampling to reduce inter-tree correlation further\n- **Article 20: How Random Forest Creates Diversity** — a deeper look at what makes random forest trees different from each other"]}
 ]
}