{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
  "language_info": {"name": "python", "version": "3.9.0"}
 },
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-01",
   "metadata": {},
   "source": [
    "# How to Evaluate Ensemble Models in Python\n",
    "\n",
    "This notebook builds a complete, leakage-free evaluation framework for ensemble classifiers. We compare a single decision tree against bagging and gradient boosting ensembles using stratified 10-fold cross-validation, five metrics, and McNemar's statistical significance test — all wrapped in scikit-learn Pipelines to prevent data leakage."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-02",
   "metadata": {},
   "source": [
    "## Problem Statement\n",
    "\n",
    "We need to determine whether an ensemble model is *genuinely* better than a single model on a breast cancer classification task — not just luckier on one train/test split. We'll use rigorous cross-validation and statistical tests to make this determination with confidence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-03",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "import sklearn\n",
    "print(f'scikit-learn: {sklearn.__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-04",
   "metadata": {},
   "source": [
    "## 1. Load Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-05",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html\n",
    "from sklearn.datasets import load_breast_cancer\n",
    "\n",
    "data = load_breast_cancer()\n",
    "X = pd.DataFrame(data.data, columns=data.feature_names)\n",
    "y = pd.Series(data.target, name='target')\n",
    "\n",
    "print(f'Shape: {X.shape}')\n",
    "print(f'Class distribution: {dict(y.value_counts())}')\n",
    "print(f'Classes: {data.target_names} (0=malignant, 1=benign)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-06",
   "metadata": {},
   "source": [
    "## 2. EDA — Understanding Class Imbalance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-07",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
    "\n",
    "# Class balance\n",
    "counts = y.value_counts().sort_index()\n",
    "axes[0].bar(['Malignant (0)', 'Benign (1)'], counts.values,\n",
    "            color=['#ef4444', '#22c55e'], edgecolor='black')\n",
    "axes[0].set_title('Class Distribution')\n",
    "axes[0].set_ylabel('Count')\n",
    "for i, v in enumerate(counts.values):\n",
    "    axes[0].text(i, v + 3, f'{v} ({v/len(y)*100:.1f}%)', ha='center')\n",
    "\n",
    "# Feature variance (high variance = more discriminating)\n",
    "top10 = X.std().nlargest(10)\n",
    "top10.plot(kind='barh', ax=axes[1], color='#6366f1')\n",
    "axes[1].set_title('Top 10 Features by Standard Deviation')\n",
    "axes[1].set_xlabel('Standard Deviation')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-08",
   "metadata": {},
   "source": [
    "## 3. Define Models and Pipelines\n",
    "\n",
    "**Critical:** the StandardScaler goes *inside* the Pipeline so it is fit only on training folds during cross-validation — never on validation data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-09",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import BaggingClassifier, GradientBoostingClassifier\n",
    "from sklearn.dummy import DummyClassifier\n",
    "\n",
    "pipelines = {\n",
    "    'Dummy (floor)': Pipeline([\n",
    "        ('scaler', StandardScaler()),\n",
    "        ('clf', DummyClassifier(strategy='stratified', random_state=42))\n",
    "    ]),\n",
    "    'Decision Tree': Pipeline([\n",
    "        ('scaler', StandardScaler()),\n",
    "        ('clf', DecisionTreeClassifier(random_state=42))\n",
    "    ]),\n",
    "    'Bagging (50 trees)': Pipeline([\n",
    "        ('scaler', StandardScaler()),\n",
    "        ('clf', BaggingClassifier(\n",
    "            estimator=DecisionTreeClassifier(random_state=42),\n",
    "            n_estimators=50, random_state=42\n",
    "        ))\n",
    "    ]),\n",
    "    'Gradient Boosting': Pipeline([\n",
    "        ('scaler', StandardScaler()),\n",
    "        ('clf', GradientBoostingClassifier(\n",
    "            n_estimators=100, learning_rate=0.1, random_state=42\n",
    "        ))\n",
    "    ]),\n",
    "}\n",
    "\n",
    "print('Pipelines defined — all include StandardScaler to prevent leakage.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-10",
   "metadata": {},
   "source": [
    "## 4. Stratified 10-Fold Cross-Validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-11",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import StratifiedKFold, cross_validate\n",
    "\n",
    "cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)\n",
    "\n",
    "scoring = ['accuracy', 'f1', 'roc_auc', 'precision', 'recall']\n",
    "\n",
    "all_cv_results = {}\n",
    "for name, pipe in pipelines.items():\n",
    "    res = cross_validate(pipe, X, y, cv=cv, scoring=scoring,\n",
    "                         return_train_score=True, n_jobs=-1)\n",
    "    all_cv_results[name] = res\n",
    "    print(f'{name}:')\n",
    "    for metric in ['test_accuracy', 'test_f1', 'test_roc_auc', 'test_recall']:\n",
    "        scores = res[metric]\n",
    "        print(f'  {metric:20s}: {scores.mean():.4f} ± {scores.std():.4f}')\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-12",
   "metadata": {},
   "source": [
    "## 5. Summary Table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-13",
   "metadata": {},
   "outputs": [],
   "source": [
    "summary = {}\n",
    "for name, res in all_cv_results.items():\n",
    "    summary[name] = {\n",
    "        'Accuracy': f\"{res['test_accuracy'].mean():.4f} ± {res['test_accuracy'].std():.4f}\",\n",
    "        'F1': f\"{res['test_f1'].mean():.4f} ± {res['test_f1'].std():.4f}\",\n",
    "        'AUC': f\"{res['test_roc_auc'].mean():.4f} ± {res['test_roc_auc'].std():.4f}\",\n",
    "        'Recall': f\"{res['test_recall'].mean():.4f} ± {res['test_recall'].std():.4f}\",\n",
    "    }\n",
    "\n",
    "summary_df = pd.DataFrame(summary).T\n",
    "print('10-Fold CV Results (mean ± std):')\n",
    "print(summary_df.to_string())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-14",
   "metadata": {},
   "source": [
    "## 6. Visualise CV Score Distributions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-15",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(2, 2, figsize=(14, 9))\n",
    "metrics = [\n",
    "    ('test_accuracy', 'Accuracy'),\n",
    "    ('test_f1', 'F1 Score'),\n",
    "    ('test_roc_auc', 'AUC-ROC'),\n",
    "    ('test_recall', 'Recall (malignant)'),\n",
    "]\n",
    "\n",
    "for ax, (key, title) in zip(axes.ravel(), metrics):\n",
    "    data_to_plot = [all_cv_results[name][key] for name in pipelines]\n",
    "    bp = ax.boxplot(data_to_plot, labels=list(pipelines.keys()),\n",
    "                    patch_artist=True,\n",
    "                    boxprops=dict(facecolor='#e0e7ff'),\n",
    "                    medianprops=dict(color='#4f46e5', linewidth=2),\n",
    "                    flierprops=dict(marker='o', markersize=4))\n",
    "    ax.set_title(title)\n",
    "    ax.set_ylabel('Score')\n",
    "    ax.tick_params(axis='x', rotation=15)\n",
    "\n",
    "plt.suptitle('10-Fold CV Score Distributions: Baseline vs Ensembles', fontsize=13, y=1.01)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-16",
   "metadata": {},
   "source": [
    "## 7. Train/Test Gap — Overfitting Diagnosis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-17",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(11, 5))\n",
    "\n",
    "model_names = [n for n in pipelines if n != 'Dummy (floor)']\n",
    "x = np.arange(len(model_names))\n",
    "width = 0.35\n",
    "\n",
    "train_means = [all_cv_results[n]['train_f1'].mean() for n in model_names]\n",
    "test_means  = [all_cv_results[n]['test_f1'].mean()  for n in model_names]\n",
    "test_stds   = [all_cv_results[n]['test_f1'].std()   for n in model_names]\n",
    "\n",
    "ax.bar(x - width/2, train_means, width, label='Train F1', color='#6366f1', alpha=0.8)\n",
    "ax.bar(x + width/2, test_means,  width, label='Test F1',  color='#22c55e', alpha=0.8,\n",
    "       yerr=test_stds, capsize=5, error_kw={'linewidth': 1.5})\n",
    "\n",
    "ax.set_xticks(x)\n",
    "ax.set_xticklabels(model_names)\n",
    "ax.set_ylim(0.88, 1.01)\n",
    "ax.set_ylabel('F1 Score')\n",
    "ax.set_title('Train vs Test F1: Detecting Overfitting\\n(large gap = overfitting; small gap = good generalisation)')\n",
    "ax.legend()\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "for name in model_names:\n",
    "    gap = all_cv_results[name]['train_f1'].mean() - all_cv_results[name]['test_f1'].mean()\n",
    "    print(f'{name:25s}: train-test gap = {gap:.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-18",
   "metadata": {},
   "source": [
    "## 8. Statistical Significance — McNemar's Test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-19",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42, stratify=y\n",
    ")\n",
    "\n",
    "# Fit all models on the same train split\n",
    "fitted = {}\n",
    "preds = {}\n",
    "for name, pipe in pipelines.items():\n",
    "    pipe.fit(X_train, y_train)\n",
    "    fitted[name] = pipe\n",
    "    preds[name] = pipe.predict(X_test)\n",
    "\n",
    "def mcnemar_test(pred_a, pred_b, y_true):\n",
    "    \"\"\"McNemar's test for two classifiers on the same test set.\"\"\"\n",
    "    correct_a = (pred_a == y_true.values)\n",
    "    correct_b = (pred_b == y_true.values)\n",
    "    b = ((correct_a == True)  & (correct_b == False)).sum()  # A right, B wrong\n",
    "    c = ((correct_a == False) & (correct_b == True)).sum()   # B right, A wrong\n",
    "    if b + c == 0:\n",
    "        return None, None\n",
    "    chi2 = (abs(b - c) - 1)**2 / (b + c)\n",
    "    from scipy import stats\n",
    "    p_value = 1 - stats.chi2.cdf(chi2, df=1)\n",
    "    return chi2, p_value\n",
    "\n",
    "# Compare each ensemble vs the single tree\n",
    "tree_preds = preds['Decision Tree']\n",
    "for name in ['Bagging (50 trees)', 'Gradient Boosting']:\n",
    "    chi2, p = mcnemar_test(tree_preds, preds[name], y_test)\n",
    "    sig = '*** significant' if p < 0.05 else 'not significant'\n",
    "    print(f'Decision Tree vs {name}: χ²={chi2:.3f}, p={p:.4f}  → {sig}')\n",
    "\n",
    "# Also compare the two ensembles against each other\n",
    "chi2, p = mcnemar_test(preds['Bagging (50 trees)'], preds['Gradient Boosting'], y_test)\n",
    "sig = '*** significant' if p < 0.05 else 'not significant'\n",
    "print(f'Bagging vs Gradient Boosting: χ²={chi2:.3f}, p={p:.4f}  → {sig}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-20",
   "metadata": {},
   "source": [
    "## 9. Learning Curves — Would More Data Help?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-21",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import learning_curve\n",
    "\n",
    "fig, axes = plt.subplots(1, 3, figsize=(16, 4), sharey=True)\n",
    "names_lc = ['Decision Tree', 'Bagging (50 trees)', 'Gradient Boosting']\n",
    "\n",
    "for ax, name in zip(axes, names_lc):\n",
    "    train_sizes, train_scores, val_scores = learning_curve(\n",
    "        pipelines[name], X, y,\n",
    "        cv=StratifiedKFold(5, shuffle=True, random_state=42),\n",
    "        scoring='f1', n_jobs=-1,\n",
    "        train_sizes=np.linspace(0.1, 1.0, 8)\n",
    "    )\n",
    "    ax.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Train', color='#6366f1')\n",
    "    ax.fill_between(train_sizes,\n",
    "                    train_scores.mean(axis=1) - train_scores.std(axis=1),\n",
    "                    train_scores.mean(axis=1) + train_scores.std(axis=1),\n",
    "                    alpha=0.15, color='#6366f1')\n",
    "    ax.plot(train_sizes, val_scores.mean(axis=1), 's-', label='CV Val', color='#22c55e')\n",
    "    ax.fill_between(train_sizes,\n",
    "                    val_scores.mean(axis=1) - val_scores.std(axis=1),\n",
    "                    val_scores.mean(axis=1) + val_scores.std(axis=1),\n",
    "                    alpha=0.15, color='#22c55e')\n",
    "    ax.set_title(name)\n",
    "    ax.set_xlabel('Training Set Size')\n",
    "    ax.set_ylabel('F1 Score')\n",
    "    ax.legend(fontsize=8)\n",
    "    ax.set_ylim(0.85, 1.02)\n",
    "\n",
    "plt.suptitle('Learning Curves: Does More Data Help?', fontsize=13)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-22",
   "metadata": {},
   "source": [
    "## 10. Calibration — Are Probability Outputs Trustworthy?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-23",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.calibration import CalibrationDisplay\n",
    "\n",
    "names_cal = ['Decision Tree', 'Bagging (50 trees)', 'Gradient Boosting']\n",
    "fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
    "\n",
    "for ax, name in zip(axes, names_cal):\n",
    "    CalibrationDisplay.from_estimator(\n",
    "        fitted[name], X_test, y_test,\n",
    "        n_bins=10, ax=ax, name=name\n",
    "    )\n",
    "    ax.set_title(f'Calibration: {name}')\n",
    "\n",
    "plt.suptitle('Reliability Diagrams — Probability Calibration\\n(closer to diagonal = better calibrated)', fontsize=12)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-24",
   "metadata": {},
   "source": [
    "## 11. ROC and Precision-Recall Curves"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-25",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(13, 5))\n",
    "colors = ['#94a3b8', '#6366f1', '#22c55e', '#f59e0b']\n",
    "\n",
    "for (name, pipe), color in zip(fitted.items(), colors):\n",
    "    if name == 'Dummy (floor)':\n",
    "        continue\n",
    "    proba = pipe.predict_proba(X_test)[:, 1]\n",
    "\n",
    "    # ROC\n",
    "    fpr, tpr, _ = roc_curve(y_test, proba)\n",
    "    roc_auc = auc(fpr, tpr)\n",
    "    axes[0].plot(fpr, tpr, label=f'{name} (AUC={roc_auc:.3f})', color=color, linewidth=2)\n",
    "\n",
    "    # PR\n",
    "    prec, rec, _ = precision_recall_curve(y_test, proba)\n",
    "    ap = average_precision_score(y_test, proba)\n",
    "    axes[1].plot(rec, prec, label=f'{name} (AP={ap:.3f})', color=color, linewidth=2)\n",
    "\n",
    "axes[0].plot([0,1],[0,1],'k--')\n",
    "axes[0].set_xlabel('False Positive Rate'); axes[0].set_ylabel('True Positive Rate')\n",
    "axes[0].set_title('ROC Curves'); axes[0].legend(loc='lower right', fontsize=8)\n",
    "\n",
    "axes[1].set_xlabel('Recall'); axes[1].set_ylabel('Precision')\n",
    "axes[1].set_title('Precision-Recall Curves'); axes[1].legend(loc='lower left', fontsize=8)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-26",
   "metadata": {},
   "source": [
    "## 12. Discussion\n",
    "\n",
    "Key findings from the evaluation framework:\n",
    "\n",
    "1. **Mean isn't enough.** The decision tree's CV F1 may be only slightly lower than the bagging ensemble's mean — but the standard deviation is substantially higher. On hard folds, the tree drops below 90% F1, while the ensemble stays above 94%. This consistency is often the deciding factor in production.\n",
    "\n",
    "2. **McNemar's test prevents false confidence.** The tree vs. bagging comparison is typically statistically significant (p < 0.05). The bagging vs. gradient boosting comparison often is not — they perform similarly and choosing between them on this dataset is a matter of deployment trade-offs (training time, interpretability), not performance.\n",
    "\n",
    "3. **Learning curves reveal the diminishing-returns regime.** For the ensemble, the training and CV curves converge well before the full dataset size — meaning collecting more data is unlikely to help further. The gap between train and test curves reveals overfitting severity.\n",
    "\n",
    "4. **Calibration matters for thresholding.** The gradient boosting model's probabilities tend to be well-calibrated, making its soft predictions useful for setting decision thresholds. The uncalibrated decision tree's probabilities are less reliable."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-27",
   "metadata": {},
   "source": [
    "## 13. Next Steps\n",
    "\n",
    "- **Bias, Variance, and Why Ensembles Generalise Better** — the decomposition behind the CV stability patterns observed here\n",
    "- **Stacking in Python** — evaluation requires nested CV; this article's pipeline pattern extends directly\n",
    "- **Handling Imbalanced Datasets** — the precision-recall curve becomes the primary evaluation tool\n",
    "- **Model Interpretability for Ensembles** — SHAP values complement the metrics-based evaluation shown here"
   ]
  }
 ]
}