{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
  "language_info": {"name": "python", "version": "3.9.0"}
 },
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c01",
   "metadata": {},
   "source": [
    "# Gradient Boosting in Python for Structured Data\n",
    "\n",
    "Builds gradient boosting from scratch (MSE loss, regression trees), then scales to sklearn's GradientBoostingRegressor and GradientBoostingClassifier. Covers staged prediction, feature importance, partial dependence, and comparison against baselines on synthetic structured data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c02",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "np.random.seed(42)\n",
    "import sklearn; print(f'sklearn {sklearn.__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c03",
   "metadata": {},
   "source": ["## 1. Regression Dataset"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c04",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html\n",
    "from sklearn.datasets import make_regression, make_classification\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_reg, y_reg = make_regression(\n",
    "    n_samples=2000, n_features=15, n_informative=10,\n",
    "    noise=30.0, random_state=42\n",
    ")\n",
    "\n",
    "X_tr, X_te, y_tr, y_te = train_test_split(\n",
    "    X_reg, y_reg, test_size=0.25, random_state=42\n",
    ")\n",
    "print(f'Regression — train: {X_tr.shape}, test: {X_te.shape}')\n",
    "print(f'Target range: [{y_reg.min():.1f}, {y_reg.max():.1f}]  mean={y_reg.mean():.1f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c05",
   "metadata": {},
   "source": ["## 2. EDA"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c06",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
    "\n",
    "axes[0].hist(y_reg, bins=40, color='#6366f1', edgecolor='white', alpha=0.85)\n",
    "axes[0].set_title('Target Distribution'); axes[0].set_xlabel('y')\n",
    "\n",
    "corrs = [np.corrcoef(X_reg[:, i], y_reg)[0, 1] for i in range(X_reg.shape[1])]\n",
    "axes[1].bar(range(X_reg.shape[1]), np.abs(corrs), color='#22c55e', alpha=0.8)\n",
    "axes[1].set_title('|Pearson Correlation| with Target')\n",
    "axes[1].set_xlabel('Feature Index')\n",
    "\n",
    "top_feat = np.argmax(np.abs(corrs))\n",
    "axes[2].scatter(X_reg[:, top_feat], y_reg, alpha=0.3, s=10, color='#6366f1')\n",
    "axes[2].set_title(f'Most Correlated Feature vs Target\\n(Feature {top_feat})')\n",
    "axes[2].set_xlabel(f'Feature {top_feat}'); axes[2].set_ylabel('y')\n",
    "\n",
    "plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c07",
   "metadata": {},
   "source": ["## 3. Gradient Boosting from Scratch (MSE Regression)"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c08",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from sklearn.metrics import mean_squared_error, r2_score\n",
    "\n",
    "# --- Parameters ---\n",
    "n_trees = 100\n",
    "learning_rate = 0.1\n",
    "max_depth = 3\n",
    "\n",
    "# --- Initialise ---\n",
    "F_train = np.full(len(y_tr), y_tr.mean())\n",
    "F_test  = np.full(len(y_te), y_tr.mean())  # same constant\n",
    "trees_scratch = []\n",
    "\n",
    "scratch_train_rmse, scratch_test_rmse = [], []\n",
    "\n",
    "for m in range(n_trees):\n",
    "    residuals = y_tr - F_train  # pseudo-residuals for MSE = ordinary residuals\n",
    "    tree = DecisionTreeRegressor(max_depth=max_depth, random_state=m)\n",
    "    tree.fit(X_tr, residuals)\n",
    "    F_train += learning_rate * tree.predict(X_tr)\n",
    "    F_test  += learning_rate * tree.predict(X_te)\n",
    "    trees_scratch.append(tree)\n",
    "    scratch_train_rmse.append(np.sqrt(mean_squared_error(y_tr, F_train)))\n",
    "    scratch_test_rmse.append(np.sqrt(mean_squared_error(y_te, F_test)))\n",
    "\n",
    "final_r2 = r2_score(y_te, F_test)\n",
    "print(f'From-scratch GB ({n_trees} trees):')\n",
    "print(f'  Test RMSE: {scratch_test_rmse[-1]:.4f}')\n",
    "print(f'  Test R²:   {final_r2:.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c09",
   "metadata": {},
   "source": ["## 4. Residuals Shrink Across Rounds"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c10",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show residual statistics at selected rounds\n",
    "F_snapshot = np.full(len(y_tr), y_tr.mean())\n",
    "snapshot_rounds = [0, 5, 20, 50, 99]\n",
    "snapshot_data = {}\n",
    "for m in range(100):\n",
    "    if m in snapshot_rounds:\n",
    "        snapshot_data[m] = y_tr - F_snapshot\n",
    "    tree = trees_scratch[m]\n",
    "    F_snapshot += learning_rate * tree.predict(X_tr)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(12, 4))\n",
    "positions = list(range(len(snapshot_rounds)))\n",
    "data_to_plot = [snapshot_data[r] for r in snapshot_rounds]\n",
    "bp = ax.boxplot(data_to_plot, positions=positions,\n",
    "                patch_artist=True,\n",
    "                boxprops=dict(facecolor='#e0e7ff'),\n",
    "                medianprops=dict(color='#4f46e5', linewidth=2),\n",
    "                showfliers=False)\n",
    "ax.set_xticks(positions)\n",
    "ax.set_xticklabels([f'Round {r}' for r in snapshot_rounds])\n",
    "ax.axhline(0, color='gray', linestyle='--', linewidth=1)\n",
    "ax.set_ylabel('Residual')\n",
    "ax.set_title('Residual Distribution Shrinks Across Boosting Rounds\\n(each box is the distribution of y − F_m on the training set)')\n",
    "plt.tight_layout(); plt.show()\n",
    "\n",
    "for r, res in snapshot_data.items():\n",
    "    print(f'Round {r:3d}: std={res.std():.2f}, |max|={np.abs(res).max():.2f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c11",
   "metadata": {},
   "source": ["## 5. Staged RMSE — Scratch vs sklearn"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c12",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import GradientBoostingRegressor\n",
    "from sklearn.tree import DecisionTreeRegressor as DTR\n",
    "\n",
    "# Single-tree baselines\n",
    "single_tree = DTR(max_depth=5, random_state=42)\n",
    "single_tree.fit(X_tr, y_tr)\n",
    "single_rmse = np.sqrt(mean_squared_error(y_te, single_tree.predict(X_te)))\n",
    "\n",
    "# sklearn GBR\n",
    "gbr = GradientBoostingRegressor(\n",
    "    n_estimators=300, learning_rate=0.05,\n",
    "    max_depth=3, subsample=0.8,\n",
    "    min_samples_leaf=5, random_state=42\n",
    ")\n",
    "gbr.fit(X_tr, y_tr)\n",
    "staged_rmse_sk = [\n",
    "    np.sqrt(mean_squared_error(y_te, pred))\n",
    "    for pred in gbr.staged_predict(X_te)\n",
    "]\n",
    "\n",
    "plt.figure(figsize=(12, 4))\n",
    "plt.plot(np.arange(1, 101), scratch_test_rmse,\n",
    "         color='#f59e0b', linewidth=1.8, label='Scratch (lr=0.1, d=3)')\n",
    "plt.plot(np.arange(1, 301), staged_rmse_sk,\n",
    "         color='#6366f1', linewidth=1.8, label='sklearn GBR (lr=0.05, d=3, sub=0.8)')\n",
    "plt.axhline(single_rmse, color='#ef4444', linestyle='--', linewidth=1.5,\n",
    "            label=f'Single Tree (RMSE={single_rmse:.1f})')\n",
    "plt.xlabel('Boosting Round'); plt.ylabel('Test RMSE')\n",
    "plt.title('Staged RMSE: Scratch Implementation vs sklearn GBR')\n",
    "plt.legend(); plt.tight_layout(); plt.show()\n",
    "\n",
    "best_round = np.argmin(staged_rmse_sk) + 1\n",
    "print(f'sklearn GBR best RMSE: {min(staged_rmse_sk):.4f} at round {best_round}')\n",
    "print(f'sklearn GBR test R²:   {r2_score(y_te, gbr.predict(X_te)):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c13",
   "metadata": {},
   "source": ["## 6. Feature Importance — Impurity vs Permutation"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c14",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.inspection import permutation_importance\n",
    "\n",
    "mdi_imp = gbr.feature_importances_\n",
    "perm_res = permutation_importance(gbr, X_te, y_te, n_repeats=10,\n",
    "                                   random_state=42, scoring='r2')\n",
    "perm_imp = perm_res.importances_mean\n",
    "\n",
    "feat_names = [f'F{i}' for i in range(X_reg.shape[1])]\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 4))\n",
    "\n",
    "order_mdi = np.argsort(mdi_imp)[::-1]\n",
    "axes[0].bar(range(len(mdi_imp)), mdi_imp[order_mdi],\n",
    "            color='#6366f1', alpha=0.85)\n",
    "axes[0].set_xticks(range(len(mdi_imp)))\n",
    "axes[0].set_xticklabels([feat_names[i] for i in order_mdi], rotation=45)\n",
    "axes[0].set_title('MDI Feature Importance (Impurity-Based)')\n",
    "axes[0].set_ylabel('Mean Decrease Impurity')\n",
    "\n",
    "order_perm = np.argsort(perm_imp)[::-1]\n",
    "axes[1].bar(range(len(perm_imp)), perm_imp[order_perm],\n",
    "            color='#22c55e', alpha=0.85)\n",
    "axes[1].set_xticks(range(len(perm_imp)))\n",
    "axes[1].set_xticklabels([feat_names[i] for i in order_perm], rotation=45)\n",
    "axes[1].set_title('Permutation Importance (Test Set)')\n",
    "axes[1].set_ylabel('Mean R² Drop')\n",
    "\n",
    "plt.tight_layout(); plt.show()\n",
    "\n",
    "print('Top 5 by MDI: ', [feat_names[i] for i in order_mdi[:5]])\n",
    "print('Top 5 by Perm:', [feat_names[i] for i in order_perm[:5]])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c15",
   "metadata": {},
   "source": ["## 7. Partial Dependence Plots"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c16",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.inspection import PartialDependenceDisplay\n",
    "\n",
    "top2 = order_mdi[:2].tolist()\n",
    "fig, ax = plt.subplots(figsize=(12, 4))\n",
    "PartialDependenceDisplay.from_estimator(\n",
    "    gbr, X_tr, features=top2,\n",
    "    feature_names=feat_names,\n",
    "    ax=ax, grid_resolution=50\n",
    ")\n",
    "ax.set_title('Partial Dependence: Top 2 Most Important Features')\n",
    "plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c17",
   "metadata": {},
   "source": ["## 8. Classification Dataset and GradientBoostingClassifier"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c18",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html\n",
    "X_cls, y_cls = make_classification(\n",
    "    n_samples=2000, n_features=10, n_informative=7,\n",
    "    n_redundant=2, flip_y=0.03, random_state=42\n",
    ")\n",
    "\n",
    "X_tc, X_ec, y_tc, y_ec = train_test_split(\n",
    "    X_cls, y_cls, test_size=0.25, random_state=42, stratify=y_cls\n",
    ")\n",
    "print(f'Classification — train: {X_tc.shape}, test: {X_ec.shape}')\n",
    "print(f'Class counts (train): {np.bincount(y_tc)}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c19",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "from sklearn.metrics import accuracy_score, roc_auc_score\n",
    "\n",
    "gbc = GradientBoostingClassifier(\n",
    "    n_estimators=200, learning_rate=0.05,\n",
    "    max_depth=3, subsample=0.8,\n",
    "    min_samples_leaf=5, random_state=42\n",
    ")\n",
    "gbc.fit(X_tc, y_tc)\n",
    "\n",
    "# Staged accuracy\n",
    "staged_acc = [accuracy_score(y_ec, p) for p in gbc.staged_predict(X_ec)]\n",
    "\n",
    "plt.figure(figsize=(10, 4))\n",
    "plt.plot(np.arange(1, 201), staged_acc, color='#6366f1', linewidth=1.8)\n",
    "plt.xlabel('Round'); plt.ylabel('Test Accuracy')\n",
    "plt.title('GradientBoostingClassifier — Staged Test Accuracy')\n",
    "plt.tight_layout(); plt.show()\n",
    "\n",
    "best_round_cls = np.argmax(staged_acc) + 1\n",
    "y_proba = gbc.predict_proba(X_ec)[:, 1]\n",
    "print(f'Best test acc: {max(staged_acc):.4f} at round {best_round_cls}')\n",
    "print(f'AUC-ROC: {roc_auc_score(y_ec, y_proba):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c20",
   "metadata": {},
   "source": ["## 9. Cross-Validated Comparison vs Baselines"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c21",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score, StratifiedKFold\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)\n",
    "\n",
    "models = {\n",
    "    'Decision Tree':    DecisionTreeClassifier(max_depth=5, random_state=42),\n",
    "    'Random Forest':    RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42, n_jobs=-1),\n",
    "    'Logistic Reg.':   Pipeline([('sc', StandardScaler()),\n",
    "                                 ('clf', LogisticRegression(max_iter=500, random_state=42))]),\n",
    "    'GBM (200 trees)': GradientBoostingClassifier(\n",
    "                            n_estimators=200, learning_rate=0.05,\n",
    "                            max_depth=3, subsample=0.8,\n",
    "                            min_samples_leaf=5, random_state=42),\n",
    "}\n",
    "\n",
    "cv_results = {}\n",
    "for name, model in models.items():\n",
    "    scores = cross_val_score(model, X_cls, y_cls, cv=cv,\n",
    "                             scoring='accuracy', n_jobs=-1)\n",
    "    cv_results[name] = scores\n",
    "    print(f'{name:20s}: {scores.mean():.4f} ± {scores.std():.4f}')\n",
    "\n",
    "plt.figure(figsize=(10, 4))\n",
    "plt.boxplot(cv_results.values(), labels=cv_results.keys(),\n",
    "            patch_artist=True,\n",
    "            boxprops=dict(facecolor='#e0e7ff'),\n",
    "            medianprops=dict(color='#4f46e5', linewidth=2))\n",
    "plt.ylabel('10-Fold CV Accuracy')\n",
    "plt.title('Gradient Boosting vs Baselines')\n",
    "plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c22",
   "metadata": {},
   "source": ["## 10. Learning Rate and Tree Depth Interaction"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c23",
   "metadata": {},
   "outputs": [],
   "source": [
    "configs = [\n",
    "    {'lr': 0.01, 'depth': 3, 'label': 'lr=0.01, d=3', 'color': '#94a3b8'},\n",
    "    {'lr': 0.05, 'depth': 3, 'label': 'lr=0.05, d=3', 'color': '#6366f1'},\n",
    "    {'lr': 0.1,  'depth': 3, 'label': 'lr=0.1,  d=3', 'color': '#22c55e'},\n",
    "    {'lr': 0.1,  'depth': 5, 'label': 'lr=0.1,  d=5', 'color': '#ef4444'},\n",
    "]\n",
    "\n",
    "plt.figure(figsize=(12, 4))\n",
    "for cfg in configs:\n",
    "    m = GradientBoostingRegressor(\n",
    "        n_estimators=200, learning_rate=cfg['lr'],\n",
    "        max_depth=cfg['depth'], random_state=42\n",
    "    )\n",
    "    m.fit(X_tr, y_tr)\n",
    "    rmses = [np.sqrt(mean_squared_error(y_te, p)) for p in m.staged_predict(X_te)]\n",
    "    plt.plot(np.arange(1, 201), rmses,\n",
    "             color=cfg['color'], linewidth=1.8, label=cfg['label'])\n",
    "\n",
    "plt.xlabel('Round'); plt.ylabel('Test RMSE')\n",
    "plt.title('Learning Rate × Max Depth Interaction\\n(lower lr = slower convergence but lower floor; deeper trees overfit faster)')\n",
    "plt.legend(); plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c24",
   "metadata": {},
   "source": ["## 11. Residual Plot — Final Model"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c25",
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_final = gbr.predict(X_te)\n",
    "residuals_final = y_te - y_pred_final\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(13, 4))\n",
    "\n",
    "axes[0].scatter(y_pred_final, residuals_final, alpha=0.3, s=10, color='#6366f1')\n",
    "axes[0].axhline(0, color='gray', linestyle='--')\n",
    "axes[0].set_xlabel('Predicted'); axes[0].set_ylabel('Residual')\n",
    "axes[0].set_title('Residuals vs Predicted\\n(no systematic pattern = well-specified model)')\n",
    "\n",
    "axes[1].scatter(y_te, y_pred_final, alpha=0.3, s=10, color='#22c55e')\n",
    "lims = [min(y_te.min(), y_pred_final.min()), max(y_te.max(), y_pred_final.max())]\n",
    "axes[1].plot(lims, lims, 'k--', linewidth=1)\n",
    "axes[1].set_xlabel('Actual'); axes[1].set_ylabel('Predicted')\n",
    "axes[1].set_title(f'Actual vs Predicted\\nR² = {r2_score(y_te, y_pred_final):.4f}')\n",
    "\n",
    "plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c26",
   "metadata": {},
   "source": [
    "## 12. Discussion\n",
    "\n",
    "1. **Residuals shrink toward zero.** The boxplot at each snapshot round confirms that both the spread and the median of residuals collapse as boosting progresses — exactly the gradient descent in function space interpretation.\n",
    "\n",
    "2. **Staged prediction is essential for model selection.** The staged RMSE curve shows where improvement plateaus. Running to 300 rounds at lr=0.05 is safe because the stochastic subsampling (subsample=0.8) prevents overfitting even past the optimal round.\n",
    "\n",
    "3. **MDI and permutation importance agree on the top features.** When both metrics rank the same features highly, you can be confident the model has found real signal rather than overfitting noisy features. When they disagree, permutation importance is more reliable (it is computed on held-out data).\n",
    "\n",
    "4. **Learning rate × depth is the key interaction.** Shallow trees (depth=3) with a lower learning rate converge to a better test RMSE than deep trees (depth=5) with a higher learning rate, even at the same number of rounds. This is the regularisation effect of shrinkage combined with model simplicity."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c27",
   "metadata": {},
   "source": [
    "## 13. Next Steps\n",
    "\n",
    "- **Article 10: XGBoost for Real Business Problems** — adds second-order gradients, column subsampling, and built-in regularisation (L1, L2) to the gradient boosting framework; dramatically faster on large datasets\n",
    "- **Article 11: LightGBM for Fast Gradient Boosting** — histogram-based splits and leaf-wise growth make LightGBM 10–20× faster than sklearn GBM on large datasets\n",
    "- **Article 12: CatBoost for Categorical Features** — symmetric trees and ordered target encoding handle high-cardinality categoricals without preprocessing"
   ]
  }
 ]
}