{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c01",
   "metadata": {},
   "source": [
    "# XGBoost for Real Business Problems\n",
    "\n",
    "Two end-to-end business scenarios: (A) churn prediction on a 15% positive-rate imbalanced dataset with scale_pos_weight and threshold optimisation; (B) revenue regression with intentional missing values showing XGBoost's native NaN handling. Both include SHAP-based explanations and comparison against sklearn GBM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c02",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "np.random.seed(42)\n",
    "\n",
    "import xgboost as xgb\n",
    "import sklearn\n",
    "print(f'sklearn  {sklearn.__version__}')\n",
    "print(f'xgboost  {xgb.__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c03",
   "metadata": {},
   "source": [
    "## Scenario A \u2014 Churn Prediction (Imbalanced)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c04",
   "metadata": {},
   "source": [
    "### A1. Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c05",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html\n",
    "from sklearn.datasets import make_classification\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_churn, y_churn = make_classification(\n",
    "    n_samples=5000, n_features=12, n_informative=8,\n",
    "    n_redundant=2, weights=[0.85, 0.15],\n",
    "    flip_y=0.02, random_state=42\n",
    ")\n",
    "\n",
    "X_tr, X_tmp, y_tr, y_tmp = train_test_split(\n",
    "    X_churn, y_churn, test_size=0.3, random_state=42, stratify=y_churn\n",
    ")\n",
    "X_val, X_te, y_val, y_te = train_test_split(\n",
    "    X_tmp, y_tmp, test_size=0.5, random_state=42, stratify=y_tmp\n",
    ")\n",
    "\n",
    "pos_rate = y_churn.mean()\n",
    "scale_pw = (1 - pos_rate) / pos_rate\n",
    "print(f'Positive rate (churn): {pos_rate:.3f}')\n",
    "print(f'scale_pos_weight:      {scale_pw:.2f}')\n",
    "print(f'Train: {X_tr.shape}  Val: {X_val.shape}  Test: {X_te.shape}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c06",
   "metadata": {},
   "source": [
    "### A2. EDA \u2014 Class Imbalance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c07",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
    "\n",
    "counts = np.bincount(y_churn)\n",
    "axes[0].bar(['Not Churned (0)', 'Churned (1)'], counts,\n",
    "            color=['#6366f1','#ef4444'], edgecolor='k')\n",
    "for i, c in enumerate(counts):\n",
    "    axes[0].text(i, c + 30, str(c), ha='center', fontweight='bold')\n",
    "axes[0].set_title(f'Class Distribution (imbalance ratio \u2248 {scale_pw:.1f}:1)')\n",
    "\n",
    "# Feature separability for top feature\n",
    "corrs = [abs(np.corrcoef(X_churn[:, i], y_churn)[0, 1]) for i in range(12)]\n",
    "best_feat = np.argmax(corrs)\n",
    "for cls, col, lbl in [(0, '#6366f1', 'Not Churned'), (1, '#ef4444', 'Churned')]:\n",
    "    axes[1].hist(X_churn[y_churn == cls, best_feat], bins=30,\n",
    "                 alpha=0.6, color=col, label=lbl)\n",
    "axes[1].set_title(f'Best Separating Feature (F{best_feat})')\n",
    "axes[1].legend()\n",
    "\n",
    "plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c08",
   "metadata": {},
   "source": [
    "### A3. Train XGBoost with scale_pos_weight and Early Stopping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c09",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_churn = xgb.XGBClassifier(\n",
    "    n_estimators=500,\n",
    "    learning_rate=0.05,\n",
    "    max_depth=4,\n",
    "    subsample=0.8,\n",
    "    colsample_bytree=0.8,\n",
    "    scale_pos_weight=scale_pw,\n",
    "    reg_alpha=0.1,\n",
    "    reg_lambda=1.0,\n",
    "    eval_metric='auc',\n",
    "    early_stopping_rounds=30,\n",
    "    random_state=42,\n",
    "    verbosity=0\n",
    ")\n",
    "model_churn.fit(\n",
    "    X_tr, y_tr,\n",
    "    eval_set=[(X_val, y_val)],\n",
    "    verbose=False\n",
    ")\n",
    "\n",
    "best_n = model_churn.best_iteration\n",
    "print(f'Best round (early stopping): {best_n}')\n",
    "\n",
    "# Eval metrics\n",
    "from sklearn.metrics import roc_auc_score, average_precision_score\n",
    "y_prob_val = model_churn.predict_proba(X_val)[:, 1]\n",
    "y_prob_te  = model_churn.predict_proba(X_te)[:, 1]\n",
    "print(f'Val  AUC-ROC: {roc_auc_score(y_val, y_prob_val):.4f}')\n",
    "print(f'Val  AUC-PR:  {average_precision_score(y_val, y_prob_val):.4f}')\n",
    "print(f'Test AUC-ROC: {roc_auc_score(y_te, y_prob_te):.4f}')\n",
    "print(f'Test AUC-PR:  {average_precision_score(y_te, y_prob_te):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c10",
   "metadata": {},
   "source": [
    "### A4. Precision-Recall Curve and Threshold Optimisation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c11",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import precision_recall_curve, f1_score\n",
    "\n",
    "prec, rec, thresholds = precision_recall_curve(y_val, y_prob_val)\n",
    "\n",
    "# F1 at each threshold (prec/rec have len(thresholds)+1 elements)\n",
    "p, r = prec[:-1], rec[:-1]  # align with thresholds\n",
    "f1_scores = np.where((p + r) > 0, 2 * p * r / (p + r), 0)\n",
    "best_thresh_idx = np.argmax(f1_scores)\n",
    "best_thresh = thresholds[best_thresh_idx]\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(13, 4))\n",
    "\n",
    "axes[0].plot(rec, prec, color='#6366f1', linewidth=2)\n",
    "axes[0].scatter(rec[best_thresh_idx], prec[best_thresh_idx],\n",
    "                color='#ef4444', s=100, zorder=5,\n",
    "                label=f'Best F1 thresh={best_thresh:.2f}')\n",
    "axes[0].set_xlabel('Recall'); axes[0].set_ylabel('Precision')\n",
    "axes[0].set_title('Precision-Recall Curve (Validation Set)')\n",
    "axes[0].legend()\n",
    "\n",
    "axes[1].plot(thresholds, f1_scores, color='#22c55e', linewidth=2)\n",
    "axes[1].axvline(best_thresh, color='#ef4444', linestyle='--',\n",
    "                label=f'Optimal threshold={best_thresh:.2f}')\n",
    "axes[1].set_xlabel('Threshold'); axes[1].set_ylabel('F1 Score')\n",
    "axes[1].set_title('F1 vs Threshold')\n",
    "axes[1].legend()\n",
    "\n",
    "plt.tight_layout(); plt.show()\n",
    "\n",
    "y_pred_opt = (y_prob_te >= best_thresh).astype(int)\n",
    "print(f'Optimal threshold: {best_thresh:.4f}')\n",
    "print(f'Test F1 at threshold: {f1_score(y_te, y_pred_opt):.4f}')\n",
    "print(f'Test F1 at default 0.5: {f1_score(y_te, (y_prob_te >= 0.5).astype(int)):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c12",
   "metadata": {},
   "source": [
    "### A5. SHAP Feature Importance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c13",
   "metadata": {},
   "outputs": [],
   "source": [
    "# XGBoost native SHAP values \u2014 exact tree SHAP\n",
    "booster = model_churn.get_booster()\n",
    "dmatrix = xgb.DMatrix(X_te)\n",
    "shap_values = booster.predict(dmatrix, pred_contribs=True)\n",
    "# shap_values shape: (n_samples, n_features + 1); last column is bias\n",
    "shap_main = shap_values[:, :-1]\n",
    "\n",
    "feat_names = [f'F{i}' for i in range(X_churn.shape[1])]\n",
    "\n",
    "# Mean absolute SHAP\n",
    "mean_abs_shap = np.abs(shap_main).mean(axis=0)\n",
    "order = np.argsort(mean_abs_shap)[::-1]\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 4))\n",
    "\n",
    "axes[0].barh([feat_names[i] for i in order[::-1]],\n",
    "             mean_abs_shap[order[::-1]],\n",
    "             color='#6366f1', alpha=0.85)\n",
    "axes[0].set_xlabel('Mean |SHAP Value|')\n",
    "axes[0].set_title('SHAP Feature Importance (Test Set)')\n",
    "\n",
    "# Beeswarm-style: SHAP value vs feature value for top 4 features\n",
    "top4 = order[:4]\n",
    "for rank, feat_idx in enumerate(top4):\n",
    "    sc = axes[1].scatter(\n",
    "        shap_main[:, feat_idx], [rank]*len(shap_main),\n",
    "        c=X_te[:, feat_idx], cmap='coolwarm',\n",
    "        alpha=0.3, s=8\n",
    "    )\n",
    "axes[1].set_yticks(range(4))\n",
    "axes[1].set_yticklabels([feat_names[i] for i in top4])\n",
    "axes[1].set_xlabel('SHAP Value (impact on log-odds)')\n",
    "axes[1].set_title('SHAP Beeswarm \u2014 Top 4 Features\\n(color = feature value: red=high, blue=low)')\n",
    "plt.colorbar(sc, ax=axes[1], label='Feature value')\n",
    "\n",
    "plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c14",
   "metadata": {},
   "source": [
    "### A6. XGBoost vs sklearn GBM \u2014 Churn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c15",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "from sklearn.metrics import average_precision_score\n",
    "\n",
    "gbc_base = GradientBoostingClassifier(\n",
    "    n_estimators=best_n, learning_rate=0.05,\n",
    "    max_depth=4, subsample=0.8, random_state=42\n",
    ")\n",
    "gbc_base.fit(X_tr, y_tr)\n",
    "y_prob_gbc = gbc_base.predict_proba(X_te)[:, 1]\n",
    "\n",
    "results = {\n",
    "    'XGBoost (scale_pos_weight)': {'auc_roc': roc_auc_score(y_te, y_prob_te),\n",
    "                                   'auc_pr':  average_precision_score(y_te, y_prob_te)},\n",
    "    'sklearn GBM (no imbalance)': {'auc_roc': roc_auc_score(y_te, y_prob_gbc),\n",
    "                                   'auc_pr':  average_precision_score(y_te, y_prob_gbc)},\n",
    "}\n",
    "\n",
    "print(f'{\"Model\":35s} {\"AUC-ROC\":>10} {\"AUC-PR\":>10}')\n",
    "for name, m in results.items():\n",
    "    print(f'{name:35s} {m[\"auc_roc\"]:10.4f} {m[\"auc_pr\"]:10.4f}')\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 4))\n",
    "from sklearn.metrics import roc_curve\n",
    "for name, probs, col in [\n",
    "    ('XGBoost', y_prob_te, '#6366f1'),\n",
    "    ('sklearn GBM', y_prob_gbc, '#f59e0b')\n",
    "]:\n",
    "    fpr, tpr, _ = roc_curve(y_te, probs)\n",
    "    auc = roc_auc_score(y_te, probs)\n",
    "    ax.plot(fpr, tpr, color=col, linewidth=2, label=f'{name} (AUC={auc:.3f})')\n",
    "ax.plot([0,1],[0,1],'k--',linewidth=1)\n",
    "ax.set_xlabel('FPR'); ax.set_ylabel('TPR')\n",
    "ax.set_title('ROC Curve: XGBoost vs sklearn GBM on Imbalanced Churn')\n",
    "ax.legend(); plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c16",
   "metadata": {},
   "source": [
    "## Scenario B \u2014 Revenue Regression with Missing Values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c17",
   "metadata": {},
   "source": [
    "### B1. Dataset with Intentional NaNs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c18",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_regression\n",
    "\n",
    "X_rev, y_rev = make_regression(\n",
    "    n_samples=3000, n_features=10, n_informative=7,\n",
    "    noise=25.0, random_state=42\n",
    ")\n",
    "\n",
    "# Introduce 15% missing values randomly\n",
    "missing_mask = np.random.rand(*X_rev.shape) < 0.15\n",
    "X_rev_nan = X_rev.copy().astype(float)\n",
    "X_rev_nan[missing_mask] = np.nan\n",
    "\n",
    "Xr_tr, Xr_tmp, yr_tr, yr_tmp = train_test_split(\n",
    "    X_rev_nan, y_rev, test_size=0.3, random_state=42\n",
    ")\n",
    "Xr_val, Xr_te, yr_val, yr_te = train_test_split(\n",
    "    Xr_tmp, yr_tmp, test_size=0.5, random_state=42\n",
    ")\n",
    "\n",
    "nan_pct = np.isnan(X_rev_nan).mean() * 100\n",
    "print(f'Overall NaN rate: {nan_pct:.1f}%')\n",
    "print(f'Train: {Xr_tr.shape}  Val: {Xr_val.shape}  Test: {Xr_te.shape}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c19",
   "metadata": {},
   "source": [
    "### B2. XGBoost \u2014 Native NaN Handling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c20",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_squared_error, r2_score\n",
    "\n",
    "reg_xgb = xgb.XGBRegressor(\n",
    "    n_estimators=400,\n",
    "    learning_rate=0.05,\n",
    "    max_depth=4,\n",
    "    subsample=0.8,\n",
    "    colsample_bytree=0.8,\n",
    "    reg_alpha=0.05,\n",
    "    reg_lambda=1.0,\n",
    "    early_stopping_rounds=30,\n",
    "    random_state=42,\n",
    "    verbosity=0\n",
    ")\n",
    "# XGBoost handles NaN natively \u2014 no imputation\n",
    "reg_xgb.fit(\n",
    "    Xr_tr, yr_tr,\n",
    "    eval_set=[(Xr_val, yr_val)],\n",
    "    verbose=False\n",
    ")\n",
    "\n",
    "y_pred_xgb = reg_xgb.predict(Xr_te)\n",
    "rmse_xgb = np.sqrt(mean_squared_error(yr_te, y_pred_xgb))\n",
    "r2_xgb   = r2_score(yr_te, y_pred_xgb)\n",
    "print(f'XGBoost (native NaN): RMSE={rmse_xgb:.4f}  R\u00b2={r2_xgb:.4f}')\n",
    "print(f'Best round: {reg_xgb.best_iteration}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c21",
   "metadata": {},
   "source": [
    "### B3. sklearn GBM Baseline (Requires Imputation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c22",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import GradientBoostingRegressor\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.pipeline import Pipeline\n",
    "\n",
    "gbr_pipe = Pipeline([\n",
    "    ('imputer', SimpleImputer(strategy='mean')),\n",
    "    ('gbr', GradientBoostingRegressor(\n",
    "        n_estimators=reg_xgb.best_iteration,\n",
    "        learning_rate=0.05,\n",
    "        max_depth=4,\n",
    "        subsample=0.8,\n",
    "        random_state=42\n",
    "    ))\n",
    "])\n",
    "gbr_pipe.fit(Xr_tr, yr_tr)\n",
    "y_pred_gbr = gbr_pipe.predict(Xr_te)\n",
    "rmse_gbr = np.sqrt(mean_squared_error(yr_te, y_pred_gbr))\n",
    "r2_gbr   = r2_score(yr_te, y_pred_gbr)\n",
    "print(f'sklearn GBM + mean impute: RMSE={rmse_gbr:.4f}  R\u00b2={r2_gbr:.4f}')\n",
    "\n",
    "# Comparison bar chart\n",
    "fig, axes = plt.subplots(1, 2, figsize=(11, 4))\n",
    "names = ['XGBoost\\n(native NaN)', 'sklearn GBM\\n+ imputation']\n",
    "rmses = [rmse_xgb, rmse_gbr]\n",
    "r2s   = [r2_xgb,   r2_gbr]\n",
    "\n",
    "axes[0].bar(names, rmses, color=['#6366f1','#f59e0b'], edgecolor='k', width=0.5)\n",
    "axes[0].set_ylabel('Test RMSE (lower is better)')\n",
    "axes[0].set_title('RMSE: XGBoost vs sklearn GBM\\n(15% missing values)')\n",
    "for i, v in enumerate(rmses):\n",
    "    axes[0].text(i, v + 0.5, f'{v:.2f}', ha='center')\n",
    "\n",
    "axes[1].bar(names, r2s, color=['#6366f1','#f59e0b'], edgecolor='k', width=0.5)\n",
    "axes[1].set_ylabel('Test R\u00b2 (higher is better)')\n",
    "axes[1].set_title('R\u00b2: XGBoost vs sklearn GBM')\n",
    "for i, v in enumerate(r2s):\n",
    "    axes[1].text(i, v - 0.02, f'{v:.3f}', ha='center', color='white', fontweight='bold')\n",
    "\n",
    "plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c23",
   "metadata": {},
   "source": [
    "### B4. Residual and Actual vs Predicted Plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c24",
   "metadata": {},
   "outputs": [],
   "source": [
    "resid_xgb = yr_te - y_pred_xgb\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(13, 4))\n",
    "\n",
    "axes[0].scatter(y_pred_xgb, resid_xgb, alpha=0.3, s=10, color='#6366f1')\n",
    "axes[0].axhline(0, color='gray', linestyle='--')\n",
    "axes[0].set_xlabel('Predicted'); axes[0].set_ylabel('Residual')\n",
    "axes[0].set_title('Residuals vs Predicted (XGBoost)')\n",
    "\n",
    "lims = [min(yr_te.min(), y_pred_xgb.min()),\n",
    "        max(yr_te.max(), y_pred_xgb.max())]\n",
    "axes[1].scatter(yr_te, y_pred_xgb, alpha=0.3, s=10, color='#22c55e')\n",
    "axes[1].plot(lims, lims, 'k--', linewidth=1)\n",
    "axes[1].set_xlabel('Actual'); axes[1].set_ylabel('Predicted')\n",
    "axes[1].set_title(f'Actual vs Predicted (XGBoost)\\nR\u00b2 = {r2_xgb:.4f}')\n",
    "\n",
    "plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c25",
   "metadata": {},
   "source": [
    "### B5. Feature Importance \u2014 Gain vs SHAP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c26",
   "metadata": {},
   "outputs": [],
   "source": [
    "gain_imp = reg_xgb.get_booster().get_score(importance_type='gain')\n",
    "# Sort by gain\n",
    "feat_names_r = [f'f{i}' for i in range(10)]\n",
    "gain_vals = [gain_imp.get(f'f{i}', 0.0) for i in range(10)]\n",
    "\n",
    "# SHAP for regressor\n",
    "dmat_te = xgb.DMatrix(Xr_te)\n",
    "shap_r = reg_xgb.get_booster().predict(dmat_te, pred_contribs=True)\n",
    "shap_r_main = shap_r[:, :-1]\n",
    "mean_abs_shap_r = np.abs(shap_r_main).mean(axis=0)\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 4))\n",
    "\n",
    "order_gain = np.argsort(gain_vals)[::-1]\n",
    "axes[0].bar(range(10), [gain_vals[i] for i in order_gain],\n",
    "            color='#6366f1', alpha=0.85)\n",
    "axes[0].set_xticks(range(10))\n",
    "axes[0].set_xticklabels([feat_names_r[i] for i in order_gain])\n",
    "axes[0].set_title('Gain-Based Feature Importance')\n",
    "axes[0].set_ylabel('Mean Gain')\n",
    "\n",
    "order_shap = np.argsort(mean_abs_shap_r)[::-1]\n",
    "axes[1].bar(range(10), mean_abs_shap_r[order_shap],\n",
    "            color='#22c55e', alpha=0.85)\n",
    "axes[1].set_xticks(range(10))\n",
    "axes[1].set_xticklabels([feat_names_r[i] for i in order_shap])\n",
    "axes[1].set_title('SHAP Mean |Value| Importance')\n",
    "axes[1].set_ylabel('Mean |SHAP|')\n",
    "\n",
    "plt.tight_layout(); plt.show()\n",
    "print('Top 5 gain: ', [feat_names_r[i] for i in order_gain[:5]])\n",
    "print('Top 5 SHAP: ', [feat_names_r[i] for i in order_shap[:5]])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c27",
   "metadata": {},
   "source": [
    "## Discussion\n",
    "\n",
    "1. **scale_pos_weight + threshold optimisation outperforms default.** The default threshold (0.5) on imbalanced data tends to favour the majority class, missing churners. Setting scale_pos_weight to the class ratio and selecting the F1-optimal threshold on a validation set typically improves F1 by 10\u201320 percentage points on a 15% positive-rate dataset.\n",
    "\n",
    "2. **Native NaN handling is practical, not just convenient.** XGBoost's learned default direction for missing values captures the information that a value is missing \u2014 which can be informative (MNAR patterns). On the synthetic data with MCAR (missing completely at random), XGBoost matches mean imputation; on real data where missingness is informative, XGBoost often outperforms imputation.\n",
    "\n",
    "3. **Early stopping is the single most important tuning decision.** Setting n_estimators=400 with early_stopping_rounds=30 is universally better than guessing n_estimators: it finds the optimal round automatically and is robust to learning_rate variation.\n",
    "\n",
    "4. **SHAP and gain importance usually agree on top features but disagree on mid-tier.** Gain-based importance is biased toward features with many unique values (continuous features get more split opportunities). SHAP is computed from the actual prediction contribution and is unbiased. For feature selection decisions, trust SHAP over gain."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c28",
   "metadata": {},
   "source": [
    "## Next Steps\n",
    "\n",
    "- **Article 11: LightGBM for Fast Gradient Boosting** \u2014 histogram-based splits, leaf-wise growth, and native categorical support make LightGBM 10\u201320\u00d7 faster on large datasets\n",
    "- **Article 12: CatBoost for Categorical Features** \u2014 symmetric trees and ordered target encoding handle high-cardinality categoricals without preprocessing\n",
    "- **Article 13: Stacking and Blending** \u2014 using XGBoost as a meta-learner on top of diverse base ensembles"
   ]
  }
 ]
}