{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-01",
   "metadata": {},
   "source": [
    "# Real-World Applications of Ensemble Learning\n",
    "\n",
    "This notebook demonstrates ensemble methods across five production ML domains: fraud detection, spam filtering, customer churn prediction, medical diagnosis, and demand forecasting. Each section uses a domain-appropriate sklearn dataset, domain-relevant metrics, and the specific ensemble configuration that addresses that domain's statistical challenge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-02",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "np.random.seed(42)\n",
    "\n",
    "from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.metrics import (\n",
    "    classification_report, confusion_matrix,\n",
    "    precision_recall_curve, average_precision_score,\n",
    "    roc_auc_score, f1_score, recall_score,\n",
    "    mean_absolute_error, mean_squared_error\n",
    ")\n",
    "\n",
    "import sklearn\n",
    "print(f'scikit-learn: {sklearn.__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-03",
   "metadata": {},
   "source": [
    "---\n",
    "## DOMAIN 1: Fraud Detection\n",
    "**Challenge:** Extreme class imbalance (fraud is rare), high cost of false negatives, non-linear patterns.  \n",
    "**Ensemble strategy:** Gradient Boosting with class weights + threshold tuning for high recall."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-04",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html\n",
    "from sklearn.datasets import make_classification\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "\n",
    "# Simulate 3% fraud rate \u2014 typical in card fraud\n",
    "X_fraud, y_fraud = make_classification(\n",
    "    n_samples=5000, n_features=20, n_informative=10,\n",
    "    weights=[0.97, 0.03], flip_y=0.01,\n",
    "    random_state=42\n",
    ")\n",
    "\n",
    "X_tr, X_te, y_tr, y_te = train_test_split(\n",
    "    X_fraud, y_fraud, test_size=0.25, random_state=42, stratify=y_fraud\n",
    ")\n",
    "\n",
    "print(f'Train fraud rate: {y_tr.mean():.3f} ({y_tr.sum()} cases)')\n",
    "print(f'Test fraud rate:  {y_te.mean():.3f} ({y_te.sum()} cases)')\n",
    "\n",
    "# Baseline: logistic regression\n",
    "scaler = StandardScaler()\n",
    "X_tr_sc = scaler.fit_transform(X_tr)\n",
    "X_te_sc = scaler.transform(X_te)\n",
    "\n",
    "lr_fraud = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)\n",
    "lr_fraud.fit(X_tr_sc, y_tr)\n",
    "lr_preds = lr_fraud.predict(X_te_sc)\n",
    "\n",
    "print('\\nBaseline (Logistic Regression):')\n",
    "print(classification_report(y_te, lr_preds, target_names=['Legit', 'Fraud']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-05",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Gradient Boosting with low threshold for high recall\n",
    "gb_fraud = GradientBoostingClassifier(\n",
    "    n_estimators=200, max_depth=3, learning_rate=0.05,\n",
    "    subsample=0.8, random_state=42\n",
    ")\n",
    "gb_fraud.fit(X_tr_sc, y_tr)\n",
    "proba_fraud = gb_fraud.predict_proba(X_te_sc)[:, 1]\n",
    "\n",
    "# Threshold sweep \u2014 find best recall at acceptable precision\n",
    "thresholds = np.arange(0.05, 0.55, 0.05)\n",
    "records = []\n",
    "for t in thresholds:\n",
    "    p = (proba_fraud >= t).astype(int)\n",
    "    from sklearn.metrics import precision_score\n",
    "    records.append({\n",
    "        'threshold': t,\n",
    "        'recall':    recall_score(y_te, p, zero_division=0),\n",
    "        'precision': precision_score(y_te, p, zero_division=0),\n",
    "        'f1':        f1_score(y_te, p, zero_division=0),\n",
    "    })\n",
    "\n",
    "thresh_df = pd.DataFrame(records)\n",
    "print(thresh_df.round(3).to_string(index=False))\n",
    "\n",
    "# Apply chosen threshold (balance recall/precision)\n",
    "chosen_t = 0.20\n",
    "final_preds = (proba_fraud >= chosen_t).astype(int)\n",
    "print(f'\\nEnsemble @ threshold={chosen_t}:')\n",
    "print(classification_report(y_te, final_preds, target_names=['Legit', 'Fraud']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-06",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Precision-Recall curve\n",
    "prec, rec, thresh = precision_recall_curve(y_te, proba_fraud)\n",
    "ap = average_precision_score(y_te, proba_fraud)\n",
    "\n",
    "lr_proba = lr_fraud.predict_proba(X_te_sc)[:, 1]\n",
    "prec_lr, rec_lr, _ = precision_recall_curve(y_te, lr_proba)\n",
    "ap_lr = average_precision_score(y_te, lr_proba)\n",
    "\n",
    "plt.figure(figsize=(8, 5))\n",
    "plt.plot(rec, prec, color='#6366f1', linewidth=2, label=f'Gradient Boosting (AP={ap:.3f})')\n",
    "plt.plot(rec_lr, prec_lr, color='#94a3b8', linewidth=1.5, linestyle='--',\n",
    "         label=f'Logistic Regression (AP={ap_lr:.3f})')\n",
    "plt.axvline(x=recall_score(y_te, final_preds), color='#ef4444', linestyle=':',\n",
    "            label=f'Chosen threshold ({chosen_t})')\n",
    "plt.xlabel('Recall (Fraud)'); plt.ylabel('Precision (Fraud)')\n",
    "plt.title('Fraud Detection: Precision-Recall Curve')\n",
    "plt.legend(); plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-07",
   "metadata": {},
   "source": [
    "---\n",
    "## DOMAIN 2: Spam Filtering\n",
    "**Challenge:** Noisy labels, high-dimensional features, need for fast inference.  \n",
    "**Ensemble strategy:** Random Forest \u2014 robust to label noise via majority vote, fast at inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-08",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_breast_cancer  # binary, high-dim proxy for spam\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.ensemble import RandomForestClassifier, BaggingClassifier\n",
    "\n",
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html\n",
    "spam_data = load_breast_cancer()\n",
    "X_spam, y_spam = spam_data.data, spam_data.target\n",
    "\n",
    "# Inject label noise (simulates spam mislabelling)\n",
    "noise_idx = np.random.choice(len(y_spam), size=int(0.08 * len(y_spam)), replace=False)\n",
    "y_spam_noisy = y_spam.copy()\n",
    "y_spam_noisy[noise_idx] = 1 - y_spam_noisy[noise_idx]\n",
    "print(f'Label noise injected: {len(noise_idx)} samples ({100*len(noise_idx)/len(y_spam):.1f}%)')\n",
    "\n",
    "Xs_tr, Xs_te, ys_tr, ys_te = train_test_split(\n",
    "    X_spam, y_spam_noisy, test_size=0.2, random_state=42, stratify=y_spam_noisy\n",
    ")\n",
    "\n",
    "# Baseline: Naive Bayes\n",
    "nb = GaussianNB()\n",
    "nb.fit(Xs_tr, ys_tr)\n",
    "print('\\nNaive Bayes:')\n",
    "print(classification_report(ys_te, nb.predict(Xs_te), target_names=['Ham','Spam']))\n",
    "\n",
    "# Ensemble: Random Forest\n",
    "rf_spam = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42)\n",
    "rf_spam.fit(Xs_tr, ys_tr)\n",
    "print('Random Forest:')\n",
    "print(classification_report(ys_te, rf_spam.predict(Xs_te), target_names=['Ham','Spam']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-09",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show noise robustness: F1 vs noise level\n",
    "noise_levels = [0.0, 0.02, 0.05, 0.08, 0.12, 0.15, 0.20]\n",
    "nb_f1_scores, rf_f1_scores = [], []\n",
    "\n",
    "for noise in noise_levels:\n",
    "    yn = y_spam.copy()\n",
    "    if noise > 0:\n",
    "        ni = np.random.choice(len(yn), size=int(noise * len(yn)), replace=False)\n",
    "        yn[ni] = 1 - yn[ni]\n",
    "    Xn_tr, Xn_te, yn_tr, yn_te = train_test_split(X_spam, yn, test_size=0.2,\n",
    "                                                    random_state=42, stratify=yn)\n",
    "    nb.fit(Xn_tr, yn_tr)\n",
    "    rf_spam.fit(Xn_tr, yn_tr)\n",
    "    nb_f1_scores.append(f1_score(yn_te, nb.predict(Xn_te)))\n",
    "    rf_f1_scores.append(f1_score(yn_te, rf_spam.predict(Xn_te)))\n",
    "\n",
    "plt.figure(figsize=(8, 4))\n",
    "plt.plot([n*100 for n in noise_levels], nb_f1_scores, 'o--', label='Naive Bayes', color='#94a3b8')\n",
    "plt.plot([n*100 for n in noise_levels], rf_f1_scores, 's-',  label='Random Forest', color='#6366f1', linewidth=2)\n",
    "plt.xlabel('Label Noise Level (%)')\n",
    "plt.ylabel('F1 Score')\n",
    "plt.title('Spam Filtering: F1 vs Label Noise\\n(Random Forest degrades more gracefully)')\n",
    "plt.legend(); plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-10",
   "metadata": {},
   "source": [
    "---\n",
    "## DOMAIN 3: Customer Churn Prediction\n",
    "**Challenge:** Correlated features, need for business interpretability, asymmetric costs.  \n",
    "**Ensemble strategy:** Gradient Boosting with feature importance for stakeholder communication."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-11",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html\n",
    "X_churn, y_churn = make_classification(\n",
    "    n_samples=3000, n_features=15, n_informative=8,\n",
    "    n_redundant=4, weights=[0.80, 0.20],\n",
    "    flip_y=0.02, random_state=42\n",
    ")\n",
    "\n",
    "feature_names = [\n",
    "    'tenure_months', 'monthly_charges', 'total_charges',\n",
    "    'support_tickets', 'num_products', 'last_login_days',\n",
    "    'contract_length', 'payment_method', 'avg_session_mins',\n",
    "    'downloads_per_month', 'referrals', 'plan_tier',\n",
    "    'region', 'device_type', 'promo_used'\n",
    "]\n",
    "\n",
    "Xc_tr, Xc_te, yc_tr, yc_te = train_test_split(\n",
    "    X_churn, y_churn, test_size=0.2, random_state=42, stratify=y_churn\n",
    ")\n",
    "\n",
    "# Baseline\n",
    "lr_churn = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)\n",
    "scaler_c = StandardScaler()\n",
    "lr_churn.fit(scaler_c.fit_transform(Xc_tr), yc_tr)\n",
    "print('Logistic Regression:')\n",
    "print(classification_report(yc_te, lr_churn.predict(scaler_c.transform(Xc_te)),\n",
    "                             target_names=['Retained', 'Churned']))\n",
    "\n",
    "# Gradient Boosting\n",
    "gb_churn = GradientBoostingClassifier(\n",
    "    n_estimators=150, max_depth=4, learning_rate=0.05,\n",
    "    subsample=0.8, min_samples_leaf=20, random_state=42\n",
    ")\n",
    "gb_churn.fit(Xc_tr, yc_tr)\n",
    "print('Gradient Boosting:')\n",
    "print(classification_report(yc_te, gb_churn.predict(Xc_te),\n",
    "                             target_names=['Retained', 'Churned']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-12",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature importance \u2014 key for stakeholder communication\n",
    "importance_df = pd.DataFrame({\n",
    "    'feature': feature_names,\n",
    "    'importance': gb_churn.feature_importances_\n",
    "}).sort_values('importance', ascending=True)\n",
    "\n",
    "plt.figure(figsize=(9, 6))\n",
    "plt.barh(importance_df['feature'], importance_df['importance'], color='#6366f1', alpha=0.85)\n",
    "plt.xlabel('Feature Importance (Gradient Boosting)')\n",
    "plt.title('Churn Prediction: Which Signals Drive the Model?')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-13",
   "metadata": {},
   "source": [
    "---\n",
    "## DOMAIN 4: Medical Diagnosis\n",
    "**Challenge:** High cost of false negatives (missed disease), need for calibrated probability output.  \n",
    "**Ensemble strategy:** Soft VotingClassifier \u2014 diverse models, calibrated probabilities for clinical thresholding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-14",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.svm import SVC\n",
    "from sklearn.ensemble import RandomForestClassifier, VotingClassifier\n",
    "from sklearn.calibration import CalibrationDisplay\n",
    "\n",
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html\n",
    "med_data = load_breast_cancer()\n",
    "X_med, y_med = med_data.data, med_data.target  # 0=malignant, 1=benign\n",
    "\n",
    "Xm_tr, Xm_te, ym_tr, ym_te = train_test_split(\n",
    "    X_med, y_med, test_size=0.2, random_state=42, stratify=y_med\n",
    ")\n",
    "scaler_m = StandardScaler()\n",
    "Xm_tr_sc = scaler_m.fit_transform(Xm_tr)\n",
    "Xm_te_sc = scaler_m.transform(Xm_te)\n",
    "\n",
    "# Three diverse base learners\n",
    "lr_med = LogisticRegression(max_iter=1000, random_state=42)\n",
    "svm_med = SVC(probability=True, kernel='rbf', random_state=42)\n",
    "rf_med  = RandomForestClassifier(n_estimators=100, random_state=42)\n",
    "\n",
    "# Soft voting ensemble\n",
    "voting_med = VotingClassifier(\n",
    "    estimators=[('lr', lr_med), ('svm', svm_med), ('rf', rf_med)],\n",
    "    voting='soft'\n",
    ")\n",
    "\n",
    "# Compare all\n",
    "for name, model in [('Logistic Regression', lr_med),\n",
    "                     ('SVM (RBF)', svm_med),\n",
    "                     ('Random Forest', rf_med),\n",
    "                     ('Soft Voting', voting_med)]:\n",
    "    model.fit(Xm_tr_sc, ym_tr)\n",
    "    preds = model.predict(Xm_te_sc)\n",
    "    recall_mal = recall_score(ym_te, preds, pos_label=0)  # malignant=0, critical class\n",
    "    print(f'{name:22s}: recall(malignant)={recall_mal:.4f}  F1={f1_score(ym_te,preds):.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-15",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calibration: can we trust the probabilities for clinical thresholding?\n",
    "fig, axes = plt.subplots(1, 3, figsize=(14, 4))\n",
    "for ax, (name, model) in zip(axes, [('Logistic Regression', lr_med),\n",
    "                                      ('Random Forest', rf_med),\n",
    "                                      ('Soft Voting', voting_med)]):\n",
    "    CalibrationDisplay.from_estimator(model, Xm_te_sc, ym_te,\n",
    "                                       n_bins=8, ax=ax, name=name)\n",
    "    ax.set_title(f'Calibration: {name}')\n",
    "\n",
    "plt.suptitle('Medical Diagnosis: Probability Calibration\\n(closer to diagonal \u2192 more trustworthy for threshold setting)')\n",
    "plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-16",
   "metadata": {},
   "source": [
    "---\n",
    "## DOMAIN 5: Demand Forecasting\n",
    "**Challenge:** High variance in regression trees, need for reliable point estimates, MAE/MAPE metrics.  \n",
    "**Ensemble strategy:** Bagging Regressor \u2014 variance reduction for stable predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-17",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_regression\nfrom sklearn.tree import DecisionTreeRegressor\nfrom sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor\nfrom sklearn.linear_model import Ridge\n\n# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html\n# Synthetic regression dataset \u2014 proxy for demand forecasting (no network needed)\nX_demand, y_demand = make_regression(\n    n_samples=5000, n_features=8, n_informative=6, noise=30.0, random_state=42\n)\n\nXd_tr, Xd_te, yd_tr, yd_te = train_test_split(\n    X_demand, y_demand, test_size=0.2, random_state=42\n)\n\ndef mape(y_true, y_pred):\n    \"\"\"Mean Absolute Percentage Error.\"\"\"\n    return np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100\n\ndemand_models = {\n    'Single Tree':         DecisionTreeRegressor(random_state=42),\n    'Ridge Regression':    Ridge(alpha=1.0),\n    'Bagging (50 trees)':  BaggingRegressor(\n        estimator=DecisionTreeRegressor(random_state=42),\n        n_estimators=50, random_state=42\n    ),\n    'Gradient Boosting':   GradientBoostingRegressor(\n        n_estimators=200, max_depth=4, learning_rate=0.05,\n        subsample=0.8, random_state=42\n    ),\n}\n\nprint(f'{\"Model\":25s}  MAE      RMSE     MAPE(%)')\nprint('-' * 55)\nfor name, model in demand_models.items():\n    model.fit(Xd_tr, yd_tr)\n    preds = model.predict(Xd_te)\n    mae  = mean_absolute_error(yd_te, preds)\n    rmse = mean_squared_error(yd_te, preds) ** 0.5\n    mp   = mape(yd_te, preds)\n    print(f'{name:25s}  {mae:.4f}  {rmse:.4f}  {mp:.2f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-18",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prediction distribution: single tree vs bagging vs gradient boosting\n",
    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
    "plot_names = ['Single Tree', 'Bagging (50 trees)', 'Gradient Boosting']\n",
    "\n",
    "for ax, name in zip(axes, plot_names):\n",
    "    model = demand_models[name]\n",
    "    preds = model.predict(Xd_te)\n",
    "    ax.scatter(yd_te, preds, alpha=0.3, s=10, color='#6366f1')\n",
    "    ax.plot([yd_te.min(), yd_te.max()], [yd_te.min(), yd_te.max()],\n",
    "            'r--', linewidth=1.5, label='Perfect')\n",
    "    mae  = mean_absolute_error(yd_te, preds)\n",
    "    ax.set_title(f'{name}\\nMAE={mae:.3f}')\n",
    "    ax.set_xlabel('Actual'); ax.set_ylabel('Predicted')\n",
    "    ax.legend(fontsize=8)\n",
    "\n",
    "plt.suptitle('Demand Forecasting: Predicted vs Actual', fontsize=12)\n",
    "plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-19",
   "metadata": {},
   "source": [
    "## Cross-Domain Summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-20",
   "metadata": {},
   "outputs": [],
   "source": [
    "summary = pd.DataFrame([\n",
    "    {'Domain': 'Fraud Detection',    'Challenge': 'Class imbalance',      'Ensemble': 'Gradient Boosting', 'Key Metric': 'Recall (fraud)'},\n",
    "    {'Domain': 'Spam Filtering',     'Challenge': 'Label noise',          'Ensemble': 'Random Forest',     'Key Metric': 'F1'},\n",
    "    {'Domain': 'Churn Prediction',   'Challenge': 'Correlated features',  'Ensemble': 'Gradient Boosting', 'Key Metric': 'Recall (churn)'},\n",
    "    {'Domain': 'Medical Diagnosis',  'Challenge': 'High recall required', 'Ensemble': 'Soft Voting',       'Key Metric': 'Recall (malignant)'},\n",
    "    {'Domain': 'Demand Forecasting', 'Challenge': 'High variance',        'Ensemble': 'Bagging',           'Key Metric': 'MAPE'},\n",
    "])\n",
    "\n",
    "print(summary.to_string(index=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-21",
   "metadata": {},
   "source": [
    "## Discussion\n",
    "\n",
    "The five domains illustrate that ensemble methods are not a one-size-fits-all tool \u2014 the right configuration is domain-specific:\n",
    "\n",
    "1. **Fraud**: The key gain is not accuracy (which is dominated by the majority class) but recall on fraud. Gradient boosting with threshold tuning achieves dramatically higher recall than logistic regression. The precision-recall curve quantifies the trade-off clearly.\n",
    "\n",
    "2. **Spam**: The noise-robustness experiment shows that random forest degrades more gracefully as label noise increases. Majority voting across 100 diverse trees means that even when some trees overfit to noisy labels, the ensemble vote overrides them.\n",
    "\n",
    "3. **Churn**: Feature importance provides the business communication layer that single-model coefficient interpretation cannot. Stakeholders can act on \"tenure_months is the top predictor\" in a way they cannot act on \"the ROC-AUC improved by 2%\".\n",
    "\n",
    "4. **Medical**: The calibration plots reveal which models' probability outputs can be trusted for threshold selection. A model with a recall of 0.97 but poorly calibrated probabilities cannot be reliably used for clinical decision support \u2014 the voting ensemble's probabilities are more reliable.\n",
    "\n",
    "5. **Forecasting**: The scatter plots make the variance reduction from bagging visually obvious. The single tree has a characteristic \"banding\" pattern from its step function predictions; the bagged ensemble and gradient boosting both produce smoother, more accurate point estimates."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-22",
   "metadata": {},
   "source": [
    "## Next Steps\n",
    "\n",
    "Part 1 of the series is complete. Part 2 begins with boosting:\n",
    "\n",
    "- **Boosting Explained Simply with Python** \u2014 the core sequential learning idea\n",
    "- **AdaBoost in Python** \u2014 the first practical boosting algorithm\n",
    "- **Gradient Boosting in Python for Structured Data** \u2014 the workhorse of modern ML\n",
    "- **XGBoost for Real Business Problems** \u2014 production-grade implementation\n",
    "- **Handling Imbalanced Datasets in Python** \u2014 extending the fraud domain from this notebook\n",
    "\n",
    "All notebooks are self-contained and build on the concepts established in this Part 1 series."
   ]
  }
 ]
}