{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
  "language_info": {"name": "python", "version": "3.10.0"}
 },
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c01",
   "metadata": {},
   "source": ["# Boosting with Noisy Data: Challenges and Fixes\n\nDemonstrates AdaBoost's weight-concentration failure under label noise, compares it against gradient boosting's inherent noise resistance, and implements early stopping and sample-filtering mitigation strategies at 5%, 15%, and 25% noise rates."]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c02",
   "metadata": {},
   "outputs": [],
   "source": ["import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport warnings\nwarnings.filterwarnings('ignore')\nnp.random.seed(42)\nimport sklearn\nprint(f'sklearn {sklearn.__version__}')"]
  },
  {
   "cell_type": "markdown",
   "id": "c03",
   "metadata": {},
   "source": ["## 1. Dataset + Noise Injection"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c04",
   "metadata": {},
   "outputs": [],
   "source": ["# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html\nfrom sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\n\nX, y_clean = make_classification(\n    n_samples=2000, n_features=20, n_informative=15,\n    n_redundant=3, flip_y=0.0, random_state=42\n)\n\nX_tr_raw, X_te, y_tr_clean, y_te = train_test_split(\n    X, y_clean, test_size=0.25, random_state=42, stratify=y_clean\n)\n\ndef add_label_noise(y, noise_rate, random_state=42):\n    \"\"\"Flip a random fraction of training labels.\"\"\"\n    rng = np.random.RandomState(random_state)\n    noisy_y = y.copy()\n    n_noisy = int(noise_rate * len(y))\n    noisy_idx = rng.choice(len(y), n_noisy, replace=False)\n    noisy_y[noisy_idx] = 1 - noisy_y[noisy_idx]\n    mask = np.zeros(len(y), dtype=bool)\n    mask[noisy_idx] = True\n    return noisy_y, mask\n\nprint(f'Train: {X_tr_raw.shape}  Test: {X_te.shape}')\nfor rate in [0.05, 0.15, 0.25]:\n    yn, mask = add_label_noise(y_tr_clean, rate)\n    print(f'Noise {rate*100:.0f}%: {mask.sum()} flipped labels')"]
  },
  {
   "cell_type": "markdown",
   "id": "c05",
   "metadata": {},
   "source": ["## 2. EDA — Clean vs Noisy Label Distribution"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c06",
   "metadata": {},
   "outputs": [],
   "source": ["noise_rate = 0.15\ny_tr_noisy, noise_mask = add_label_noise(y_tr_clean, noise_rate)\n\nfig, axes = plt.subplots(1, 3, figsize=(16, 4))\n\n# Class balance\nfor ax, y, title in [\n    (axes[0], y_tr_clean, 'Clean Labels'),\n    (axes[1], y_tr_noisy, f'Noisy Labels ({noise_rate*100:.0f}%)')\n]:\n    ax.bar(['Class 0','Class 1'], np.bincount(y),\n           color=['#6366f1','#22c55e'], edgecolor='k')\n    ax.set_title(title); ax.set_ylabel('Count')\n\n# Noisy sample location in feature space\naxes[2].scatter(X_tr_raw[~noise_mask, 0], X_tr_raw[~noise_mask, 1],\n                c=y_tr_clean[~noise_mask], cmap='RdYlBu',\n                s=15, alpha=0.5, label='Clean')\naxes[2].scatter(X_tr_raw[noise_mask, 0], X_tr_raw[noise_mask, 1],\n                c='black', s=50, marker='x', label='Noisy', zorder=5)\naxes[2].set_title('Noisy Samples in Feature Space (Feature 0 vs 1)')\naxes[2].legend()\nplt.tight_layout(); plt.show()"]
  },
  {
   "cell_type": "markdown",
   "id": "c07",
   "metadata": {},
   "source": ["## 3. AdaBoost Weight Concentration Under Noise"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c08",
   "metadata": {},
   "outputs": [],
   "source": ["from sklearn.ensemble import AdaBoostClassifier\nfrom sklearn.tree import DecisionTreeClassifier\n\n# Track per-sample weights across rounds\nclass WeightTracker(AdaBoostClassifier):\n    \"\"\"AdaBoost subclass that records sample weights at each round.\"\"\"\n    def fit(self, X, y, sample_weight=None):\n        self._weight_history = []\n        return super().fit(X, y, sample_weight=sample_weight)\n\n    def _boost(self, iboost, X, y, sample_weight, random_state):\n        result = super()._boost(iboost, X, y, sample_weight, random_state)\n        self._weight_history.append(sample_weight.copy())\n        return result\n\n# Use 15% noise for the demonstration\ny_noisy_15, mask_15 = add_label_noise(y_tr_clean, 0.15)\n\nada = AdaBoostClassifier(\n    estimator=DecisionTreeClassifier(max_depth=1),\n    n_estimators=100, algorithm='SAMME', random_state=42\n)\nada.fit(X_tr_raw, y_noisy_15)\n\n# Reconstruct weight trajectory from estimator_errors_ and estimator_weights_\n# (Direct weight tracking requires internal access; we proxy via staged predictions)\nstaged_acc_noisy = []\nfor pred in ada.staged_predict(X_te):\n    from sklearn.metrics import accuracy_score\n    staged_acc_noisy.append(accuracy_score(y_te, pred))\n\nprint(f'AdaBoost (15% noise) — final test acc: {staged_acc_noisy[-1]:.4f}')\nprint(f'Best round: {np.argmax(staged_acc_noisy)+1}  acc: {max(staged_acc_noisy):.4f}')"]
  },
  {
   "cell_type": "markdown",
   "id": "c09",
   "metadata": {},
   "source": ["## 4. Simulating Weight Concentration"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c10",
   "metadata": {},
   "outputs": [],
   "source": ["from sklearn.metrics import accuracy_score\n\n# Simulate weight update manually for a small example\nnp.random.seed(42)\nN = len(y_noisy_15)\nweights = np.ones(N) / N\n\nnoisy_avg_weights = []\nclean_avg_weights = []\n\nfor est, err in zip(ada.estimators_[:60], ada.estimator_errors_[:60]):\n    if err <= 0 or err >= 1:\n        break\n    alpha = 0.5 * np.log((1 - err) / err)\n    preds = est.predict(X_tr_raw)\n    incorrect = (preds != y_noisy_15).astype(float)\n    weights *= np.exp(alpha * incorrect)\n    weights /= weights.sum()\n    noisy_avg_weights.append(weights[mask_15].mean())\n    clean_avg_weights.append(weights[~mask_15].mean())\n\nfig, ax = plt.subplots(figsize=(12, 5))\nrounds = range(1, len(noisy_avg_weights)+1)\nax.semilogy(rounds, noisy_avg_weights, color='#ef4444', linewidth=2, label='Noisy samples (avg weight)')\nax.semilogy(rounds, clean_avg_weights, color='#6366f1', linewidth=2, label='Clean samples (avg weight)')\nax.axhline(1/N, color='gray', linestyle='--', alpha=0.7, label='Uniform weight (1/N)')\nax.set_xlabel('Boosting Round'); ax.set_ylabel('Average Sample Weight (log scale)')\nax.set_title('Weight Concentration: Noisy vs Clean Samples (15% noise)')\nax.legend(); plt.tight_layout(); plt.show()\n\nif len(noisy_avg_weights) > 0:\n    ratio = noisy_avg_weights[-1] / clean_avg_weights[-1]\n    print(f'Final weight ratio (noisy/clean): {ratio:.1f}×')"]
  },
  {
   "cell_type": "markdown",
   "id": "c11",
   "metadata": {},
   "source": ["## 5. Staged Accuracy — AdaBoost with Early Stopping"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c12",
   "metadata": {},
   "outputs": [],
   "source": ["# Clean baseline\nada_clean = AdaBoostClassifier(\n    estimator=DecisionTreeClassifier(max_depth=1),\n    n_estimators=200, algorithm='SAMME', random_state=42\n)\nada_clean.fit(X_tr_raw, y_tr_clean)\nstaged_clean = [accuracy_score(y_te, p) for p in ada_clean.staged_predict(X_te)]\n\n# Noisy — no early stop\nada_noisy = AdaBoostClassifier(\n    estimator=DecisionTreeClassifier(max_depth=1),\n    n_estimators=200, algorithm='SAMME', random_state=42\n)\nada_noisy.fit(X_tr_raw, y_noisy_15)\nstaged_noisy = [accuracy_score(y_te, p) for p in ada_noisy.staged_predict(X_te)]\n\nbest_round = np.argmax(staged_noisy) + 1\n\nfig, ax = plt.subplots(figsize=(12, 5))\nax.plot(range(1,201), staged_clean, color='#22c55e', linewidth=2, label='AdaBoost (clean)')\nax.plot(range(1,201), staged_noisy, color='#ef4444', linewidth=2, label='AdaBoost (15% noise)')\nax.axvline(best_round, color='#f59e0b', linestyle='--',\n           label=f'Best round = {best_round}')\nax.set_xlabel('Boosting Round'); ax.set_ylabel('Test Accuracy')\nax.set_title('AdaBoost Staged Accuracy: Clean vs 15% Noisy Labels')\nax.legend(); plt.tight_layout(); plt.show()\n\nprint(f'Clean  — final: {staged_clean[-1]:.4f}  peak: {max(staged_clean):.4f}')\nprint(f'Noisy  — final: {staged_noisy[-1]:.4f}  peak: {max(staged_noisy):.4f} at round {best_round}')"]
  },
  {
   "cell_type": "markdown",
   "id": "c13",
   "metadata": {},
   "source": ["## 6. Gradient Boosting — Inherent Noise Resistance"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c14",
   "metadata": {},
   "outputs": [],
   "source": ["from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier\n\ndef eval_model(model, X_tr, y_tr, X_te, y_te):\n    model.fit(X_tr, y_tr)\n    return accuracy_score(y_te, model.predict(X_te))\n\nnoise_rates = [0.0, 0.05, 0.15, 0.25]\n\nada_results = []\ngbm_results = []\nrf_results  = []\n\nfor rate in noise_rates:\n    if rate == 0.0:\n        y_tr_n = y_tr_clean\n    else:\n        y_tr_n, _ = add_label_noise(y_tr_clean, rate)\n\n    ada_acc = eval_model(\n        AdaBoostClassifier(n_estimators=200, algorithm='SAMME', random_state=42),\n        X_tr_raw, y_tr_n, X_te, y_te\n    )\n    gbm_acc = eval_model(\n        GradientBoostingClassifier(n_estimators=200, learning_rate=0.05,\n                                   max_depth=3, subsample=0.7,\n                                   min_samples_leaf=10, random_state=42),\n        X_tr_raw, y_tr_n, X_te, y_te\n    )\n    rf_acc = eval_model(\n        RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),\n        X_tr_raw, y_tr_n, X_te, y_te\n    )\n    ada_results.append(ada_acc)\n    gbm_results.append(gbm_acc)\n    rf_results.append(rf_acc)\n    print(f'Noise {rate*100:4.0f}%: AdaBoost={ada_acc:.4f}  GBM={gbm_acc:.4f}  RF={rf_acc:.4f}')"]
  },
  {
   "cell_type": "markdown",
   "id": "c15",
   "metadata": {},
   "source": ["## 7. Noise Rate vs Accuracy Chart"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c16",
   "metadata": {},
   "outputs": [],
   "source": ["fig, ax = plt.subplots(figsize=(10, 5))\nnr_pct = [r*100 for r in noise_rates]\nax.plot(nr_pct, ada_results, 'o-', color='#ef4444', linewidth=2.5, markersize=8, label='AdaBoost')\nax.plot(nr_pct, gbm_results, 's-', color='#6366f1', linewidth=2.5, markersize=8, label='GBM (subsample=0.7)')\nax.plot(nr_pct, rf_results,  '^-', color='#22c55e', linewidth=2.5, markersize=8, label='Random Forest')\nax.set_xlabel('Label Noise Rate (%)')\nax.set_ylabel('Test Accuracy')\nax.set_title('Noise Robustness: AdaBoost vs GBM vs Random Forest')\nax.legend(); ax.set_ylim(0.5, 1.0)\nplt.tight_layout(); plt.show()"]
  },
  {
   "cell_type": "markdown",
   "id": "c17",
   "metadata": {},
   "source": ["## 8. Sample Filtering Strategy"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c18",
   "metadata": {},
   "outputs": [],
   "source": ["from sklearn.metrics import log_loss\n\nnoise_rate = 0.15\ny_tr_noisy_f, true_noise_mask = add_label_noise(y_tr_clean, noise_rate)\n\n# Step 1: brief GBM to identify high-loss samples\ndetector = GradientBoostingClassifier(\n    n_estimators=30, learning_rate=0.1, max_depth=3, random_state=42\n)\ndetector.fit(X_tr_raw, y_tr_noisy_f)\n\n# Per-sample log-loss on training set\nprobs = detector.predict_proba(X_tr_raw)\nper_sample_loss = np.array([\n    log_loss([y_tr_noisy_f[i]], [probs[i]], labels=[0,1])\n    for i in range(len(y_tr_noisy_f))\n])\n\n# Filter top 7% highest-loss samples\nthreshold = np.percentile(per_sample_loss, 93)\nkeep_mask = per_sample_loss <= threshold\n\nX_tr_filtered = X_tr_raw[keep_mask]\ny_tr_filtered  = y_tr_noisy_f[keep_mask]\n\n# How many noisy samples were caught?\ndetected = true_noise_mask[~keep_mask].sum()\ntotal_removed = (~keep_mask).sum()\nprint(f'Removed {total_removed} samples ({total_removed/len(y_tr_noisy_f)*100:.1f}%)')\nprint(f'Of removed: {detected} were truly noisy ({detected/total_removed*100:.1f}% precision)')\nprint(f'Noise recall: {detected/true_noise_mask.sum()*100:.1f}%')"]
  },
  {
   "cell_type": "markdown",
   "id": "c19",
   "metadata": {},
   "source": ["## 9. Filtered vs Unfiltered GBM"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c20",
   "metadata": {},
   "outputs": [],
   "source": ["gbm_base = GradientBoostingClassifier(\n    n_estimators=200, learning_rate=0.05,\n    max_depth=3, subsample=0.7, min_samples_leaf=10, random_state=42\n)\ngbm_base.fit(X_tr_raw, y_tr_noisy_f)\nbase_acc = accuracy_score(y_te, gbm_base.predict(X_te))\n\ngbm_filt = GradientBoostingClassifier(\n    n_estimators=200, learning_rate=0.05,\n    max_depth=3, subsample=0.7, min_samples_leaf=10, random_state=42\n)\ngbm_filt.fit(X_tr_filtered, y_tr_filtered)\nfilt_acc = accuracy_score(y_te, gbm_filt.predict(X_te))\n\ngbm_oracle = GradientBoostingClassifier(\n    n_estimators=200, learning_rate=0.05,\n    max_depth=3, subsample=0.7, min_samples_leaf=10, random_state=42\n)\ngbm_oracle.fit(X_tr_raw, y_tr_clean)  # ideal: train on clean labels\noracle_acc = accuracy_score(y_te, gbm_oracle.predict(X_te))\n\nnames = ['GBM (noisy)', 'GBM (filtered)', 'GBM (oracle clean)']\naccs  = [base_acc, filt_acc, oracle_acc]\ncolors = ['#ef4444', '#6366f1', '#22c55e']\n\nfig, ax = plt.subplots(figsize=(8, 4))\nbars = ax.bar(names, accs, color=colors, edgecolor='k', alpha=0.85, width=0.5)\nax.set_ylabel('Test Accuracy')\nax.set_title(f'GBM Filtering Effect at {noise_rate*100:.0f}% Noise')\nax.set_ylim(min(accs) - 0.03, 1.0)\nfor bar, v in zip(bars, accs):\n    ax.text(bar.get_x() + bar.get_width()/2, v + 0.002, f'{v:.4f}', ha='center', fontweight='bold')\nplt.tight_layout(); plt.show()"]
  },
  {
   "cell_type": "markdown",
   "id": "c21",
   "metadata": {},
   "source": ["## 10. Subsample Effect on Noise Robustness"]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c22",
   "metadata": {},
   "outputs": [],
   "source": ["y_noisy_15_sub, _ = add_label_noise(y_tr_clean, 0.15)\n\nsubsamples = [0.3, 0.5, 0.7, 0.8, 1.0]\nsub_accs = []\nfor ss in subsamples:\n    m = GradientBoostingClassifier(\n        n_estimators=150, learning_rate=0.05,\n        max_depth=3, subsample=ss,\n        min_samples_leaf=5, random_state=42\n    )\n    m.fit(X_tr_raw, y_noisy_15_sub)\n    acc = accuracy_score(y_te, m.predict(X_te))\n    sub_accs.append(acc)\n    print(f'subsample={ss}: test acc={acc:.4f}')\n\nfig, ax = plt.subplots(figsize=(9, 4))\nax.plot(subsamples, sub_accs, 'o-', color='#6366f1', linewidth=2.5, markersize=9)\nax.set_xlabel('subsample fraction')\nax.set_ylabel('Test Accuracy (15% noise)')\nax.set_title('GBM Subsample vs Noise Robustness')\nplt.tight_layout(); plt.show()"]
  },
  {
   "cell_type": "markdown",
   "id": "c23",
   "metadata": {},
   "source": ["## 11. Discussion\n\n1. **AdaBoost's exponential loss is the root cause.** The weight update exp(α · 𝟙[misclassified]) grows without bound for samples that are persistently misclassified. At 15% noise, noisy sample weights exceed clean sample weights by 50–100× within 60 rounds.\n\n2. **GBM's bounded pseudo-residuals limit noise damage.** The logistic residual r = y − σ(F(x)) is bounded in [−1, +1] regardless of how many rounds have passed. A noisy sample cannot accumulate unbounded influence.\n\n3. **Subsample provides probabilistic exclusion.** At subsample=0.7, each noisy sample is excluded from 30% of rounds by chance, reducing its expected contribution to the ensemble by 30% compared to subsample=1.0.\n\n4. **Filtering is a data quality audit, not just a preprocessing step.** The high-loss samples identified by the detector GBM are worth manually reviewing — they often reveal systematic annotation errors, data pipeline bugs, or genuinely ambiguous examples that should be relabelled or collected with more context."]
  },
  {
   "cell_type": "markdown",
   "id": "c24",
   "metadata": {},
   "source": ["## 12. Next Steps\n\n- **Article 15: Why Boosting Resists Overfitting** — why adding more rounds does not always cause overfitting, and when it does\n- **LightGBM on noisy data** — how min_child_samples interacts with noise at large scale\n- **Curriculum learning for boosting** — ordering samples from easy to hard to delay exposure to noisy examples"]
  }
 ]
}