{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c01",
   "metadata": {},
   "source": [
    "# Multi-Class Boosting in Python\n\nCompares SAMME, one-vs-rest GBM, and softmax gradient boosting on the Wine (3-class) and Digits (10-class) datasets. Demonstrates the multi-class learner weight correction and how softmax GBM produces better-calibrated probability outputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c02",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport warnings\nwarnings.filterwarnings('ignore')\nnp.random.seed(42)\nimport sklearn; print(f'sklearn {sklearn.__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c03",
   "metadata": {},
   "source": [
    "## 1. Datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c04",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html\n# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html\nfrom sklearn.datasets import load_wine, load_digits\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\n\nwine = load_wine()\nXw, yw = wine.data, wine.target\nXw_tr, Xw_te, yw_tr, yw_te = train_test_split(Xw, yw, test_size=0.25,\n                                               random_state=42, stratify=yw)\n\ndigits = load_digits()\nXd, yd = digits.data, digits.target\nXd_tr, Xd_te, yd_tr, yd_te = train_test_split(Xd, yd, test_size=0.25,\n                                               random_state=42, stratify=yd)\n\nsc = StandardScaler()\nXw_tr_sc = sc.fit_transform(Xw_tr); Xw_te_sc = sc.transform(Xw_te)\n\nprint(f'Wine   \u2014 train: {Xw_tr.shape}  test: {Xw_te.shape}  classes: {len(np.unique(yw))}')\nprint(f'Digits \u2014 train: {Xd_tr.shape}  test: {Xd_te.shape}  classes: {len(np.unique(yd))}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c05",
   "metadata": {},
   "source": [
    "## 2. EDA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c06",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n\naxes[0].bar(wine.target_names, np.bincount(yw),\n            color=['#6366f1','#22c55e','#f59e0b'], edgecolor='k')\naxes[0].set_title('Wine \u2014 Class Distribution')\n\nfor cls, col in zip([0,1,2], ['#6366f1','#22c55e','#f59e0b']):\n    axes[1].scatter(Xw[yw==cls, 0], Xw[yw==cls, 6],\n                    alpha=0.6, s=25, color=col, label=wine.target_names[cls])\naxes[1].set_xlabel('Alcohol'); axes[1].set_ylabel('Flavanoids')\naxes[1].set_title('Wine \u2014 Alcohol vs Flavanoids'); axes[1].legend()\n\naxes[2].bar(range(10), np.bincount(yd), color='#6366f1', alpha=0.8)\naxes[2].set_xlabel('Digit class'); axes[2].set_title('Digits \u2014 Class Distribution')\n\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c07",
   "metadata": {},
   "source": [
    "## 3. SAMME on Wine (3-class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c08",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.metrics import accuracy_score, f1_score, log_loss, ConfusionMatrixDisplay\n\nsamme_wine = AdaBoostClassifier(\n    estimator=DecisionTreeClassifier(max_depth=1),\n    n_estimators=200, algorithm='SAMME', random_state=42\n)\nsamme_wine.fit(Xw_tr_sc, yw_tr)\n\ny_pred_w = samme_wine.predict(Xw_te_sc)\ny_prob_w = samme_wine.predict_proba(Xw_te_sc)\nprint(f'SAMME (Wine) \u2014 Acc: {accuracy_score(yw_te, y_pred_w):.4f}')\nprint(f'               Macro-F1: {f1_score(yw_te, y_pred_w, average=\"macro\"):.4f}')\nprint(f'               Log-loss: {log_loss(yw_te, y_prob_w):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c09",
   "metadata": {},
   "source": [
    "## 4. Softmax GBM on Wine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c10",
   "metadata": {},
   "outputs": [],
   "source": [
    "softmax_wine = GradientBoostingClassifier(\n    n_estimators=200, learning_rate=0.1,\n    max_depth=3, subsample=0.8,\n    random_state=42\n)\nsoftmax_wine.fit(Xw_tr_sc, yw_tr)\n\ny_pred_sw = softmax_wine.predict(Xw_te_sc)\ny_prob_sw = softmax_wine.predict_proba(Xw_te_sc)\nprint(f'Softmax GBM (Wine) \u2014 Acc: {accuracy_score(yw_te, y_pred_sw):.4f}')\nprint(f'                     Macro-F1: {f1_score(yw_te, y_pred_sw, average=\"macro\"):.4f}')\nprint(f'                     Log-loss: {log_loss(yw_te, y_prob_sw):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c11",
   "metadata": {},
   "source": [
    "## 5. Confusion Matrices \u2014 Wine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c12",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(11, 4))\nfor ax, model, title in [\n    (axes[0], samme_wine,   'SAMME'),\n    (axes[1], softmax_wine, 'Softmax GBM')\n]:\n    ConfusionMatrixDisplay.from_estimator(\n        model, Xw_te_sc, yw_te,\n        display_labels=wine.target_names,\n        cmap='Blues', ax=ax\n    )\n    ax.set_title(f'Wine Confusion Matrix \u2014 {title}')\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c13",
   "metadata": {},
   "source": [
    "## 6. Staged Accuracy \u2014 Wine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c14",
   "metadata": {},
   "outputs": [],
   "source": [
    "samme_staged  = [accuracy_score(yw_te, p) for p in samme_wine.staged_predict(Xw_te_sc)]\nsoftmax_staged = [accuracy_score(yw_te, p) for p in softmax_wine.staged_predict(Xw_te_sc)]\n\nfig, ax = plt.subplots(figsize=(10, 4))\nax.plot(range(1,201), samme_staged,   color='#6366f1', linewidth=1.8, label='SAMME')\nax.plot(range(1,201), softmax_staged,  color='#22c55e', linewidth=1.8, label='Softmax GBM')\nax.set_xlabel('Boosting Round'); ax.set_ylabel('Test Accuracy')\nax.set_title('Staged Accuracy on Wine (3-class)')\nax.legend(); plt.tight_layout(); plt.show()\nprint(f'SAMME best:       {max(samme_staged):.4f} at round {np.argmax(samme_staged)+1}')\nprint(f'Softmax GBM best: {max(softmax_staged):.4f} at round {np.argmax(softmax_staged)+1}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c15",
   "metadata": {},
   "source": [
    "## 7. Scale to 10-Class Digits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# SAMME on digits\nsamme_digits = AdaBoostClassifier(\n    estimator=DecisionTreeClassifier(max_depth=2),\n    n_estimators=150, algorithm='SAMME', random_state=42\n)\nsamme_digits.fit(Xd_tr, yd_tr)\nsamme_d_acc = accuracy_score(yd_te, samme_digits.predict(Xd_te))\nsamme_d_f1  = f1_score(yd_te, samme_digits.predict(Xd_te), average='macro')\n\n# Softmax GBM on digits\nsoftmax_digits = GradientBoostingClassifier(\n    n_estimators=100, learning_rate=0.1,\n    max_depth=3, subsample=0.8, random_state=42\n)\nsoftmax_digits.fit(Xd_tr, yd_tr)\nsoft_d_acc = accuracy_score(yd_te, softmax_digits.predict(Xd_te))\nsoft_d_f1  = f1_score(yd_te, softmax_digits.predict(Xd_te), average='macro')\n\nprint(f'Digits (10-class):')\nprint(f'  SAMME        \u2014 Acc: {samme_d_acc:.4f}  Macro-F1: {samme_d_f1:.4f}')\nprint(f'  Softmax GBM  \u2014 Acc: {soft_d_acc:.4f}  Macro-F1: {soft_d_f1:.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c17",
   "metadata": {},
   "source": [
    "## 8. Cross-Validated Comparison"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c18",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score, StratifiedKFold\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.pipeline import Pipeline\n\ncv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)\n\nmodels_d = {\n    'Decision Tree':  DecisionTreeClassifier(max_depth=5, random_state=42),\n    'Logistic Reg.':  Pipeline([('sc', StandardScaler()),\n                                ('clf', LogisticRegression(max_iter=500, random_state=42))]),\n    'Random Forest':  RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),\n    'SAMME':          AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2),\n                                         n_estimators=100, algorithm='SAMME', random_state=42),\n    'Softmax GBM':    GradientBoostingClassifier(n_estimators=80, learning_rate=0.1,\n                                                  max_depth=3, random_state=42),\n}\n\ncv_scores = {}\nfor name, model in models_d.items():\n    scores = cross_val_score(model, Xd, yd, cv=cv, scoring='accuracy', n_jobs=-1)\n    cv_scores[name] = scores\n    print(f'{name:20s}: {scores.mean():.4f} \u00b1 {scores.std():.4f}')\n\nplt.figure(figsize=(11, 4))\nplt.boxplot(cv_scores.values(), labels=cv_scores.keys(), patch_artist=True,\n            boxprops=dict(facecolor='#e0e7ff'),\n            medianprops=dict(color='#4f46e5', linewidth=2))\nplt.ylabel('5-Fold CV Accuracy (Digits, 10-class)')\nplt.title('Multi-Class Boosting vs Baselines')\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c19",
   "metadata": {},
   "source": [
    "## 9. Probability Calibration Comparison \u2014 Wine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c20",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare mean predicted probability of correct class\nfor name, probs, true in [\n    ('SAMME',        y_prob_w,  yw_te),\n    ('Softmax GBM',  y_prob_sw, yw_te)\n]:\n    correct_probs = probs[np.arange(len(true)), true]\n    print(f'{name:15s} \u2014 mean P(correct class): {correct_probs.mean():.4f}  '\n          f'min: {correct_probs.min():.4f}  log-loss: {log_loss(true, probs):.4f}')\n\nfig, axes = plt.subplots(1, 2, figsize=(12, 4))\nfor ax, name, probs in [\n    (axes[0], 'SAMME',       y_prob_w),\n    (axes[1], 'Softmax GBM', y_prob_sw)\n]:\n    correct_probs = probs[np.arange(len(yw_te)), yw_te]\n    ax.hist(correct_probs, bins=15, color='#6366f1', edgecolor='white', alpha=0.85)\n    ax.set_xlabel('P(correct class)'); ax.set_ylabel('Count')\n    ax.set_title(f'{name} \u2014 Confidence in Correct Class\\nlog-loss={log_loss(yw_te, probs):.4f}')\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c21",
   "metadata": {},
   "source": [
    "## 10. Discussion\n\n1. **The ln(K\u22121) correction matters.** At K=10, SAMME's learner weight formula subtracts ln(9) \u2248 2.20 from the naive binary formula. Without this correction, weights would be inflated, causing faster weight concentration and slower convergence to the correct boundary.\n\n2. **Softmax GBM produces more confident, better-calibrated probabilities.** The histogram of P(correct class) for Softmax GBM is concentrated near 1.0 with a much lower log-loss \u2014 a direct consequence of optimising cross-entropy directly rather than the exponential loss approximation.\n\n3. **SAMME is competitive in accuracy but not in probability outputs.** For applications that need only the predicted class (not probabilities), SAMME is a reasonable choice. For threshold tuning, decision boundaries under uncertainty, or probability-weighted decisions, Softmax GBM is strongly preferred.\n\n4. **Deeper trees help on Digits.** The 8\u00d78 pixel feature space has non-linear interactions (adjacent pixels); max_depth=2\u20133 captures pairwise pixel relationships that stumps (max_depth=1) miss entirely."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c22",
   "metadata": {},
   "source": [
    "## 11. Next Steps\n\n- **Article 13: Multi-Label Boosting** \u2014 predicting multiple labels per sample simultaneously\n- **Article 14: Boosting with Noisy Data** \u2014 how label noise corrupts the multi-class weight update\n- **Article 11: LightGBM** \u2014 scaling multi-class softmax to large K efficiently"
   ]
  }
 ]
}