{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c01",
   "metadata": {},
   "source": [
    "# AdaBoost in Python with a Simple Classification Example\n\nEnd-to-end AdaBoost on the Iris dataset. We visualise decision boundary evolution across rounds (binary sub-problem), then train the full multi-class model with SAMME/SAMME and compare against baselines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c02",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport matplotlib.colors as mcolors\nimport warnings\nwarnings.filterwarnings('ignore')\nnp.random.seed(42)\nimport sklearn; print(f'sklearn {sklearn.__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c03",
   "metadata": {},
   "source": [
    "## 1. Load Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c04",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html\nfrom sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\n\niris = load_iris()\nX_full = pd.DataFrame(iris.data, columns=iris.feature_names)\ny_full = pd.Series(iris.target)\n\nprint(f'Shape: {X_full.shape}')\nprint(f'Classes: {iris.target_names}')\nprint(f'Class counts:\\n{y_full.value_counts().sort_index()}')\n\n# Binary sub-problem: versicolor (1) vs virginica (2)\nmask = y_full.isin([1, 2])\nX_bin = iris.data[mask][:, 2:4]   # petal length & width only (for 2D viz)\ny_bin = iris.target[mask] - 1     # remap to {0,1}\n\nX_bin_tr, X_bin_te, y_bin_tr, y_bin_te = train_test_split(\n    X_bin, y_bin, test_size=0.25, random_state=42, stratify=y_bin\n)\nprint(f'\\nBinary problem \u2014 train: {X_bin_tr.shape}, test: {X_bin_te.shape}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c05",
   "metadata": {},
   "source": [
    "## 2. EDA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c06",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(13, 4))\ncolors = ['#6366f1', '#22c55e', '#f59e0b']\n\n# All 3 classes: petal length vs petal width\nfor cls, name in enumerate(iris.target_names):\n    mask_c = iris.target == cls\n    axes[0].scatter(iris.data[mask_c, 2], iris.data[mask_c, 3],\n                    s=40, alpha=0.7, label=name, color=colors[cls])\naxes[0].set_xlabel('Petal Length (cm)'); axes[0].set_ylabel('Petal Width (cm)')\naxes[0].set_title('Iris: Petal Features by Class'); axes[0].legend()\n\n# Feature pair plot: sepal vs petal\nfor cls, name in enumerate(iris.target_names):\n    mask_c = iris.target == cls\n    axes[1].scatter(iris.data[mask_c, 0], iris.data[mask_c, 1],\n                    s=40, alpha=0.7, label=name, color=colors[cls])\naxes[1].set_xlabel('Sepal Length'); axes[1].set_ylabel('Sepal Width')\naxes[1].set_title('Iris: Sepal Features by Class'); axes[1].legend()\n\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c07",
   "metadata": {},
   "source": [
    "## 3. Decision Boundary Evolution \u2014 Binary Problem"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c08",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import AdaBoostClassifier\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.metrics import accuracy_score\n\n# Mesh grid\nh = 0.02\nx0_min, x0_max = X_bin[:, 0].min() - 0.3, X_bin[:, 0].max() + 0.3\nx1_min, x1_max = X_bin[:, 1].min() - 0.3, X_bin[:, 1].max() + 0.3\nxx, yy = np.meshgrid(np.arange(x0_min, x0_max, h),\n                      np.arange(x1_min, x1_max, h))\n\nn_rounds_list = [1, 5, 15, 50, 100]\nfig, axes = plt.subplots(1, 5, figsize=(20, 4))\ncmap_bg = plt.cm.RdBu\n\nfor ax, n in zip(axes, n_rounds_list):\n    m = AdaBoostClassifier(\n        estimator=DecisionTreeClassifier(max_depth=1),\n        n_estimators=n, learning_rate=1.0,\n        algorithm='SAMME', random_state=42\n    )\n    m.fit(X_bin_tr, y_bin_tr)\n    Z = m.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)\n    acc = accuracy_score(y_bin_te, m.predict(X_bin_te))\n\n    ax.contourf(xx, yy, Z, alpha=0.3, cmap=cmap_bg)\n    ax.scatter(X_bin_te[y_bin_te==0, 0], X_bin_te[y_bin_te==0, 1],\n               color='#6366f1', s=40, edgecolor='k', linewidth=0.5, label='versicolor')\n    ax.scatter(X_bin_te[y_bin_te==1, 0], X_bin_te[y_bin_te==1, 1],\n               color='#ef4444', s=40, edgecolor='k', linewidth=0.5, label='virginica')\n    ax.set_title(f'{n} rounds\\nAcc={acc:.2f}', fontsize=9)\n    ax.set_xticks([]); ax.set_yticks([])\n\nplt.suptitle('AdaBoost Decision Boundary Evolution (petal length vs petal width)', y=1.01)\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c09",
   "metadata": {},
   "source": [
    "## 4. Full 3-Class Classification \u2014 Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c10",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n\nX_arr, y_arr = iris.data, iris.target\nX_tr, X_te, y_tr, y_te = train_test_split(\n    X_arr, y_arr, test_size=0.25, random_state=42, stratify=y_arr\n)\n\nscaler = StandardScaler()\nX_tr_sc = scaler.fit_transform(X_tr)\nX_te_sc = scaler.transform(X_te)\n\nprint(f'Train: {X_tr.shape}  Test: {X_te.shape}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c11",
   "metadata": {},
   "source": [
    "## 5. Train SAMME and SAMME"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c12",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import classification_report\n\n# SAMME (discrete)\nsamme = AdaBoostClassifier(\n    estimator=DecisionTreeClassifier(max_depth=1),\n    n_estimators=200, learning_rate=1.0,\n    algorithm='SAMME', random_state=42\n)\nsamme.fit(X_tr_sc, y_tr)\n\n# SAMME.R (real \u2014 uses probabilities)\nsamme_r = AdaBoostClassifier(\n    estimator=DecisionTreeClassifier(max_depth=1),\n    n_estimators=200, learning_rate=1.0,\n    algorithm='SAMME', random_state=42\n)\nsamme_r.fit(X_tr_sc, y_tr)\n\nfor name, model in [('SAMME', samme), ('SAMME.R', samme_r)]:\n    preds = model.predict(X_te_sc)\n    print(f'=== {name} ===')\n    print(classification_report(y_te, preds, target_names=iris.target_names))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c13",
   "metadata": {},
   "source": [
    "## 6. Staged Score: SAMME vs SAMME Convergence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c14",
   "metadata": {},
   "outputs": [],
   "source": [
    "samme_scores  = [accuracy_score(y_te, p) for p in samme.staged_predict(X_te_sc)]\nsamme_r_scores = [accuracy_score(y_te, p) for p in samme_r.staged_predict(X_te_sc)]\n\nplt.figure(figsize=(10, 4))\nrounds = np.arange(1, 201)\nplt.plot(rounds, samme_scores,   color='#6366f1', linewidth=1.5, label='SAMME')\nplt.plot(rounds, samme_r_scores, color='#f59e0b', linewidth=1.5, label='SAMME.R')\nplt.xlabel('Boosting Rounds'); plt.ylabel('Test Accuracy')\nplt.title('SAMME vs SAMME.R: Convergence Speed on Iris (3-class)')\nplt.legend(); plt.tight_layout(); plt.show()\n\nprint(f'SAMME   best: {max(samme_scores):.4f} at round {np.argmax(samme_scores)+1}')\nprint(f'SAMME.R best: {max(samme_r_scores):.4f} at round {np.argmax(samme_r_scores)+1}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c15",
   "metadata": {},
   "source": [
    "## 7. Confusion Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c16",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import ConfusionMatrixDisplay\n\nfig, axes = plt.subplots(1, 2, figsize=(11, 4))\nfor ax, (name, model) in zip(axes, [('SAMME', samme), ('SAMME.R', samme_r)]):\n    ConfusionMatrixDisplay.from_estimator(\n        model, X_te_sc, y_te,\n        display_labels=iris.target_names,\n        cmap='Blues', ax=ax\n    )\n    ax.set_title(f'Confusion Matrix \u2014 {name}')\n\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c17",
   "metadata": {},
   "source": [
    "## 8. Comparison with Baselines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c18",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.linear_model import LogisticRegression\n\npipelines = {\n    'Single Stump':      Pipeline([('sc', StandardScaler()), ('clf', DecisionTreeClassifier(max_depth=1, random_state=42))]),\n    'Full Tree':         Pipeline([('sc', StandardScaler()), ('clf', DecisionTreeClassifier(random_state=42))]),\n    'Logistic Reg.':     Pipeline([('sc', StandardScaler()), ('clf', LogisticRegression(max_iter=500, random_state=42))]),\n    'AdaBoost SAMME':    Pipeline([('sc', StandardScaler()), ('clf', AdaBoostClassifier(\n                             estimator=DecisionTreeClassifier(max_depth=1),\n                             n_estimators=100, algorithm='SAMME', random_state=42))]),\n    'AdaBoost SAMME.R':  Pipeline([('sc', StandardScaler()), ('clf', AdaBoostClassifier(\n                             estimator=DecisionTreeClassifier(max_depth=1),\n                             n_estimators=100, algorithm='SAMME', random_state=42))]),\n}\n\ncv_results = {}\nfor name, pipe in pipelines.items():\n    scores = cross_val_score(pipe, X_arr, y_arr, cv=10, scoring='accuracy', n_jobs=-1)\n    cv_results[name] = scores\n    print(f'{name:22s}: {scores.mean():.4f} \u00b1 {scores.std():.4f}')\n\nplt.figure(figsize=(10, 4))\nplt.boxplot(cv_results.values(), labels=cv_results.keys(), patch_artist=True,\n            boxprops=dict(facecolor='#e0e7ff'),\n            medianprops=dict(color='#4f46e5', linewidth=2))\nplt.xticks(rotation=15, ha='right')\nplt.ylabel('10-Fold CV Accuracy')\nplt.title('AdaBoost vs Baselines on Iris')\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c19",
   "metadata": {},
   "source": [
    "## 9. Learning Rate Sensitivity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c20",
   "metadata": {},
   "outputs": [],
   "source": [
    "lrs = [0.1, 0.5, 1.0, 2.0]\nplt.figure(figsize=(10, 4))\ncolors_lr = ['#6366f1', '#22c55e', '#f59e0b', '#ef4444']\n\nfor lr, col in zip(lrs, colors_lr):\n    m = AdaBoostClassifier(\n        estimator=DecisionTreeClassifier(max_depth=1),\n        n_estimators=200, learning_rate=lr,\n        algorithm='SAMME', random_state=42\n    )\n    m.fit(X_tr_sc, y_tr)\n    scores = [accuracy_score(y_te, p) for p in m.staged_predict(X_te_sc)]\n    plt.plot(np.arange(1, 201), scores, color=col, linewidth=1.5, label=f'lr={lr}')\n\nplt.xlabel('Rounds'); plt.ylabel('Test Accuracy')\nplt.title('SAMME.R: Learning Rate vs Rounds on Iris')\nplt.legend(); plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c21",
   "metadata": {},
   "source": [
    "## 10. Discussion\n\n1. **Boundary evolution tells the story.** The 2D plots show that early rounds (1\u20135) create coarse boundaries that already capture the easy separation between species. Later rounds (15\u2013100) refine the boundary in the overlap region where versicolor and virginica are hardest to distinguish.\n\n2. **SAMME converges faster.** On Iris, SAMME typically reaches near-optimal accuracy in 20\u201330 rounds, while SAMME takes 60\u201380 rounds for the same accuracy. This is because probability information is a richer training signal than hard vote.\n\n3. **Learning rate controls the trade-off.** Very high learning rate (2.0) converges fast but plateaus at lower accuracy. Very low (0.1) converges slowly but may ultimately achieve comparable accuracy if rounds are sufficient. For Iris (clean data, small dataset), 1.0 is near-optimal.\n\n4. **AdaBoost outperforms baselines on this dataset.** The CV boxplot shows AdaBoost has both higher mean and lower variance than the full decision tree \u2014 the ensemble effect at work on a clean, small dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c22",
   "metadata": {},
   "source": [
    "## 11. Next Steps\n\n- **Article 8: How AdaBoost Reweights Misclassified Samples** \u2014 step-by-step weight update animation and analysis\n- **Article 9: Gradient Boosting in Python** \u2014 extends sequential correction to arbitrary differentiable losses\n- **Article 10: XGBoost for Real Business Problems** \u2014 production boosting with regularisation and parallelism"
   ]
  }
 ]
}