{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
  "language_info": {"name": "python", "version": "3.9.0"}
 },
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-01",
   "metadata": {},
   "source": [
    "# Base Learners in Python: Decision Trees, Logistic Regression, k-NN, SVM, and Naive Bayes\n",
    "\n",
    "This notebook trains five base learners on the Wine dataset, compares their individual performance and decision boundaries, examines their error patterns for complementarity, then combines them into a soft-voting ensemble to show how diversity improves predictions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-02",
   "metadata": {},
   "source": [
    "## Problem Statement\n",
    "\n",
    "We need to identify wine cultivars from 13 chemical measurements. Rather than picking one best model, we want to understand the error patterns of five algorithm families and select a combination that maximises ensemble diversity — the key driver of ensemble performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-03",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "import sklearn\n",
    "print(f'scikit-learn: {sklearn.__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-04",
   "metadata": {},
   "source": [
    "## 1. Load Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-05",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html\n",
    "from sklearn.datasets import load_wine\n",
    "\n",
    "data = load_wine()\n",
    "X = pd.DataFrame(data.data, columns=data.feature_names)\n",
    "y = pd.Series(data.target, name='cultivar')\n",
    "\n",
    "print(f'Shape: {X.shape}')\n",
    "print(f'Features: {list(data.feature_names)}')\n",
    "print(f'Classes: {data.target_names}')\n",
    "print(f'Class distribution:\\n{y.value_counts().sort_index()}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-06",
   "metadata": {},
   "source": [
    "## 2. Exploratory Data Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-07",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 3, figsize=(16, 4))\n",
    "\n",
    "# Class distribution\n",
    "y.value_counts().sort_index().plot(kind='bar', ax=axes[0],\n",
    "    color=['#6366f1','#22c55e','#f59e0b'], edgecolor='black')\n",
    "axes[0].set_title('Class Distribution')\n",
    "axes[0].set_xticklabels(data.target_names, rotation=0)\n",
    "axes[0].set_xlabel('')\n",
    "\n",
    "# Feature boxplots for top 2 discriminating features\n",
    "top_feats = X.corrwith(y).abs().nlargest(2).index.tolist()\n",
    "for i, feat in enumerate(top_feats[:2]):\n",
    "    for cls in [0, 1, 2]:\n",
    "        axes[1+i].hist(X.loc[y==cls, feat], bins=15, alpha=0.6,\n",
    "                       label=data.target_names[cls])\n",
    "    axes[1+i].set_title(f'{feat}')\n",
    "    axes[1+i].legend(fontsize=8)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-08",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Correlation heatmap\n",
    "plt.figure(figsize=(10, 7))\n",
    "corr = X.corr()\n",
    "mask = np.triu(np.ones_like(corr, dtype=bool))\n",
    "sns.heatmap(corr, mask=mask, cmap='coolwarm', center=0,\n",
    "            annot=False, linewidths=0.3)\n",
    "plt.title('Feature Correlation Matrix')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-09",
   "metadata": {},
   "source": [
    "## 3. Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-10",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42, stratify=y\n",
    ")\n",
    "\n",
    "scaler = StandardScaler()\n",
    "X_train_sc = scaler.fit_transform(X_train)\n",
    "X_test_sc = scaler.transform(X_test)\n",
    "\n",
    "print(f'Train: {X_train.shape}  Test: {X_test.shape}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-11",
   "metadata": {},
   "source": [
    "## 4. Train All Five Base Learners"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-12",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.dummy import DummyClassifier\n",
    "from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, classification_report\n",
    "\n",
    "models = {\n",
    "    'Dummy (baseline)': DummyClassifier(strategy='stratified', random_state=42),\n",
    "    'Decision Tree': DecisionTreeClassifier(max_depth=4, random_state=42),\n",
    "    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),\n",
    "    'k-NN (k=5)': KNeighborsClassifier(n_neighbors=5),\n",
    "    'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42),\n",
    "    'Naive Bayes': GaussianNB(),\n",
    "}\n",
    "\n",
    "results = {}\n",
    "trained = {}\n",
    "for name, model in models.items():\n",
    "    model.fit(X_train_sc, y_train)\n",
    "    preds = model.predict(X_test_sc)\n",
    "    proba = model.predict_proba(X_test_sc)\n",
    "    results[name] = {\n",
    "        'Accuracy': accuracy_score(y_test, preds),\n",
    "        'F1 (macro)': f1_score(y_test, preds, average='macro'),\n",
    "        'AUC (OvR)': roc_auc_score(y_test, proba, multi_class='ovr', average='macro'),\n",
    "    }\n",
    "    trained[name] = model\n",
    "\n",
    "results_df = pd.DataFrame(results).T.round(4)\n",
    "print(results_df.to_string())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-13",
   "metadata": {},
   "source": [
    "## 5. Classification Reports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-14",
   "metadata": {},
   "outputs": [],
   "source": [
    "for name, model in trained.items():\n",
    "    if name == 'Dummy (baseline)':\n",
    "        continue\n",
    "    preds = model.predict(X_test_sc)\n",
    "    print(f'=== {name} ===')\n",
    "    print(classification_report(y_test, preds, target_names=data.target_names))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-15",
   "metadata": {},
   "source": [
    "## 6. Results Comparison Plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-16",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(13, 5))\n",
    "x = np.arange(len(results_df))\n",
    "width = 0.25\n",
    "colors = ['#6366f1', '#22c55e', '#f59e0b']\n",
    "\n",
    "for i, metric in enumerate(['Accuracy', 'F1 (macro)', 'AUC (OvR)']):\n",
    "    ax.bar(x + i * width, results_df[metric], width,\n",
    "           label=metric, color=colors[i], alpha=0.85)\n",
    "\n",
    "ax.set_xticks(x + width)\n",
    "ax.set_xticklabels(results_df.index, rotation=15, ha='right')\n",
    "ax.set_ylim(0.3, 1.05)\n",
    "ax.set_ylabel('Score')\n",
    "ax.set_title('Base Learner Comparison: Accuracy, F1, AUC on Wine Test Set')\n",
    "ax.legend()\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-17",
   "metadata": {},
   "source": [
    "## 7. Decision Boundaries on 2D PCA Projection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-18",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "pca = PCA(n_components=2, random_state=42)\n",
    "X_train_2d = pca.fit_transform(X_train_sc)\n",
    "X_test_2d = pca.transform(X_test_sc)\n",
    "\n",
    "# Retrain on 2D for boundary visualization\n",
    "vis_models = {\n",
    "    'Decision Tree': DecisionTreeClassifier(max_depth=4, random_state=42),\n",
    "    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),\n",
    "    'k-NN (k=5)': KNeighborsClassifier(n_neighbors=5),\n",
    "    'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42),\n",
    "    'Naive Bayes': GaussianNB(),\n",
    "}\n",
    "\n",
    "# Mesh grid\n",
    "h = 0.05\n",
    "x_min, x_max = X_train_2d[:, 0].min() - 0.5, X_train_2d[:, 0].max() + 0.5\n",
    "y_min, y_max = X_train_2d[:, 1].min() - 0.5, X_train_2d[:, 1].max() + 0.5\n",
    "xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n",
    "\n",
    "fig, axes = plt.subplots(1, 5, figsize=(20, 4))\n",
    "colors_bg = ['#c7d2fe', '#bbf7d0', '#fde68a']  # light purple, green, yellow per class\n",
    "cmap_bg = plt.cm.RdYlBu\n",
    "\n",
    "for ax, (name, model) in zip(axes, vis_models.items()):\n",
    "    model.fit(X_train_2d, y_train)\n",
    "    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])\n",
    "    Z = Z.reshape(xx.shape)\n",
    "    ax.contourf(xx, yy, Z, alpha=0.3, cmap=cmap_bg)\n",
    "    scatter = ax.scatter(X_test_2d[:, 0], X_test_2d[:, 1], c=y_test,\n",
    "                         cmap=cmap_bg, edgecolor='k', s=40, linewidth=0.5)\n",
    "    acc_2d = accuracy_score(y_test, model.predict(X_test_2d))\n",
    "    ax.set_title(f'{name}\\nAcc(2D)={acc_2d:.2f}', fontsize=9)\n",
    "    ax.set_xticks([]); ax.set_yticks([])\n",
    "\n",
    "plt.suptitle('Decision Boundaries on 2D PCA Projection (test points shown)', y=1.02)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-19",
   "metadata": {},
   "source": [
    "## 8. Error Analysis — Where Do Models Disagree?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-20",
   "metadata": {},
   "outputs": [],
   "source": [
    "base_names = ['Decision Tree', 'Logistic Regression', 'k-NN (k=5)', 'SVM (RBF)', 'Naive Bayes']\n",
    "error_matrix = pd.DataFrame(index=range(len(y_test)))\n",
    "\n",
    "for name in base_names:\n",
    "    preds = trained[name].predict(X_test_sc)\n",
    "    error_matrix[name] = (preds != y_test.values).astype(int)\n",
    "\n",
    "print('Error overlap (1 = wrong, 0 = correct) — sample of test set:')\n",
    "print(error_matrix.head(20).to_string())\n",
    "print(f'\\nTotal errors per model:\\n{error_matrix.sum()}')\n",
    "print(f'\\nSamples where ALL models are wrong: {(error_matrix.sum(axis=1)==5).sum()}')\n",
    "print(f'Samples where only 1 model is wrong: {(error_matrix.sum(axis=1)==1).sum()}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-21",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Error correlation heatmap — high correlation = models make same mistakes\n",
    "plt.figure(figsize=(7, 5))\n",
    "err_corr = error_matrix.corr()\n",
    "sns.heatmap(err_corr, annot=True, fmt='.2f', cmap='Reds',\n",
    "            vmin=0, vmax=1, linewidths=0.5)\n",
    "plt.title('Error Correlation Between Base Learners\\n(high = similar mistakes, bad for ensembles)')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-22",
   "metadata": {},
   "source": [
    "## 9. Soft Voting Ensemble — Combining All Five"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-23",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import VotingClassifier\n",
    "\n",
    "voting_clf = VotingClassifier(\n",
    "    estimators=[\n",
    "        ('dt', DecisionTreeClassifier(max_depth=4, random_state=42)),\n",
    "        ('lr', LogisticRegression(max_iter=1000, random_state=42)),\n",
    "        ('knn', KNeighborsClassifier(n_neighbors=5)),\n",
    "        ('svm', SVC(kernel='rbf', probability=True, random_state=42)),\n",
    "        ('nb', GaussianNB()),\n",
    "    ],\n",
    "    voting='soft'\n",
    ")\n",
    "\n",
    "voting_clf.fit(X_train_sc, y_train)\n",
    "v_preds = voting_clf.predict(X_test_sc)\n",
    "v_proba = voting_clf.predict_proba(X_test_sc)\n",
    "\n",
    "print('=== Soft Voting Ensemble ===')\n",
    "print(classification_report(y_test, v_preds, target_names=data.target_names))\n",
    "print(f'AUC (macro OvR): {roc_auc_score(y_test, v_proba, multi_class=\"ovr\", average=\"macro\"):.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-24",
   "metadata": {},
   "source": [
    "## 10. Cross-Validation Stability"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-25",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.pipeline import Pipeline\n",
    "\n",
    "cv_models = {\n",
    "    'Decision Tree': DecisionTreeClassifier(max_depth=4, random_state=42),\n",
    "    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),\n",
    "    'k-NN (k=5)': KNeighborsClassifier(n_neighbors=5),\n",
    "    'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42),\n",
    "    'Naive Bayes': GaussianNB(),\n",
    "    'Soft Voting': VotingClassifier(\n",
    "        estimators=[\n",
    "            ('dt', DecisionTreeClassifier(max_depth=4, random_state=42)),\n",
    "            ('lr', LogisticRegression(max_iter=1000, random_state=42)),\n",
    "            ('knn', KNeighborsClassifier(n_neighbors=5)),\n",
    "            ('svm', SVC(kernel='rbf', probability=True, random_state=42)),\n",
    "            ('nb', GaussianNB()),\n",
    "        ], voting='soft'\n",
    "    ),\n",
    "}\n",
    "\n",
    "cv_results = {}\n",
    "for name, model in cv_models.items():\n",
    "    pipe = Pipeline([('scaler', StandardScaler()), ('clf', model)])\n",
    "    scores = cross_val_score(pipe, X, y, cv=10, scoring='f1_macro', n_jobs=-1)\n",
    "    cv_results[name] = scores\n",
    "    print(f'{name:25s}: mean={scores.mean():.4f}  std={scores.std():.4f}')\n",
    "\n",
    "plt.figure(figsize=(11, 4))\n",
    "plt.boxplot(cv_results.values(), labels=cv_results.keys(), patch_artist=True,\n",
    "            boxprops=dict(facecolor='#e0e7ff'),\n",
    "            medianprops=dict(color='#4f46e5', linewidth=2))\n",
    "plt.xticks(rotation=15, ha='right')\n",
    "plt.ylabel('F1 Macro (10-Fold CV)')\n",
    "plt.title('Stability: Individual Base Learners vs Soft Voting Ensemble')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-26",
   "metadata": {},
   "source": [
    "## 11. Discussion\n",
    "\n",
    "Key observations:\n",
    "\n",
    "1. **Error correlation reveals diversity.** The error correlation heatmap shows which model pairs share mistakes. Low correlation (near 0) between two models means their errors tend to fall on different samples — ideal for ensembling. High correlation means they're essentially duplicates in the ensemble.\n",
    "\n",
    "2. **Decision boundaries tell the story.** The 2D PCA projection shows how radically different each model's partitioning of feature space is. The decision tree creates hard rectangular regions; the SVM creates smooth curves; k-NN creates jagged local boundaries. These structural differences drive error diversity.\n",
    "\n",
    "3. **Soft voting narrows variance.** The CV boxplot shows the voting ensemble has a tighter distribution than any individual model — it consistently stays near its mean F1 even on harder folds. This stability is often more valuable in production than a slightly higher mean.\n",
    "\n",
    "4. **Not all combinations are equal.** If LR and SVM have high error correlation (both are margin-based linear-ish models), including both adds less diversity than adding a decision tree."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-27",
   "metadata": {},
   "source": [
    "## 12. Next Steps\n",
    "\n",
    "- **How to Evaluate Ensemble Models in Python** — rigorous cross-validation pipelines, stratification, and avoiding leakage\n",
    "- **Bias, Variance, and Why Ensembles Generalise Better** — decomposing ensemble error mathematically\n",
    "- **Voting Classifiers: Hard Voting vs Soft Voting** — deep dive into combination rules\n",
    "- **Stacking in Python with scikit-learn** — replacing the averaging rule with a learned meta-model\n",
    "\n",
    "All subsequent notebooks assume familiarity with the five base learners introduced here."
   ]
  }
 ]
}