{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c01",
   "metadata": {},
   "source": [
    "# LightGBM in Python: Faster Gradient Boosting for Large Datasets\n\nLightGBM trains gradient boosting trees via histogram binning and leaf-wise growth, achieving large speed gains over sklearn's GBM with no accuracy loss. This notebook demonstrates speed and accuracy on a 20,000-sample classification problem and compares key hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c02",
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n    import lightgbm\nexcept ImportError:\n    import subprocess, sys\n    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'lightgbm', '-q'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c03",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport time\nimport warnings\nwarnings.filterwarnings('ignore')\nnp.random.seed(42)\nimport sklearn, lightgbm as lgb\nprint(f'sklearn   {sklearn.__version__}')\nprint(f'lightgbm  {lgb.__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c04",
   "metadata": {},
   "source": [
    "## 1. Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c05",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html\nfrom sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\n\nX, y = make_classification(\n    n_samples=20_000, n_features=30,\n    n_informative=20, n_redundant=7,\n    flip_y=0.03, random_state=42\n)\n\nX_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)\nX_val, X_te, y_val, y_te = train_test_split(X_tmp, y_tmp, test_size=0.5, random_state=42, stratify=y_tmp)\n\nprint(f'Train: {X_tr.shape}  Val: {X_val.shape}  Test: {X_te.shape}')\nprint(f'Class balance (train): {np.bincount(y_tr)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c06",
   "metadata": {},
   "source": [
    "## 2. EDA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c07",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n\naxes[0].bar(['Class 0','Class 1'], np.bincount(y), color=['#6366f1','#22c55e'], edgecolor='k')\naxes[0].set_title('Class Distribution')\n\nfor c, col in zip([0,1],['#6366f1','#22c55e']):\n    axes[1].hist(X[y==c,0], bins=30, alpha=0.55, color=col, label=f'Class {c}')\naxes[1].set_title('Feature 0 Distribution by Class'); axes[1].legend()\n\ncorrs = [abs(np.corrcoef(X[:,i], y)[0,1]) for i in range(30)]\naxes[2].bar(range(30), corrs, color='#6366f1', alpha=0.8)\naxes[2].set_xlabel('Feature index'); axes[2].set_title('|Correlation| with Target')\n\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c08",
   "metadata": {},
   "source": [
    "## 3. Baseline \u2014 sklearn GradientBoosting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c09",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import GradientBoostingClassifier\nfrom sklearn.metrics import accuracy_score, roc_auc_score\n\nt0 = time.time()\ngbc = GradientBoostingClassifier(\n    n_estimators=100, learning_rate=0.1,\n    max_depth=4, subsample=0.8, random_state=42\n)\ngbc.fit(X_tr, y_tr)\nsklearn_time = time.time() - t0\n\ngbc_auc = roc_auc_score(y_te, gbc.predict_proba(X_te)[:,1])\ngbc_acc = accuracy_score(y_te, gbc.predict(X_te))\nprint(f'sklearn GBM  \u2014 time: {sklearn_time:.2f}s  AUC: {gbc_auc:.4f}  Acc: {gbc_acc:.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c10",
   "metadata": {},
   "source": [
    "## 4. LightGBM with Early Stopping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c11",
   "metadata": {},
   "outputs": [],
   "source": [
    "t0 = time.time()\nlgbm_model = lgb.LGBMClassifier(\n    n_estimators=300,\n    learning_rate=0.05,\n    num_leaves=31,\n    min_child_samples=20,\n    subsample=0.8,\n    colsample_bytree=0.8,\n    reg_alpha=0.1,\n    reg_lambda=1.0,\n    random_state=42,\n    n_jobs=-1,\n    verbose=-1\n)\nlgbm_model.fit(\n    X_tr, y_tr,\n    eval_set=[(X_val, y_val)],\n    callbacks=[lgb.early_stopping(30, verbose=False), lgb.log_evaluation(0)]\n)\nlgbm_time = time.time() - t0\n\nlgbm_auc = roc_auc_score(y_te, lgbm_model.predict_proba(X_te)[:,1])\nlgbm_acc = accuracy_score(y_te, lgbm_model.predict(X_te))\nbest_iter = lgbm_model.best_iteration_\nprint(f'LightGBM     \u2014 time: {lgbm_time:.2f}s  AUC: {lgbm_auc:.4f}  Acc: {lgbm_acc:.4f}')\nprint(f'Best iteration: {best_iter}  |  Speedup vs sklearn GBM: {sklearn_time/lgbm_time:.1f}\u00d7')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c12",
   "metadata": {},
   "source": [
    "## 5. Speed vs Accuracy Bar Chart"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c13",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n\nnames  = ['sklearn GBM\\n(100 trees)', f'LightGBM\\n({best_iter} trees)']\ntimes_ = [sklearn_time, lgbm_time]\naucs_  = [gbc_auc, lgbm_auc]\n\naxes[0].bar(names, times_, color=['#f59e0b','#6366f1'], edgecolor='k', width=0.5)\naxes[0].set_ylabel('Training Time (s)')\naxes[0].set_title('Training Time')\nfor i,v in enumerate(times_): axes[0].text(i, v*1.02, f'{v:.2f}s', ha='center', fontweight='bold')\n\naxes[1].bar(names, aucs_, color=['#f59e0b','#6366f1'], edgecolor='k', width=0.5)\naxes[1].set_ylabel('Test AUC-ROC')\naxes[1].set_title('AUC-ROC')\naxes[1].set_ylim([min(aucs_)-0.03, 1.0])\nfor i,v in enumerate(aucs_): axes[1].text(i, v-0.007, f'{v:.4f}', ha='center', color='white', fontweight='bold')\n\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c14",
   "metadata": {},
   "source": [
    "## 6. Feature Importance \u2014 Split Count vs Gain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c15",
   "metadata": {},
   "outputs": [],
   "source": [
    "feat_names = [f'F{i}' for i in range(30)]\nimp_split = lgbm_model.booster_.feature_importance(importance_type='split')\nimp_gain  = lgbm_model.booster_.feature_importance(importance_type='gain')\n\nfig, axes = plt.subplots(1, 2, figsize=(14, 4))\nfor ax, imp, title, col in [\n    (axes[0], imp_split, 'Split Count', '#6366f1'),\n    (axes[1], imp_gain,  'Gain (more reliable)', '#22c55e')\n]:\n    order = np.argsort(imp)[::-1][:12]\n    ax.bar(range(12), imp[order], color=col, alpha=0.85)\n    ax.set_xticks(range(12))\n    ax.set_xticklabels([feat_names[i] for i in order], rotation=45)\n    ax.set_title(f'Feature Importance \u2014 {title}')\n\nplt.tight_layout(); plt.show()\nprint('Top 5 gain:', [feat_names[i] for i in np.argsort(imp_gain)[::-1][:5]])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c16",
   "metadata": {},
   "source": [
    "## 7. num_leaves Sweep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c17",
   "metadata": {},
   "outputs": [],
   "source": [
    "leaves_list = [7, 15, 31, 63]\nresults = []\nfor nl in leaves_list:\n    m = lgb.LGBMClassifier(\n        n_estimators=200, learning_rate=0.05,\n        num_leaves=nl, min_child_samples=20,\n        subsample=0.8, colsample_bytree=0.8,\n        random_state=42, n_jobs=-1, verbose=-1\n    )\n    m.fit(X_tr, y_tr,\n          eval_set=[(X_val, y_val)],\n          callbacks=[lgb.early_stopping(20, verbose=False), lgb.log_evaluation(0)])\n    val_auc  = roc_auc_score(y_val, m.predict_proba(X_val)[:,1])\n    test_auc = roc_auc_score(y_te,  m.predict_proba(X_te)[:,1])\n    results.append({'num_leaves': nl, 'val_auc': val_auc, 'test_auc': test_auc})\n    print(f'num_leaves={nl:3d}: val AUC={val_auc:.4f}  test AUC={test_auc:.4f}')\n\ndf_r = pd.DataFrame(results)\nfig, ax = plt.subplots(figsize=(8, 4))\nax.plot(df_r['num_leaves'], df_r['val_auc'],  'o-', color='#6366f1', label='Val AUC')\nax.plot(df_r['num_leaves'], df_r['test_auc'], 's-', color='#22c55e', label='Test AUC')\nax.set_xlabel('num_leaves'); ax.set_ylabel('AUC-ROC')\nax.set_title('num_leaves Sweep'); ax.legend()\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c18",
   "metadata": {},
   "source": [
    "## 8. Learning Rate Convergence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c19",
   "metadata": {},
   "outputs": [],
   "source": [
    "lrs = [0.01, 0.05, 0.1, 0.2]\ncolors = ['#94a3b8','#6366f1','#22c55e','#ef4444']\nfig, ax = plt.subplots(figsize=(11, 4))\nfor lr, col in zip(lrs, colors):\n    evals = {}\n    m = lgb.LGBMClassifier(\n        n_estimators=200, learning_rate=lr,\n        num_leaves=31, min_child_samples=20,\n        subsample=0.8, colsample_bytree=0.8,\n        random_state=42, n_jobs=-1, verbose=-1\n    )\n    m.fit(X_tr, y_tr, eval_set=[(X_te, y_te)],\n          callbacks=[lgb.record_evaluation(evals), lgb.log_evaluation(0)])\n    curve = evals['valid_0']['binary_logloss']\n    ax.plot(range(1, len(curve)+1), curve, color=col, linewidth=1.8, label=f'lr={lr}')\nax.set_xlabel('Round'); ax.set_ylabel('Log-Loss (test)')\nax.set_title('Learning Rate \u00d7 Convergence'); ax.legend()\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c20",
   "metadata": {},
   "source": [
    "## 9. Final Cross-Validated Comparison"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c21",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score, StratifiedKFold\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.ensemble import RandomForestClassifier\n\ncv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)\n\nbest_nl = int(df_r.loc[df_r['val_auc'].idxmax(), 'num_leaves'])\n\nmodels = {\n    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),\n    'sklearn GBM':   GradientBoostingClassifier(n_estimators=80, learning_rate=0.1,\n                                                max_depth=4, random_state=42),\n    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),\n    'LightGBM':      lgb.LGBMClassifier(n_estimators=best_iter, learning_rate=0.05,\n                                         num_leaves=best_nl, min_child_samples=20,\n                                         subsample=0.8, colsample_bytree=0.8,\n                                         random_state=42, n_jobs=-1, verbose=-1),\n}\n\ncv_scores = {}\nfor name, model in models.items():\n    scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc', n_jobs=-1)\n    cv_scores[name] = scores\n    print(f'{name:20s}: {scores.mean():.4f} \u00b1 {scores.std():.4f}')\n\nplt.figure(figsize=(10, 4))\nplt.boxplot(cv_scores.values(), labels=cv_scores.keys(), patch_artist=True,\n            boxprops=dict(facecolor='#e0e7ff'),\n            medianprops=dict(color='#4f46e5', linewidth=2))\nplt.ylabel('3-Fold CV AUC-ROC')\nplt.title('LightGBM vs Baselines')\nplt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c22",
   "metadata": {},
   "source": [
    "## 10. Discussion\n\n1. **Histogram binning is the key speed driver.** Bucketing values into \u2264255 bins converts an O(N) scan into an O(255) lookup per split, giving 5\u201320\u00d7 speedup on datasets with N > 10,000 rows.\n\n2. **Leaf-wise growth is more accurate per node count.** By always splitting the highest-gain leaf, LightGBM builds asymmetric trees that adapt depth to difficulty \u2014 deep paths for hard examples, shallow paths for easy ones.\n\n3. **Gain importance is more reliable than split count.** Features can appear in many splits with tiny gain; the gain metric weights each split by actual loss reduction.\n\n4. **num_leaves is the primary complexity knob.** Unlike max_depth, num_leaves directly bounds the maximum leaf count per tree. Start at 31 and increase to 63 for complex datasets with strong interactions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c23",
   "metadata": {},
   "source": [
    "## 11. Next Steps\n\n- **Article 12: Multi-Class Boosting in Python** \u2014 softmax loss for K > 2 classes\n- **Article 13: Multi-Label Boosting** \u2014 predicting multiple labels simultaneously\n- **Article 14: Boosting with Noisy Data** \u2014 when label noise causes runaway weights"
   ]
  }
 ]
}