data-jupyter-notebooks/data_driven_risk_assessment/experiments/001_basic_booking_attributes.ipynb

1439 lines
352 KiB
Text
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"id": "84dcd475",
"metadata": {},
"source": [
"# DDRA - 001 - Basic Booking Attributes\n",
"\n",
"## General Idea\n",
"The idea is to start with a very simple model with basic Booking attributes. This should serve as a first understanding of what can bring value in the data-driven risk assessment of new dash protected bookings.\n",
"\n",
"## Initial setup\n",
"This first section just ensures that the connection to DWH works correctly."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "12368ce1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🔌 Testing connection using credentials at: /home/uri/.superhog-dwh/credentials.yml\n",
"✅ Connection successful.\n"
]
}
],
"source": [
"# This script connects to a Data Warehouse (DWH) using PostgreSQL. \n",
"# This should be common for all Notebooks, but you might need to adjust the path to the `dwh_utils` module.\n",
"\n",
"import sys\n",
"import os\n",
"sys.path.append(os.path.abspath(\"../../utils\")) # Adjust path if needed\n",
"\n",
"from dwh_utils import read_credentials, create_postgres_engine, query_to_dataframe, test_connection\n",
"\n",
"# --- Connect to DWH ---\n",
"creds = read_credentials()\n",
"dwh_pg_engine = create_postgres_engine(creds)\n",
"\n",
"# --- Test Query ---\n",
"test_connection()"
]
},
{
"cell_type": "markdown",
"id": "c86f94f1",
"metadata": {},
"source": [
"## Data Extraction\n",
"In this section we extract the data for our first attempt on Basic Booking Attributes modelling.\n",
"\n",
"This SQL query retrieves a clean and relevant subset of booking data for our model. It includes:\n",
"- A **unique booking ID**\n",
"- Key **numeric features** such as number of services, time between booking creation and check-in, and number of nights\n",
"- Several **categorical (boolean) features** related to service usage\n",
"- A **target variable** (`has_resolution_incident`) indicating whether a resolution incident occurred\n",
"\n",
"Filters applied being:\n",
"1. Bookings from **\"New Dash\" users** with a valid deal ID\n",
"2. Only **protected bookings**, i.e., those with Protection or Deposit Management services\n",
"3. Bookings flagged for **risk categorisation** (excluding incomplete/rejected ones)\n",
"4. Bookings that are **already completed**\n",
"\n",
"The result is converted into a pandas DataFrame for further processing and modeling.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3e3ed391",
"metadata": {},
"outputs": [],
"source": [
"# Initialise all imports needed for the Notebook\n",
"from sklearn.model_selection import (\n",
" train_test_split, \n",
" GridSearchCV\n",
")\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import date\n",
"from sklearn.metrics import (\n",
" roc_auc_score, \n",
" average_precision_score,\n",
" classification_report,\n",
" roc_curve, \n",
" auc,\n",
" precision_recall_curve\n",
")\n",
"import matplotlib.pyplot as plt\n",
"import shap"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "db5e3098",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" id_booking number_of_applied_services \\\n",
"0 919656 3 \n",
"1 926634 3 \n",
"2 931082 2 \n",
"3 931086 2 \n",
"4 931096 2 \n",
"\n",
" number_of_applied_upgraded_services number_of_applied_billable_services \\\n",
"0 2 2 \n",
"1 2 2 \n",
"2 1 1 \n",
"3 1 1 \n",
"4 1 1 \n",
"\n",
" booking_days_to_check_in booking_number_of_nights \\\n",
"0 87 4 \n",
"1 109 3 \n",
"2 50 7 \n",
"3 15 3 \n",
"4 8 5 \n",
"\n",
" has_verification_request has_billable_services \\\n",
"0 False True \n",
"1 False True \n",
"2 False True \n",
"3 False True \n",
"4 False True \n",
"\n",
" has_upgraded_screening_service_business_type \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" has_deposit_management_service_business_type \\\n",
"0 True \n",
"1 True \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" has_protection_service_business_type has_resolution_incident \n",
"0 True False \n",
"1 True False \n",
"2 True False \n",
"3 True False \n",
"4 True False \n",
"Total Bookings: 16,193\n"
]
}
],
"source": [
"# Query to extract data\n",
"data_extraction_query = \"\"\"\n",
"select \n",
" -- Unique ID --\n",
" ibs.id_booking,\n",
" -- Numeric Features --\n",
" ibs.number_of_applied_services,\n",
" ibs.number_of_applied_upgraded_services,\n",
" ibs.number_of_applied_billable_services,\n",
" ibs.booking_check_in_date_utc - booking_created_date_utc as booking_days_to_check_in,\n",
" ibs.booking_number_of_nights,\n",
" -- Categorical (Boolean) Features --\n",
" ibs.has_verification_request,\n",
" ibs.has_billable_services,\n",
" ibs.has_upgraded_screening_service_business_type,\n",
" ibs.has_deposit_management_service_business_type,\n",
" ibs.has_protection_service_business_type,\n",
" -- Target (Boolean) --\n",
" ibs.has_resolution_incident\n",
"from intermediate.int_booking_summary ibs\n",
"where \n",
" -- 1. Bookings from New Dash users with Id Deal\n",
" ibs.is_user_in_new_dash = True and \n",
" ibs.is_missing_id_deal = False and\n",
" -- 2. Protected Bookings with a Protection or a Deposit Management service\n",
" (ibs.has_protection_service_business_type or \n",
" ibs.has_deposit_management_service_business_type) and\n",
" -- 3. Bookings with flagging categorisation (this excludes Cancelled/Incomplete/Rejected bookings)\n",
" ibs.is_booking_flagged_as_risk is not null and \n",
" -- 4. Booking is completed\n",
" ibs.is_booking_past_completion_date = True \n",
"\"\"\"\n",
"\n",
"# Retrieve Data from Query\n",
"df_extraction = query_to_dataframe(engine=dwh_pg_engine, query=data_extraction_query)\n",
"print(df_extraction.head())\n",
"print(f\"Total Bookings: {len(df_extraction):,}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Processing\n",
"Processing in this notebook is quite straight-forward: we just drop id booking, split the features and target and apply a scaling to numeric features.\n",
"Afterwards, we split the dataset between train and test and display their sizes and target distribution."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set size: 11335 rows\n",
"Test set size: 4858 rows\n",
"\n",
"Training target distribution:\n",
"has_resolution_incident\n",
"False 0.988619\n",
"True 0.011381\n",
"Name: proportion, dtype: float64\n",
"\n",
"Test target distribution:\n",
"has_resolution_incident\n",
"False 0.988473\n",
"True 0.011527\n",
"Name: proportion, dtype: float64\n"
]
}
],
"source": [
"# Drop ID column\n",
"df = df_extraction.copy().drop(columns=['id_booking'])\n",
"\n",
"# Separate features and target\n",
"X = df.drop(columns=['has_resolution_incident'])\n",
"y = df['has_resolution_incident']\n",
"\n",
"# Scale numeric features\n",
"numeric_features = ['number_of_applied_services', \n",
" 'booking_number_of_nights', \n",
" 'number_of_applied_upgraded_services',\n",
" 'number_of_applied_billable_services',\n",
" 'booking_days_to_check_in']\n",
"X[numeric_features] = X[numeric_features].astype(float)\n",
"\n",
"# Split the data\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=123)\n",
"\n",
"print(f\"Training set size: {X_train.shape[0]} rows\")\n",
"print(f\"Test set size: {X_test.shape[0]} rows\")\n",
"\n",
"print(\"\\nTraining target distribution:\")\n",
"print(y_train.value_counts(normalize=True))\n",
"\n",
"print(\"\\nTest target distribution:\")\n",
"print(y_test.value_counts(normalize=True))"
]
},
{
"cell_type": "markdown",
"id": "d36c9276",
"metadata": {},
"source": [
"## Classification Model with Random Forest\n",
"\n",
"We define a machine learning pipeline that includes:\n",
"- **Scaling numeric features** with `StandardScaler`\n",
"- **Training a Random Forest classifier** with balanced class weights to handle the imbalanced dataset\n",
"\n",
"We then use `GridSearchCV` to perform a **grid search with cross-validation** over a range of key hyperparameters (e.g., number of trees, max depth, etc.). \n",
"The model is evaluated using **Average Precision**, which is better suited for imbalanced classification tasks.\n",
"\n",
"The best combination of parameters is selected, and the resulting model is used to make predictions on the test set.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "943ef7d6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 5 folds for each of 72 candidates, totalling 360 fits\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 7.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 7.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.1s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.1s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 2.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 2.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 2.3s\n",
"Best hyperparameters: {'model__max_depth': 10, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 2, 'model__min_samples_split': 2, 'model__n_estimators': 100}\n"
]
}
],
"source": [
"\n",
"# Define pipeline (scaling numeric features only)\n",
"pipeline = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('model', RandomForestClassifier(class_weight='balanced', # We have an imbalanced dataset\n",
" random_state=123))\n",
"])\n",
"\n",
"# Define parameter grid\n",
"param_grid = {\n",
" 'model__n_estimators': [100, 200, 300],\n",
" 'model__max_depth': [None, 10, 20],\n",
" 'model__min_samples_split': [2, 5],\n",
" 'model__min_samples_leaf': [1, 2],\n",
" 'model__max_features': ['sqrt', 'log2']\n",
"}\n",
"\n",
"# GridSearchCV\n",
"grid_search = GridSearchCV(\n",
" estimator=pipeline,\n",
" param_grid=param_grid,\n",
" scoring='average_precision', # For imbalanced classification\n",
" cv=5, # 5-fold cross-validation\n",
" n_jobs=-1, # Use all available cores\n",
" verbose=2 # Verbose output for progress tracking\n",
")\n",
"\n",
"# Fit the grid search on training data\n",
"grid_search.fit(X_train, y_train)\n",
"\n",
"# Best model\n",
"best_pipeline = grid_search.best_estimator_\n",
"print(\"Best hyperparameters:\", grid_search.best_params_)\n",
"\n",
"# Predict on test set\n",
"y_pred_proba = best_pipeline.predict_proba(X_test)[:, 1]\n",
"y_pred = best_pipeline.predict(X_test)\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>mean_fit_time</th>\n",
" <th>std_fit_time</th>\n",
" <th>mean_score_time</th>\n",
" <th>std_score_time</th>\n",
" <th>param_model__max_depth</th>\n",
" <th>param_model__max_features</th>\n",
" <th>param_model__min_samples_leaf</th>\n",
" <th>param_model__min_samples_split</th>\n",
" <th>param_model__n_estimators</th>\n",
" <th>params</th>\n",
" <th>split0_test_score</th>\n",
" <th>split1_test_score</th>\n",
" <th>split2_test_score</th>\n",
" <th>split3_test_score</th>\n",
" <th>split4_test_score</th>\n",
" <th>mean_test_score</th>\n",
" <th>std_test_score</th>\n",
" <th>rank_test_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>1.191664</td>\n",
" <td>0.060865</td>\n",
" <td>0.060239</td>\n",
" <td>0.003913</td>\n",
" <td>10</td>\n",
" <td>log2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>100</td>\n",
" <td>{'model__max_depth': 10, 'model__max_features'...</td>\n",
" <td>0.035431</td>\n",
" <td>0.023902</td>\n",
" <td>0.019452</td>\n",
" <td>0.022538</td>\n",
" <td>0.026337</td>\n",
" <td>0.025532</td>\n",
" <td>0.005426</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>1.295314</td>\n",
" <td>0.295965</td>\n",
" <td>0.071769</td>\n",
" <td>0.019185</td>\n",
" <td>10</td>\n",
" <td>sqrt</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>100</td>\n",
" <td>{'model__max_depth': 10, 'model__max_features'...</td>\n",
" <td>0.035431</td>\n",
" <td>0.023902</td>\n",
" <td>0.019452</td>\n",
" <td>0.022538</td>\n",
" <td>0.026337</td>\n",
" <td>0.025532</td>\n",
" <td>0.005426</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>2.318125</td>\n",
" <td>0.101894</td>\n",
" <td>0.105294</td>\n",
" <td>0.009273</td>\n",
" <td>10</td>\n",
" <td>sqrt</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>200</td>\n",
" <td>{'model__max_depth': 10, 'model__max_features'...</td>\n",
" <td>0.037634</td>\n",
" <td>0.021405</td>\n",
" <td>0.018878</td>\n",
" <td>0.022386</td>\n",
" <td>0.025625</td>\n",
" <td>0.025186</td>\n",
" <td>0.006589</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43</th>\n",
" <td>2.513033</td>\n",
" <td>0.161350</td>\n",
" <td>0.120259</td>\n",
" <td>0.020841</td>\n",
" <td>10</td>\n",
" <td>log2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>200</td>\n",
" <td>{'model__max_depth': 10, 'model__max_features'...</td>\n",
" <td>0.037634</td>\n",
" <td>0.021405</td>\n",
" <td>0.018878</td>\n",
" <td>0.022386</td>\n",
" <td>0.025625</td>\n",
" <td>0.025186</td>\n",
" <td>0.006589</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44</th>\n",
" <td>3.862008</td>\n",
" <td>0.369737</td>\n",
" <td>0.170743</td>\n",
" <td>0.029734</td>\n",
" <td>10</td>\n",
" <td>log2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>300</td>\n",
" <td>{'model__max_depth': 10, 'model__max_features'...</td>\n",
" <td>0.034515</td>\n",
" <td>0.021561</td>\n",
" <td>0.019028</td>\n",
" <td>0.023610</td>\n",
" <td>0.024728</td>\n",
" <td>0.024688</td>\n",
" <td>0.005283</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>4.705051</td>\n",
" <td>1.009530</td>\n",
" <td>0.263226</td>\n",
" <td>0.106331</td>\n",
" <td>None</td>\n",
" <td>log2</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>300</td>\n",
" <td>{'model__max_depth': None, 'model__max_feature...</td>\n",
" <td>0.028740</td>\n",
" <td>0.015051</td>\n",
" <td>0.015244</td>\n",
" <td>0.018043</td>\n",
" <td>0.012987</td>\n",
" <td>0.018013</td>\n",
" <td>0.005599</td>\n",
" <td>67</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>2.778192</td>\n",
" <td>0.175340</td>\n",
" <td>0.121770</td>\n",
" <td>0.012860</td>\n",
" <td>None</td>\n",
" <td>log2</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>200</td>\n",
" <td>{'model__max_depth': None, 'model__max_feature...</td>\n",
" <td>0.030543</td>\n",
" <td>0.013419</td>\n",
" <td>0.014527</td>\n",
" <td>0.016448</td>\n",
" <td>0.012857</td>\n",
" <td>0.017559</td>\n",
" <td>0.006607</td>\n",
" <td>69</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3.294891</td>\n",
" <td>0.485518</td>\n",
" <td>0.134053</td>\n",
" <td>0.017547</td>\n",
" <td>None</td>\n",
" <td>sqrt</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>200</td>\n",
" <td>{'model__max_depth': None, 'model__max_feature...</td>\n",
" <td>0.030543</td>\n",
" <td>0.013419</td>\n",
" <td>0.014527</td>\n",
" <td>0.016448</td>\n",
" <td>0.012857</td>\n",
" <td>0.017559</td>\n",
" <td>0.006607</td>\n",
" <td>69</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.316659</td>\n",
" <td>0.108668</td>\n",
" <td>0.064057</td>\n",
" <td>0.006920</td>\n",
" <td>None</td>\n",
" <td>sqrt</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>100</td>\n",
" <td>{'model__max_depth': None, 'model__max_feature...</td>\n",
" <td>0.026317</td>\n",
" <td>0.014495</td>\n",
" <td>0.013819</td>\n",
" <td>0.014843</td>\n",
" <td>0.012623</td>\n",
" <td>0.016419</td>\n",
" <td>0.005007</td>\n",
" <td>71</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>1.497623</td>\n",
" <td>0.385128</td>\n",
" <td>0.083825</td>\n",
" <td>0.028476</td>\n",
" <td>None</td>\n",
" <td>log2</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>100</td>\n",
" <td>{'model__max_depth': None, 'model__max_feature...</td>\n",
" <td>0.026317</td>\n",
" <td>0.014495</td>\n",
" <td>0.013819</td>\n",
" <td>0.014843</td>\n",
" <td>0.012623</td>\n",
" <td>0.016419</td>\n",
" <td>0.005007</td>\n",
" <td>71</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>72 rows × 18 columns</p>\n",
"</div>"
],
"text/plain": [
" mean_fit_time std_fit_time mean_score_time std_score_time \\\n",
"42 1.191664 0.060865 0.060239 0.003913 \n",
"30 1.295314 0.295965 0.071769 0.019185 \n",
"31 2.318125 0.101894 0.105294 0.009273 \n",
"43 2.513033 0.161350 0.120259 0.020841 \n",
"44 3.862008 0.369737 0.170743 0.029734 \n",
".. ... ... ... ... \n",
"14 4.705051 1.009530 0.263226 0.106331 \n",
"13 2.778192 0.175340 0.121770 0.012860 \n",
"1 3.294891 0.485518 0.134053 0.017547 \n",
"0 1.316659 0.108668 0.064057 0.006920 \n",
"12 1.497623 0.385128 0.083825 0.028476 \n",
"\n",
" param_model__max_depth param_model__max_features \\\n",
"42 10 log2 \n",
"30 10 sqrt \n",
"31 10 sqrt \n",
"43 10 log2 \n",
"44 10 log2 \n",
".. ... ... \n",
"14 None log2 \n",
"13 None log2 \n",
"1 None sqrt \n",
"0 None sqrt \n",
"12 None log2 \n",
"\n",
" param_model__min_samples_leaf param_model__min_samples_split \\\n",
"42 2 2 \n",
"30 2 2 \n",
"31 2 2 \n",
"43 2 2 \n",
"44 2 2 \n",
".. ... ... \n",
"14 1 2 \n",
"13 1 2 \n",
"1 1 2 \n",
"0 1 2 \n",
"12 1 2 \n",
"\n",
" param_model__n_estimators \\\n",
"42 100 \n",
"30 100 \n",
"31 200 \n",
"43 200 \n",
"44 300 \n",
".. ... \n",
"14 300 \n",
"13 200 \n",
"1 200 \n",
"0 100 \n",
"12 100 \n",
"\n",
" params split0_test_score \\\n",
"42 {'model__max_depth': 10, 'model__max_features'... 0.035431 \n",
"30 {'model__max_depth': 10, 'model__max_features'... 0.035431 \n",
"31 {'model__max_depth': 10, 'model__max_features'... 0.037634 \n",
"43 {'model__max_depth': 10, 'model__max_features'... 0.037634 \n",
"44 {'model__max_depth': 10, 'model__max_features'... 0.034515 \n",
".. ... ... \n",
"14 {'model__max_depth': None, 'model__max_feature... 0.028740 \n",
"13 {'model__max_depth': None, 'model__max_feature... 0.030543 \n",
"1 {'model__max_depth': None, 'model__max_feature... 0.030543 \n",
"0 {'model__max_depth': None, 'model__max_feature... 0.026317 \n",
"12 {'model__max_depth': None, 'model__max_feature... 0.026317 \n",
"\n",
" split1_test_score split2_test_score split3_test_score \\\n",
"42 0.023902 0.019452 0.022538 \n",
"30 0.023902 0.019452 0.022538 \n",
"31 0.021405 0.018878 0.022386 \n",
"43 0.021405 0.018878 0.022386 \n",
"44 0.021561 0.019028 0.023610 \n",
".. ... ... ... \n",
"14 0.015051 0.015244 0.018043 \n",
"13 0.013419 0.014527 0.016448 \n",
"1 0.013419 0.014527 0.016448 \n",
"0 0.014495 0.013819 0.014843 \n",
"12 0.014495 0.013819 0.014843 \n",
"\n",
" split4_test_score mean_test_score std_test_score rank_test_score \n",
"42 0.026337 0.025532 0.005426 1 \n",
"30 0.026337 0.025532 0.005426 1 \n",
"31 0.025625 0.025186 0.006589 3 \n",
"43 0.025625 0.025186 0.006589 3 \n",
"44 0.024728 0.024688 0.005283 5 \n",
".. ... ... ... ... \n",
"14 0.012987 0.018013 0.005599 67 \n",
"13 0.012857 0.017559 0.006607 69 \n",
"1 0.012857 0.017559 0.006607 69 \n",
"0 0.012623 0.016419 0.005007 71 \n",
"12 0.012623 0.016419 0.005007 71 \n",
"\n",
"[72 rows x 18 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Retrieve cv results\n",
"pd.DataFrame(grid_search.cv_results_).sort_values(by='mean_test_score', ascending=False)"
]
},
{
"cell_type": "markdown",
"id": "fc2fcc89",
"metadata": {},
"source": [
"## Evaluation\n",
"This section aims to evaluate how good the new model is vs. the actual Resolution Incidents.\n",
"\n",
"We start by computing and displaying the classification report, ROC Curve, PR Curve and the respective Area Under the Curve (AUC)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "30786f7c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" False 0.99 0.92 0.95 4802\n",
" True 0.02 0.16 0.04 56\n",
"\n",
" accuracy 0.91 4858\n",
" macro avg 0.51 0.54 0.49 4858\n",
"weighted avg 0.98 0.91 0.94 4858\n",
"\n"
]
}
],
"source": [
"# Print classification report\n",
"print(classification_report(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpreting the Classification Report\n",
"\n",
"The **Classification Report** provides key metrics to evaluate how well the model performed on each class.\n",
"\n",
"It includes the following metrics for each class (0 and 1):\n",
"* Metric: Meaning\n",
"* Precision: Out of all predicted positives, how many were actually positive?\n",
"* Recall: Out of all actual positives, how many did we correctly identify?\n",
"* F1-score: Harmonic mean of precision and recall (balances both)\n",
"* Support: Number of true samples of that class in the test data\n",
"\n",
"Interpretation:\n",
"* Class 0 = No incident\n",
"* Class 1 = Has resolution incident (rare, but important!)\n",
"\n",
"A few explanatory cases:\n",
"* A high recall for class 1 means we're catching most incidents.\n",
"* A high precision for class 1 means when we predict an incident, we're often correct.\n",
"* The F1-score gives a single balanced measure (good for imbalanced data).\n",
"\n",
"Special note for imbalanced data:\n",
"Since class 1 (or just True) is rare (1% in our case), metrics for that class are more critical.\n",
"We want to maximize recall to catch as many real incidents as possible — without letting precision drop too low (to avoid too many false alarms)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4b4da914",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAhgAAAHWCAYAAAA1jvBJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB4k0lEQVR4nO3deVhUZf8G8HsGhn0TEVlEEcTc9yX3DUXLLTfA3co2Ld/8WWmLZotWltlblqWZWgqIW5q7lvuairu4IG6AyouKrLM9vz+IgQlQBs9wBub+XBeXZ86cc+Y7jwPcnPOc51EIIQSIiIiIJKSUuwAiIiKqfBgwiIiISHIMGERERCQ5BgwiIiKSHAMGERERSY4Bg4iIiCTHgEFERESSY8AgIiIiyTFgEBERkeQYMIiIiEhyDBhEVmDJkiVQKBSGL1tbW/j7+2Ps2LG4detWsfsIIfDrr7+ic+fO8PDwgJOTExo3boyPPvoImZmZJb7W2rVr0adPH3h5ecHOzg5+fn4YNmwY/vzzz1LVmpOTg6+//hpt27aFu7s7HBwcULduXUycOBEXL14s0/snovKn4FwkRJXfkiVLMG7cOHz00UeoXbs2cnJycOjQISxZsgSBgYE4c+YMHBwcDNvrdDoMHz4cK1euRKdOnTBo0CA4OTlh7969WLFiBRo0aIAdO3agevXqhn2EEHj++eexZMkSNG/eHEOGDIGPjw+Sk5Oxdu1aHDt2DPv370f79u1LrDM1NRW9e/fGsWPH0LdvX4SGhsLFxQXx8fGIjo5GSkoK1Gq1WduKiCQiiKjS++WXXwQAcfToUaP177zzjgAgYmJijNbPmjVLABBTpkwpcqz169cLpVIpevfubbR+zpw5AoD4z3/+I/R6fZH9li1bJg4fPvzIOp999lmhVCrFqlWrijyXk5Mj/u///u+R+5eWRqMRubm5khyLiIrHgEFkBUoKGH/88YcAIGbNmmVYl5WVJapUqSLq1q0rNBpNsccbN26cACAOHjxo2MfT01PUq1dPaLXaMtV46NAhAUCMHz++VNt36dJFdOnSpcj6MWPGiFq1ahkeX716VQAQc+bMEV9//bUICgoSSqVSHDp0SNjY2IgPP/ywyDEuXLggAIhvv/3WsO7evXti0qRJokaNGsLOzk4EBweLzz77TOh0OpPfK5E1YB8MIiuWmJgIAKhSpYph3b59+3Dv3j0MHz4ctra2xe43evRoAMAff/xh2CctLQ3Dhw+HjY1NmWpZv349AGDUqFFl2v9xfvnlF3z77bd46aWX8NVXX8HX1xddunTBypUri2wbExMDGxsbDB06FACQlZWFLl264LfffsPo0aPx3//+Fx06dMC0adMwefJks9RLVNEV/9ODiCqlBw8eIDU1FTk5OTh8+DBmzpwJe3t79O3b17DNuXPnAABNmzYt8Tj5z50/f97o38aNG5e5NimO8Sg3b97E5cuXUa1aNcO68PBwvPzyyzhz5gwaNWpkWB8TE4MuXboY+pjMnTsXV65cwYkTJxASEgIAePnll+Hn54c5c+bg//7v/xAQEGCWuokqKp7BILIioaGhqFatGgICAjBkyBA4Oztj/fr1qFGjhmGbhw8fAgBcXV1LPE7+c+np6Ub/Pmqfx5HiGI8yePBgo3ABAIMGDYKtrS1iYmIM686cOYNz584hPDzcsC42NhadOnVClSpVkJqaavgKDQ2FTqfDnj17zFIzUUXGMxhEVmT+/PmoW7cuHjx4gMWLF2PPnj2wt7c32ib/F3x+0CjOv0OIm5vbY/d5nMLH8PDwKPNxSlK7du0i67y8vNCjRw+sXLkSH3/8MYC8sxe2trYYNGiQYbtLly7h1KlTRQJKvjt37kheL1FFx4BBZEXatGmDVq1aAQAGDhyIjh07Yvjw4YiPj4eLiwsAoH79+gCAU6dOYeDAgcUe59SpUwCABg0aAADq1asHADh9+nSJ+zxO4WN06tTpsdsrFAqIYu6y1+l0xW7v6OhY7PqIiAiMGzcOcXFxaNasGVauXIkePXrAy8vLsI1er0fPnj3x9ttvF3uMunXrPrZeImvDSyREVsrGxgazZ89GUlISvvvuO8P6jh07wsPDAytWrCjxl/WyZcsAwNB3o2PHjqhSpQqioqJK3Odx+vXrBwD47bffSrV9lSpVcP/+/SLrr127ZtLrDhw4EHZ2doiJiUFcXBwuXryIiIgIo22Cg4ORkZGB0NDQYr9q1qxp0msSWQMGDCIr1rVrV7Rp0wbz5s1DTk4OAMDJyQlTpkxBfHw83nvvvSL7bNy4EUuWLEFYWBiefvppwz7vvPMOzp8/j3feeafYMwu//fYbjhw5UmIt7dq1Q+/evbFo0SKsW7euyPNqtRpTpkwxPA4ODsaFCxdw9+5dw7qTJ09i//79pX7/AODh4YGwsDCsXLkS0dHRsLOzK3IWZtiwYTh48CC2bt1aZP/79+9Dq9Wa9JpE1oAjeRJZgfyRPI8ePWq4RJJv1apVGDp0KH744Qe88sorAPIuM4SHh2P16tXo3LkzBg8eDEdHR+zbtw+//fYb6tevj507dxqN5KnX6zF27Fj8+uuvaNGihWEkz5SUFKxbtw5HjhzBgQMH0K5duxLrvHv3Lnr16oWTJ0+iX79+6NGjB5ydnXHp0iVER0cjOTkZubm5APLuOmnUqBGaNm2KF154AXfu3MGCBQtQvXp1pKenG27BTUxMRO3atTFnzhyjgFLY8uXLMXLkSLi6uqJr166GW2bzZWVloVOnTjh16hTGjh2Lli1bIjMzE6dPn8aqVauQmJhodEmFiMCRPImsQUkDbQkhhE6nE8HBwSI4ONhokCydTid++eUX0aFDB+Hm5iYcHBxEw4YNxcyZM0VGRkaJr7Vq1SrRq1cv4enpKWxtbYWvr68IDw8Xu3btKlWtWVlZ4ssvvxStW7cWLi4uws7OToSEhIjXX39dXL582Wjb3377TQQFBQk7OzvRrFkzsXXr1kcOtFWS9PR04ejoKACI3377rdhtHj58KKZNmybq1Kkj7OzshJeXl2jfvr348ssvhVqtLtV7I7ImPINBREREkmMfDCIiIpIcAwYRERFJjgGDiIiIJMeAQURERJJjwCAiIiLJMWAQERGR5KxuLhK9Xo+kpCS4urpCoVDIXQ4REVGFIYTAw4cP4efnB6Xy0ecorC5gJCUlISAgQO4yiIiIKqwbN26gRo0aj9zG6gJG/vTSN27cMEwP/aQ0Gg22bduGXr16QaVSSXJMa8c2lR7bVFpsT+mxTaVljvZMT09HQECA4Xfpo1hdwMi/LOLm5iZpwHBycoKbmxu/KSTCNpUe21RabE/psU2lZc72LE0XA3byJCIiIskxYBAREZHkGDCIiIhIcgwYREREJDkGDCIiIpIcAwYRERFJjgGDiIiIJMeAQURERJJjwCAiIiLJMWAQERGR5GQNGHv27EG/fv3g5+cHhUKBdevWPXafXbt2oUWLFrC3t0edOnWwZMkSs9dJREREppE1YGRmZqJp06aYP39+qba/evUqnn32WXTr1g1xcXH4z3/+gxdffBFbt241c6VERERkClknO+vTpw/69OlT6u0XLFiA2rVr46uvvgIA1K9fH/v27cPXX3+NsLAwc5VJREQWQAggIQHIyZG7kopBowGuX3fF7dvAY2ZWN4sKNZvqwYMHERoaarQuLCwM//nPf0rcJzc3F7m5uYbH6enpAPJmmdNoNJLUlX8cqY5HbFNzYJtKi+0pvZLaVAhg40YFPvzQBqdOPX4WTwJsbHQIDLyGK1e6IzFRg88+k/b3XWlUqICRkpKC6tWrG62rXr060tPTkZ2dDUdHxyL7zJ49GzNnziyyftu2bXBycpK0vu3bt0t6PGKbmgPbVFpsT+nlt6kQQFxcNaxYUR+XLlWRuaqKw8kpC8OGrUStWtcQFRWJxEQFNm06J8mxs7KySr1thQoYZTFt2jRMnjzZ8Dg9PR0BAQHo1asX3NzcJHkNjUaD7du3o2fPnlCpVJIc09qxTaXHNpUW21N6hdv04EE7fPihEvv2GXcVbNFCj6ZNZSqwAlCp7sDbOwa2tveh19ujadO
"text/plain": [
"<Figure size 600x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ROC Curve\n",
"fpr, tpr, _ = roc_curve(y_test, y_pred_proba)\n",
"roc_auc = auc(fpr, tpr)\n",
"\n",
"plt.figure(figsize=(6, 5))\n",
"plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')\n",
"plt.plot([0, 1], [0, 1], color='gray', linestyle='--')\n",
"plt.xlabel('False Positive Rate')\n",
"plt.ylabel('True Positive Rate')\n",
"plt.title('ROC Curve')\n",
"plt.legend(loc='lower right')\n",
"plt.grid(True)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpreting the ROC Curve\n",
"\n",
"The **Receiver Operating Characteristic (ROC) curve** shows how well the model distinguishes between the positive and negative classes across all decision thresholds.\n",
"\n",
"A quick reminder of the definitions:\n",
"* True Positive Rate (TPR) = Recall\n",
"* False Positive Rate (FPR) = Proportion of negatives wrongly classified as positives\n",
"\n",
"What we display in this plot is:\n",
"* The x-axis is False Positive Rate\n",
"* The y-axis is True Positive Rate\n",
"\n",
"The curve shows how TPR and FPR change as the threshold varies\n",
"\n",
"It's important to note that:\n",
"* A model with no skill will produce a diagonal line (AUC = 0.5)\n",
"* A model with perfect discrimination will hug the top-left corner (AUC = 1.0)\n",
"\n",
"The Area Under the Curve (ROC AUC) gives a single performance score:\n",
"* Closer to 1 means better at ranking positive cases higher than negative ones\n",
"\n",
"**Important!**\n",
"\n",
"While useful, the ROC curve can sometimes overestimate performance when the dataset is imbalanced, because it includes negatives (which dominate in our case, around 99%!). Thats why we also MUST check the Precision-Recall curve."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "6790d41d",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAhgAAAHWCAYAAAA1jvBJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABSaElEQVR4nO3deVxU5f4H8M/sgICgbIoormFpaiiGZmqiJKnXbqWpuZWmKV2VrDQXMks090wlvbncX5qmpVmSiqjlQrdSsXvLXRRTQTHZlxlmnt8f3JkcGRDwGUbk8369eMmc85wz3/kyMB/Pec6MQgghQERERCSR0tEFEBER0YOHAYOIiIikY8AgIiIi6RgwiIiISDoGDCIiIpKOAYOIiIikY8AgIiIi6RgwiIiISDoGDCIiIpKOAYOomhoxYgQCAwMrtM2BAwegUChw4MABu9RU3XXr1g3dunWz3L548SIUCgXWrVvnsJqIqisGDKJyWrduHRQKheXLyckJLVq0QGRkJNLS0hxd3n3P/GJt/lIqlahTpw569+6NxMRER5cnRVpaGiZPnoygoCC4uLigVq1aCA4Oxvvvv4+MjAxHl0dUpdSOLoCounnvvffQuHFjFBQU4NChQ1i5ciXi4uLw3//+Fy4uLlVWx+rVq2EymSq0zZNPPon8/HxotVo7VXV3gwYNQkREBIxGI86cOYMVK1age/fu+Pnnn9G6dWuH1XWvfv75Z0RERCAnJwcvvfQSgoODAQC//PIL5s6dix9++AF79uxxcJVEVYcBg6iCevfujfbt2wMARo0ahbp162LRokX4+uuvMWjQIJvb5ObmolatWlLr0Gg0Fd5GqVTCyclJah0V9dhjj+Gll16y3O7SpQt69+6NlStXYsWKFQ6srPIyMjLw7LPPQqVS4fjx4wgKCrJa/8EHH2D16tVS7ssezyUie+ApEqJ79NRTTwEAkpOTARTPjXB1dcX58+cREREBNzc3DBkyBABgMpmwZMkSPPLII3BycoKvry/GjBmDW7duldjvd999h65du8LNzQ3u7u7o0KEDNm7caFlvaw7Gpk2bEBwcbNmmdevWWLp0qWV9aXMwtmzZguDgYDg7O8PLywsvvfQSrly5YjXG/LiuXLmC/v37w9XVFd7e3pg8eTKMRmOl+9elSxcAwPnz562WZ2RkYOLEiQgICIBOp0OzZs0wb968EkdtTCYTli5ditatW8PJyQne3t54+umn8csvv1jGrF27Fk899RR8fHyg0+nw8MMPY+XKlZWu+U6ffPIJrly5gkWLFpUIFwDg6+uL6dOnW24rFAq8++67JcYFBgZixIgRltvm03Lff/89xo0bBx8fHzRo0ABbt261LLdVi0KhwH//+1/LslOnTuH5559HnTp14OTkhPbt22PHjh339qCJ7oJHMIjukfmFsW7dupZlRUVFCA8PxxNPPIEFCxZYTp2MGTMG69atw8iRI/GPf/wDycnJ+Pjjj3H8+HEcPnzYclRi3bp1ePnll/HII49g6tSp8PDwwPHjx7Fr1y4MHjzYZh3x8fEYNGgQevTogXnz5gEATp48icOHD2PChAml1m+up0OHDoiJiUFaWhqWLl2Kw4cP4/jx4/Dw8LCMNRqNCA8PR8eOHbFgwQLs3bsXCxcuRNOmTfHaa69Vqn8XL14EAHh6elqW5eXloWvXrrhy5QrGjBmDhg0b4siRI5g6dSquXbuGJUuWWMa+8sorWLduHXr37o1Ro0ahqKgIBw8exI8//mg50rRy5Uo88sgj6NevH9RqNb755huMGzcOJpMJ48ePr1Tdt9uxYwecnZ3x/PPP3/O+bBk3bhy8vb0xc+ZM5Obm4plnnoGrqyu++OILdO3a1Wrs5s2b8cgjj6BVq1YAgN9++w2dO3eGv78/pkyZglq1auGLL75A//798eWXX+LZZ5+1S81EEERULmvXrhUAxN69e8WNGzfE5cuXxaZNm0TdunWFs7Oz+OOPP4QQQgwfPlwAEFOmTLHa/uDBgwKA2LBhg9XyXbt2WS3PyMgQbm5uomPHjiI/P99qrMlksnw/fPhw0ahRI8vtCRMmCHd3d1FUVFTqY9i/f78AIPbv3y+EEEKv1wsfHx/RqlUrq/v69ttvBQAxc+ZMq/sDIN577z2rfbZr104EBweXep9mycnJAoCYNWuWuHHjhkhNTRUHDx4UHTp0EADEli1bLGNnz54tatWqJc6cOWO1jylTpgiVSiVSUlKEEELs27dPABD/+Mc/Stzf7b3Ky8srsT48PFw0adLEalnXrl1F165dS9S8du3aMh+bp6enaNOmTZljbgdAREdHl1jeqFEjMXz4cMtt83PuiSeeKPFzHTRokPDx8bFafu3aNaFUKq1+Rj169BCtW7cWBQUFlmUmk0l06tRJNG/evNw1E1UUT5EQVVBYWBi8vb0REBCAF198Ea6urti2bRv8/f2txt35P/otW7agdu3a6NmzJ9LT0y1fwcHBcHV1xf79+wEUH4nIzs7GlClTSsyXUCgUpdbl4eGB3NxcxMfHl/ux/PLLL7h+/TrGjRtndV/PPPMMgoKCsHPnzhLbjB071up2ly5dcOHChXLfZ3R0NLy9veHn54cuXbrg5MmTWLhwodX//rds2YIuXbrA09PTqldhYWEwGo344YcfAABffvklFAoFoqOjS9zP7b1ydna2fJ+ZmYn09HR07doVFy5cQGZmZrlrL01WVhbc3NzueT+lGT16NFQqldWygQMH4vr161anu7Zu3QqTyYSBAwcCAP7880/s27cPAwYMQHZ2tqWPN2/eRHh4OM6ePVviVBiRLDxFQlRBy5cvR4sWLaBWq+Hr64uHHnoISqV1Vler1WjQoIHVsrNnzyIzMxM+Pj4293v9+nUAf51yMR/iLq9x48bhiy++QO/eveHv749evXphwIABePrpp0vd5tKlSwCAhx56qMS6oKAgHDp0yGqZeY7D7Tw9Pa3mkNy4ccNqToarqytcXV0tt1999VW88MILKCgowL59+/DRRx+VmMNx9uxZ/PrrryXuy+z2XtWvXx916tQp9TECwOHDhxEdHY3ExETk5eVZrcvMzETt2rXL3P5u3N3dkZ2dfU/7KEvjxo1LLHv66adRu3ZtbN68GT169ABQfHqkbdu2aNGiBQDg3LlzEEJgxowZmDFjhs19X79+vUQ4JpKBAYOogkJCQizn9kuj0+lKhA6TyQQfHx9s2LDB5jalvZiWl4+PD5KSkrB792589913+O6777B27VoMGzYM69evv6d9m935v2hbOnToYAkuQPERi9snNDZv3hxhYWEAgD59+kClUmHKlCno3r27pa8mkwk9e/bEW2+9ZfM+zC+g5XH+/Hn06NEDQUFBWLRoEQICAqDVahEXF4fFixdX+FJfW4KCgpCUlAS9Xn9PlwCXNln29iMwZjqdDv3798e2bduwYsUKpKWl4fDhw5gzZ45ljPmxTZ48GeHh4Tb33axZs0rXS1QWBgyiKtK0aVPs3bsXnTt3tvmCcfs4APjvf/9b4T/+Wq0Wffv2Rd++fWEymTBu3Dh88sknmDFjhs19NWrUCABw+vRpy9UwZqdPn7asr4gNGzYgPz/fcrtJkyZljp82bRpWr16N6dOnY9euXQCKe5CTk2MJIqVp2rQpdu/ejT///LPUoxjffPMNCgsLsWPHDjRs2NCy3HxKSoa+ffsiMTERX375ZamXKt/O09OzxBtv6fV6XLt2rUL3O3DgQKxfvx4JCQk4efIkhBCW0yPAX73XaDR37SWRbJyDQVRFBgwYAKPRiNmzZ5dYV1RUZHnB6dWrF9zc3BATE4OCggKrcUKIUvd/8+ZNq9tKpRKPPvooAKCwsNDmNu3bt4ePjw9iY2Otxnz33Xc4efIknnnmmXI9ttt17twZYWFhlq+7BQwPDw+MGTMGu3fvRlJSEoDiXiUmJmL37t0lxmdkZKCoqAgA8Nxzz0EIgVmzZpUYZ+6V+ajL7b3LzMzE2rVrK/zYSjN27FjUq1cPb7zxBs6cOVNi/fXr1/H+++9bbjdt2tQyj8Rs1apVFb7cNywsDHXq1MHmzZuxefNmhIS
"text/plain": [
"<Figure size 600x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PR Curve\n",
"precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)\n",
"pr_auc = average_precision_score(y_test, y_pred_proba)\n",
"\n",
"plt.figure(figsize=(6, 5))\n",
"plt.plot(recall, precision, color='green', lw=2, label=f'PR curve (AUC = {pr_auc:.4f})')\n",
"plt.xlabel('Recall')\n",
"plt.ylabel('Precision')\n",
"plt.title('Precision-Recall Curve')\n",
"plt.legend(loc='lower left')\n",
"plt.grid(True)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpreting the Precision-Recall (PR) Curve\n",
"\n",
"The **Precision-Recall (PR) curve** helps evaluate model performance, especially on imbalanced datasets like ours (where positive cases are rare).\n",
"\n",
"A quick reminder of the definitions:\n",
"* Precision = How many of the predicted positives are actually positive\n",
"* Recall = How many of the actual positives the model correctly identifies\n",
"\n",
"What we display in this plot is:\n",
"* The x-axis is Recall \n",
"* The y-axis is Precision \n",
"\n",
"The curve shows the trade-off between them at different model thresholds\n",
"\n",
"In imbalanced datasets, accuracy can be misleading — the PR curve focuses only on the positive class, making it much more meaningful:\n",
"* A higher curve means better performance\n",
"* The area under the curve (PR AUC) summarizes this: closer to 1 is better"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Importance\n",
"Understanding what drives the prediction is useful for future experiments and business knowledge. Here we track both the native feature importances of the trees, as well as a more heavy SHAP values analysis.\n",
"\n",
"Important! Be aware that SHAP analysis might take quite a bit of time."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d66ffe2c",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAxkAAAHqCAYAAABoeoNhAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC6cklEQVR4nOzdeVxO6f8/8NddabvblFQSiUqitKBkiZiEaCxZmkkmEppsCV/Sgiwj+1hmfKb4TMTYZ4pBI0O2RNmSNJL5yDQfS6axpfv+/eHX+bi1c5Oa1/PxuB+P7nOuc533dc6pzvs+13XdIqlUKgUREREREZGcKNR1AERERERE1LAwySAiIiIiIrlikkFERERERHLFJIOIiIiIiOSKSQYREREREckVkwwiIiIiIpIrJhlERERERCRXTDKIiIiIiEiumGQQEREREZFcMckgIiIiIiK5YpJBRET0EYiLi4NIJKrwNXv27Peyz1OnTiEiIgKPHj16L/W/i7Ljcf78+boO5a2tX78ecXFxdR0GUZ1QqusAiIiI6H+ioqLQqlUrmWXt27d/L/s6deoUIiMj4efnBx0dnfeyj3+y9evXo0mTJvDz86vrUIg+OCYZREREHxEPDw84OjrWdRjv5O+//4ZYLK7rMOrMkydPoK6uXtdhENUpdpciIiKqRw4ePIju3btDLBZDU1MTAwYMwNWrV2XKXLp0CX5+fjAzM4OqqioMDQ3xxRdf4P79+0KZiIgIzJw5EwDQqlUroWtWXl4e8vLyIBKJKuzqIxKJEBERIVOPSCTCtWvXMHr0aDRu3BjdunUT1n///fdwcHCAmpoadHV1MXLkSNy5c+et2u7n5wcNDQ3k5+dj4MCB0NDQgLGxMb7++msAwOXLl9G7d2+IxWK0bNkS27Ztk9m+rAvWr7/+igkTJkBPTw9aWlrw9fXFw4cPy+1v/fr1sLa2hoqKCpo1a4bJkyeX61rm6uqK9u3bIz09HT169IC6ujr+7//+D6amprh69SqOHz8uHFtXV1cAwIMHDxASEoIOHTpAQ0MDWlpa8PDwQGZmpkzdKSkpEIlE2LlzJxYtWoTmzZtDVVUVbm5uuHnzZrl4z549i/79+6Nx48YQi8WwsbHB6tWrZcpcv34dw4YNg66uLlRVVeHo6IgDBw7IlCkpKUFkZCTMzc2hqqoKPT09dOvWDUeOHKnReSIC+CSDiIjoo1JUVIT//ve/MsuaNGkCAPj3v/+NMWPGwN3dHUuXLsWTJ0+wYcMGdOvWDRcvXoSpqSkA4MiRI/jtt98wduxYGBoa4urVq/jmm29w9epVnDlzBiKRCEOGDMGNGzewfft2rFy5UtiHvr4+/vzzz1rHPXz4cJibmyM6OhpSqRQAsGjRIoSFhcHb2xvjxo3Dn3/+ibVr16JHjx64ePHiW3XRKi0thYeHB3r06IFly5YhPj4eQUFBEIvFmDt3Lnx8fDBkyBBs3LgRvr6+cHZ2Ltf9LCgoCDo6OoiIiEB2djY2bNiA27dvCzf1wKvkKTIyEn369MHEiROFcmlpaUhNTUWjRo2E+u7fvw8PDw+MHDkSn332GQwMDODq6oovv/wSGhoamDt3LgDAwMAAAPDbb79h3759GD58OFq1aoU//vgDmzZtQs+ePXHt2jU0a9ZMJt4lS5ZAQUEBISEhKCoqwrJly+Dj44OzZ88KZY4cOYKBAwfCyMgIU6ZMgaGhIbKysvDTTz9hypQpAICrV6/CxcUFxsbGmD17NsRiMXbu3AkvLy/s3r0bn376qdD2xYsXY9y4cejcuTMeP36M8+fP48KFC+jbt2+tzxn9Q0mJiIiozsXGxkoBVPiSSqXSv/76S6qjoyMdP368zHb37t2Tamtryyx/8uRJufq3b98uBSD99ddfhWVfffWVFID01q1bMmVv3bolBSCNjY0tVw8AaXh4uPA+PDxcCkA6atQomXJ5eXlSRUVF6aJFi2SWX758WaqkpFRueWXHIy0tTVg2ZswYKQBpdHS0sOzhw4dSNTU1qUgkkiYkJAjLr1+/Xi7WsjodHBykL168EJYvW7ZMCkC6f/9+qVQqlRYWFkqVlZWln3zyibS0tFQot27dOikA6XfffScs69mzpxSAdOPGjeXaYG1tLe3Zs2e55c+ePZOpVyp9dcxVVFSkUVFRwrJjx45JAUitrKykz58/F5avXr1aCkB6+fJlqVQqlb58+VLaqlUracuWLaUPHz6UqVcikQg/u7m5STt06CB99uyZzPquXbtKzc3NhWW2trbSAQMGlIubqDbYXYqIiOgj8vXXX+PIkSMyL+DVJ9WPHj3CqFGj8N///ld4KSoqokuXLjh27JhQh5qamvDzs2fP8N///hdOTk4AgAsXLryXuAMDA2Xe79mzBxKJBN7e3jLxGhoawtzcXCbe2ho3bpzws46ODiwtLSEWi+Ht7S0st7S0hI6ODn777bdy2wcEBMg8iZg4cSKUlJSQlJQEADh69ChevHiBqVOnQkHhf7dK48ePh5aWFhITE2XqU1FRwdixY2scv4qKilBvaWkp7t+/Dw0NDVhaWlZ4fsaOHQtlZWXhfffu3QFAaNvFixdx69YtTJ06tdzTobInMw8ePMAvv/wCb29v/PXXX8L5uH//Ptzd3ZGTk4P//Oc/AF4d06tXryInJ6fGbSJ6E7tLERERfUQ6d+5c4cDvshu+3r17V7idlpaW8PODBw8QGRmJhIQEFBYWypQrKiqSY7T/82aXpJycHEilUpibm1dY/vWb/NpQVVWFvr6+zDJtbW00b95cuKF+fXlFYy3ejElDQwNGRkbIy8sDANy+fRvAq0TldcrKyjAzMxPWlzE2NpZJAqojkUiwevVqrF+/Hrdu3UJpaamwTk9Pr1z5Fi1ayLxv3LgxAAhty83NBVD1LGQ3b96EVCpFWFgYwsLCKixTWFgIY2NjREVFYfDgwbCwsED79u3Rr18/fP7557CxsalxG4mYZBAREdUDEokEwKtxGYaGhuXWKyn971+6t7c3Tp06hZkzZ6Jjx47Q0NCARCJBv379hHqq8ubNepnXb4bf9PrTk7J4RSIRDh48CEVFxXLlNTQ0qo2jIhXVVdVy6f8fH/I+vdn26kRHRyMsLAxffPEFFixYAF1dXSgoKGDq1KkVnh95tK2s3pCQELi7u1dYpk2bNgCAHj16IDc3F/v378fhw4exefNmrFy5Ehs3bpR5ikRUFSYZRERE9UDr1q0BAE2bNkWfPn0qLffw4UMkJycjMjIS8+fPF5ZX1PWlsmSi7JPyN2dSevMT/OrilUqlaNWqFSwsLGq83YeQk5ODXr16Ce+Li4tRUFCA/v37AwBatmwJAMjOzoaZmZlQ7sWLF7h161aVx/91lR3fXbt2oVevXvjXv/4ls/zRo0fCAPzaKLs2rly5UmlsZe1o1KhRjeLX1dXF2LFjMXbsWBQXF6NHjx6IiIhgkkE1xjEZRERE9YC7uzu0tLQQHR2NkpKScuvLZoQq+9T7zU+5V61aVW6bsu+yeDOZ0NLSQpMmTfDrr7/KLF+/fn2N4x0yZAgUFRURGRlZLhapVCozne6H9s0338gcww0bNuDly5fw8PAAAPTp0wfKyspYs2aNTOz/+te/UFRUhAEDBtRoP2KxuMJvU1dUVCx3TH744QdhTERt2dvbo1WrVli1alW5/ZXtp2nTpnB1dcWmTZtQUFBQro7XZxR789xoaGigTZs2eP78+VvFR/9MfJJBRERUD2hpaWHDhg34/PPPYW9vj5EjR0JfXx/5+flITEyEi4sL1q1bBy0tLWF615KSEhgbG+Pw4cO4detWuTodHBwAAHPnzsXIkSPRqFEjeHp6QiwWY9y4cViyZAnGjRsHR0dH/Prrr7hx40aN423dujUWLlyIOXPmIC8vD15eXtDU1MStW7ewd+9eBAQEICQkRG7HpzZevHgBNzc3eHt7Izs7G+vXr0e3bt0waNAgAK+m8Z0zZw4iIyPRr18/DBo0SCjXqVMnfPbZZzXaj4ODAzZs2ICFCxeiTZs2aNq0KXr37o2BAwciKioKY8eORdeuXXH58mXEx8fLPDWpDQUFBWzYsAG
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"## BUILT-IN\n",
"\n",
"# Get feature importances from the model\n",
"importances = best_pipeline.named_steps['model'].feature_importances_\n",
"features = X.columns\n",
"\n",
"# Create a Series and sort\n",
"feat_series = pd.Series(importances, index=features).sort_values(ascending=True) # ascending=True for horizontal plot\n",
"\n",
"# Plot Feature Importances\n",
"plt.figure(figsize=(8, 5))\n",
"feat_series.plot(kind='barh', color='skyblue')\n",
"plt.title('Feature Importances')\n",
"plt.xlabel('Importance')\n",
"plt.grid(axis='x')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpreting the Feature Importance Plot\n",
"The **feature importance plot** shows how much each feature contributes to the models overall decision-making.\n",
"\n",
"For tree-based models like Random Forest, importance is based on how often and how effectively a feature is used to split the data across all trees.\n",
"A higher score means the feature plays a bigger role in improving prediction accuracy.\n",
"\n",
"In the graph you will see that:\n",
"* Features are ranked from most to least important.\n",
"* The values are relative and model-specific — not directly interpretable as weights or probabilities.\n",
"\n",
"This helps us identify which features the model relies on most when making predictions.\n",
"\n",
"**Important!**\n",
"Unlike SHAP values, native importance doesn't show how a feature affects predictions — only how useful it is to the model overall. For deeper interpretability (e.g., direction and context), SHAP is better (but it takes more time to run)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "e2197cea",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"ExactExplainer explainer: 4859it [09:15, 8.73it/s] \n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAyoAAAIcCAYAAAAZnVrDAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3gUVdvA4d9sS+8FQkJC6F2EIII0pUp9aYqgICioFDv2hq9+iq8iYkFEBBGQDqGqoICA9Kb0GggJIaT3bJvvjyWbLJuQhBbA576uvWBnz8ycmZ2dnGdOU1RVVRFCCCGEEEKIW4imojMghBBCCCGEEJeTQEUIIYQQQghxy5FARQghhBBCCHHLkUBFCCGEEEIIccuRQEUIIYQQQghxy5FARQghhBBCCHHLkUBFCCGEEEIIccuRQEUIIYQQQghxy5FARQghhBBCCHHLkUBFCCGEEEKIW9x7772Hp6dnqZ/FxMSgKAqLFi0q1/avdr0bSVfRGRBCCCGEEEJcHyEhIWzdupXatWtXdFaumQQqQgghhBBC3CFcXFy49957Kzob14U0/RJCCCGEEOIOUVwTLqPRyLPPPou/vz++vr489dRTzJ07F0VRiImJcVg/Ly+PMWPG4OfnR0hICC+//DJms/kmH4WNBCpCCCGEEELcJsxms9PLarVecZ3XXnuNqVOn8uqrrzJ//nysViuvvfZasWnffPNNNBoNCxYs4Omnn+azzz7j+++/vxGHUipp+iWEEEIIIcRtIDs7G71eX+xnHh4exS5PSUlhypQpvPXWW7z66qsAdOnShY4dOxIbG+uUvkWLFkyePBmATp06sX79ehYtWsTTTz99nY6i7CRQEUIIIYS4CiaTiRkzZgAwbNiwEguQQhRL6eu8TF1yxVXc3Nz4888/nZZ/9913zJ07t9h1/vnnH/Ly8ujVq5fD8t69e/P77787pe/cubPD+/r16/PHH39cMV83igQqQgghhBBC3AY0Gg1RUVFOy1euXFniOufPnwcgKCjIYXlwcHCx6X19fR3eGwwG8vLyypnT60P6qAghhBBCCHHTKcW8rr+QkBAALl686LA8MTHxhuzvepJARQghhBBCiDtUw4YNcXV1JTo62mH5smXLKiZD5SBNv4QQQgghhLjpbkwNyuUCAgJ45pln+PDDD3F1daVJkyYsXLiQY8eOAbbmZLeqWzdnQgghhBBC3LFuTtMvgI8//piRI0fy0UcfMWDAAEwmk314Yh8fnxu232ulqKqqVnQmhBBCCCFuNzLql7gmygDnZerCm7b7xx57jM2bN3P69Ombts/ykqZfQgghhBBC3HQ3p+kXwMaNG9myZQvNmjXDarWycuVK5syZw8SJE29aHq6GBCpCCCGEEELcwTw9PVm5ciUTJkwgNzeXyMhIJk6cyPPPP1/RWbsiCVSEEEIIIYS4gzVr1oy//vqrorNRbtKZXgghhBBCCHHLkRoVIYQQQgghbrqb10fldiU1KkIIIYQQQohbjtSoCCGEEEIIcdNJjUpppEZFCCGEEEIIccuRGhUhhBBCCCFuOqlRKY0EKkIIIYQQQtx0EqiURpp+CSGEEEIIIW45UqMihBBCCCHETSc1KqWRGhUhhBBCCCHELUdqVIQQQgghhLjppEalNFKjIoQQQgghhLjlSI2KEEIIIYQQN5laTI2K1LE4khoVIYQQQgghxC1HAhUhhBBCiFvN4q1w1wtw32uw7WhF50aICiFNv4QQQgghbiVt3oDNRwrft3wdRnSE70ZVXJ7EDSANvUojNSpCCCGEuK2pqsofZy2sP2ut6KxcnbhkaPkqaPqCtp9jkFJg2jq4kHbTsyZERZIaFSGEEELctmYdMDP0l4J3Khqs7HoM7q50mxRxrFaoOxay8mzvVbXktAu2wNjuNydf4oaTzvSlkxoVIYQQQtyWrKpaJEi5tAyI+qlCsnN11u0vDFJKk5ELf/wNC/+ChNQbmy8hbgG3yeMGIYQQQghHr260FLvcCmw+Z6V12G3wPPaLFWVP+9bcy973g/8Ovr75ETeR1J+U5jb4BQshhBBCONt4tuTPDiXdBv1VjsTC6n1Xv/4Hi+FkwnXLjhC3GglUhBBCCHFbsaoq3Rea2ZlYcpqB9W6Dp9VNX772bfT++Nq3ISqIUsxLFCVNv4QQQghxW2kyw8I/KVdOM3W/itFqYUwTBR/XW+y5bFyybW6UXNO1b+vouWvfhhC3KAlUhBBCCHFbKS1IAXjlTwCVtzarLOkN3asrZBnB360Cn1rvPgn3vw2ZZew8Xxbm26CJmyhWcaN+CUcSqAghhBDilpaWZyU2ExoEKldVtOsbXViY99TBgWFaInxuYiHRaLJN2rjn1I3Z/vF4qFXlxmxbiAokgYoQQgghbiqLVeVcpkqol4JOc+WAIXSKmfjswveNA69t31lmqPuDhdwXblIRyGKB8BFwIePG7aPPx/D7+/DFSqhdBYa0B80t1txNFENqVEpT7qt4xYoVREVFsWvXrhuRnxuqZ8+ejBw5sqKzcdVu9/wLUV67du0iKiqKFSvKMXznDRIfH09UVBRTp06tsDyMHDmSnj17Vtj+hbge5h02o5toodo0K/qJFip9ZSYjv/jmS41/cAxSAP5OuvY85FnAZLmBTaYyc6DSMFD6gm7AjQ1SAA6eg8rD4aMlMOwr0PaHlMwbu09xzdRiXsKR1Kjc5ubOnYuXl5cUXv7FMjMzmTt3Ls2aNSMqKqqisyPEVTl69CgbNmygZ8+eVKkiTVjuBKqq8upGC1P2Qa4ZqvvCjAcVHlnlmC4xD3y+tAJWIrzgQo4tkHDVQN4NjCWy8q34ud+gWgffx8BawcXO8BGQNa9i8yDENZJA5TayePFiFMWxmvDnn38mJCREApV/sczMTKZNmwZwxwUqTZs2ZcuWLeh0cqsC+Prrr1HVO/OZ27Fjx5g2bRrNmjWTQOUWsuaUhT/OqHSO1OCqszVUuXwSxZ8Pm3l9E/i5wIIeCrUCtGw7Z6blZWXk42nQ+ucrX79nilQC3MggBcDDcAOa3SzZBo99UfFBCkC2EQKHwBMd4P8eBa22onMknEjTr9LIX//biMFgqOgsCHHVzGYzFosFFxeXMq+j0WjKlf5Op9frKzoL4jaVa1KZe9jKkuMqOWZoFwY/HYKYDFuZWgHuDobmlWF9LKTnQWJuYVOUT3cXjRqs1PSGHIut9sNyKdEZoPYMFTDf1GO7Wi6TVFwUMyYVAlxhcW+Iz9Yw+6CF6rv/oaclgQ4j7yYjrDI/7cxl68Y4PHQwum9lQqp48s2fWUzfbaL6uXvocXI3pya+Re2jx2+tomdyFnwSbXu1rge9m9v6rwT7woIt8P06qBkCHw8Gb4+Kzq0QThS1nI/nVqxYwfjx45kyZQpHjhxh0aJFJCYmEhISwvDhw+nRo4c97W+//caaNWs4duwYKSkpuLu706RJE55++mlq1arlsN39+/czffp0jh49SmZmJj4+PtSqVYsRI0bQqFGjch1UQkICkyZNYuvWrYDtqexLL73EM888Q0hICN99951D+u3btzNr1iwOHjyI0WgkPDyc/v37079/f4d0PXv2JCQkhBdffJFJkyZx8OBB9Ho9bdq04bnnnsPf398hfVpaGlOnTuXPP/8kOTmZgIAA2rZty1NPPYWvr689XX5+PjNnzuTXX3/lwoUL6PV6KlWqRKtWrXjuueec9l+Q/5Keni9fvrzMTyRHjhzJ+fPnmTp1KhMnTmTXrl0oikK7du145ZVXcHV1ZebMmSxbtoykpCQiIyMZN24cTZo0sW/DarUyY8YMtm3bxtmzZ0lPTycgIIDWrVvzzDPPOBxrfHw8vXr1YsSIEdSvX59p06Zx4sQJvLy86NatG6NHj3Z4en7gwAEWLVrE33//zYULF9BqtdSsWZPHHnuM+++/3+l4du/ezVdffcWxY8fw9PSkU6dO9OnTh4cffpgRI0bw1FNP2dOqqsrixYtZtmwZp0+fRqPRUL9+fUaMGOFwbovmuXr16syYMYMzZ84QFBTE8OHD6dWrFwkJCfbzZzabadeuHa+99hoeHo43/qSkJKZNm8bmzZt
"text/plain": [
"<Figure size 800x550 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"## SHAP VALUES\n",
"\n",
"# SHAP requires that all features passed to Explainer be numeric (floats/ints)\n",
"X_test_shap = X_test.copy()\n",
"X_test_shap = X_test_shap.astype(float)\n",
"\n",
"# Function that returns the probability of the positive class\n",
"def model_predict(data):\n",
" return best_pipeline.predict_proba(data)[:, 1]\n",
"\n",
"# Ensure input to SHAP is numeric\n",
"X_test_shap = X_test.astype(float)\n",
"\n",
"# Create SHAP explainer\n",
"explainer = shap.Explainer(model_predict, X_test_shap)\n",
"\n",
"# Compute SHAP values\n",
"shap_values = explainer(X_test_shap)\n",
"\n",
"# Plot summary\n",
"shap.summary_plot(shap_values.values, X_test_shap)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpreting the SHAP Summary Plot\n",
"\n",
"Each point on a row represents a SHAP value for a single prediction (row = feature).\n",
"The x-axis shows how much the feature contributed to increasing or decreasing the prediction.\n",
"* Right (positive SHAP value): pushes prediction toward the positive class (i.e., higher chance of incident).\n",
"* Left (negative SHAP value): pushes prediction toward the negative class (i.e., lower chance of incident).\n",
"\n",
"Color shows the actual feature value for that point:\n",
"* Red = high value\n",
"* Blue = low value\n",
"\n",
"In other words:\n",
"* The position tells you impact.\n",
"* The color tells you feature value.\n",
"* The density (thickness) of dots shows how often a value occurs."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}