data-jupyter-notebooks/data_driven_risk_assessment/experiments/002_contactless_full_attributes.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "84dcd475",
   "metadata": {},
   "source": [
    "# DDRA - Contactless (Full)\n",
    "\n",
    "## General Idea\n",
    "The idea is to play only with numeric features (floats, integers or booleans) that are CONTACTLESS.\n",
    "\n",
    "This considers the FULL set of features.\n",
    "\n",
    "A more readable EDA is available in Notion here: [EDA Uri: Contactless](https://www.notion.so/truvi/EDA-Uri-Contactless-2170446ff9c980909624d45a6c124ec2)\n",
    "\n",
    "## Initial setup\n",
    "This first section just ensures that the connection to DWH works correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "12368ce1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "🔌 Testing connection using credentials at: /home/uri/.superhog-dwh/credentials.yml\n",
      "✅ Connection successful.\n"
     ]
    }
   ],
   "source": [
    "# This script connects to a Data Warehouse (DWH) using PostgreSQL. \n",
    "# This should be common for all Notebooks, but you might need to adjust the path to the `dwh_utils` module.\n",
    "\n",
    "import sys\n",
    "import os\n",
    "sys.path.append(os.path.abspath(\"../../utils\"))  # Adjust path if needed\n",
    "\n",
    "from dwh_utils import read_credentials, create_postgres_engine, query_to_dataframe, test_connection\n",
    "\n",
    "# --- Connect to DWH ---\n",
    "creds = read_credentials()\n",
    "dwh_pg_engine = create_postgres_engine(creds)\n",
    "\n",
    "# --- Test Query ---\n",
    "test_connection()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c86f94f1",
   "metadata": {},
   "source": [
    "## Data Extraction\n",
    "In this section we extract the data.\n",
    "\n",
    "This SQL query retrieves a clean and relevant subset of booking data for our model. It includes:\n",
    "- A **unique booking ID**\n",
    "- Key **numeric features** such as number of services, time between booking creation and check-in, number of nights, etc.\n",
    "- Several **categorical (boolean) features** related to service usage\n",
    "- A **target variable** (`has_resolution_incident`) indicating whether a resolution incident occurred\n",
    "\n",
    "Filters applied being:\n",
    "1. Bookings from **\"New Dash\" users** with a valid deal ID\n",
    "2. Only **protected bookings**, i.e., those with Protection or Deposit Management services\n",
    "3. Bookings flagged for **risk categorisation** (excluding incomplete/rejected ones)\n",
    "4. Bookings that are **already completed**\n",
    "\n",
    "The result is converted into a pandas DataFrame for further processing and modeling.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "3e3ed391",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialise all imports needed for the Notebook\n",
    "from sklearn.model_selection import (\n",
    "    train_test_split, \n",
    "    GridSearchCV\n",
    ")\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import date\n",
    "from sklearn.metrics import (\n",
    "    roc_auc_score, \n",
    "    average_precision_score,\n",
    "    classification_report,\n",
    "    roc_curve, \n",
    "    auc,\n",
    "    precision_recall_curve,\n",
    "    precision_score,\n",
    "    recall_score,\n",
    "    fbeta_score,\n",
    "    confusion_matrix\n",
    ")\n",
    "import matplotlib.pyplot as plt\n",
    "import shap\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "db5e3098",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total Bookings: 21,384\n"
     ]
    }
   ],
   "source": [
    "# Query to extract data\n",
    "data_extraction_query = \"\"\"\n",
    "WITH \n",
    "service_information AS (\n",
    "\tSELECT\n",
    "\t\tid_booking,\n",
    "\t\tcount(DISTINCT CASE WHEN service_business_type = 'SCREENING' THEN id_booking_service_detail ELSE NULL END) AS number_of_applied_screening_services,\n",
    "\t\tcount(DISTINCT CASE WHEN service_business_type = 'DEPOSIT_MANAGEMENT' THEN id_booking_service_detail ELSE NULL END) AS number_of_applied_deposit_management_services,\n",
    "\t\tcount(DISTINCT CASE WHEN service_business_type = 'PROTECTION' THEN id_booking_service_detail ELSE NULL END) AS number_of_applied_protection_services,\n",
    "\t\tcount(DISTINCT CASE WHEN service_name = 'WAIVER PRO' THEN id_booking ELSE NULL END)>0 AS has_waiver_pro,\n",
    "\t\tcount(DISTINCT CASE WHEN service_name IN ('BASIC DAMAGE DEPOSIT','BASIC DAMAGE DEPOSIT OR BASIC WAIVER','BASIC DAMAGE DEPOSIT OR WAIVER PLUS','BASIC WAIVER','WAIVER PLUS') THEN id_booking ELSE NULL END)>0 AS has_guest_facing_waiver_or_deposit,\n",
    "\t\tcount(DISTINCT CASE WHEN service_name = 'GUEST AGREEMENT' THEN id_booking ELSE NULL END)>0 AS has_guest_agreement,\n",
    "\t\tcount(DISTINCT CASE WHEN service_name = 'BASIC PROTECTION' THEN id_booking ELSE NULL END)>0 AS has_basic_protection,\n",
    "\t\tcount(DISTINCT CASE WHEN service_name = 'PROTECTION PLUS' THEN id_booking ELSE NULL END)>0 AS has_protection_plus,\n",
    "\t\tcount(DISTINCT CASE WHEN service_name = 'PROTECTION PRO' THEN id_booking ELSE NULL END)>0 AS has_protection_pro,\n",
    "\t\tcount(DISTINCT CASE WHEN service_name = 'ID VERIFICATION' THEN id_booking ELSE NULL END)>0 AS has_id_verification,\n",
    "\t\tcount(DISTINCT CASE WHEN service_name = 'SCREENING PLUS' THEN id_booking ELSE NULL END)>0 AS has_screening_plus,\n",
    "\t\tcount(DISTINCT CASE WHEN service_name = 'SEX OFFENDER CHECK' THEN id_booking ELSE NULL END)>0 AS has_sex_offender_check\n",
    "\tFROM\n",
    "\t\tintermediate.int_core__booking_service_detail\n",
    "\tGROUP BY\n",
    "\t\t1\n",
    "),\n",
    "listing_information AS (\n",
    "SELECT \n",
    "\tica.id_accommodation,\n",
    "\t-- Defaults to 0 if null\n",
    "\tCOALESCE(ica.number_of_bedrooms, 0) AS listing_number_of_bedrooms,\n",
    "\t-- Defaults to 0 if null\n",
    "\tCOALESCE(ica.number_of_bathrooms, 0) AS listing_number_of_bathrooms\n",
    "\tFROM intermediate.int_core__accommodation ica \n",
    "),\n",
    "raw_bookings_checked_in_prior_to_TCR AS (\n",
    "\tSELECT\n",
    "\t\tb.id_booking,\n",
    "\t\t-- Using group by on check-in date to remove booking duplicates\n",
    "\t\tb2.booking_check_in_date_utc,\n",
    "\t\t-- Using min as a conservative approach to reduce outliers\n",
    "\t\tmin(b2.booking_number_of_nights) AS min_booking_number_of_nights\n",
    "\tFROM\n",
    "\t\tintermediate.int_booking_summary b\n",
    "\t-- Note that by joining with BS we're only considering New Dash bookings\n",
    "\tLEFT JOIN intermediate.int_booking_summary b2\n",
    "    ON\n",
    "\t\tb2.id_accommodation = b.id_accommodation\n",
    "\t\t-- Exclusion based on actual booking creation!\n",
    "\t\tAND b2.booking_check_in_date_utc >= b.booking_created_date_utc - INTERVAL '30 days'\n",
    "\t\tAND b2.booking_check_in_date_utc < b.booking_created_date_utc\n",
    "\t\t-- Note that since is based on TCR we can remove Cancelled\n",
    "\t\tAND b2.booking_status NOT IN ('CANCELLED')\n",
    "\tGROUP BY\n",
    "\t\tb.id_booking,\n",
    "\t\tb2.booking_check_in_date_utc\n",
    "),\n",
    "bookings_checked_in_prior_to_TCR AS (\n",
    "\tSELECT\n",
    "\t\tid_booking,\n",
    "\t\tLEAST(\n",
    "\t\t\tcount(booking_check_in_date_utc),\n",
    "\t\t\t30\n",
    "\t\t) AS listing_check_ins_prior_to_TCR_in_30_days,\n",
    "\t\t-- Capping\n",
    "\t\tLEAST(\n",
    "\t\t\tGREATEST(\n",
    "\t\t\t\tsum(min_booking_number_of_nights),\n",
    "\t\t\t\t0\n",
    "\t\t\t),\n",
    "\t\t\t30\n",
    "\t\t) AS listing_occupancy_prior_to_TCR_in_30_days\n",
    "\tFROM\n",
    "\t\traw_bookings_checked_in_prior_to_TCR\n",
    "\tGROUP BY\n",
    "\t\t1\n",
    "),\n",
    "raw_known_bookings_checking_in_prior_to_TCI AS (\n",
    "\tSELECT\n",
    "\t\tb.id_booking,\n",
    "\t\tb.booking_check_in_date_utc,\n",
    "\t\t-- Using group by on check-in date to remove booking duplicates\n",
    "\t\tb2.booking_check_in_date_utc AS other_bookings_check_in_date_utc,\n",
    "\t\t-- Using min as a conservative approach to reduce outliers\n",
    "\t\tmin(b2.booking_number_of_nights) AS min_booking_number_of_nights\n",
    "\tFROM\n",
    "\t\tintermediate.int_booking_summary b\n",
    "\t-- Note that by joining with BS we're only considering New Dash bookings\n",
    "\tLEFT JOIN intermediate.int_booking_summary b2\n",
    "    ON\n",
    "\t\tb2.id_accommodation = b.id_accommodation\n",
    "\t\t-- Exclusion based on check-in\n",
    "\t\tAND b2.booking_check_in_date_utc >= b.booking_check_in_date_utc - INTERVAL '30 days'\n",
    "\t\tAND b2.booking_check_in_date_utc < b.booking_check_in_date_utc\n",
    "\t\t-- that are known!\n",
    "\t\tAND b2.booking_created_date_utc < b.booking_created_date_utc\n",
    "\t\t-- Note that since is based on TCI we cannot remove Cancelled\n",
    "\tGROUP BY\n",
    "\t\tb.id_booking,\n",
    "\t\tb.booking_check_in_date_utc,\n",
    "\t\tb2.booking_check_in_date_utc\n",
    "),\n",
    "known_bookings_checking_in_prior_to_TCI AS (\n",
    "\tSELECT\n",
    "\t\tid_booking,\n",
    "\t\tLEAST(\n",
    "\t\t\tcount(other_bookings_check_in_date_utc),\n",
    "\t\t\t30\n",
    "\t\t) AS listing_known_check_ins_prior_to_TCI_in_30_days,\n",
    "\t\t-- Capping\n",
    "\t\tLEAST(\n",
    "\t\t\tGREATEST(\n",
    "\t\t\t\tsum(min_booking_number_of_nights),\n",
    "\t\t\t\t0\n",
    "\t\t\t),\n",
    "\t\t\t30\n",
    "\t\t) AS listing_known_occupancy_prior_to_TCI_in_30_days,\n",
    "\t\tCOALESCE(\n",
    "\t\t\tbooking_check_in_date_utc - max(other_bookings_check_in_date_utc),\n",
    "\t\t\t30\n",
    "\t\t) AS lead_time_between_prior_known_check_in_to_TCI_30_days\n",
    "\tFROM\n",
    "\t\traw_known_bookings_checking_in_prior_to_TCI\n",
    "\tGROUP BY\n",
    "\t\tid_booking, \n",
    "\t\tbooking_check_in_date_utc\n",
    "),\n",
    "incidents_prior_to_TCP AS (\n",
    "\tSELECT\n",
    "\t\tb.id_booking,\n",
    "\t\t-- Using distinct count on check-in date to remove booking duplicates\n",
    "\t\tCOUNT(DISTINCT b2.booking_check_in_date_utc) AS listing_incidents_prior_to_TCP_in_30_days\n",
    "\tFROM\n",
    "\t\tintermediate.int_booking_summary b\n",
    "\tLEFT JOIN intermediate.int_booking_summary b2\n",
    "    ON\n",
    "\t\tb2.id_accommodation = b.id_accommodation\n",
    "\t\t-- Filter on Check Out date\n",
    "\t\tAND b2.booking_completed_date_utc >= b.booking_created_date_utc - INTERVAL '30 days'\n",
    "\t\tAND b2.booking_completed_date_utc < b.booking_created_date_utc\n",
    "\t\tAND b2.has_resolution_incident = TRUE\n",
    "\tGROUP BY\n",
    "\t\tb.id_booking\n",
    ")\n",
    "SELECT\n",
    "\t-- UNIQUE BOOKING ID --\n",
    "\tbooking_summary.id_booking,\n",
    "\t\n",
    "\t-- CONTEXTUAL SERVICE INFORMATION --\n",
    "\t-- We're not including number_of_applied_services as it 1-correlates with upgraded services\n",
    "\tbooking_summary.number_of_applied_upgraded_services,\n",
    "\tbooking_summary.number_of_applied_billable_services,\n",
    "\tservice_information.number_of_applied_screening_services,\n",
    "\tservice_information.number_of_applied_deposit_management_services,\n",
    "\tservice_information.number_of_applied_protection_services,\n",
    "\tservice_information.has_waiver_pro,\n",
    "\tservice_information.has_guest_facing_waiver_or_deposit,\n",
    "\tservice_information.has_guest_agreement,\n",
    "\tservice_information.has_basic_protection,\n",
    "\tservice_information.has_protection_plus,\n",
    "\tservice_information.has_protection_pro,\n",
    "\tservice_information.has_id_verification,\n",
    "\tservice_information.has_screening_plus,\n",
    "\tservice_information.has_sex_offender_check,\n",
    "\tNOT booking_summary.has_verification_request AS is_contactless_booking,\n",
    "\t\n",
    "\t-- CONTEXTUAL LISTING INFORMATION --\n",
    "\tlisting_information.listing_number_of_bedrooms,\n",
    "\tlisting_information.listing_number_of_bathrooms,\n",
    "\t\n",
    "\t-- CONTEXTUAL TIMELINE OF OUR BOOKING\n",
    "\t-- Defaults to 0 if booking_created_date_utc > booking_check_in_date_utc\n",
    "\tGREATEST(booking_summary.booking_check_in_date_utc - booking_summary.booking_created_date_utc, 0) AS booking_lead_time,\n",
    "\tbooking_summary.booking_check_out_date_utc - booking_summary.booking_check_in_date_utc AS booking_duration,\n",
    "\t\n",
    "\t-- SAME-LISTING, OTHER BOOKING INTERACTIONS: PRIOR TO TCR\n",
    "\tbookings_checked_in_prior_to_TCR.listing_check_ins_prior_to_TCR_in_30_days,\n",
    "\tbookings_checked_in_prior_to_TCR.listing_occupancy_prior_to_TCR_in_30_days,\n",
    "\t\n",
    "\t-- SAME-LISTING, OTHER BOOKING INTERACTIONS: PRIOR TO TCI (KNOWN)\n",
    "\tknown_bookings_checking_in_prior_to_TCI.listing_known_check_ins_prior_to_TCI_in_30_days,\n",
    "\tknown_bookings_checking_in_prior_to_TCI.listing_known_occupancy_prior_to_TCI_in_30_days,\n",
    "\tknown_bookings_checking_in_prior_to_TCI.lead_time_between_prior_known_check_in_to_TCI_30_days,\n",
    "\t\n",
    "\t-- SAME-LISTING, OTHER BOOKING INTERACTIONS: INCIDENTAL BOOKINGS\n",
    "\tincidents_prior_to_TCP.listing_incidents_prior_to_TCP_in_30_days,\n",
    "\t\n",
    "\t-- TARGET (BOOLEAN) --\n",
    "\tbooking_summary.has_resolution_incident\n",
    "\n",
    "FROM\n",
    "\tintermediate.int_booking_summary booking_summary\n",
    "LEFT JOIN service_information \n",
    "\tON\n",
    "\tbooking_summary.id_booking = service_information.id_booking\n",
    "LEFT JOIN listing_information \n",
    "\tON booking_summary.id_accommodation = listing_information.id_accommodation\n",
    "LEFT JOIN bookings_checked_in_prior_to_TCR\n",
    "\tON booking_summary.id_booking = bookings_checked_in_prior_to_TCR.id_booking\n",
    "LEFT JOIN known_bookings_checking_in_prior_to_TCI\n",
    "\tON booking_summary.id_booking = known_bookings_checking_in_prior_to_TCI.id_booking\n",
    "LEFT JOIN incidents_prior_to_TCP\n",
    "\tON booking_summary.id_booking = incidents_prior_to_TCP.id_booking\n",
    "WHERE\n",
    "\t-- 1. Bookings from New Dash users with Id Deal\n",
    "\tbooking_summary.is_user_in_new_dash = TRUE\n",
    "\tAND \n",
    "    booking_summary.is_missing_id_deal = FALSE\n",
    "\tAND\n",
    "\t-- 2. Protected Bookings with a Protection or a Deposit Management service\n",
    "    (\n",
    "\t\tbooking_summary.has_protection_service_business_type\n",
    "\t\t\tOR \n",
    "    booking_summary.has_deposit_management_service_business_type\n",
    "\t)\n",
    "\tAND\n",
    "\t-- 3. Bookings with flagging categorisation (this excludes Cancelled/Incomplete/Rejected bookings)\n",
    "\tbooking_summary.is_booking_flagged_as_risk IS NOT NULL\n",
    "\tAND\n",
    "\t-- 4. Booking is completed\n",
    "\tbooking_summary.is_booking_past_completion_date = TRUE\n",
    "\n",
    "\n",
    "\"\"\"\n",
    "\n",
    "# Retrieve Data from Query\n",
    "df_extraction = query_to_dataframe(engine=dwh_pg_engine, query=data_extraction_query)\n",
    "print(f\"Total Bookings: {len(df_extraction):,}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessing\n",
    "Preprocessing in this notebook is quite straight-forward: we just drop id booking and split the features and target."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop ID column\n",
    "df = df_extraction.copy().drop(columns=['id_booking'])\n",
    "\n",
    "# Separate features and target\n",
    "target_col = 'has_resolution_incident'\n",
    "X = df.drop(columns=[target_col])\n",
    "y = df[target_col]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploratory Data Analysis\n",
    "In this section we focus on explore the different features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### EDA - Dataset Overview"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape: (21384, 25)\n",
      "has_resolution_incident\n",
      "False    98.8\n",
      "True      1.2\n",
      "Name: proportion, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# Shape and types\n",
    "print(f\"Shape: {X.shape}\")\n",
    "\n",
    "# Target distribution\n",
    "print(round(100*df[target_col].value_counts(normalize=True),2))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>5%</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>95%</th>\n",
       "      <th>99%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>number_of_applied_upgraded_services</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>2.664282</td>\n",
       "      <td>1.532038</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>number_of_applied_billable_services</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>1.842780</td>\n",
       "      <td>0.946184</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>number_of_applied_screening_services</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>2.007903</td>\n",
       "      <td>0.985649</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>number_of_applied_deposit_management_services</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>0.620651</td>\n",
       "      <td>0.485814</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>number_of_applied_protection_services</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>0.727132</td>\n",
       "      <td>0.445444</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>listing_number_of_bedrooms</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>2.049476</td>\n",
       "      <td>1.755499</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>listing_number_of_bathrooms</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>1.590816</td>\n",
       "      <td>1.312573</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>booking_lead_time</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>18.151422</td>\n",
       "      <td>24.349579</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>113.0</td>\n",
       "      <td>220.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>booking_duration</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>4.175084</td>\n",
       "      <td>4.851055</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>28.0</td>\n",
       "      <td>116.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>listing_check_ins_prior_to_tcr_in_30_days</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>2.481107</td>\n",
       "      <td>2.804436</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>25.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>listing_occupancy_prior_to_tcr_in_30_days</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>8.780817</td>\n",
       "      <td>9.260855</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>27.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>listing_known_check_ins_prior_to_tci_in_30_days</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>2.661149</td>\n",
       "      <td>2.937777</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>26.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>listing_known_occupancy_prior_to_tci_in_30_days</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>9.470913</td>\n",
       "      <td>9.715511</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>17.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>lead_time_between_prior_known_check_in_to_tci_30_days</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>15.287318</td>\n",
       "      <td>11.424657</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>11.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>listing_incidents_prior_to_tcp_in_30_days</th>\n",
       "      <td>21384.0</td>\n",
       "      <td>0.013468</td>\n",
       "      <td>0.130493</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                      count       mean  \\\n",
       "number_of_applied_upgraded_services                 21384.0   2.664282   \n",
       "number_of_applied_billable_services                 21384.0   1.842780   \n",
       "number_of_applied_screening_services                21384.0   2.007903   \n",
       "number_of_applied_deposit_management_services       21384.0   0.620651   \n",
       "number_of_applied_protection_services               21384.0   0.727132   \n",
       "listing_number_of_bedrooms                          21384.0   2.049476   \n",
       "listing_number_of_bathrooms                         21384.0   1.590816   \n",
       "booking_lead_time                                   21384.0  18.151422   \n",
       "booking_duration                                    21384.0   4.175084   \n",
       "listing_check_ins_prior_to_tcr_in_30_days           21384.0   2.481107   \n",
       "listing_occupancy_prior_to_tcr_in_30_days           21384.0   8.780817   \n",
       "listing_known_check_ins_prior_to_tci_in_30_days     21384.0   2.661149   \n",
       "listing_known_occupancy_prior_to_tci_in_30_days     21384.0   9.470913   \n",
       "lead_time_between_prior_known_check_in_to_tci_3...  21384.0  15.287318   \n",
       "listing_incidents_prior_to_tcp_in_30_days           21384.0   0.013468   \n",
       "\n",
       "                                                          std  min   5%  25%  \\\n",
       "number_of_applied_upgraded_services                  1.532038  1.0  1.0  1.0   \n",
       "number_of_applied_billable_services                  0.946184  0.0  1.0  1.0   \n",
       "number_of_applied_screening_services                 0.985649  1.0  1.0  1.0   \n",
       "number_of_applied_deposit_management_services        0.485814  0.0  0.0  0.0   \n",
       "number_of_applied_protection_services                0.445444  0.0  0.0  0.0   \n",
       "listing_number_of_bedrooms                           1.755499  0.0  0.0  1.0   \n",
       "listing_number_of_bathrooms                          1.312573  0.0  0.0  1.0   \n",
       "booking_lead_time                                   24.349579  0.0  0.0  2.0   \n",
       "booking_duration                                     4.851055  0.0  1.0  2.0   \n",
       "listing_check_ins_prior_to_tcr_in_30_days            2.804436  0.0  0.0  0.0   \n",
       "listing_occupancy_prior_to_tcr_in_30_days            9.260855  0.0  0.0  0.0   \n",
       "listing_known_check_ins_prior_to_tci_in_30_days      2.937777  0.0  0.0  0.0   \n",
       "listing_known_occupancy_prior_to_tci_in_30_days      9.715511  0.0  0.0  0.0   \n",
       "lead_time_between_prior_known_check_in_to_tci_3...  11.424657  1.0  2.0  5.0   \n",
       "listing_incidents_prior_to_tcp_in_30_days            0.130493  0.0  0.0  0.0   \n",
       "\n",
       "                                                     50%   75%   95%    99%  \\\n",
       "number_of_applied_upgraded_services                  2.0   4.0   5.0    6.0   \n",
       "number_of_applied_billable_services                  2.0   2.0   4.0    4.0   \n",
       "number_of_applied_screening_services                 2.0   3.0   4.0    4.0   \n",
       "number_of_applied_deposit_management_services        1.0   1.0   1.0    1.0   \n",
       "number_of_applied_protection_services                1.0   1.0   1.0    1.0   \n",
       "listing_number_of_bedrooms                           2.0   3.0   5.0    8.0   \n",
       "listing_number_of_bathrooms                          1.0   2.0   4.0    6.0   \n",
       "booking_lead_time                                    9.0  25.0  69.0  113.0   \n",
       "booking_duration                                     3.0   5.0  10.0   28.0   \n",
       "listing_check_ins_prior_to_tcr_in_30_days            2.0   4.0   8.0   11.0   \n",
       "listing_occupancy_prior_to_tcr_in_30_days            6.0  16.0  27.0   30.0   \n",
       "listing_known_check_ins_prior_to_tci_in_30_days      2.0   4.0   8.0   12.0   \n",
       "listing_known_occupancy_prior_to_tci_in_30_days      6.0  17.0  30.0   30.0   \n",
       "lead_time_between_prior_known_check_in_to_tci_3...  11.0  30.0  30.0   30.0   \n",
       "listing_incidents_prior_to_tcp_in_30_days            0.0   0.0   0.0    1.0   \n",
       "\n",
       "                                                      max  \n",
       "number_of_applied_upgraded_services                   7.0  \n",
       "number_of_applied_billable_services                   5.0  \n",
       "number_of_applied_screening_services                  4.0  \n",
       "number_of_applied_deposit_management_services         2.0  \n",
       "number_of_applied_protection_services                 1.0  \n",
       "listing_number_of_bedrooms                           15.0  \n",
       "listing_number_of_bathrooms                          17.0  \n",
       "booking_lead_time                                   220.0  \n",
       "booking_duration                                    116.0  \n",
       "listing_check_ins_prior_to_tcr_in_30_days            25.0  \n",
       "listing_occupancy_prior_to_tcr_in_30_days            30.0  \n",
       "listing_known_check_ins_prior_to_tci_in_30_days      26.0  \n",
       "listing_known_occupancy_prior_to_tci_in_30_days      30.0  \n",
       "lead_time_between_prior_known_check_in_to_tci_3...   30.0  \n",
       "listing_incidents_prior_to_tcp_in_30_days             3.0  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>unique</th>\n",
       "      <th>top</th>\n",
       "      <th>freq</th>\n",
       "      <th>freq/count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>has_waiver_pro</th>\n",
       "      <td>21384</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>19082</td>\n",
       "      <td>0.892349</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>has_guest_facing_waiver_or_deposit</th>\n",
       "      <td>21384</td>\n",
       "      <td>2</td>\n",
       "      <td>True</td>\n",
       "      <td>10970</td>\n",
       "      <td>0.513</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>has_guest_agreement</th>\n",
       "      <td>21384</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>14787</td>\n",
       "      <td>0.691498</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>has_basic_protection</th>\n",
       "      <td>21384</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>11894</td>\n",
       "      <td>0.55621</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>has_protection_plus</th>\n",
       "      <td>21384</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>20083</td>\n",
       "      <td>0.93916</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>has_protection_pro</th>\n",
       "      <td>21384</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>16626</td>\n",
       "      <td>0.777497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>has_id_verification</th>\n",
       "      <td>21384</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>12438</td>\n",
       "      <td>0.58165</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>has_screening_plus</th>\n",
       "      <td>21384</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>11001</td>\n",
       "      <td>0.51445</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>has_sex_offender_check</th>\n",
       "      <td>21384</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>19158</td>\n",
       "      <td>0.895903</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>is_contactless_booking</th>\n",
       "      <td>21384</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>13185</td>\n",
       "      <td>0.616582</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>has_resolution_incident</th>\n",
       "      <td>21384</td>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>21127</td>\n",
       "      <td>0.987982</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                    count unique    top   freq freq/count\n",
       "has_waiver_pro                      21384      2  False  19082   0.892349\n",
       "has_guest_facing_waiver_or_deposit  21384      2   True  10970      0.513\n",
       "has_guest_agreement                 21384      2  False  14787   0.691498\n",
       "has_basic_protection                21384      2  False  11894    0.55621\n",
       "has_protection_plus                 21384      2  False  20083    0.93916\n",
       "has_protection_pro                  21384      2  False  16626   0.777497\n",
       "has_id_verification                 21384      2  False  12438    0.58165\n",
       "has_screening_plus                  21384      2  False  11001    0.51445\n",
       "has_sex_offender_check              21384      2  False  19158   0.895903\n",
       "is_contactless_booking              21384      2  False  13185   0.616582\n",
       "has_resolution_incident             21384      2  False  21127   0.987982"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Summary statistics for numerical features\n",
    "display(df.describe(include= ['number'], percentiles=[.05,.25,.5,.75,.95,.99]).T)\n",
    "# Summary statistics for boolean features\n",
    "summary = df.describe(include= ['bool']).T\n",
    "summary['freq/count'] = summary['freq']/summary['count']\n",
    "display(summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAABncAAAWxCAYAAABEBcfHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzddXRUx9vA8e/GXQkxiCvuENwdSlscirfQ4g7FocW1pbRIseLubm2x4u7BIQJR4rbvH4FNluzGSn8pb5/POXsO3J2Zfe5czZ07MwqlUqlECCGEEEIIIYQQQgghhBBCfBR0CjoAIYQQQgghhBBCCCGEEEIIkXvSuCOEEEIIIYQQQgghhBBCCPERkcYdIYQQQgghhBBCCCGEEEKIj4g07gghhBBCCCGEEEIIIYQQQnxEpHFHCCGEEEIIIYQQQgghhBDiIyKNO0IIIYQQQgghhBBCCCGEEB8RadwRQgghhBBCCCGEEEIIIYT4iEjjjhBCCCGEEEIIIYQQQgghxEdEGneEEEIIIYQQQgghhBBCCCE+ItK4I4QQQgghhBDig1q5ciUKhYLHjx9/sDIfP36MQqFg5cqVH6zMj13t2rWpXbt2QYchhBBCCCEKgDTuCCGEEEIIIcRHIDAwkN69e+Ph4YGRkREWFhZUq1aNBQsWEB8fX9DhfTDr1q1j/vz5BR2Gmm7duqFQKLCwsNBY1/fv30ehUKBQKJg9e3aey3/58iUTJ07kypUrHyBaIYQQQgjxX6BX0AEIIYQQQgghhMje3r17adOmDYaGhnTp0oUSJUqQlJTEyZMnGT58ODdv3mTJkiUFHeYHsW7dOm7cuMGgQYPUlru6uhIfH4++vn6BxKWnp0dcXBy7d++mbdu2at+tXbsWIyMjEhIS8lX2y5cvmTRpEm5ubpQpUybX+Q4dOpSv3xNCCCGEEB8/adwRQgghhBBCiH+xR48e0b59e1xdXTl27BiOjo6q7/r27cuDBw/Yu3fv3/4dpVJJQkICxsbGWb5LSEjAwMAAHZ2CG/xBoVBgZGRUYL9vaGhItWrVWL9+fZbGnXXr1tGsWTO2bt36P4klLi4OExMTDAwM/ie/J4QQQggh/n1kWDYhhBBCCCGE+BebOXMmMTEx/Prrr2oNO+94eXkxcOBA1f9TUlKYMmUKnp6eGBoa4ubmxrfffktiYqJaPjc3N5o3b87BgwepUKECxsbGLF68mBMnTqBQKNiwYQNjx47F2dkZExMToqOjAfjrr79o3LgxlpaWmJiYUKtWLU6dOpXjeuzcuZNmzZrh5OSEoaEhnp6eTJkyhdTUVFWa2rVrs3fvXp48eaIa5szNzQ3QPufOsWPHqFGjBqamplhZWfHJJ59w+/ZttTQTJ05EoVDw4MEDunXrhpWVFZaWlnTv3p24uLgcY3+nY8eO7N+/n8jISNWy8+fPc//+fTp27JglfXh4OMOGDaNkyZKYmZlhYWFBkyZNuHr1qirNiRMnqFixIgDdu3dXrfe79axduzYlSpTg4sWL1KxZExMTE7799lvVd5nn3OnatStGRkZZ1r9Ro0ZYW1vz8uXLXK+rEEIIIYT4d5OeO0IIIYQQQgjxL7Z79248PDyoWrVqrtL36tWLVatW0bp1a4YOHcpff/3FtGnTuH37Ntu3b1dLe/fuXTp06EDv3r358ssv8fX1VX03ZcoUDAwMGDZsGImJiRgYGHDs2DGaNGlC+fLlmTBhAjo6OqxYsYK6devy559/UqlSJa1xrVy5EjMzM4YMGYKZmRnHjh1j/PjxREdHM2vWLADGjBlDVFQUz58/Z968eQCYmZlpLfPIkSM0adIEDw8PJk6cSHx8PD/++CPVqlXj0qVLqoahd9q2bYu7uzvTpk3j0qVLLFu2jMKFCzNjxoxc1e1nn31Gnz592LZtGz169ADSe+34+flRrly5LOkfPnzIjh07aNOmDe7u7oSEhLB48WJq1arFrVu3cHJywt/fn8mTJzN+/Hi++uoratSoAaC2vcPCwmjSpAnt27enc+fO2Nvba4xvwYIFHDt2jK5du3LmzBl0dXVZvHgxhw4d4rfffsPJySlX6ymEEEIIIT4CSiGEEEIIIYQQ/0pRUVFKQPnJJ5/kKv2VK1eUgLJXr15qy4cNG6YElMeOHVMtc3V1VQLKAwcOqKU9fvy4ElB6eHgo4+LiVMvT0tKU3t7eykaNGinT0tJUy+Pi4pTu7u7KBg0aqJatWLFCCSgfPXqklu59vXv3VpqYmCgTEhJUy5o1a6Z0dXXNkvbRo0dKQLlixQrVsjJlyigLFy6sDAsLUy27evWqUkdHR9mlSxfVsgkTJigBZY8ePdTK/PTTT5W2trZZfut9Xbt2VZqamiqVSqWydevWynr16imVSqUyNTVV6eDgoJw0aZIqvlmzZqnyJSQkKFNTU7Osh6GhoXLy5MmqZefPn8+ybu/UqlVLCSh/+eUXjd/VqlVLbdnBgweVgPK7775TPnz4UGlmZqZs1apVjusohBBCCCE+LjIsmxBCCCGEEEL8S70bCs3c3DxX6fft2wfAkCFD1JYPHToUIMvcPO7u7jRq1EhjWV27dlWbf+fKlSuq4cfCwsJ4/fo1r1+/JjY2lnr16vHHH3+QlpamNbbMZb1584bXr19To0YN4uLiuHPnTq7WL7OgoCCuXLlCt27dsLGxUS0vVaoUDRo0UNVFZn369FH7f40aNQgLC1PVc2507NiREydOEBwczLFjxwgODtY4JBukz9Pzbp6i1NRUwsLCMDMzw9fXl0uXLuX6Nw0NDenevXuu0jZs2JDevXszefJkPvvsM4yMjFi8eHGuf0sIIYQQQnwcZFg2IYQQQgghhPiXsrCwANIbQ3LjyZMn6Ojo4OXlpbbcwcEBKysrnjx5orbc3d1da1nvf3f//n0gvdFHm6ioKKytrTV+d/PmTcaOHcuxY8eyNKZERUVpLVObd+uSeSi5d/z9/Tl48CCxsbGYmpqqlru4uKilexdrRESEqq5z0rRpU8zNzdm4cSNXrlyhYsWKeHl58fjx4yxp09LSWLBgAYsWLeLRo0dq8wvZ2trm6vcAnJ2dMTAwyHX62bNns3PnTq5cucK6desoXLhwrvMKIYQQQoiPgzTuCCGEEEIIIcS/lIWFBU5OTty4cSNP+RQKRa7SZe5Nk9N373rlzJo1izJlymjMo21+nMjISGrVqoWFhQWTJ0/G09MTIyMjLl26xMiRI7Pt8fMh6erqalyuVCpzXYahoSGfffYZq1at4uHDh0ycOFFr2qlTpzJu3Dh69OjBlClTsLGxQUdHh0GDBuVpnbPbTppcvnyZ0NBQAK5fv06HDh3ylF8IIYQQQvz7SeOOEEIIIYQQQvyLNW/enCVLlnDmzBkCAgKyTevq6kpaWhr379/H399ftTwkJITIyEhcXV3zHYenpyeQ3uBUv379POU9ceIEYWFhbNu2jZo1a6qWP3r0KEva3DZMvVuXu3fvZvnuzp07FCpUSK3XzofUsWNHli9fjo6ODu3bt9eabsuWLdSpU4dff/1VbXlkZCSFChVS/T+365wbsbGxdO/enWLFilG1alVmzpzJp59+SsWKFT/YbwghhBBCiIInc+4IIYQQQgghxL/YiBEjMDU1pVevXoSEhGT5PjAwkAULFgDpQ4YBzJ8/Xy3N3LlzAWjWrFm+4yhfvjyenp7Mnj2bmJiYLN+/evVKa953PWYy95BJSkpi0aJFWdKamprmapg2R0dHypQpw6pVq4iMjFQtv3HjBocOHVLVxT+hTp06TJkyhYULF+Lg4KA1na6ubpZeQZs3b+bFixdqy941QmVej/waOXIkT58+ZdWqVcydOxc3Nze6du1KYmLi3y5bCCGEEEL8e0jPHSGEEEIIIYT4F/P09GTdunW0a9cOf39/unTpQokSJUhKSuL06dNs3ryZbt26AVC6dGm6du3KkiVLVEOhnTt3jlWrVtGqVSvq1KmT7zh0dHRYtmwZTZo0oXjx4nTv3h1nZ2devHjB8ePHsbCwYPfu3RrzVq1aFWtra7p27cqAAQNQKBT89ttvGodDK1++PBs3bmTIkCFUrFgRMzMzWrR
      "text/plain": [
       "<Figure size 1700x1300 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Correlation heatmap\n",
    "plt.figure(figsize=(17, 13))\n",
    "cmap = sns.diverging_palette(220, 20, as_cmap=True)\n",
    "sns.heatmap(df.corr(), annot=True, cmap=cmap, fmt=\".2f\", linewidths=.5,)\n",
    "plt.title(\"Correlation Matrix\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Processing for modelling\n",
    "Afterwards, we split the dataset between train and test and display their sizes and target distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training set size: 14968 rows\n",
      "Test set size: 6416 rows\n",
      "\n",
      "Training target distribution:\n",
      "has_resolution_incident\n",
      "False    0.98744\n",
      "True     0.01256\n",
      "Name: proportion, dtype: float64\n",
      "\n",
      "Test target distribution:\n",
      "has_resolution_incident\n",
      "False    0.989246\n",
      "True     0.010754\n",
      "Name: proportion, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# Split the data\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)\n",
    "\n",
    "print(f\"Training set size: {X_train.shape[0]} rows\")\n",
    "print(f\"Test set size: {X_test.shape[0]} rows\")\n",
    "\n",
    "print(\"\\nTraining target distribution:\")\n",
    "print(y_train.value_counts(normalize=True))\n",
    "\n",
    "print(\"\\nTest target distribution:\")\n",
    "print(y_test.value_counts(normalize=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d36c9276",
   "metadata": {},
   "source": [
    "## Classification Model with Random Forest\n",
    "\n",
    "We define a machine learning pipeline that includes:\n",
    "- **Scaling numeric features** with `StandardScaler`\n",
    "- **Training a Random Forest classifier** with balanced class weights to handle the imbalanced dataset\n",
    "\n",
    "We then use `GridSearchCV` to perform a **grid search with cross-validation** over a range of key hyperparameters (e.g., number of trees, max depth, etc.).  \n",
    "The model is evaluated using **Average Precision**, which is better suited for imbalanced classification tasks.\n",
    "\n",
    "The best combination of parameters is selected, and the resulting model is used to make predictions on the test set.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "943ef7d6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 4 folds for each of 72 candidates, totalling 288 fits\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.9s[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.9s\n",
      "\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   2.6s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   2.6s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.8s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.8s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.9s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.8s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.5s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.6s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.7s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.7s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.8s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.9s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   7.3s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   7.4s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   6.1s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   6.3s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   4.4s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   4.3s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   4.3s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   4.5s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   2.7s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   7.4s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   7.5s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   2.6s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.7s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.7s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.8s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.8s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   6.2s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.9s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   4.1s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   6.5s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   2.2s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   2.1s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   2.2s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   2.2s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.4s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.2s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.3s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.3s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.4s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.4s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.4s\n",
      "[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.8s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.0s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.4s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.5s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.5s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   2.0s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.8s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   4.0s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.9s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.9s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.7s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.9s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.7s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.2s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.9s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.3s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.3s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.3s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   2.0s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   2.1s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.2s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.2s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.2s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.6s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.3s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.6s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.8s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.6s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   4.7s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   4.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.1s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.1s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.2s\n",
      "[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.3s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.4s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.4s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   4.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   4.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   4.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.1s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.3s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.3s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.0s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.3s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.9s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.9s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   5.1s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   5.0s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.1s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.5s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.5s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.5s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   4.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   4.8s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.9s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   4.7s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.4s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.5s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.4s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.4s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.4s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.4s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.4s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   2.8s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   2.8s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   2.9s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.6s[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.3s\n",
      "\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.6s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.7s\n",
      "[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.8s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   4.4s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   4.9s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   4.8s[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   4.3s\n",
      "\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   2.9s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   2.9s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   2.9s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.0s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.5s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.5s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.4s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.5s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.4s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   5.0s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.1s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.9s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.4s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.6s[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.5s\n",
      "\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   4.3s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   4.3s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.0s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   2.9s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   2.9s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.0s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.1s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.1s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   2.1s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   2.3s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   2.1s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   2.1s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   4.5s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.1s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   4.5s\n",
      "[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   4.9s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   2.0s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   2.0s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   2.0s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   2.0s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.6s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.6s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.7s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.7s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.7s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.3s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.5s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.9s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.7s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   6.1s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   6.3s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   4.4s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   2.0s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   5.9s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   5.9s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   5.9s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.9s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.9s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   2.5s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.9s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.9s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   2.6s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   6.6s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   2.6s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.2s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.2s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.3s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.4s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   5.7s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.5s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.6s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   3.6s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.1s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.2s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time=   1.6s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.2s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.1s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time=   3.2s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.2s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   5.2s\n",
      "[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   6.0s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.0s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.6s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.5s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.4s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time=   5.6s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time=   1.8s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.1s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.4s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time=   3.2s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.5s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.5s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.7s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time=   1.5s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.7s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.8s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   5.0s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time=   4.9s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.1s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.1s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   3.0s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time=   2.9s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   4.2s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   4.4s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   4.4s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   2.6s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   2.7s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time=   4.4s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   2.6s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time=   2.6s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   3.2s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   3.3s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   3.2s\n",
      "[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time=   3.2s\n",
      "Best hyperparameters: {'model__max_depth': 10, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 2, 'model__min_samples_split': 5, 'model__n_estimators': 300}\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# Define pipeline (scaling numeric features only)\n",
    "pipeline = Pipeline([\n",
    "    ('scaler', StandardScaler()),\n",
    "    ('model', RandomForestClassifier(class_weight='balanced', # We have an imbalanced dataset\n",
    "                                     random_state=123))\n",
    "])\n",
    "\n",
    "# Define parameter grid\n",
    "param_grid = {\n",
    "    'model__n_estimators': [100, 200, 300],\n",
    "    'model__max_depth': [None, 10, 20],\n",
    "    'model__min_samples_split': [2, 5],\n",
    "    'model__min_samples_leaf': [1, 2],\n",
    "    'model__max_features': ['sqrt', 'log2']\n",
    "}\n",
    "\n",
    "# GridSearchCV\n",
    "grid_search = GridSearchCV(\n",
    "    estimator=pipeline,\n",
    "    param_grid=param_grid,\n",
    "    scoring='average_precision',  # For imbalanced classification\n",
    "    cv=4, # 4-fold cross-validation\n",
    "    n_jobs=-1, # Use all available cores\n",
    "    verbose=2, # Verbose output for progress tracking,\n",
    "    refit=True # Refit the best model on the entire training set - it's already true by default\n",
    ")\n",
    "\n",
    "# Fit the grid search on training data\n",
    "grid_search.fit(X_train, y_train)\n",
    "\n",
    "# Best model\n",
    "best_pipeline = grid_search.best_estimator_\n",
    "print(\"Best hyperparameters:\", grid_search.best_params_)\n",
    "\n",
    "# Predict on test set\n",
    "y_pred_proba = best_pipeline.predict_proba(X_test)[:, 1]\n",
    "y_pred = best_pipeline.predict(X_test)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_fit_time</th>\n",
       "      <th>std_fit_time</th>\n",
       "      <th>mean_score_time</th>\n",
       "      <th>std_score_time</th>\n",
       "      <th>param_model__max_depth</th>\n",
       "      <th>param_model__max_features</th>\n",
       "      <th>param_model__min_samples_leaf</th>\n",
       "      <th>param_model__min_samples_split</th>\n",
       "      <th>param_model__n_estimators</th>\n",
       "      <th>params</th>\n",
       "      <th>split0_test_score</th>\n",
       "      <th>split1_test_score</th>\n",
       "      <th>split2_test_score</th>\n",
       "      <th>split3_test_score</th>\n",
       "      <th>mean_test_score</th>\n",
       "      <th>std_test_score</th>\n",
       "      <th>rank_test_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>5.492363</td>\n",
       "      <td>0.074103</td>\n",
       "      <td>0.193978</td>\n",
       "      <td>0.016560</td>\n",
       "      <td>10</td>\n",
       "      <td>sqrt</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>300</td>\n",
       "      <td>{'model__max_depth': 10, 'model__max_features'...</td>\n",
       "      <td>0.041262</td>\n",
       "      <td>0.021222</td>\n",
       "      <td>0.028958</td>\n",
       "      <td>0.058779</td>\n",
       "      <td>0.037555</td>\n",
       "      <td>0.014185</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>3.078427</td>\n",
       "      <td>0.037090</td>\n",
       "      <td>0.129033</td>\n",
       "      <td>0.003915</td>\n",
       "      <td>None</td>\n",
       "      <td>log2</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>200</td>\n",
       "      <td>{'model__max_depth': None, 'model__max_feature...</td>\n",
       "      <td>0.046899</td>\n",
       "      <td>0.023721</td>\n",
       "      <td>0.029079</td>\n",
       "      <td>0.049230</td>\n",
       "      <td>0.037232</td>\n",
       "      <td>0.011028</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>1.725934</td>\n",
       "      <td>0.030368</td>\n",
       "      <td>0.065814</td>\n",
       "      <td>0.002268</td>\n",
       "      <td>20</td>\n",
       "      <td>sqrt</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>100</td>\n",
       "      <td>{'model__max_depth': 20, 'model__max_features'...</td>\n",
       "      <td>0.046455</td>\n",
       "      <td>0.021084</td>\n",
       "      <td>0.030397</td>\n",
       "      <td>0.050986</td>\n",
       "      <td>0.037230</td>\n",
       "      <td>0.012059</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>4.754896</td>\n",
       "      <td>0.284760</td>\n",
       "      <td>0.197159</td>\n",
       "      <td>0.010598</td>\n",
       "      <td>None</td>\n",
       "      <td>log2</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>300</td>\n",
       "      <td>{'model__max_depth': None, 'model__max_feature...</td>\n",
       "      <td>0.045281</td>\n",
       "      <td>0.024624</td>\n",
       "      <td>0.028884</td>\n",
       "      <td>0.049424</td>\n",
       "      <td>0.037053</td>\n",
       "      <td>0.010511</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>64</th>\n",
       "      <td>3.150147</td>\n",
       "      <td>0.123393</td>\n",
       "      <td>0.133204</td>\n",
       "      <td>0.010875</td>\n",
       "      <td>20</td>\n",
       "      <td>log2</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>200</td>\n",
       "      <td>{'model__max_depth': 20, 'model__max_features'...</td>\n",
       "      <td>0.048786</td>\n",
       "      <td>0.021536</td>\n",
       "      <td>0.031982</td>\n",
       "      <td>0.045861</td>\n",
       "      <td>0.037041</td>\n",
       "      <td>0.010974</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3.655133</td>\n",
       "      <td>0.052994</td>\n",
       "      <td>0.141072</td>\n",
       "      <td>0.002776</td>\n",
       "      <td>None</td>\n",
       "      <td>sqrt</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>200</td>\n",
       "      <td>{'model__max_depth': None, 'model__max_feature...</td>\n",
       "      <td>0.044698</td>\n",
       "      <td>0.019424</td>\n",
       "      <td>0.026336</td>\n",
       "      <td>0.041751</td>\n",
       "      <td>0.033052</td>\n",
       "      <td>0.010513</td>\n",
       "      <td>68</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49</th>\n",
       "      <td>3.499403</td>\n",
       "      <td>0.044126</td>\n",
       "      <td>0.146713</td>\n",
       "      <td>0.003312</td>\n",
       "      <td>20</td>\n",
       "      <td>sqrt</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>200</td>\n",
       "      <td>{'model__max_depth': 20, 'model__max_features'...</td>\n",
       "      <td>0.043488</td>\n",
       "      <td>0.019535</td>\n",
       "      <td>0.026128</td>\n",
       "      <td>0.041667</td>\n",
       "      <td>0.032705</td>\n",
       "      <td>0.010165</td>\n",
       "      <td>69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48</th>\n",
       "      <td>2.029998</td>\n",
       "      <td>0.085049</td>\n",
       "      <td>0.118226</td>\n",
       "      <td>0.019632</td>\n",
       "      <td>20</td>\n",
       "      <td>sqrt</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>100</td>\n",
       "      <td>{'model__max_depth': 20, 'model__max_features'...</td>\n",
       "      <td>0.040683</td>\n",
       "      <td>0.018370</td>\n",
       "      <td>0.026502</td>\n",
       "      <td>0.038585</td>\n",
       "      <td>0.031035</td>\n",
       "      <td>0.009097</td>\n",
       "      <td>70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2.102099</td>\n",
       "      <td>0.029990</td>\n",
       "      <td>0.092719</td>\n",
       "      <td>0.007638</td>\n",
       "      <td>None</td>\n",
       "      <td>log2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>100</td>\n",
       "      <td>{'model__max_depth': None, 'model__max_feature...</td>\n",
       "      <td>0.035229</td>\n",
       "      <td>0.020518</td>\n",
       "      <td>0.024970</td>\n",
       "      <td>0.039950</td>\n",
       "      <td>0.030167</td>\n",
       "      <td>0.007769</td>\n",
       "      <td>71</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.983677</td>\n",
       "      <td>0.277025</td>\n",
       "      <td>0.091703</td>\n",
       "      <td>0.020498</td>\n",
       "      <td>None</td>\n",
       "      <td>sqrt</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>100</td>\n",
       "      <td>{'model__max_depth': None, 'model__max_feature...</td>\n",
       "      <td>0.037104</td>\n",
       "      <td>0.016652</td>\n",
       "      <td>0.023631</td>\n",
       "      <td>0.034512</td>\n",
       "      <td>0.027975</td>\n",
       "      <td>0.008264</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>72 rows × 17 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    mean_fit_time  std_fit_time  mean_score_time  std_score_time  \\\n",
       "35       5.492363      0.074103         0.193978        0.016560   \n",
       "22       3.078427      0.037090         0.129033        0.003915   \n",
       "54       1.725934      0.030368         0.065814        0.002268   \n",
       "23       4.754896      0.284760         0.197159        0.010598   \n",
       "64       3.150147      0.123393         0.133204        0.010875   \n",
       "..            ...           ...              ...             ...   \n",
       "1        3.655133      0.052994         0.141072        0.002776   \n",
       "49       3.499403      0.044126         0.146713        0.003312   \n",
       "48       2.029998      0.085049         0.118226        0.019632   \n",
       "12       2.102099      0.029990         0.092719        0.007638   \n",
       "0        1.983677      0.277025         0.091703        0.020498   \n",
       "\n",
       "   param_model__max_depth param_model__max_features  \\\n",
       "35                     10                      sqrt   \n",
       "22                   None                      log2   \n",
       "54                     20                      sqrt   \n",
       "23                   None                      log2   \n",
       "64                     20                      log2   \n",
       "..                    ...                       ...   \n",
       "1                    None                      sqrt   \n",
       "49                     20                      sqrt   \n",
       "48                     20                      sqrt   \n",
       "12                   None                      log2   \n",
       "0                    None                      sqrt   \n",
       "\n",
       "    param_model__min_samples_leaf  param_model__min_samples_split  \\\n",
       "35                              2                               5   \n",
       "22                              2                               5   \n",
       "54                              2                               2   \n",
       "23                              2                               5   \n",
       "64                              1                               5   \n",
       "..                            ...                             ...   \n",
       "1                               1                               2   \n",
       "49                              1                               2   \n",
       "48                              1                               2   \n",
       "12                              1                               2   \n",
       "0                               1                               2   \n",
       "\n",
       "    param_model__n_estimators  \\\n",
       "35                        300   \n",
       "22                        200   \n",
       "54                        100   \n",
       "23                        300   \n",
       "64                        200   \n",
       "..                        ...   \n",
       "1                         200   \n",
       "49                        200   \n",
       "48                        100   \n",
       "12                        100   \n",
       "0                         100   \n",
       "\n",
       "                                               params  split0_test_score  \\\n",
       "35  {'model__max_depth': 10, 'model__max_features'...           0.041262   \n",
       "22  {'model__max_depth': None, 'model__max_feature...           0.046899   \n",
       "54  {'model__max_depth': 20, 'model__max_features'...           0.046455   \n",
       "23  {'model__max_depth': None, 'model__max_feature...           0.045281   \n",
       "64  {'model__max_depth': 20, 'model__max_features'...           0.048786   \n",
       "..                                                ...                ...   \n",
       "1   {'model__max_depth': None, 'model__max_feature...           0.044698   \n",
       "49  {'model__max_depth': 20, 'model__max_features'...           0.043488   \n",
       "48  {'model__max_depth': 20, 'model__max_features'...           0.040683   \n",
       "12  {'model__max_depth': None, 'model__max_feature...           0.035229   \n",
       "0   {'model__max_depth': None, 'model__max_feature...           0.037104   \n",
       "\n",
       "    split1_test_score  split2_test_score  split3_test_score  mean_test_score  \\\n",
       "35           0.021222           0.028958           0.058779         0.037555   \n",
       "22           0.023721           0.029079           0.049230         0.037232   \n",
       "54           0.021084           0.030397           0.050986         0.037230   \n",
       "23           0.024624           0.028884           0.049424         0.037053   \n",
       "64           0.021536           0.031982           0.045861         0.037041   \n",
       "..                ...                ...                ...              ...   \n",
       "1            0.019424           0.026336           0.041751         0.033052   \n",
       "49           0.019535           0.026128           0.041667         0.032705   \n",
       "48           0.018370           0.026502           0.038585         0.031035   \n",
       "12           0.020518           0.024970           0.039950         0.030167   \n",
       "0            0.016652           0.023631           0.034512         0.027975   \n",
       "\n",
       "    std_test_score  rank_test_score  \n",
       "35        0.014185                1  \n",
       "22        0.011028                2  \n",
       "54        0.012059                3  \n",
       "23        0.010511                4  \n",
       "64        0.010974                5  \n",
       "..             ...              ...  \n",
       "1         0.010513               68  \n",
       "49        0.010165               69  \n",
       "48        0.009097               70  \n",
       "12        0.007769               71  \n",
       "0         0.008264               72  \n",
       "\n",
       "[72 rows x 17 columns]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Retrieve cv results\n",
    "pd.DataFrame(grid_search.cv_results_).sort_values(by='mean_test_score', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We apply a threshold selector to find a proper value for F2 optimisation, rather than defaulting to 0.5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find the best threshold for F2 score\n",
    "\n",
    "def find_best_threshold(y_true, y_proba, beta=2.0):\n",
    "    thresholds = np.linspace(0, 1, 200)\n",
    "    f2_scores = []\n",
    "\n",
    "    for t in thresholds:\n",
    "        preds = (y_proba >= t).astype(int)\n",
    "        score = fbeta_score(y_true, preds, beta=beta)\n",
    "        f2_scores.append(score)\n",
    "\n",
    "    best_index = np.argmax(f2_scores)\n",
    "    return thresholds[best_index], f2_scores[best_index]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best threshold: 38.2% — F2 score: 15.31%\n"
     ]
    }
   ],
   "source": [
    "# Predict probabilities\n",
    "y_pred_proba = best_pipeline.predict_proba(X_test)[:, 1]\n",
    "\n",
    "# Find best threshold for F2\n",
    "best_thresh, best_f2 = find_best_threshold(y_test, y_pred_proba, beta=2.0)\n",
    "print(f\"Best threshold: {100*best_thresh:.1f}% — F2 score: {100*best_f2:.2f}%\")\n",
    "\n",
    "# Use that threshold for final classification\n",
    "y_pred_opt = (y_pred_proba >= best_thresh).astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc2fcc89",
   "metadata": {},
   "source": [
    "## Evaluation\n",
    "This section aims to evaluate how good the new model is vs. the actual Resolution Incidents.\n",
    "\n",
    "We start by computing and displaying the classification report, ROC Curve, PR Curve and the respective Area Under the Curve (AUC)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "30786f7c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      " No Incident       0.99      0.89      0.94      6347\n",
      "    Incident       0.04      0.43      0.08        69\n",
      "\n",
      "    accuracy                           0.89      6416\n",
      "   macro avg       0.52      0.66      0.51      6416\n",
      "weighted avg       0.98      0.89      0.93      6416\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Print classification report\n",
    "print(classification_report(y_test, y_pred_opt, target_names=['No Incident', 'Incident']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interpreting the Classification Report\n",
    "\n",
    "The **Classification Report** provides key metrics to evaluate how well the model performed on each class.\n",
    "\n",
    "It includes the following metrics for each class (0 and 1):\n",
    "* Precision: Out of all predicted positives, how many were actually positive?\n",
    "* Recall: Out of all actual positives, how many did we correctly identify?\n",
    "* F1-score: Harmonic mean of precision and recall (balances both)\n",
    "* Support: Number of true samples of that class in the test data\n",
    "\n",
    "Interpretation:\n",
    "* Class 0 = No incident\n",
    "* Class 1 = Has resolution incident (rare, but important!)\n",
    "\n",
    "A few explanatory cases:\n",
    "* A high recall for class 1 means we're catching most incidents.\n",
    "* A high precision for class 1 means when we predict an incident, we're often correct.\n",
    "* The F1-score gives a single balanced measure (good for imbalanced data).\n",
    "\n",
    "Special note for imbalanced data:\n",
    "Since class 1 (or just True) is rare (1% in our case), metrics for that class are more critical.\n",
    "We want to maximize recall to catch as many real incidents as possible — without letting precision drop too low (to avoid too many false alarms)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "4b4da914",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAhgAAAHWCAYAAAA1jvBJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABw10lEQVR4nO3dd1gU1/oH8O/uwtKLiEgRRRBiF3vsDUWNvQEaRZOYpom/eE2iibGkaG5MjLmJNyYaY4kCgj1WNNHYol4VS1QsiA1QuaiAlF12z+8PLxuRRVmcZRb4fp6HR/bsmZl3jwv7MvPOOQohhAARERGRhJRyB0BERESVDxMMIiIikhwTDCIiIpIcEwwiIiKSHBMMIiIikhwTDCIiIpIcEwwiIiKSHBMMIiIikhwTDCIiIpIcEwwiIiKSHBMMoipg2bJlUCgUhi8rKyv4+Phg7NixuHnzptFthBBYuXIlOnfuDFdXV9jb26NJkyb4+OOP8eDBgxKPtX79evTp0wfu7u5Qq9Xw9vbGiBEj8Ntvv5Uq1ry8PHz99ddo27YtXFxcYGtri6CgIEycOBEXLlwo0+snovKn4FokRJXfsmXLMG7cOHz88ceoW7cu8vLy8Oeff2LZsmXw8/PDmTNnYGtra+iv0+kwcuRIrFmzBp06dcKQIUNgb2+Pffv2YfXq1WjYsCF27dqFmjVrGrYRQuCll17CsmXL0Lx5cwwbNgyenp5ITU3F+vXrcezYMRw4cADt27cvMc709HT07t0bx44dQ79+/RASEgJHR0ckJiYiOjoaaWlp0Gg0Zh0rIpKIIKJK7+effxYAxNGjR4u0v//++wKAiImJKdI+Z84cAUBMmTKl2L42bdoklEql6N27d5H2efPmCQDi//7v/4Rery+23YoVK8Thw4efGOcLL7wglEqliIuLK/ZcXl6e+Mc//vHE7UtLq9WK/Px8SfZFRMYxwSCqAkpKMH799VcBQMyZM8fQlpOTI6pVqyaCgoKEVqs1ur9x48YJAOLQoUOGbdzc3ET9+vVFQUFBmWL8888/BQAxfvz4UvXv0qWL6NKlS7H2yMhIUadOHcPjK1euCABi3rx54uuvvxb+/v5CqVSKP//8U6hUKjFr1qxi+zh//rwAIL799ltD2927d8WkSZNErVq1hFqtFgEBAeLzzz8XOp3O5NdKVBWwBoOoCktOTgYAVKtWzdC2f/9+3L17FyNHjoSVlZXR7caMGQMA+PXXXw3bZGRkYOTIkVCpVGWKZdOmTQCA0aNHl2n7p/n555/x7bff4tVXX8VXX30FLy8vdOnSBWvWrCnWNyYmBiqVCsOHDwcA5OTkoEuXLvjll18wZswY/Otf/0KHDh0wbdo0TJ482SzxElV0xn97EFGldP/+faSnpyMvLw+HDx/G7NmzYWNjg379+hn6nD17FgDQrFmzEvdT+Ny5c+eK/NukSZMyxybFPp7kxo0buHTpEmrUqGFoCwsLw2uvvYYzZ86gcePGhvaYmBh06dLFUGMyf/58XL58GSdOnEBgYCAA4LXXXoO3tzfmzZuHf/zjH/D19TVL3EQVFc9gEFUhISEhqFGjBnx9fTFs2DA4ODhg06ZNqFWrlqFPVlYWAMDJyanE/RQ+l5mZWeTfJ23zNFLs40mGDh1aJLkAgCFDhsDKygoxMTGGtjNnzuDs2bMICwsztMXGxqJTp06oVq0a0tPTDV8hISHQ6XT4448/zBIzUUXGMxhEVcjChQsRFBSE+/fvY+nSpfjjjz9gY2NTpE/hB3xhomHM40mIs7PzU7d5mkf34erqWub9lKRu3brF2tzd3dGjRw+sWbMGn3zyCYCHZy+srKwwZMgQQ7+LFy/i1KlTxRKUQrdv35Y8XqKKjgkGURXSpk0btGrVCgAwaNAgdOzYESNHjkRiYiIcHR0BAA0aNAAAnDp1CoMGDTK6n1OnTgEAGjZsCACoX78+AOD06dMlbvM0j+6jU6dOT+2vUCggjNxlr9PpjPa3s7Mz2h4eHo5x48YhISEBwcHBWLNmDXr06AF3d3dDH71ej549e+K9994zuo+goKCnxktU1fASCVEVpVKpMHfuXKSkpOC7774ztHfs2BGurq5YvXp1iR/WK1asAABD7UbHjh1RrVo1REVFlbjN0/Tv3x8A8Msvv5Sqf7Vq1XDv3r1i7VevXjXpuIMGDYJarUZMTAwSEhJw4cIFhIeHF+kTEBCA7OxshISEGP2qXbu2ScckqgqYYBBVYV27dkWbNm2wYMEC5OXlAQDs7e0xZcoUJCYm4sMPPyy2zZYtW7Bs2TKEhobi+eefN2zz/vvv49y5c3j//feNnln45ZdfcOTIkRJjadeuHXr37o0lS5Zgw4YNxZ7XaDSYMmWK4XFAQADOnz+PO3fuGNpOnjyJAwcOlPr1A4CrqytCQ0OxZs0aREdHQ61WFzsLM2LECBw6dAg7duwotv29e/dQUFBg0jGJqgLO5ElUBRTO5Hn06FHDJZJCcXFxGD58OL7//nu8/vrrAB5eZggLC8PatWvRuXNnDB06FHZ2dti/fz9++eUXNGjQALt37y4yk6der8fYsWOxcuVKtGjRwjCTZ1paGjZs2IAjR47g4MGDaNeuXYlx3rlzB7169cLJkyfRv39/9OjRAw4ODrh48SKio6ORmpqK/Px8AA/vOmncuDGaNWuGl19+Gbdv38aiRYtQs2ZNZGZmGm7BTU5ORt26dTFv3rwiCcqjVq1ahRdffBFOTk7o2rWr4ZbZQjk5OejUqRNOnTqFsWPHomXLlnjw4AFOnz6NuLg4JCcnF7mkQkTgTJ5EVUFJE20JIYROpxMBAQEiICCgyCRZOp1O/Pzzz6JDhw7C2dlZ2NraikaNGonZs2eL7OzsEo8VFxcnevXqJdzc3ISVlZXw8vISYWFhYs+ePaWKNScnR3z55ZeidevWwtHRUajVahEYGCjeeustcenSpSJ9f/nlF+Hv7y/UarUIDg4WO3bseOJEWyXJzMwUdnZ2AoD45ZdfjPbJysoS06ZNE/Xq1RNqtVq4u7uL9u3biy+//FJoNJpSvTaiqoRnMIiIiEhyrMEgIiIiyTHBICIiIskxwSAiIiLJMcEgIiIiyTHBICIiIskxwSAiIiLJVbm1SPR6PVJSUuDk5ASFQiF3OERERBWGEAJZWVnw9vaGUvnkcxRVLsFISUmBr6+v3GEQERFVWNevX0etWrWe2KfKJRiFy0tfv37dsDz0s9Jqtdi5cyd69eoFa2trSfZZ1XFMpccxlRbHU3ocU2mZYzwzMzPh6+tr+Cx9kiqXYBReFnF2dpY0wbC3t4ezszN/KCTCMZUex1RaHE/pcUylZc7xLE2JAYs8iYiISHJMMIiIiEhyTDCIiIhIckwwiIiISHJMMIiIiEhyTDCIiIhIckwwiIiISHJMMIiIiEhyTDCIiIhIckwwiIiISHKyJhh//PEH+vfvD29vbygUCmzYsOGp2+zZswctWrSAjY0N6tWrh2XLlpk9TiIiIjKNrAnGgwcP0KxZMyxcuLBU/a9cuYIXXngB3bp1Q0JCAv7v//4Pr7zyCnbs2GHmSImIiMgUsi521qdPH/Tp06fU/RctWoS6deviq6++AgA0aNAA+/fvx9dff43Q0FBzhUlEREQmqlCrqR46dAghISFF2kJDQ/F///d/JW6Tn5+P/Px8w+PMzEwAD1eZ02q1ksRVuB+p9kccU3PgmEqL4ym9yjqmcXEKzJ6tQnZ2+R1TqdTByysZly/3Qu3aShw+LO3nXWlUqAQjLS0NNWvWLNJWs2ZNZGZmIjc3F3Z2dsW2mTt3LmbPnl2sfefOnbC3t5c0vvj4eEn3RxxTc+CYSovjKb3KNqbvvdcdN244ldvx7O1zMGLEGtSpcxVRURG4ds0XW7fulGTfOTk5pe5boRKMspg2bRomT55seJyZmQlfX1/06tULzs7OkhxDq9UiPj4ePXv2hLW1tST7rOo4ptLjmEqL4ym9yjqmQjz8qFUqBby8zHusatVuIzQ0Gk5O96DR2MDRUQsHBzX69u0ryf4LrwKURoVKMDw9PXHr1q0ibbdu3YKzs7PRsxcAYGNjAxs
      "text/plain": [
       "<Figure size 600x500 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# ROC Curve\n",
    "fpr, tpr, _ = roc_curve(y_test, y_pred_proba)\n",
    "roc_auc = auc(fpr, tpr)\n",
    "\n",
    "plt.figure(figsize=(6, 5))\n",
    "plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')\n",
    "plt.plot([0, 1], [0, 1], color='gray', linestyle='--')\n",
    "plt.xlabel('False Positive Rate')\n",
    "plt.ylabel('True Positive Rate')\n",
    "plt.title('ROC Curve')\n",
    "plt.legend(loc='lower right')\n",
    "plt.grid(True)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interpreting the ROC Curve\n",
    "\n",
    "The **Receiver Operating Characteristic (ROC) curve** shows how well the model distinguishes between the positive and negative classes across all decision thresholds.\n",
    "\n",
    "A quick reminder of the definitions:\n",
    "* True Positive Rate (TPR) = Recall\n",
    "* False Positive Rate (FPR) = Proportion of negatives wrongly classified as positives\n",
    "\n",
    "What we display in this plot is:\n",
    "* The x-axis is False Positive Rate\n",
    "* The y-axis is True Positive Rate\n",
    "\n",
    "The curve shows how TPR and FPR change as the threshold varies\n",
    "\n",
    "It's important to note that:\n",
    "* A model with no skill will produce a diagonal line (AUC = 0.5)\n",
    "* A model with perfect discrimination will hug the top-left corner (AUC = 1.0)\n",
    "\n",
    "The Area Under the Curve (ROC AUC) gives a single performance score:\n",
    "* Closer to 1 means better at ranking positive cases higher than negative ones\n",
    "\n",
    "**Important!**\n",
    "\n",
    "While useful, the ROC curve can sometimes overestimate performance when the dataset is imbalanced, because it includes negatives (which dominate in our case, around 99%!). That’s why we also MUST check the Precision-Recall curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "6790d41d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAhgAAAHWCAYAAAA1jvBJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAABYBUlEQVR4nO3deVxU5f4H8M/sgICgbIIoKhqWa7iEZmqyqGnZptddS7OU+zNp00zRFklTc8mlvLncezW3yixXRMm1xQUr9wXFDQSVRZaZYeb5/cFlcpxBAR8Y0c+7l6+cZ87ynS/gfDjnOWcUQggBIiIiIomUji6AiIiIHjwMGERERCQdAwYRERFJx4BBRERE0jFgEBERkXQMGERERCQdAwYRERFJx4BBRERE0jFgEBERkXQMGERV1JAhQxAUFFSmdRITE6FQKJCYmFghNVV1nTp1QqdOnSyPz507B4VCgaVLlzqsJqKqigGDqJSWLl0KhUJh+ePk5IRGjRohOjoaaWlpji7vvlf8Zl38R6lUokaNGujWrRv27dvn6PKkSEtLw9tvv42QkBC4uLigWrVqCA0Nxccff4zMzExHl0dUqdSOLoCoqvnwww9Rr149FBQUYPfu3ViwYAE2btyIv/76Cy4uLpVWx6JFi2A2m8u0zlNPPYX8/HxotdoKquru+vbti+7du8NkMuHkyZOYP38+OnfujN9//x1NmzZ1WF336vfff0f37t1x8+ZNDBgwAKGhoQCA/fv349NPP8XOnTuxdetWB1dJVHkYMIjKqFu3bmjVqhUAYNiwYahZsyZmzpyJH374AX379rW7Tm5uLqpVqya1Do1GU+Z1lEolnJycpNZRVo8//jgGDBhgedyhQwd069YNCxYswPz58x1YWfllZmbi+eefh0qlwqFDhxASEmL1/CeffIJFixZJ2VdFfC8RVQSeIiG6R08//TQAIDk5GUDR3AhXV1ecOXMG3bt3h5ubG/r37w8AMJvNmDVrFh577DE4OTnB19cXI0aMwI0bN2y2u2nTJnTs2BFubm5wd3dH69atsWLFCsvz9uZgrFy5EqGhoZZ1mjZtitmzZ1ueL2kOxpo1axAaGgpnZ2d4eXlhwIABuHTpktUyxa/r0qVL6NWrF1xdXeHt7Y23334bJpOp3P3r0KEDAODMmTNW45mZmXjzzTcRGBgInU6H4OBgTJ061eaojdlsxuzZs9G0aVM4OTnB29sbXbt2xf79+y3LLFmyBE8//TR8fHyg0+nw6KOPYsGCBeWu+XZffvklLl26hJkzZ9qECwDw9fXFBx98YHmsUCgwadIkm+WCgoIwZMgQy+Pi03I///wzRo4cCR8fH9SuXRtr1661jNurRaFQ4K+//rKMHT9+HC+99BJq1KgBJycntGrVCuvXr7+3F010FzyCQXSPit8Ya9asaRkrLCxEVFQUnnzySUyfPt1y6mTEiBFYunQphg4div/7v/9DcnIyvvjiCxw6dAh79uyxHJVYunQpXnnlFTz22GMYN24cPDw8cOjQIWzevBn9+vWzW0d8fDz69u2LLl26YOrUqQCAY8eOYc+ePRg9enSJ9RfX07p1a8TFxSEtLQ2zZ8/Gnj17cOjQIXh4eFiWNZlMiIqKQtu2bTF9+nRs27YNM2bMQIMGDfDGG2+Uq3/nzp0DAHh6elrG8vLy0LFjR1y6dAkjRoxAnTp1sHfvXowbNw5XrlzBrFmzLMu++uqrWLp0Kbp164Zhw4ahsLAQu3btwi+//GI50rRgwQI89thjePbZZ6FWq/Hjjz9i5MiRMJvNGDVqVLnqvtX69evh7OyMl1566Z63Zc/IkSPh7e2NiRMnIjc3F8888wxcXV2xevVqdOzY0WrZVatW4bHHHkOTJk0AAEeOHEH79u0REBCAsWPHolq1ali9ejV69eqFb7/9Fs8//3yF1EwEQUSlsmTJEgFAbNu2TaSnp4sLFy6IlStXipo1awpnZ2dx8eJFIYQQgwcPFgDE2LFjrdbftWuXACCWL19uNb5582ar8czMTOHm5ibatm0r8vPzrZY1m82Wvw8ePFjUrVvX8nj06NHC3d1dFBYWlvgaduzYIQCIHTt2CCGEMBgMwsfHRzRp0sRqXz/99JMAICZOnGi1PwDiww8/tNpmy5YtRWhoaIn7LJacnCwAiMmTJ4v09HSRmpoqdu3aJVq3bi0AiDVr1liW/eijj0S1atXEyZMnrbYxduxYoVKpREpKihBCiO3btwsA4v/+7/9s9ndrr/Ly8myej4qKEvXr17ca69ixo+jYsaNNzUuWLLnja/P09BTNmze/4zK3AiBiY2NtxuvWrSsGDx5seVz8Pffkk0/afF379u0rfHx8rMavXLkilEql1deoS5cuomnTpqKgoMAyZjabRbt27UTDhg1LXTNRWfEUCVEZhYeHw9vbG4GBgfjHP/4BV1dXfP/99wgICLBa7vbf6NesWYPq1asjIiICGRkZlj+hoaFwdXXFjh07ABQdicjJycHYsWNt5ksoFIoS6/Lw8EBubi7i4+NL/Vr279+Pq1evYuTIkVb7euaZZxASEoINGzbYrPP6669bPe7QoQPOnj1b6n3GxsbC29sbfn5+6NChA44dO4YZM2ZY/fa/Zs0adOjQAZ6enla9Cg8Ph8lkws6dOwEA3377LRQKBWJjY232c2uvnJ2dLX/PyspCRkYGOnbsiLNnzyIrK6vUtZckOzsbbm5u97ydkgwfPhwqlcpqrE+fPrh69arV6a61a9fCbDajT58+AIDr169j+/bt6N27N3Jycix9vHbtGqKionDq1CmbU2FEsvAUCVEZzZs3D40aNYJarYavry8eeeQRKJXWWV2tVqN27dpWY6dOnUJWVhZ8fHzsbvfq1asA/j7lUnyIu7RGjhyJ1atXo1u3bggICEBkZCR69+6Nrl27lrjO+fPnAQCPPPKIzXMhISHYvXu31VjxHIdbeXp6Ws0hSU9Pt5qT4erqCldXV8vj1157DS+//DIKCgqwfft2zJkzx2YOx6lTp/DHH3/Y7KvYrb3y9/dHjRo1SnyNALBnzx7ExsZi3759yMvLs3ouKysL1atXv+P6d+Pu7o6cnJx72sad1KtXz2asa9euqF69OlatWoUuXboAKDo90qJFCzRq1AgAcPr0aQghMGHCBEyYMMHutq9evWoTjolkYMAgKqM2bdpYzu2XRKfT2YQOs9kMHx8fLF++3O46Jb2ZlpaPjw+SkpKwZcsWbNq0CZs2bcKSJUswaNAgLFu27J62Xez236Ltad26tSW4AEVHLG6d0NiwYUOEh4cDAHr06AGVSoWxY8eic+fOlr6azWZERETg3XfftbuP4jfQ0jhz5gy6dOmCkJAQzJw5E4GBgdBqtdi4cSM+//zzMl/qa09ISAiSkpJgMBju6RLgkibL3noEpphOp0OvXr3w/fffY/78+UhLS8OePXswZcoUyzLFr+3tt99GVFSU3W0HBweXu16iO2HAIKokDRo0wLZt29C+fXu7bxi3LgcAf/31V5n/8ddqtejZsyd69uwJs9mMkSNH4ssvv8SECRPsbqtu3boAgBMnTliuhil24sQJy/NlsXz5cuTn51se169f/47Ljx8/HosWLcIHH3yAzZs3Ayjqwc2bNy1BpCQNGjTAli1bcP369RKPYvz444/Q6/VYv3496tSpYxkvPiUlQ8+ePbFv3z58++23JV6qfCtPT0+bG28ZDAZcuXKlTPvt06cPli1bhoSEBBw7dgxCCMvpEeDv3ms0mrv2kkg2zsEgqiS9e/eGyWTCRx99ZPNcYWGh5Q0nMjISbm5uiIuLQ0FBgdVyQogSt3/t2jWrx0qlEs2aNQMA6PV6u+u0atUKPj4+WLhwodUymzZtwrFjx/DMM8+U6rXdqn379ggPD7f8uVvA8PDwwIgRI7BlyxYkJSUBKOrVvn37sGXLFpvlMzMzUVhYCAB48cUXIYTA5MmTbZYr7lXxUZdbe5eVlYUlS5aU+bWV5PXXX0etWrXw1ltv4eTJkzbPX716FR9//LHlcYMGDSzzSIp99dVXZb7cNzw8HDVq1MCqVauwatUqtGnTxup0io+
      "text/plain": [
       "<Figure size 600x500 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# PR Curve\n",
    "precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)\n",
    "pr_auc = average_precision_score(y_test, y_pred_proba)\n",
    "\n",
    "plt.figure(figsize=(6, 5))\n",
    "plt.plot(recall, precision, color='green', lw=2, label=f'PR curve (AUC = {pr_auc:.4f})')\n",
    "plt.xlabel('Recall')\n",
    "plt.ylabel('Precision')\n",
    "plt.title('Precision-Recall Curve')\n",
    "plt.legend(loc='lower left')\n",
    "plt.grid(True)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interpreting the Precision-Recall (PR) Curve\n",
    "\n",
    "The **Precision-Recall (PR) curve** helps evaluate model performance, especially on imbalanced datasets like ours (where positive cases are rare).\n",
    "\n",
    "A quick reminder of the definitions:\n",
    "* Precision = How many of the predicted positives are actually positive\n",
    "* Recall = How many of the actual positives the model correctly identifies\n",
    "\n",
    "What we display in this plot is:\n",
    "* The x-axis is Recall \n",
    "* The y-axis is Precision \n",
    "\n",
    "The curve shows the trade-off between them at different model thresholds\n",
    "\n",
    "In imbalanced datasets, accuracy can be misleading — the PR curve focuses only on the positive class, making it much more meaningful:\n",
    "* A higher curve means better performance\n",
    "* The area under the curve (PR AUC) summarizes this: closer to 1 is better"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Compute confusion matrix: [ [TN, FP], [FN, TP] ]\n",
    "tn, fp, fn, tp = confusion_matrix(y_test, y_pred_opt).ravel()\n",
    "\n",
    "# Total predictions\n",
    "total = tp + tn + fp + fn\n",
    "\n",
    "# Compute all requested metrics\n",
    "recall = recall_score(y_test, y_pred_opt)\n",
    "precision = precision_score(y_test, y_pred_opt)\n",
    "f1 = fbeta_score(y_test, y_pred_opt, beta=1)\n",
    "f2 = fbeta_score(y_test, y_pred_opt, beta=2)\n",
    "f3 = fbeta_score(y_test, y_pred_opt, beta=3)\n",
    "fpr = fp / (fp + tn) if (fp + tn) != 0 else 0\n",
    "\n",
    "# Scores relative to total\n",
    "tp_score = tp / total\n",
    "tn_score = tn / total\n",
    "fp_score = fp / total\n",
    "fn_score = fn / total\n",
    "\n",
    "# Create DataFrame\n",
    "summary_df = pd.DataFrame([{\n",
    "    \"flagging_analysis_type\": \"RISK_VS_CLAIM\",\n",
    "    \"count_total\": total,\n",
    "    \"count_true_positive\": tp,\n",
    "    \"count_true_negative\": tn,\n",
    "    \"count_false_positive\": fp,\n",
    "    \"count_false_negative\": fn,\n",
    "    \"true_positive_score\": tp_score,\n",
    "    \"true_negative_score\": tn_score,\n",
    "    \"false_positive_score\": fp_score,\n",
    "    \"false_negative_score\": fn_score,\n",
    "    \"recall_score\": recall,\n",
    "    \"precision_score\": precision,\n",
    "    \"false_positive_rate_score\": fpr,\n",
    "    \"f1_score\": f1,\n",
    "    \"f2_score\": f2,\n",
    "    \"f3_score\": f3,\n",
    "    \"roc_auc_score\": roc_auc,\n",
    "    \"pr_auc_score\": pr_auc\n",
    "}])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_confusion_matrix_from_df(df, flagging_analysis_type, name_of_the_experiment=\"\"):\n",
    "\n",
    "    # Subset - just retrieve one row depending on the flagging_analysis_type\n",
    "    row = df[df['flagging_analysis_type'] == flagging_analysis_type].iloc[0]\n",
    "\n",
    "    # Define custom x-axis labels and wording\n",
    "    if flagging_analysis_type == 'RISK_VS_CLAIM':\n",
    "        x_labels = ['With Submitted Claim', 'Without Submitted Claim']\n",
    "        outcome_label = \"submitted claim\"\n",
    "    elif flagging_analysis_type == 'RISK_VS_SUBMITTED_PAYOUT':\n",
    "        x_labels = ['With Submitted Payout', 'Without Submitted Payout']\n",
    "        outcome_label = \"submitted payout\"\n",
    "    else:\n",
    "        x_labels = ['Actual Positive', 'Actual Negative']  \n",
    "        outcome_label = \"outcome\"\n",
    "\n",
    "    # Confusion matrix structure\n",
    "    cm = np.array([\n",
    "        [row['count_true_positive'], row['count_false_positive']],\n",
    "        [row['count_false_negative'], row['count_true_negative']]\n",
    "    ])\n",
    "\n",
    "    # Create annotations for the confusion matrix\n",
    "    labels = [['True Positives', 'False Positives'], ['False Negatives', 'True Negatives']]\n",
    "    counts = [[f\"{v:,}\" for v in [row['count_true_positive'], row['count_false_positive']]],\n",
    "              [f\"{v:,}\" for v in [row['count_false_negative'], row['count_true_negative']]]]\n",
    "    percentages = [[f\"{round(100*v,2):,}\" for v in [row['true_positive_score'], row['false_positive_score']]],\n",
    "                   [f\"{round(100*v,2):,}\" for v in [row['false_negative_score'], row['true_negative_score']]]]\n",
    "    annot = [[f\"{labels[i][j]}\\n{counts[i][j]} ({percentages[i][j]}%)\" for j in range(2)] for i in range(2)]\n",
    "\n",
    "    # Scores formatted as percentages\n",
    "    recall = row['recall_score'] * 100\n",
    "    precision = row['precision_score'] * 100\n",
    "    f1 = row['f1_score'] * 100\n",
    "    f2 = row['f2_score'] * 100\n",
    "    f3 = row['f3_score'] * 100\n",
    "    roc_auc = row['roc_auc_score'] * 100\n",
    "    pr_auc = row['pr_auc_score'] * 100\n",
    "\n",
    "    # Set up figure and axes manually for precise control\n",
    "    fig = plt.figure(figsize=(9, 8))\n",
    "    grid = fig.add_gridspec(nrows=3, height_ratios=[1, 15, 2])\n",
    "\n",
    "    \n",
    "    ax_main_title = fig.add_subplot(grid[0])\n",
    "    ax_main_title.axis('off')\n",
    "    ax_main_title.set_title(f\"{name_of_the_experiment} - Flagged as Risk vs. {outcome_label.title()}\", fontsize=14, weight='bold')\n",
    "\n",
    "    # Heatmap\n",
    "    ax_heatmap = fig.add_subplot(grid[1])\n",
    "    ax_heatmap.set_title(f\"Confusion Matrix – Risk vs. {outcome_label.title()}\", fontsize=12, weight='bold', ha='center', va='center', wrap=False)\n",
    "\n",
    "    cmap = sns.light_palette(\"#A73A52\", as_cmap=True)\n",
    "\n",
    "    sns.heatmap(cm, annot=annot, fmt='', cmap=cmap, cbar=False,\n",
    "                xticklabels=x_labels,\n",
    "                yticklabels=['Flagged as Risk', 'Flagged as No Risk'],\n",
    "                ax=ax_heatmap,\n",
    "                linewidths=1.0,\n",
    "                annot_kws={'fontsize': 10, 'linespacing': 1.2})\n",
    "    ax_heatmap.set_xlabel(\"Resolution Outcome (Actual)\", fontsize=11, labelpad=10)\n",
    "    ax_heatmap.set_ylabel(\"Flagging (Prediction)\", fontsize=11, labelpad=10)\n",
    "    \n",
    "    # Make borders visible\n",
    "    for _, spine in ax_heatmap.spines.items():\n",
    "        spine.set_visible(True)\n",
    "\n",
    "    # Footer with metrics and date\n",
    "    ax_footer = fig.add_subplot(grid[2])\n",
    "    ax_footer.axis('off')\n",
    "    metrics_text = f\"Total Booking Count: {row['count_total']}  |  Recall: {recall:.2f}%  |  Precision: {precision:.2f}%  |  F1 Score: {f1:.2f}%  |  F2 Score: {f2:.2f}%  |  ROC AUC: {roc_auc:.2f}%  |  PR AUC: {pr_auc:.2f}%\"\n",
    "    date_text = f\"Generated on {date.today().strftime('%B %d, %Y')}\"\n",
    "    ax_footer.text(0.5, 0.7, metrics_text, ha='center', fontsize=9)\n",
    "    ax_footer.text(0.5, 0.1, date_text, ha='center', fontsize=8, color='gray')\n",
    "\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA5wAAAMVCAYAAAAbDfvBAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3QUVRvH8d+mB1JIAqGTAIHQe++9FwUBAZUigoIgTbFRBREQEEEElSYiVXoHBQEpgiBIb6H3UEIvybx/5M2YJZWEZYl8P+fsYffOnZlnZyfLPHvv3GsxDMMQAAAAAABPmYO9AwAAAAAA/DeRcAIAAAAAbIKEEwAAAABgEyScAAAAAACbIOEEAAAAANgECScAAAAAwCZIOAEAAAAANkHCCQAAAACwCRJOAAAAAIBNkHACSJYTJ07IYrGYj/Xr19s7pBRh6tSpVsftRbd+/Xqr43HixAl7h2RTAwYMMN9rYGBgkrfDeWQbT+vzSa6U+vk+67jbtm1r7qtKlSrJ3l5gYKC5vQEDBiR7e8CLjoQTL6yLFy/qs88+U+XKlZU+fXq5uLgoderUyp8/v958802tWLFChmHYJbbn5eKbZDJxol+cxveYOnWqvUPFU/T4RXXUw8nJSX5+fipTpowGDx6sGzdu2DvU/4xZs2apdu3aSp8+vZydneXt7a3s2bOrSpUqeu+997Rq1Sp7h/jMJDapS4nfQY8ePdKsWbPUvHlz5ciRQx4eHnJxcVGWLFlUv359jRs3TteuXbN3mAASycneAQD2MH78ePXq1Uv37t2zKn/48KH279+v/fv3a/LkyQoJCbHrr9sAUp7w8HBdvXpV27Zt07Zt2zRjxgz9+eef8vT0NOvUqlVLHh4ekiRvb297hZqivPHGG5o+fbpVWVhYmMLCwnTixAn9/vvvOnnypGrXrm2nCJ++kiVLasSIEfYO45nau3evWrRoof3798dYdvbsWZ09e1bLly/XlStXbNb6+Mknn5g/FJUrV84m+wBeJCSceOEMHz5cffr0MV87Ojqqfv36Kl68uCwWi44ePapVq1bp4sWLdowSKdnHH38sHx+fGOUlS5a0QzR4Vt5++23lzJlToaGhmjVrltkz4eDBg5oyZYq6detm1i1XrhwXsk9g5cqVVslm8eLFVbt2bXl4eOjy5cvauXOntmzZYscIbSN//vzKnz+/vcN4Zg4ePKjKlSvr6tWrZlmBAgVUp04d+fr66tKlS9q4caP++usvm8bx1ltv2XT7wAvHAF4g+/btMxwdHQ1JhiTD39/f2LlzZ4x6Dx48ML777jvj4sWLVuVnzpwxevfubRQoUMBInTq14erqagQEBBitW7c2tm3bFmM7/fv3N/cVEBBgXL9+3ejdu7eRLVs2w9nZ2ciePbsxZMgQIyIiwlwnqn5cjzZt2hiGYRgPHz40Pv30U6Nu3bpGjhw5DG9vb8PJycnw9fU1KlSoYHz99dfGgwcPYj0Op0+fNj744AOjSJEihqenp+Hq6mpkzZrVaNy4sbF69WrDMAwjICAg3jgqV65sGIZhhISEWJWvW7cuxv4WL15sNGrUyMiQIYPh7OxspEmTxqhatarx008/Wb33KBs2bDBeeuklI1OmTIazs7OROnVqIyAgwKhTp47Rv39/4/r162bdW7duGQMHDjSKFi1qeHh4GE5OTka6dOmMwoULGx06dDBWrFgR6zF4mqJ/zpKMkJCQBNeZMmWK1TrRrVu3zmjfvr1RtGhRI0OGDIaLi4vh7u5u5MyZ02jbtq2xZ8+eWLd54sQJo2XLloavr6+ROnVqo2LFisavv/4a774MwzD27NljNGjQwPD09DQ8PT2NOnXqGLt27Ypx/j7uxo0bxueff26UKlXK8PLyMpydnY2sWbMabdq0Mfbu3RtrjFeuXDE6depk+Pv7G25ubkbx4sWNWbNmGevWrXviY2gYhjF//nzjtddeMwoWLGj4+/ub50vevHmNLl26xLqdy5cvG7169TLy5ctnpEqVynB2djbSp09vlCxZ0ujSpYuxZcuWRO378eMa/dw/cOCA1bJOnTpZrRvfsT1x4oTRsWNHIygoyHBzczNcXV2NTJkyGeXKlTN69Ohh7N+/P84YonvvvffMcgcHB2PSpElxvpfw8HAjW7ZsZv3+/fvHqPPBBx+Yy3PlymWW79mzx2jdurUREBBguLi4GG5ubkbWrFmNqlWrGh9++KFx5syZRBzN+PXo0cPcd1BQkPHo0aMYdW7cuGFs2rTJqiy+4xzfd9fj64WFhRk9e/Y0smTJYri6uhp58+Y1xo4dG+P7q02bNlbfkYcOHTJeeuklw8vLy/Dx8TFatmxpXLhwwTAMw1i7dq1RoUIFw93d3UibNq3Rvn174+rVq1bbi+3zfTzu2B79+/c3KleuHG+dx4/HhQsXjI8++sgoXLiw4eHhYbi6uho5c+Y0OnfubJw8eTLWz+XEiRPGq6++avj4+BipUqUyKlasaKxZsybB75y4lC1b1mq9zz//PNb/I3bs2GEsWrQozuMe3aRJk4xmzZoZefLkMfz8/AwnJyfD09PTKFy4sPHBBx8Yly9fjrH96P//Rf9bePx76uDBg0a/fv2MbNmyGe7u7kbJkiXN/3MuXbpktG/f3kibNq3h5uZmlC9f3tiwYUOijwXwX0LCiRfK22+/bfWfxS+//JLodX///XfDx8cnzv+8HRwcjJEjR1qtE/2ixc/Pz8ibN2+s6/bt29dcJ7EJ582bNxOsW6NGjRgXZsuWLTM8PT3jXOe9994zDOPpJJzh4eHG66+/Hu92mjVrZhXj2rVrrX4UiO1x4MABs36VKlXirduiRYtEf8ZJ9bQTzl69esX7nlxcXIw1a9ZYrRMSEmJkyJAh1vOyfv36ce5r+/bthoeHR4z13NzcjJo1a8Z5cXr48GEjMDAwzhhdXV2NOXPmWK1z7do1I0+ePLHWfzzGxCacTZs2jfdYeXl5WSXod+/eNYKDg+Ndp0+fPonad3wJZ1hYmNWyTz75xGrduBKhixcvGunSpYs3vm+//TbOGKK8//77Zpmjo6MxY8aMBN9P3759zXVy585ttSwiIsIqIf38888Nw4j8ES9VqlTxxvs0fvTp2rWrub20adMaR48eTdR6TyPhTJ8+vVGiRIlY31vXrl2tthk98cmePXus/2cEBwcbP/74o+Hg4BBjWaVKlay296wSzs2bNxtp06aNs663t3eMZCmu7xyLxWLUq1cvzu+cuGzdutVqnYYNGyZqvceP++MJZ/HixeM9DpkzZzbOnj1rtU5iE87Ytu3g4GDMmjXLyJ49e4xlrq6uVj8YAS8KutTihfLrr7+az318fPTSSy8lar3r16+rSZMm5iAF7u7uateunby8vDRz5kydPHlSERER6t27t4oXL67KlSvH2EZoaKiuXbumN954Q5kyZdIPP/ygK1euSJLGjBmjTz/9VC4uLhoxYoSOHTumCRMmmOtG76JZoEABSZEDQeTIkUNlypRR5syZ5ePjo4cPH+rgwYOaO3euHj16pLVr1+qXX35R8+bNJUknT55Us2bNdOfOHXMbjRo1UpEiRXT58mX99ttv5j4/+eQTnThxQp9//rlZFtVlUJKyZs2a4HEbPny42Q3OYrGoadOmKly4sEJCQjR9+nQ9fPhQc+fOVZEiRfTxxx9Lkr777juFh4dLkvLkyaNmzZrJyclJp06d0t9//62dO3ea2z9w4IA5kJGDg4PeeOMN5c6dW1euXFFISIjdBjn6/vvvY+1S27t370Stnzp1alWuXFkFCxaUr6+v3N3dFRoaqmXLlunAgQN68OCBunXrZnWP07vvvqsLFy6Yr+vVq6fixYtr2bJlWrZsWaz7MQxD7du3161bt8yyli1bKkeOHJozZ47WrFkT63rh4eF6+eWXzS6j6dKlU6tWreTr66tVq1Zp8+bNun//vt544w0VL15cOXLkkCR9+umnOnjwoLmdypUrq3Llyvrjjz/ijDEhadKkUa1atZQ3b175+PjIxcVFFy9e1IIFC3Tq1CmFhYWpT58+Wr58uSRp3bp1OnTokCTJzc1Nb775pjJnzqw
      "text/plain": [
       "<Figure size 900x800 with 3 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Plot confusion matrix for claim scenario\n",
    "plot_confusion_matrix_from_df(summary_df, 'RISK_VS_CLAIM', 'Contactless')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Importance\n",
    "Understanding what drives the prediction is useful for future experiments and business knowledge. Here we track both the native feature importances of the trees, as well as a more heavy SHAP values analysis.\n",
    "\n",
    "Important! Be aware that SHAP analysis might take quite a bit of time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "d66ffe2c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAxkAAAMWCAYAAACdtUsqAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeVyN6f/48deJ9lVISSQqZcqeJRQyWccyI4yR7MYYDLL8ZoxkN7IvYzBlN8Y2liwxasiWpcaSkCUzE2bsMUPq/P7o2/1xFJ1yyPJ+Ph7n8ejcy3W/r6v71P2+r+u6j0qtVqsRQgghhBBCCB3RK+wAhBBCCCGEEO8WSTKEEEIIIYQQOiVJhhBCCCGEEEKnJMkQQgghhBBC6JQkGUIIIYQQQgidkiRDCCGEEEIIoVOSZAghhBBCCCF0SpIMIYQQQgghhE5JkiGEEEIIIYTQKUkyhBBCCCGEEDolSYYQQgghAIiIiEClUuX6Gjly5Cs55oEDBwgJCeHOnTuvpPyXkd0eR48eLexQCmz+/PlEREQUdhjiPVS0sAMQQgghxJslNDSU8uXLayz74IMPXsmxDhw4wNixYwkKCsLKyuqVHON9Nn/+fEqUKEFQUFBhhyLeM5JkCCGEEEJD8+bNqVmzZmGH8VIePHiAqalpYYdRaB4+fIiJiUlhhyHeYzJcSgghhBD5sn37dho0aICpqSnm5ua0bNmS06dPa2zz+++/ExQUhJOTE0ZGRtja2tKjRw9u3rypbBMSEkJwcDAA5cuXV4ZmXb58mcuXL6NSqXId6qNSqQgJCdEoR6VScebMGT799FOKFStG/fr1lfUrVqygRo0aGBsbY21tTadOnbh69WqB6h4UFISZmRkpKSm0atUKMzMz7O3tmTdvHgAnT56kcePGmJqaUq5cOVatWqWxf/YQrN9++42+fftSvHhxLCwsCAwM5Pbt2zmON3/+fCpXroyhoSGlS5fmiy++yDG0zNfXlw8++IBjx47RsGFDTExM+H//7//h6OjI6dOniYmJUdrW19cXgFu3bjFs2DA8PDwwMzPDwsKC5s2bk5CQoFF2dHQ0KpWKtWvXMmHCBMqUKYORkRFNmjThwoULOeI9fPgwLVq0oFixYpiamuLp6cmsWbM0tjl79iyffPIJ1tbWGBkZUbNmTTZv3qyxTXp6OmPHjsXZ2RkjIyOKFy9O/fr1iYqK0ur3JAqf9GQIIYQQQsPdu3f5559/NJaVKFECgOXLl9OtWzf8/f2ZMmUKDx8+ZMGCBdSvX58TJ07g6OgIQFRUFBcvXqR79+7Y2tpy+vRpfvjhB06fPs2hQ4dQqVS0b9+ec+fOsXr1ambMmKEco2TJkvz999/5jrtDhw44OzszceJE1Go1ABMmTGD06NEEBATQq1cv/v77b+bMmUPDhg05ceJEgYZoZWRk0Lx5cxo2bMjUqVNZuXIlAwYMwNTUlK+//pouXbrQvn17vv/+ewIDA6lbt26O4WcDBgzAysqKkJAQkpKSWLBgAVeuXFEu6iEreRo7dix+fn58/vnnynZxcXHExsair6+vlHfz5k2aN29Op06d+OyzzyhVqhS+vr58+eWXmJmZ8fXXXwNQqlQpAC5evMimTZvo0KED5cuX5/r16yxcuBAfHx/OnDlD6dKlNeKdPHkyenp6DBs2jLt37zJ16lS6dOnC4cOHlW2ioqJo1aoVdnZ2DBo0CFtbWxITE9m6dSuDBg0C4PTp03h7e2Nvb8/IkSMxNTVl7dq1tG3blvXr19OuXTul7pMmTaJXr154eXlx7949jh49yvHjx2natGm+f2eiEKiFEEIIIdRqdXh4uBrI9aVWq9X3799XW1lZqXv37q2x37Vr19SWlpYayx8+fJij/NWrV6sB9W+//aYs++6779SA+tKlSxrbXrp0SQ2ow8PDc5QDqMeMGaO8HzNmjBpQd+7cWWO7y5cvq4sUKaKeMGGCxvKTJ0+qixYtmmP589ojLi5OWdatWzc1oJ44caKy7Pbt22pjY2O1SqVSr1mzRll+9uzZHLFml1mjRg3148ePleVTp05VA+pffvlFrVar1Tdu3FAbGBioP/zwQ3VGRoay3dy5c9WA+scff1SW+fj4qAH1999/n6MOlStXVvv4+ORY/t9//2mUq1ZntbmhoaE6NDRUWbZ37141oHZzc1M/evRIWT5r1iw1oD558qRarVarnzx5oi5fvry6XLly6tu3b2uUm5mZqfzcpEkTtYeHh/q///7TWF+vXj21s7OzsqxKlSrqli1b5ohbvD1kuJQQQgghNMybN4+oqCiNF2Tdqb5z5w6dO3fmn3/+UV5FihShdu3a7N27VynD2NhY+fm///7jn3/+oU6dOgAcP378lcTdr18/jfcbNmwgMzOTgIAAjXhtbW1xdnbWiDe/evXqpfxsZWWFq6srpqamBAQEKMtdXV2xsrLi4sWLOfbv06ePRk/E559/TtGiRYmMjARg9+7dPH78mMGDB6On97/Ltd69e2NhYcG2bds0yjM0NKR79+5ax29oaKiUm5GRwc2bNzEzM8PV1TXX30/37t0xMDBQ3jdo0ABAqduJEye4dOkSgwcPztE7lN0zc+vWLX799VcCAgK4f/++8vu4efMm/v7+nD9/nj///BPIatPTp09z/vx5resk3iwyXEoIIYQQGry8vHKd+J19wde4ceNc97OwsFB+vnXrFmPHjmXNmjXcuHFDY7u7d+/qMNr/eXZI0vnz51Gr1Tg7O+e6/dMX+flhZGREyZIlNZZZWlpSpkwZ5YL66eW5zbV4NiYzMzPs7Oy4fPkyAFeuXAGyEpWnGRgY4OTkpKzPZm9vr5EE5CUzM5NZs2Yxf/58Ll26REZGhrKuePHiObYvW7asxvtixYoBKHVLTk4GXvwUsgsXLqBWqxk9ejSjR4/OdZsbN25gb29PaGgobdq0wcXFhQ8++IBmzZrRtWtXPD09ta6jKFySZAghhBBCK5mZmUDWvAxbW9sc64sW/d9lRUBAAAcOHCA4OJiqVatiZmZGZmYmzZo1U8p5kWcv1rM9fTH8rKd7T7LjValUbN++nSJFiuTY3szMLM84cpNbWS9arv6/+SGv0rN1z8vEiRMZPXo0PXr0YNy4cVhbW6Onp8fgwYNz/f3oom7Z5Q4bNgx/f/9ct6lYsSIADRs2JDk5mV9++YVdu3axePFiZsyYwffff6/RiyTeXJJkCCGEEEIrFSpUAMDGxgY/P7/nbnf79m327NnD2LFj+fbbb5XluQ19eV4ykX2n/NknKT17Bz+veNVqNeXLl8fFxUXr/V6H8+fP06hRI+V9WloaqamptGjRAoBy5coBkJSUhJOTk7Ld48ePuXTp0gvb/2nPa99169bRqFEjlixZorH8zp07ygT8/Mg+N06dOvXc2LLroa+vr1X81tbWdO/ene7du5OWlkbDhg0JCQmRJOMtIXMyhBBCCKEVf39/LCwsmDhxIunp6TnWZz8RKvuu97N3uWfOnJljn+zvsng2mbCwsKBEiRL89ttvGsvnz5+vdbzt27enSJEijB07NkcsarVa43G6r9sPP/yg0YYLFizgyZMnNG/eHAA/Pz8MDAyYPXu2RuxLlizh7t27tGzZUqvjmJqa5vpt6kWKFMnRJj///LMyJyK/qlevTvny5Zk5c2aO42Ufx8bGBl9fXxYuXEhqamqOMp5+otizvxszMzMqVqzIo0ePChSfeP2kJ0MIIYQQWrGwsGDBggV07dqV6tWr06lTJ0qWLElKSgrbtm3D29ubuXPnYmFhoTzeNT09HXt7e3bt2sWlS5dylFmjRg0Avv76azp16oS+vj6tW7fG1NSUXr16MXnyZHr16kXNmjX57bffOHfunNbxVqhQgfHjxzNq1CguX75M27ZtMTc359KlS2zcuJE+ffowbNgwnbVPfjx+/JgmTZoQEBBAUlIS8+fPp379+nz00UdA1mN8R40axdixY2nWrBkfffSRsl2tWrX47LPPtDpOjRo1WLBgAePHj6dixYrY2NjQuHFjWrVqRWhoKN27d6devXqcPHmSlStXavSa5Ieenh4LFiygdevWVK1ale7
      "text/plain": [
       "<Figure size 800x800 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "## BUILT-IN\n",
    "\n",
    "# Get feature importances from the model\n",
    "importances = best_pipeline.named_steps['model'].feature_importances_\n",
    "features = X.columns\n",
    "\n",
    "# Create a Series and sort\n",
    "feat_series = pd.Series(importances, index=features).sort_values(ascending=True)  # ascending=True for horizontal plot\n",
    "\n",
    "# Plot Feature Importances\n",
    "plt.figure(figsize=(8, 8))\n",
    "feat_series.plot(kind='barh', color='skyblue')\n",
    "plt.title('Feature Importances')\n",
    "plt.xlabel('Importance')\n",
    "plt.grid(axis='x')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interpreting the Feature Importance Plot\n",
    "The **feature importance plot** shows how much each feature contributes to the model’s overall decision-making.\n",
    "\n",
    "For tree-based models like Random Forest, importance is based on how often and how effectively a feature is used to split the data across all trees.\n",
    "A higher score means the feature plays a bigger role in improving prediction accuracy.\n",
    "\n",
    "In the graph you will see that:\n",
    "* Features are ranked from most to least important.\n",
    "* The values are relative and model-specific — not directly interpretable as weights or probabilities.\n",
    "\n",
    "This helps us identify which features the model relies on most when making predictions.\n",
    "\n",
    "**Important!**\n",
    "Unlike SHAP values, native importance doesn't show how a feature affects predictions — only how useful it is to the model overall. For deeper interpretability (e.g., direction and context), SHAP is better (but it takes more time to run)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "e2197cea",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "PermutationExplainer explainer: 6417it [45:34,  2.34it/s]                           \n"
     ]
    }
   ],
   "source": [
    "## SHAP VALUES\n",
    "\n",
    "# SHAP requires that all features passed to Explainer be numeric (floats/ints)\n",
    "X_test_shap = X_test.copy()\n",
    "X_test_shap = X_test_shap.astype(float)\n",
    "\n",
    "# Function that returns the probability of the positive class\n",
    "def model_predict(data):\n",
    "    return best_pipeline.predict_proba(data)[:, 1]\n",
    "\n",
    "# Ensure input to SHAP is numeric\n",
    "X_test_shap = X_test.astype(float)\n",
    "\n",
    "# Create SHAP explainer\n",
    "explainer = shap.Explainer(model_predict, X_test_shap)\n",
    "\n",
    "# Compute SHAP values\n",
    "shap_values = explainer(X_test_shap)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "9cae1a51",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_881/3711913411.py:2: FutureWarning: The NumPy global RNG was seeded by calling `np.random.seed`. In a future version this function will no longer use the global RNG. Pass `rng` explicitly to opt-in to the new behaviour and silence this warning.\n",
      "  shap.summary_plot(shap_values.values, X_test_shap)\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAzoAAAOsCAYAAACCjsPqAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3gU1dfA8e9sS28kEEJCr1L8AQbpRbr0LoKCoAjSbKDYQfFVFBGpIiIdCdXQFdHQlC4WpPcEAiQhvW2Z949ll2w2gSQkNM/nefLATu7M3Jmd3dwz99w7iqqqKkIIIYQQQgjxENHc6woIIYQQQgghRGGTQEcIIYQQQgjx0JFARwghhBBCCPHQkUBHCCGEEEII8dCRQEcIIYQQQgjx0JFARwghhBBCCPHQkUBHCCGEEEII8dCRQEcIIYQQQgjx0JFARwghhBBCCPHQkUBHCCGEEEKIh9z48ePx9PS87e/OnTuHoiisWrUqX9sv6HpFSXevKyCEEEIIIYS4PwQFBfH7779TpUqVe12VOyaBjhBCCCGEEAIAFxcXGjRocK+rUSgkdU0IIYQQQggB5JyClpmZyejRoylWrBi+vr4MHTqUZcuWoSgK586dc1g/PT2dkSNH4ufnR1BQEGPGjMFkMt3lo7CSQEcIIYQQQoj/CJPJ5PRjsVhuuc64ceOYM2cOb775JmFhYVgsFsaNG5dj2XfeeQeNRsOKFSsYNmwYX3zxBd9++21RHMptSeqaEEIIIYQQ/wEpKSno9focf+fh4ZHj8ri4OGbPns27777Lm2++CUC7du1o3bo1Fy9edCpfv359pk2bBkCbNm349ddfWbVqFcOGDSuko8g7CXSEEEIIIQRGo5H58+cDMGjQoFwbxOIeUHrkvay6Jtdfubm5sWPHDqfl33zzDcuWLctxnb///pv09HS6dOnisLxr165s27bNqXzbtm0dXlevXp1ffvklLzUvdBLoCCGEEEII8R+g0WgIDQ11Wr5hw4Zc17l8+TIAxYsXd1heokSJHMv7+vo6vDYYDKSnp+ezpoVDxugIIYQQQgghchQUFATAtWvXHJZfvXr1XlQnXyTQEUIIIYQQ4r6m5OOncNWsWRNXV1fCw8Mdlv/www+Fvq/CJqlrQgghhBBCiBz5+/vz0ksv8fHHH+Pq6krt2rVZuXIlJ06cAKzpcPer+7dmQgghhBBCiHvu008/5cUXX+STTz6hd+/eGI1G+/TSPj4+97h2uVNUVVXvdSWEEEIIIcS9JbOu3ceUnnkvq64uunpk8eyzz7Jr1y7Onj17V/ZXEJK6JoQQQgghxH2t8Mfe5Mf27dvZvXs3jz32GBaLhQ0bNrB06VKmTJlyT+t1OxLoCCGEEEIIIXLl6enJhg0bmDRpEmlpaZQvX54pU6bwyiuv3Ouq3ZIEOkIIIYQQQohcPfbYY/z222/3uhr5JoGOEEIIIYQQ97V7m7r2oJJZ14QQQgghhBAPHQl0hBBCCCGEEA8dCXSEEEIIIYQQDx0ZoyOEEEIIIcR9TcboFIT06AghhBBCCCEeOhLoCCGEEEIIIR46EugIIYQQQgghHjoS6AghhBBCCCEeOhLoCCGEEEIIIR46MuuaEEIIIYQQ9zWZda0gpEdHCCGEEEII8dCRQEcIIYQQQgjx0JFARwghhBBCFJ7YJEjPdFz293nY/o/zciGKkIzREUIIIYQQd+5MNPzvNUhOv3W54e1h5ot3p04PDRmjUxDSoyOEEEIIIe5MQgpUHH77IAdg1hb48Y+ir5P4z5NARwghhBDiIWCyqNRfbEI32YT3Vya+/9d893beYWL+yr/8bdHUQ4gsJHVNCCGEEOIh4DHVTKbF+v8kI/TbpFLK00zzMtqi3fGPf8Bvx/O3zvHLcC0BivsUTZ0eOpK6VhDSoyOEEEII8YDbdNJkD3Ky6hmuFu2Odx2F9h8VbN2G4wq3LkJkI4GOEEIIIcQDyKKqTD1gImC6iY7hOZeJzSiCHadnwtV4mLAcmr5T8O2cvgKvfVdo1RIiO0ldE0IIIYR4AHVcbWbLuduXm3bQxOjHCqHJZzZDmRfh0vU735bNlxsgxB9e61p423woSepaQUiPjhBCCCHEAyY5w5KnIAfg5V8LaaflhhVukGPz+kJ4dV7hb1f850mgI4QQQgjxgHhuk3VWNa/pOQzIuYVz8fkrn6PI2DvfRm6mboSd/xbd9sV/kqSuCSGEEELchz7fa+KD3yDTDOV94EwCFDRceXeHhSVd7uD+9vA5BV83r5q9C49Xgr2fFf2+xH+C9Og8JC5dukRoaChz5tyFL6K7YOXKlfTs2ZOGDRsSGhrKpUuX7nWV8uXAgQOEhoayfv16+7KifI/mzJlT6OcpNDSU8ePHF9r27rYHvf5C5Nf69esJDQ3lwIED97oqohAETjfxxk5IM4MZOHUHQQ7AxrN3WKHZP97hBvJo3yl4Ycbd2dcDRcnHj7CRHh1x3zlw4ACTJk2iefPmDBw4EJ1Oh5+f372ulngIzJkzh6pVq9KiRYt7XRVxj1y6dIn169fTokULqlateq+rI4STt3eY+GRf4W833ngHK2feycoFMO8XOBoFuz+5u/sVDx0JdMR9Z+/evQC8//77+Pg8PA8SCwoKYvfu3Wi1RfzgNgGQ47meO3cunTp1kkDnP+zSpUvMnTuXUqVKPXSBTocOHWjbti16vf5eV0XkweVklV8vWHDVqSw8AmcT4EjMnfXa3E7bFSZ+6lOApp93v8KvzO38dhzafAAb3gUXuaZFwUigIwrEZDJhNptxcXEp9G3HxMQAPFRBDoCiKEVyvkTO5FyLB1l6ejo6nQ6dLu9/prVardxIuU9dT1fZddHCoSsq607DH9egiB/jmaOtF8BkUdFp8pHeVGwAZJiLrlK38vPf4PoUtKgBvxbwoaQPDUlJKwgJdPJh/fr1TJgwgdmzZ3Ps2DFWrVrF1atXCQoKYvDgwXTq1Amw3jHs0qULQ4YMYejQoQ7bmDNnDnPnzmXdunWUKlUKgPHjx7NhwwZ+/vlnpk6dys6dOzEajdSrV4+33nqLgIAA1qxZw7Jly7h06RJBQUGMGjUq17vSW7ZsYcGCBVy4cAE/Pz+6dOnC888/7/QHMyYmhrlz57Jr1y5iY2Px9fWladOmvPTSSxQrVsypzmFhYYSHh/Pzzz8TExPDrFmzCA0NzfP5i4iIYNGiRZw4cQJFUahcuTIDBgywH4ftvNnYtl23bl2++eabPO3j2rVrLFmyhP3793P58mUyMjIIDg6mY8eOPPvssw6NANv7OXPmTA4fPsz69euJjY2lbNmyDBo0iHbt2jlsu3PnzgQFBfHaa68xdepUjhw5gl6vp2nTprz88ssO5ywnt7oufvrpJ8LCwjh58iRms5lKlSrx7LPP0rp1a4dyFouFhQsXsnbtWmJiYggJCWHQoEF5Oje5OX36NFOnTuWPP/7AYDDQqFEjXnvttVzL57WuoaGhdOrUiSeffJLZs2dz8uRJPD09adOmDcOHD8fd3d3p/MyePZu9e/eSlJREiRIlaNu2Lc8//zyurq72cgkJCXz77bfs2LGDa9eu4ebmRlBQEG3btmXAgAFO+x8/frzDtbVhwwY2bNhgL5ef8Qy2a2DMmDFMnTqVv//+G1dXVzp06MCoUaMwm83Mnj2bH3/8kYSEBGrUqMHbb79N+fLl7dtISUlh4cKF7N27l8jISFJTUwkMDKRVq1YMGTLE4VgPHDjAsGHD+OCDD1BVlSVLlnDx4kX8/f3p3bs3AwcOdKjfnj17CA8P599//yUmJga9Xk+NGjUYPHgwjz32mNPxbNu2jW+//Zbz58/j5+dH165d+d///seIESP44IMP6Ny5s71sZmYmS5YsYcuWLURGRmIwGKhTpw5Dhw6lWrVqOdY5PT2d77//nujoaEqXLs3IkSNp2rQpp06d4quvvuKvv/5Cp9PRvn17Xn31VafvqAsXLjB37lz27dtHQkICxYsXp3Xr1rz44ou4ubnZy9m+QyMiIpg+fTq//PILKSkpVKtWjddee42aNWsCNz/zABMmTLD/Pz/fMWC9hlasWMGFCxcwmUz4+/t
      "text/plain": [
       "<Figure size 800x950 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Plot summary\n",
    "shap.summary_plot(shap_values.values, X_test_shap)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interpreting the SHAP Summary Plot\n",
    "\n",
    "Each point on a row represents a SHAP value for a single prediction (row = feature).\n",
    "The x-axis shows how much the feature contributed to increasing or decreasing the prediction.\n",
    "* Right (positive SHAP value): pushes prediction toward the positive class (i.e., higher chance of incident).\n",
    "* Left (negative SHAP value): pushes prediction toward the negative class (i.e., lower chance of incident).\n",
    "\n",
    "Color shows the actual feature value for that point:\n",
    "* Red = high value\n",
    "* Blue = low value\n",
    "\n",
    "In other words:\n",
    "* The position tells you impact.\n",
    "* The color tells you feature value.\n",
    "* The density (thickness) of dots shows how often a value occurs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAABLwAAAPZCAYAAAAbQTNdAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1hT1xsH8G+AsJcMARERFREXbhx14N5VsW5rVVptta3VOlpbW7dtHbXaqtVW6x6I+lOrKO4KdW8RJ8pQEUT25vz+SBOJCRCWUfL9PE+elnPPvefNzU2EN+e8VyKEECAiIiIiIiIiIion9LQdABERERERERERUWliwouIiIiIiIiIiMoVJryIiIiIiIiIiKhcYcKLiIiIiIiIiIjKFSa8iIiIiIiIiIioXGHCi4iIiIiIiIiIyhUmvIiIiIiIiIiIqFxhwouIiIiIiIiIiMoVJryIiIiIiIiIiKhcYcKLiIiIiIiIiIjKFSa8iIiIiIiIiIioXGHCi4iIiIiIiIhIA7m5uZgzZw6qV68OqVSK6tWr48cff0StWrWQm5tb5OOtXLkSVapUQUZGRhlEq9skQgih7SCIiIiIiIiIiN50y5cvx2effYZJkyahfv36sLKywsiRI7Fw4UKMHDmyyMdLT09H1apV8fXXX+Ozzz4rg4hV5eTk4Pz587hz5w4yMjJgY2ODpk2bonLlyhofIzY2FufPn8fTp0+RnZ0NS0tLeHp6om7dukUe5/nz57hw4QJiY2ORmpoKAwMDVKhQAV5eXnB1dS3282TCi4iIiIiIiIhIA40bN4adnR0CAwMBAD///DO+++47PH36FMbGxsU65tSpU7Ft2zY8ePAAEomkNMNV68iRI7h//z7q1asHKysr3L59GzExMejVqxccHR0L3T8yMhIHDx6EnZ0dqlWrBqlUisTERAgh0Lx58yKP8+jRI1y/fh0ODg4wNTVFdnY2Hjx4gCdPnqB169bw9PQs1vNkwouIiIiIiIiIqBDp6ekwNzfHzJkzMX36dACAl5cX6tevjw0bNhT7uBcuXECTJk1w5MgRtG/fvrTCVSsmJga7d++Gt7c3vLy8AADZ2dnw9/eHiYkJ3n333QL3z8zMxLZt2+Dg4IBOnTrlm6Ar6Ti5ubnYtWsXsrOzMXDgwGI8U9bwIiIiIiIiIiIq0OjRo2FiYoKcnBx88803kEgkcHJywtWrV9GxY0eV/lFRUTA2NsaoUaOU2oOCgiCVSvHFF18o2ho3bgwbGxvs2bOnzJ/H/fv3IZFIlGZNGRgYwMPDA0+fPkVycnKB+9+9exdpaWlo2rQpJBIJsrKyoG4eVUnH0dPTg5mZGTIzM4v4DF8yKPaeREREREREREQ6YOjQoZBKpVi1ahWWLl0KGxsb3Lt3D99//z0aNWqk0t/Z2Rl+fn74/fff8d1338HV1RW3bt3Ce++9h27dumHRokVK/Rs1aoTTp08XGENubq7GCSAjIyO1s6/i4uJgZWUFQ0NDpfaKFSsqtpubm+d73KioKEilUqSkpODQoUNISEiAgYEB3N3d0aJFCxgYGBR7nKysLOTk5CAzMxPh4eGIiIhA9erVNXq+6jDhRURERERERERUgPbt2+PIkSMwMzPD+PHjoaenh2+//RYA4Obmpnafr776CmvWrMEPP/yA2bNno2fPnqhatSq2bNkCPT3lBXfVqlUrdFnkkydPsG/fPo3iHTx4MCwsLFTaU1NTYWpqqtIub0tJSSnwuAkJCRBC4NChQ/Dw8ECzZs0QHR2NGzduIDMzEx06dCj2OP/++y9CQ0MBABKJBFWrVkWrVq0Keab5Y8KLiIiIiIiIiKgQV69eRZ06dRTJqri4OBgYGOQ7I8rZ2RkffvghVq9ejYsXLyItLQ0nTpyAmZmZSt8KFSogLS0t30QRANja2qJ79+4axWpiYqK2PTs7G/r6+irt8racnJwCj5uVlYXs7Gx4enoqklFubm7Izc1FaGgomjRpAisrq2KNU69ePbi5uSE1NRX379+HEKLQeArChBcRERERERERUSGuXLmCLl26FGmfL7/8EsuXL8fVq1dx6tQpODs7q+0nr4NV0F0ajYyMULly5SKN/yoDAwO1SSR5m7ok1av7A0CNGjWU2mvUqIHQ0FA8ffoUVlZWxRrH2toa1tbWAICaNWti//79CAwMRJ8+fYp190oWrSciIiIiIiIiKsCLFy8QERGBevXqKdpsbW2RnZ2NpKSkfPebO3cuANnMKhsbm3z7xcfHw9TUNN+ZWYAsWZSamqrRIzc3V+0xTE1NkZqaqtIub1M3++zV/QHVGWTynzMyMkplHEC2zPPZs2dISEgotK86nOFFRERERERERFSAq1evAgDq16+vaKtVqxYA4MGDB0rtcj/99BPWrFmD5cuXY/LkyZg7dy7WrFmj9vgPHjxQuqOhOk+fPi1xDS9bW1tER0cjMzNTqaB8TEyMYntB7O3tERUVhZSUFMVsLOBlTS554quk4wCyJCGAYt+pkQkvIiIiIiIiIqICXLlyBYBywqtFixYAgPPnz6skvHbv3o1p06Zh9uzZGDduHO7cuYPffvsN06dPV1vk/uLFixg6dGiBMZRGDa9q1arh6tWrCA0NhZeXFwDZzLGwsDBUrFhRUY8sOzsbycnJMDY2hrGxsdL+ly9fRlhYmNLyzFu3bkEikcDJyalI4wBAWlqaSry5ubm4c+cO9PX1UaFCBY2e86uY8CIiIiIiIiIiKsDVq1fh7OystCyxWrVqqFu3LoKCgjBq1ChF+4ULFzB06FAMHToU06dPBwBMmTIFK1euVDvL68KFC3j+/DnefffdAmMojRpeFStWRLVq1XD27FmkpaXBysoKt2/fRlJSEtq2bavoFxMTg3379qFRo0Zo0qSJot3Ozg4eHh4ICwtDbm4unJyc8PjxY9y/fx8NGjRQLFXUdBwAOHXqFDIzM+Hk5AQzMzOkpqbi7t27ePHiBZo3bw6pVFqs58qEFxERERERERFRAa5evap22eKoUaMwY8YMxSylyMhI9OrVCw0bNsTq1asV/SpVqoRRo0ZhzZo1KrO8duzYgSpVqqB9+/av5bm0a9cO5ubmuHPnDjIzM2FjY4OuXbsqZmcVpnXr1jA3N0dYWBjCw8Nhbm6OFi1aKNU3K8o41apVQ1hYGG7evIn09HQYGhrCzs4OzZo1Q9WqVYv9PCVCfisAIiIiIiIiIiLSWEJCAqpVq4Yff/wRo0ePLvL+GRkZqFq1KqZNm4bPP/+8DCLUXbxLIxERERERERFRMVhZWWHKlCn46aef8r0zYkHWrl0LqVSKsWPHlkF0uo0zvIiIiIiIiIiIqFzhDC8iIiIiIiIiIipXmPAiIiIiIiIiIqJyhQkvIiIiIiIiIiIqV5jwIiIiIiIiIiKicoUJLyIiIiIiIiIiKleY8CIiIiIiIiIiKiWJiYlo164dEhMTtR2KTmPCi4iIiIiIiIiolCQmJuLEiRNMeGkZE15ERERERERERFSuMOFFRERERERERETlChNeRERERERERERUrjDhRURERERERERUSiwtLdGyZUtYWlpqOxSdJhFCCG0HQURERERERERUXly+fBkNGjTQdhg6jTO8iIiIiIiIiIioXOEMLyIiIiIiIiKiUpSeng5jY2Nth6HTOMOLiIiIiIiIiKgURUVFaTsEnceEFxERERERERFRKUpKStJ2CDqPCS8iIiIiIiIiolJkZGSk7RB0Hmt4ERERERERERGVopycHOjr62s7DJ3GGV5ERERERERERKXo2rVr2g5B5zHhRURERERERERE5QoTXkREREREREREpahixYraDkHnMeFFRERERERERFSKjI2NtR2CzmPCi4iIiIiIiIioFD169EjbIeg8JryIiIiIiIiIiKhckQghhLaDICIiIiIiIiIqL1JTU2FqaqrtMHQaZ3gREREREREREZWip0+fajsEnceEFxERERERERFRKUpISNB2CDqPCS8iIiIiIiIiolIklUq1HYLOYw0vIiIiIiIiIiIqVzjDi4iIiIiIiIioFF2+fFnbIeg8JryIiIiIiIiIiKhcYcKLiIi
      "text/plain": [
       "<Figure size 800x1150 with 3 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Show the individual prediction for the highest predicted instance\n",
    "highest_pred_index = np.argmax(shap_values.values[:, 0]) \n",
    "\n",
    "# Use waterfall plot for a single instance\n",
    "shap.plots.waterfall(shap_values[highest_pred_index], max_display=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAABNsAAAPZCAYAAAAoeixUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd1hT59sH8O8JhLBlykZwATLELe69Zyu4W7cdjlfraK3Wra22ttVOrauuqqh1r+KsWkFREReKExVERBmyc94/+CUSEzAgEMDv57q4NM95znnuc3ISyJ1nCKIoiiAiIiIiIiIiIqK3JtF1AERERERERERERBUFk21ERERERERERETFhMk2IiIiIiIiIiKiYsJkGxERERERERERUTFhso2IiIiIiIiIiKiYMNlGRERERERERERUTJhsIyIiIiIiIiIiKiZMthERERERERERERUTJtuIiIiIiIiIiIiKCZNtRERERERERERExYTJNiIiIiIiIiIiomLCZBsRERERFRu5XI558+ahWrVqkEqlqFatGhYtWgRPT0/I5fJCH++3336Dq6srMjIySiBaIiIiouIniKIo6joIIiIiIqoYfvrpJ4wbNw6fffYZ/Pz8UKlSJQwdOhTffvsthg4dWujjpaenw83NDdOmTcO4ceNKIGJ1OTk5OHfuHG7evImMjAxYWVmhQYMGcHZ2LrH9w8PDce7cOVhaWiIwMFBt+4sXLxAWFoa4uDikp6fD1NQU1atXR+3ataGvr1/kcyUiIqLix55tRERERFRsVq9ejfbt22Px4sUYPHgwbt++jezsbPTv379IxzM0NMSHH36IJUuWoLS+Iz527BgiIiJQvXp1NGnSBBKJBPv370dsbGyJ7J+SkoKLFy/mmzRLSUnBjh078OTJE3h7e6NJkyaws7PD+fPnERISUuTzJCIiopLBZBsRERERFYv09HRcunQJLVq0UJatXr0aPXr0gKGhYZGPGxQUhHv37uHo0aPFEWaBnjx5gujoaDRs2BCNGzeGl5cXunbtCjMzM5w9e7ZE9v/vv/9QuXJl2Nraatx+8+ZNZGZmolOnTvD394eXlxdatWqFGjVq4N69exxiS0REVMYw2UZEREREb2348OEwMjJCTk4Opk+fDkEQ4ODggIiICLRr106t/sOHD2FoaIhhw4aplP/zzz+QSqWYMGGCsqxevXqwsrLCzp07S/w8bt++DUEQ4OXlpSzT19eHh4cH4uLikJKSUqz7P378GHfu3EGTJk3yPWZmZiYAwNjYWKXc2NgYgiBAIuGf9ERERGUJfzMTERER0VsbOHAgRo8eDQD48ccfsW7dOnz00UcAgLp166rVd3JywogRI7B+/Xrcu3cPAHD9+nUEBgaic+fO+O6771Tq161bF6dOnSowBrlcjvT0dK1+8huSmpCQgEqVKsHAwEClvHLlysrtBSnM/nK5HKdOnYKnpyesrKzyPaajoyMA4Pjx43j69ClSUlIQHR2Nq1evwtvbG1KptMCYiIiIqHRxNlUiIiIiemtt2rRBSEgITExMMGbMGEgkEsyYMQMA4O7urnGfL774An/88Qe++eYbzJ07F926dYObmxs2bdqk1luratWqWLduXYExxMbGYs+ePVrF279/f5iZmamVv3z5Uq0HGfCqV1lqamqBxy3M/teuXUNKSgq6du1a4DFdXFxQv359XLhwQZmYBIA6deqgQYMGBe5LREREpY/JNiIiIiIqFhEREfD29lYmyhISEqCvrw9TU1ON9Z2cnDBy5EisWLEC4eHhSEtLw/Hjx2FiYqJW19LSEmlpafkmswDA2toaXbp00SpWIyMjjeXZ2dnQ09NTK1eU5eTkFHhcbfdPT0/HuXPnULdu3XxjycvMzAwODg5wd3eHoaEh7t+/jwsXLsDIyAg+Pj5v3J+IiIhKD5NtRERERFQsLl26hI4dOxZqn0mTJuGnn35CREQETp48CScnJ431FMM+BUHI91gymQzOzs6Fav91+vr6GhNqijJNibSi7B8WFgaZTAZvb+83xnTr1i2cOHECffv2VSYu3d3dIYoiQkNDUb169bdagIKIiIiKF5NtRERERPTWnj9/jgcPHsDX11dZZm1tjezsbCQnJ2scsgkA8+fPB5DbI6ygecsSExNhbGxcYC+wnJwcrVfmNDQ01LiwgLGxscahoi9fvgQAjb3uCrv/ixcvcP36dQQEBCjLFfHL5XIkJydDKpUqE2hXr16FjY2NWg/BKlWqICoqCk+fPn3rJCMREREVHybbiIiIiOitRUREAAD8/PyUZZ6engCAO3fuqJQrLF68GH/88Qd++uknTJ48GfPnz8cff/yh8fh37txRWeFTk7i4uLees83a2hqPHj1CZmamyiIHT548UW4viDb7JyUlQRRFnD59GqdPn1Y7xqZNm+Dj46NcoTQtLQ0ymUytnlwuB4B8F3sgIiIi3WCyjYiIiIje2qVLlwCoJtsCAgIAAOfOnVNLtv3999/4/PPPMXfuXHz66ae4efMmfvnlF3z55ZcaF1QIDw/HwIEDC4yhOOZsq1q1KiIiInDt2jXUrl0bQG6Psxs3bqBy5crK3mXZ2dlISUmBoaGhyhBObfbX19dHhw4d1NoOCwtDVlYWmjRpAnNzc2V5pUqVEBMTg+fPn8PCwkJZHh0dDUEQCuwRSERERKWPyTYiIiIiemsRERFwcnJSSfxUrVoVPj4++OeffzBs2DBl+fnz5zFw4EAMHDgQX375JQBgypQp+O233zT2bjt//jyePXuGnj17FhhDcczZVrlyZVStWhWhoaFIS0tDpUqVEBUVheTkZLRs2VJZ78mTJ9izZw/q1q2L+vXrF2p/Q0NDuLm5qbV9+fJlAFDbVrt2bTx48AC7d++Gt7c3ZDIZ7t+/jwcPHsDT0/ONQ1uJiIiodKlPVEFEREREVEgREREah4oOGzYMu3fvRlpaGgAgJiYG3bt3R506dbBixQplPUdHRwwbNgx//vkn7ty5o3KMrVu3wtXVFW3atCnZk/ifVq1awdfXFzdv3sTp06chl8vRqVMnODg4lMr+r3NwcEDPnj1hY2ODK1eu4MyZM0hKSkKDBg3QrFmzIh2TiIiISo4gcpIHIiIiIiohL168QNWqVbFo0SIMHz680PtnZGTAzc0Nn3/+OcaPH18CERIREREVL/ZsIyIiIqISU6lSJUyZMgWLFy9WTuhfGKtXr4ZUKsVHH31UAtERERERFT/2bCMiIiIiIiIiIiom7NlGRERERERERERUTJhsIyIiIiIiIiIiKiZMthERERERERERERUTJtuIiIiIiIiIiIiKCZNtRERERERERERExYTJNiIiIiIqVUlJSWjVqhWSkpJ0HQoRERFRsWOyjYiIiIhKVVJSEo4fP85kGxEREVVITLYREREREREREREVEybbiIiIiIiIiIiIigmTbURERERERERERMWEyTYiIiIiKlXm5uZo0qQJzM3NdR0KERERUbETRFEUdR0EEREREb1bLl68CH9/f12HQURERFTs2LONiIiIiIiIiIiomLBnGxERERGVuvT0dBgaGuo6DCIiIqJix55tRERERFTqHj58qOsQiIiIiEoEk21EREREVOqSk5N1HQIRERFRiWCyjYiIiIhKnUwm03UIRERERCWCc7YRERERUanLycmBnp6ersMgIiIiKnbs2UZEREREpe7y5cu6DoGIiIioRDDZRkREREREREREVEyYbCMiIiKiUle5cmVdh0BERERUIphsIyIiIqJSZ2hoqOsQiIiIiEoEk21EREREVOru37+v6xCIiIiISgSTbURERERERERERMVEEEVR1HUQRERERPRuefnyJYyNjXUdBhEREVGxY882IiIiIip1cXFxug6BiIiIqEQw2UZEREREpe7Fixe6DoGIiIioRDDZRkRERESlTiqV6joEIiIiohLBOduIiIiIiIiIiIiKCXu2EREREVGpu3jxoq5DICIiIioRTLYREREREREREREVEybbiIiIiKjU2djY6DoEIiIiohLBZBsRERERlTpTU1Ndh0BERERUIphsIyIiIqJSd/fuXV2HQER
      "text/plain": [
       "<Figure size 800x1150 with 3 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Show the individual prediction for the lowest predicted instance\n",
    "lowest_pred_index = np.argmin(shap_values.values[:, 0])  \n",
    "\n",
    "# Use waterfall plot for a single instance\n",
    "shap.plots.waterfall(shap_values[lowest_pred_index], max_display=20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}