data-jupyter-notebooks/data_driven_risk_assessment/experiments/ddra_joaquin_weighted.ipynb

5582 lines
2 MiB
Text
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"id": "84dcd475",
"metadata": {},
"source": [
"# DDRA Joaquin\n",
"\n",
"## General Idea\n",
"The idea is to start with a very simple model with basic Booking attributes. This should serve as a first understanding of what can bring value in the data-driven risk assessment of new dash protected bookings.\n",
"\n",
"## Initial setup\n",
"This first section just ensures that the connection to DWH works correctly."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "12368ce1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"🔌 Testing connection using credentials at: /home/joaquin/.superhog-dwh/credentials.yml\n",
"✅ Connection successful.\n"
]
}
],
"source": [
"# This script connects to a Data Warehouse (DWH) using PostgreSQL. \n",
"# This should be common for all Notebooks, but you might need to adjust the path to the `dwh_utils` module.\n",
"\n",
"import sys\n",
"import os\n",
"sys.path.append(os.path.abspath(\"../../utils\")) # Adjust path if needed\n",
"\n",
"from dwh_utils import read_credentials, create_postgres_engine, query_to_dataframe, test_connection\n",
"\n",
"# --- Connect to DWH ---\n",
"creds = read_credentials()\n",
"dwh_pg_engine = create_postgres_engine(creds)\n",
"\n",
"# --- Test Query ---\n",
"test_connection()"
]
},
{
"cell_type": "markdown",
"id": "c86f94f1",
"metadata": {},
"source": [
"## Data Extraction\n",
"In this section we extract the data for our first attempt on Basic Booking Attributes modelling.\n",
"\n",
"This SQL query retrieves a clean and relevant subset of booking data for our model. It includes:\n",
"- A **unique booking ID**\n",
"- Key **numeric features** such as number of services, time between booking creation and check-in, and number of nights\n",
"- Several **categorical (boolean) features** related to service usage\n",
"- A **target variable** (`has_resolution_incident`) indicating whether a resolution incident occurred\n",
"\n",
"Filters applied being:\n",
"1. Bookings from **\"New Dash\" users** with a valid deal ID\n",
"2. Only **protected bookings**, i.e., those with Protection or Deposit Management services\n",
"3. Bookings flagged for **risk categorisation** (excluding incomplete/rejected ones)\n",
"4. Bookings that are **already completed**\n",
"\n",
"The result is converted into a pandas DataFrame for further processing and modeling.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3e3ed391",
"metadata": {},
"outputs": [],
"source": [
"# Initialise all imports needed for the Notebook\n",
"from sklearn.model_selection import (\n",
" train_test_split, \n",
" GridSearchCV\n",
")\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.feature_selection import RFE\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.utils.class_weight import compute_class_weight\n",
"from sklearn.feature_selection import SelectKBest, f_classif\n",
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import date\n",
"from sklearn.metrics import (\n",
" roc_auc_score, \n",
" average_precision_score,\n",
" classification_report,\n",
" roc_curve, \n",
" auc,\n",
" precision_recall_curve,\n",
" precision_score,\n",
" recall_score,\n",
" fbeta_score,\n",
" confusion_matrix\n",
")\n",
"import matplotlib.pyplot as plt\n",
"import shap\n",
"import math"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "db5e3098",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" id_booking days_from_booking_creation_to_check_in number_of_nights \\\n",
"0 919656 26.0 4.0 \n",
"1 926634 17.0 3.0 \n",
"2 931082 20.0 7.0 \n",
"3 931086 15.0 3.0 \n",
"4 931096 8.0 5.0 \n",
"\n",
" host_town host_country host_postcode host_age host_months_with_truvi \\\n",
"0 Madison CT United States 06443 125.0 8.0 \n",
"1 Madison CT United States 06443 125.0 8.0 \n",
"2 London United Kingdom N16 6DD 125.0 8.0 \n",
"3 London United Kingdom N16 6DD 125.0 8.0 \n",
"4 London United Kingdom N16 6DD 125.0 8.0 \n",
"\n",
" host_account_type host_active_pms_list ... \\\n",
"0 Host Hostaway ... \n",
"1 Host Hostaway ... \n",
"2 PMC - Property Management Company Hostify ... \n",
"3 PMC - Property Management Company Hostify ... \n",
"4 PMC - Property Management Company Hostify ... \n",
"\n",
" number_of_applied_upgraded_services number_of_applied_billable_services \\\n",
"0 2 2 \n",
"1 2 2 \n",
"2 1 1 \n",
"3 1 1 \n",
"4 1 1 \n",
"\n",
" booking_days_to_check_in booking_number_of_nights has_verification_request \\\n",
"0 87 4 False \n",
"1 109 3 False \n",
"2 50 7 False \n",
"3 15 3 False \n",
"4 8 5 False \n",
"\n",
" has_billable_services has_upgraded_screening_service_business_type \\\n",
"0 True False \n",
"1 True False \n",
"2 True False \n",
"3 True False \n",
"4 True False \n",
"\n",
" has_deposit_management_service_business_type \\\n",
"0 True \n",
"1 True \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" has_protection_service_business_type has_resolution_incident \n",
"0 True False \n",
"1 True False \n",
"2 True False \n",
"3 True False \n",
"4 True False \n",
"\n",
"[5 rows x 64 columns]\n",
"Total Bookings: 21,307\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_48568/805553034.py:455: DtypeWarning: Columns (50) have mixed types. Specify dtype option on import or set low_memory=False.\n",
" df_extraction = pd.read_csv(\"/home/joaquin/data-jupyter-notebooks/data_driven_risk_assessment/experiments/data.csv\")\n"
]
}
],
"source": [
"# Query to extract data\n",
"data_extraction_query = \"\"\"\n",
"with\n",
" int_core__verification_requests as (\n",
" select *\n",
" from intermediate.int_core__verification_requests\n",
" where created_date_utc >= '2024-10-21'\n",
" ),\n",
" int_core__bookings as (\n",
" select *\n",
" from intermediate.int_core__bookings\n",
" where created_date_utc >= '2024-10-21'\n",
" ),\n",
" stg_core__verification as (\n",
" select *\n",
" from staging.stg_core__verification\n",
" where created_date_utc >= '2024-10-21'\n",
" ),\n",
" int_core__guest_journey_payments as (\n",
" select *\n",
" from intermediate.int_core__guest_journey_payments\n",
" where payment_due_date_utc >= '2024-10-21'\n",
" ),\n",
" filtered_bookings as (\n",
" select *\n",
" from intermediate.int_booking_summary\n",
" where\n",
" is_user_in_new_dash = true\n",
" and is_missing_id_deal = false\n",
" and (\n",
" has_protection_service_business_type\n",
" or has_deposit_management_service_business_type\n",
" )\n",
" and is_booking_flagged_as_risk is not null\n",
" and is_booking_past_completion_date = true\n",
" and booking_created_date_utc < '2025-06-25'\n",
" ),\n",
" previous_booking_counts as (\n",
" select\n",
" id_booking,\n",
" id_accommodation,\n",
" id_user_guest,\n",
" booking_check_in_date_utc,\n",
" booking_check_out_date_utc,\n",
" count(*) over (\n",
" partition by id_accommodation\n",
" order by booking_check_in_date_utc\n",
" rows between unbounded preceding and 1 preceding\n",
" ) as previous_bookings_in_listing_count,\n",
" count(*) over (\n",
" partition by id_user_guest\n",
" order by booking_check_in_date_utc\n",
" rows between unbounded preceding and 1 preceding\n",
" ) as previous_guest_bookings_count\n",
" from filtered_bookings\n",
" ),\n",
" listing_info as (\n",
" select\n",
" id_accommodation,\n",
" address_line_1 as listing_address,\n",
" town as listing_town,\n",
" country_name as listing_country,\n",
" postcode as listing_postcode,\n",
" number_of_bedrooms,\n",
" number_of_bathrooms,\n",
" friendly_name as listing_description,\n",
" id_user_host\n",
" from intermediate.int_core__accommodation\n",
" ),\n",
" host_info as (\n",
" select\n",
" scu.id_user as id_user_host,\n",
" icuh.account_type,\n",
" icuh.active_pms_list,\n",
" scc.country_name,\n",
" scu.billing_town,\n",
" scu.billing_postcode,\n",
" scu.id_billing_country,\n",
" extract(year from age(current_date, scu.date_of_birth)) as host_age,\n",
" extract(\n",
" month from age(current_date, scu.joined_date_utc)\n",
" ) as host_months_with_truvi\n",
" from staging.stg_core__user scu\n",
" left join\n",
" staging.stg_core__country scc on scu.id_billing_country = scc.id_country\n",
" left join\n",
" intermediate.int_core__user_host icuh on icuh.id_user_host = scu.id_user\n",
" ),\n",
" guest_info as (\n",
" select\n",
" scu.id_user as id_user_guest,\n",
" scc.country_name,\n",
" scu.billing_town,\n",
" scu.billing_postcode,\n",
" scu.id_billing_country,\n",
" extract(year from age(current_date, scu.date_of_birth)) as guest_age,\n",
" scu.email,\n",
" scu.phone_number\n",
" from staging.stg_core__user scu\n",
" left join\n",
" staging.stg_core__country scc on scu.id_billing_country = scc.id_country\n",
" ),\n",
" host_listing_counts as (\n",
" select id_user_host, count(*) as number_of_listings_of_host\n",
" from intermediate.int_core__accommodation\n",
" where is_active = true\n",
" group by id_user_host\n",
" ),\n",
" listing_incident_counts as (\n",
" select\n",
" i.created_date_utc::date as date_day,\n",
" i.id_accommodation,\n",
" count(*) over (\n",
" partition by i.id_accommodation\n",
" order by i.created_date_utc::date\n",
" rows between unbounded preceding and current row\n",
" ) as number_of_previous_incidents_in_listing,\n",
" count(i.calculated_payout_amount_in_txn_currency) over (\n",
" partition by i.id_accommodation\n",
" order by i.created_date_utc::date\n",
" rows between unbounded preceding and current row\n",
" ) as number_of_previous_payouts_in_listing\n",
" from intermediate.int_resolutions__incidents i\n",
" where\n",
" i.id_accommodation is not null\n",
" and i.created_date_utc::date between '2024-10-21' and current_date\n",
" order by i.id_accommodation, date_day\n",
" ),\n",
" guest_incident_counts as (\n",
" select\n",
" i.created_date_utc::date as date_day,\n",
" i.id_user_guest,\n",
" count(*) over (\n",
" partition by i.id_user_guest\n",
" order by i.created_date_utc::date\n",
" rows between unbounded preceding and current row\n",
" ) as number_of_previous_incidents_of_guest\n",
" from intermediate.int_resolutions__incidents i\n",
" where\n",
" i.id_user_guest is not null\n",
" and i.created_date_utc::date between '2024-10-21' and current_date\n",
" order by i.id_user_guest, date_day\n",
" ),\n",
" host_incident_counts as (\n",
" select\n",
" i.created_date_utc::date as date_day,\n",
" i.id_user_host,\n",
" count(*) over (\n",
" partition by i.id_user_host\n",
" order by i.created_date_utc::date\n",
" rows between unbounded preceding and current row\n",
" ) as number_of_previous_incidents_of_host,\n",
" count(i.calculated_payout_amount_in_txn_currency) over (\n",
" partition by i.id_user_host\n",
" order by i.created_date_utc::date\n",
" rows between unbounded preceding and current row\n",
" ) as number_of_previous_payouts_of_host\n",
" from intermediate.int_resolutions__incidents i\n",
" where\n",
" i.id_user_host is not null\n",
" and i.created_date_utc::date between '2024-10-21' and current_date\n",
" order by i.id_user_host, date_day\n",
" ),\n",
" verification_requests as (\n",
" select\n",
" icvr.id_verification_request,\n",
" extract(\n",
" day\n",
" from\n",
" age(\n",
" icvr.verification_estimated_started_date_utc,\n",
" icb.created_date_utc\n",
" )\n",
" ) as days_to_start_verification,\n",
" extract(\n",
" day\n",
" from\n",
" age(\n",
" icvr.verification_estimated_completed_date_utc,\n",
" icvr.verification_estimated_started_date_utc\n",
" )\n",
" ) as days_to_complete_verification,\n",
" -- CSAT Results\n",
" gsr.experience_rating as guest_csat_score,\n",
" gsr.guest_comments as guest_csat_comments,\n",
" -- GUEST_PRODUCT fields\n",
" max(\n",
" case\n",
" when guest_journey_product_type = 'GUEST_PRODUCT' then product_name\n",
" end\n",
" ) as guest_product_name,\n",
" max(\n",
" case when guest_journey_product_type = 'GUEST_PRODUCT' then currency end\n",
" ) as guest_currency,\n",
" max(\n",
" case\n",
" when guest_journey_product_type = 'GUEST_PRODUCT'\n",
" then total_amount_in_txn_currency\n",
" end\n",
" ) as guest_total_amount,\n",
" -- VERIFICATION_PRODUCT fields\n",
" max(\n",
" case\n",
" when guest_journey_product_type = 'VERIFICATION_PRODUCT'\n",
" then product_name\n",
" end\n",
" ) as verification_product_name,\n",
" max(\n",
" case\n",
" when guest_journey_product_type = 'VERIFICATION_PRODUCT'\n",
" then currency\n",
" end\n",
" ) as verification_currency,\n",
" max(\n",
" case\n",
" when guest_journey_product_type = 'VERIFICATION_PRODUCT'\n",
" then total_amount_in_txn_currency\n",
" end\n",
" ) as verification_total_amount,\n",
" -- Verification Results\n",
" max(\n",
" case when scv.verification = 'Screening' then id_verification_status end\n",
" ) as screening_status,\n",
" max(\n",
" case\n",
" when scv.verification = 'GovernmentId' then id_verification_status\n",
" end\n",
" ) as government_id_status,\n",
" max(\n",
" case when scv.verification = 'Contract' then id_verification_status end\n",
" ) as contract_status,\n",
" max(\n",
" case\n",
" when scv.verification = 'SelfieConfidenceScore'\n",
" then id_verification_status\n",
" end\n",
" ) as selfie_confidence_score_status,\n",
" max(\n",
" case\n",
" when scv.verification = 'PaymentValidation'\n",
" then id_verification_status\n",
" end\n",
" ) as payment_validation_status,\n",
" max(\n",
" case when scv.verification = 'FirstName' then id_verification_status end\n",
" ) as first_name_status,\n",
" max(\n",
" case\n",
" when scv.verification = 'DateOfBirth' then id_verification_status\n",
" end\n",
" ) as date_of_birth_status,\n",
" max(\n",
" case when scv.verification = 'LastName' then id_verification_status end\n",
" ) as last_name_status,\n",
" max(\n",
" case\n",
" when scv.verification = 'AutohostPartner'\n",
" then id_verification_status\n",
" end\n",
" ) as autohost_partner_status,\n",
" max(\n",
" case\n",
" when scv.verification = 'CriminalRecord' then id_verification_status\n",
" end\n",
" ) as criminal_record_status\n",
" from int_core__verification_requests icvr\n",
" left join\n",
" int_core__bookings icb\n",
" on icb.id_verification_request = icvr.id_verification_request\n",
" left join\n",
" stg_core__verification scv\n",
" on scv.id_verification_request = icvr.id_verification_request\n",
" left join\n",
" int_core__guest_journey_payments gjp\n",
" on gjp.id_verification_request = icb.id_verification_request\n",
" left join\n",
" intermediate.int_core__guest_satisfaction_responses gsr\n",
" on gsr.id_verification_request = icvr.id_verification_request\n",
" and scv.verification in (\n",
" 'Screening',\n",
" 'GovernmentId',\n",
" 'Contract',\n",
" 'SelfieConfidenceScore',\n",
" 'PaymentValidation',\n",
" 'FirstName',\n",
" 'DateOfBirth',\n",
" 'LastName',\n",
" 'AutohostPartner',\n",
" 'CriminalRecord'\n",
" )\n",
" group by 1, 2, 3, 4, 5\n",
" )\n",
"select\n",
" fb.id_booking,\n",
" extract(day from age(fb.booking_check_in_date_utc, fb.booking_created_date_utc)) as days_from_booking_creation_to_check_in,\n",
" extract(day from age(fb.booking_check_out_date_utc, fb.booking_check_in_date_utc)) as number_of_nights,\n",
" -- Host Info\n",
" hi.billing_town as host_town,\n",
" hi.country_name as host_country,\n",
" hi.billing_postcode as host_postcode,\n",
" hi.host_age,\n",
" hi.host_months_with_truvi,\n",
" hi.account_type as host_account_type,\n",
" hi.active_pms_list as host_active_pms_list,\n",
" coalesce(hlc.number_of_listings_of_host, 0) as number_of_listings_of_host,\n",
" coalesce(\n",
" hic.number_of_previous_incidents_of_host, 0\n",
" ) as number_of_previous_incidents_of_host,\n",
" coalesce(\n",
" hic.number_of_previous_payouts_of_host, 0\n",
" ) as number_of_previous_payouts_of_host,\n",
" -- Guest Info\n",
" gi.billing_town as guest_town,\n",
" gi.country_name as guest_country,\n",
" gi.billing_postcode as guest_postcode,\n",
" gi.guest_age,\n",
" coalesce(\n",
" pbc.previous_guest_bookings_count, 0\n",
" ) as number_of_previous_bookings_of_guest,\n",
" coalesce(\n",
" gic.number_of_previous_incidents_of_guest, 0\n",
" ) as number_of_previous_incidents_of_guest,\n",
" case\n",
" when pbc.previous_bookings_in_listing_count > 0 then true else false\n",
" end as has_guest_previously_booked_same_listing,\n",
" -- Listing Info\n",
" li.listing_address,\n",
" li.listing_town,\n",
" li.listing_country,\n",
" li.listing_postcode,\n",
" li.number_of_bedrooms as listing_number_of_bedrooms,\n",
" li.number_of_bathrooms as listing_number_of_bathrooms,\n",
" li.listing_description,\n",
" coalesce(pbc.previous_bookings_in_listing_count, 0) as previous_bookings_in_listing_count,\n",
" coalesce(lic.number_of_previous_incidents_in_listing, 0) as number_of_previous_incidents_in_listing,\n",
" coalesce(lic.number_of_previous_payouts_in_listing, 0) as number_of_previous_payouts_in_listing,\n",
" -- Verification Info\n",
" case\n",
" when fb.id_verification_request is null then 0\n",
" else vr.days_to_start_verification\n",
" end as days_to_start_verification,\n",
" case \n",
" when vr.id_verification_request is null then 0\n",
" else vr.days_to_complete_verification\n",
" end as days_to_complete_verification,\n",
" vr.screening_status,\n",
" vr.government_id_status,\n",
" vr.contract_status,\n",
" vr.selfie_confidence_score_status,\n",
" vr.payment_validation_status,\n",
" vr.first_name_status,\n",
" vr.date_of_birth_status,\n",
" vr.last_name_status,\n",
" vr.autohost_partner_status,\n",
" vr.criminal_record_status,\n",
" vr.guest_csat_score,\n",
" vr.guest_csat_comments,\n",
" -- Boolean features\n",
" gi.email is not null as guest_has_email,\n",
" gi.phone_number is not null as guest_has_phone_number,\n",
" case \n",
" when gi.billing_town is null or li.listing_town is null then null \n",
" when gi.billing_town = li.listing_town \n",
" then true else false \n",
" end as is_guest_from_listing_town,\n",
" case \n",
" when gi.country_name is null or li.listing_country is null then null\n",
" when gi.country_name = li.listing_country \n",
" then true else false \n",
" end as is_guest_from_listing_country,\n",
" case \n",
" when gi.billing_postcode is null or li.listing_postcode is null then null\n",
" when gi.billing_postcode = li.listing_postcode \n",
" then true else false \n",
" end as is_guest_from_listing_postcode,\n",
" case \n",
" when hi.billing_town is null or li.listing_town is null then null\n",
" when hi.billing_town = li.listing_town \n",
" then true else false \n",
" end as is_host_from_listing_town,\n",
" case \n",
" when hi.country_name is null or li.listing_country is null then null\n",
" when hi.country_name = li.listing_country \n",
" then true else false \n",
" end as is_host_from_listing_country,\n",
" case \n",
" when hi.billing_postcode is null or li.listing_postcode is null then null\n",
" when hi.billing_postcode = li.listing_postcode \n",
" then true else false \n",
" end as is_host_from_listing_postcode,\n",
" case\n",
" when vr.days_to_complete_verification is null then false\n",
" else true\n",
" end as has_completed_verification,\n",
" -- Numeric features\n",
" fb.number_of_applied_services,\n",
" fb.number_of_applied_upgraded_services,\n",
" fb.number_of_applied_billable_services,\n",
" fb.booking_check_in_date_utc\n",
" - fb.booking_created_date_utc as booking_days_to_check_in,\n",
" fb.booking_number_of_nights,\n",
" -- Categorical features\n",
" fb.has_verification_request,\n",
" fb.has_billable_services,\n",
" fb.has_upgraded_screening_service_business_type,\n",
" fb.has_deposit_management_service_business_type,\n",
" fb.has_protection_service_business_type,\n",
" -- Target\n",
" fb.has_resolution_incident\n",
"from filtered_bookings fb\n",
"left join previous_booking_counts pbc on fb.id_booking = pbc.id_booking\n",
"left join listing_info li on li.id_accommodation = fb.id_accommodation\n",
"left join host_info hi on hi.id_user_host = fb.id_user_host\n",
"left join guest_info gi on gi.id_user_guest = fb.id_user_guest\n",
"left join host_listing_counts hlc on li.id_user_host = hlc.id_user_host\n",
"left join\n",
" lateral(\n",
" select *\n",
" from listing_incident_counts lic\n",
" where\n",
" lic.id_accommodation = fb.id_accommodation\n",
" and lic.date_day <= fb.booking_check_in_date_utc\n",
" order by lic.date_day desc\n",
" limit 1\n",
" ) lic\n",
" on true\n",
"left join\n",
" lateral(\n",
" select *\n",
" from guest_incident_counts gic\n",
" where\n",
" gic.id_user_guest = fb.id_user_guest\n",
" and gic.date_day <= fb.booking_check_in_date_utc\n",
" order by gic.date_day desc\n",
" limit 1\n",
" ) gic\n",
" on true\n",
"left join\n",
" lateral(\n",
" select *\n",
" from host_incident_counts hic\n",
" where\n",
" hic.id_user_host = fb.id_user_host\n",
" and hic.date_day <= fb.booking_check_in_date_utc\n",
" order by hic.date_day desc\n",
" limit 1\n",
" ) hic\n",
" on true\n",
"left join\n",
" verification_requests vr on vr.id_verification_request = fb.id_verification_request\n",
"\"\"\"\n",
"\n",
"# Retrieve Data from Query\n",
"# df_extraction = query_to_dataframe(engine=dwh_pg_engine, query=data_extraction_query)\n",
"df_extraction = pd.read_csv(\"/home/joaquin/data-jupyter-notebooks/data_driven_risk_assessment/experiments/data.csv\")\n",
"print(df_extraction.head())\n",
"print(f\"Total Bookings: {len(df_extraction):,}\")\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b56a8530",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id_booking</th>\n",
" <th>days_from_booking_creation_to_check_in</th>\n",
" <th>number_of_nights</th>\n",
" <th>host_town</th>\n",
" <th>host_country</th>\n",
" <th>host_postcode</th>\n",
" <th>host_age</th>\n",
" <th>host_months_with_truvi</th>\n",
" <th>host_account_type</th>\n",
" <th>host_active_pms_list</th>\n",
" <th>...</th>\n",
" <th>number_of_applied_upgraded_services</th>\n",
" <th>number_of_applied_billable_services</th>\n",
" <th>booking_days_to_check_in</th>\n",
" <th>booking_number_of_nights</th>\n",
" <th>has_verification_request</th>\n",
" <th>has_billable_services</th>\n",
" <th>has_upgraded_screening_service_business_type</th>\n",
" <th>has_deposit_management_service_business_type</th>\n",
" <th>has_protection_service_business_type</th>\n",
" <th>has_resolution_incident</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>919656</td>\n",
" <td>26.0</td>\n",
" <td>4.0</td>\n",
" <td>Madison CT</td>\n",
" <td>United States</td>\n",
" <td>06443</td>\n",
" <td>125.0</td>\n",
" <td>8.0</td>\n",
" <td>Host</td>\n",
" <td>Hostaway</td>\n",
" <td>...</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>87</td>\n",
" <td>4</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>926634</td>\n",
" <td>17.0</td>\n",
" <td>3.0</td>\n",
" <td>Madison CT</td>\n",
" <td>United States</td>\n",
" <td>06443</td>\n",
" <td>125.0</td>\n",
" <td>8.0</td>\n",
" <td>Host</td>\n",
" <td>Hostaway</td>\n",
" <td>...</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>109</td>\n",
" <td>3</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>931082</td>\n",
" <td>20.0</td>\n",
" <td>7.0</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" <td>N16 6DD</td>\n",
" <td>125.0</td>\n",
" <td>8.0</td>\n",
" <td>PMC - Property Management Company</td>\n",
" <td>Hostify</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>50</td>\n",
" <td>7</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>931086</td>\n",
" <td>15.0</td>\n",
" <td>3.0</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" <td>N16 6DD</td>\n",
" <td>125.0</td>\n",
" <td>8.0</td>\n",
" <td>PMC - Property Management Company</td>\n",
" <td>Hostify</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>15</td>\n",
" <td>3</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>931096</td>\n",
" <td>8.0</td>\n",
" <td>5.0</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" <td>N16 6DD</td>\n",
" <td>125.0</td>\n",
" <td>8.0</td>\n",
" <td>PMC - Property Management Company</td>\n",
" <td>Hostify</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>8</td>\n",
" <td>5</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 64 columns</p>\n",
"</div>"
],
"text/plain": [
" id_booking days_from_booking_creation_to_check_in number_of_nights \\\n",
"0 919656 26.0 4.0 \n",
"1 926634 17.0 3.0 \n",
"2 931082 20.0 7.0 \n",
"3 931086 15.0 3.0 \n",
"4 931096 8.0 5.0 \n",
"\n",
" host_town host_country host_postcode host_age host_months_with_truvi \\\n",
"0 Madison CT United States 06443 125.0 8.0 \n",
"1 Madison CT United States 06443 125.0 8.0 \n",
"2 London United Kingdom N16 6DD 125.0 8.0 \n",
"3 London United Kingdom N16 6DD 125.0 8.0 \n",
"4 London United Kingdom N16 6DD 125.0 8.0 \n",
"\n",
" host_account_type host_active_pms_list ... \\\n",
"0 Host Hostaway ... \n",
"1 Host Hostaway ... \n",
"2 PMC - Property Management Company Hostify ... \n",
"3 PMC - Property Management Company Hostify ... \n",
"4 PMC - Property Management Company Hostify ... \n",
"\n",
" number_of_applied_upgraded_services number_of_applied_billable_services \\\n",
"0 2 2 \n",
"1 2 2 \n",
"2 1 1 \n",
"3 1 1 \n",
"4 1 1 \n",
"\n",
" booking_days_to_check_in booking_number_of_nights has_verification_request \\\n",
"0 87 4 False \n",
"1 109 3 False \n",
"2 50 7 False \n",
"3 15 3 False \n",
"4 8 5 False \n",
"\n",
" has_billable_services has_upgraded_screening_service_business_type \\\n",
"0 True False \n",
"1 True False \n",
"2 True False \n",
"3 True False \n",
"4 True False \n",
"\n",
" has_deposit_management_service_business_type \\\n",
"0 True \n",
"1 True \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" has_protection_service_business_type has_resolution_incident \n",
"0 True False \n",
"1 True False \n",
"2 True False \n",
"3 True False \n",
"4 True False \n",
"\n",
"[5 rows x 64 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_extraction.head()"
]
},
{
"cell_type": "markdown",
"id": "e9a9da26",
"metadata": {},
"source": [
"## Exploratory Data Analysis"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f4545e95",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset size: 21,307 rows and 63 columns\n"
]
}
],
"source": [
"# Copy dataset to make changes and drop id_booking column\n",
"df = df_extraction.copy().drop(columns=['id_booking'])\n",
"\n",
"# Check size of the dataset\n",
"print(f\"Dataset size: {df.shape[0]:,} rows and {df.shape[1]:,} columns\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "de574969",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>days_from_booking_creation_to_check_in</th>\n",
" <th>number_of_nights</th>\n",
" <th>host_town</th>\n",
" <th>host_country</th>\n",
" <th>host_postcode</th>\n",
" <th>host_age</th>\n",
" <th>host_months_with_truvi</th>\n",
" <th>host_account_type</th>\n",
" <th>host_active_pms_list</th>\n",
" <th>number_of_listings_of_host</th>\n",
" <th>number_of_previous_incidents_of_host</th>\n",
" <th>number_of_previous_payouts_of_host</th>\n",
" <th>guest_town</th>\n",
" <th>guest_country</th>\n",
" <th>guest_postcode</th>\n",
" <th>guest_age</th>\n",
" <th>number_of_previous_bookings_of_guest</th>\n",
" <th>number_of_previous_incidents_of_guest</th>\n",
" <th>has_guest_previously_booked_same_listing</th>\n",
" <th>listing_address</th>\n",
" <th>listing_town</th>\n",
" <th>listing_country</th>\n",
" <th>listing_postcode</th>\n",
" <th>listing_number_of_bedrooms</th>\n",
" <th>listing_number_of_bathrooms</th>\n",
" <th>listing_description</th>\n",
" <th>previous_bookings_in_listing_count</th>\n",
" <th>number_of_previous_incidents_in_listing</th>\n",
" <th>number_of_previous_payouts_in_listing</th>\n",
" <th>days_to_start_verification</th>\n",
" <th>days_to_complete_verification</th>\n",
" <th>screening_status</th>\n",
" <th>government_id_status</th>\n",
" <th>contract_status</th>\n",
" <th>selfie_confidence_score_status</th>\n",
" <th>payment_validation_status</th>\n",
" <th>first_name_status</th>\n",
" <th>date_of_birth_status</th>\n",
" <th>last_name_status</th>\n",
" <th>autohost_partner_status</th>\n",
" <th>criminal_record_status</th>\n",
" <th>guest_csat_score</th>\n",
" <th>guest_csat_comments</th>\n",
" <th>guest_has_email</th>\n",
" <th>guest_has_phone_number</th>\n",
" <th>is_guest_from_listing_town</th>\n",
" <th>is_guest_from_listing_country</th>\n",
" <th>is_guest_from_listing_postcode</th>\n",
" <th>is_host_from_listing_town</th>\n",
" <th>is_host_from_listing_country</th>\n",
" <th>is_host_from_listing_postcode</th>\n",
" <th>has_completed_verification</th>\n",
" <th>number_of_applied_services</th>\n",
" <th>number_of_applied_upgraded_services</th>\n",
" <th>number_of_applied_billable_services</th>\n",
" <th>booking_days_to_check_in</th>\n",
" <th>booking_number_of_nights</th>\n",
" <th>has_verification_request</th>\n",
" <th>has_billable_services</th>\n",
" <th>has_upgraded_screening_service_business_type</th>\n",
" <th>has_deposit_management_service_business_type</th>\n",
" <th>has_protection_service_business_type</th>\n",
" <th>has_resolution_incident</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>26.0</td>\n",
" <td>4.0</td>\n",
" <td>Madison CT</td>\n",
" <td>United States</td>\n",
" <td>06443</td>\n",
" <td>125.0</td>\n",
" <td>8.0</td>\n",
" <td>Host</td>\n",
" <td>Hostaway</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1032</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" <td>389 Mountain View Dr, Jeffersonville, VT 05464...</td>\n",
" <td>Cambridge</td>\n",
" <td>United States</td>\n",
" <td>05464</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Mountain Life Retreat at Smuggler's Notch Resort</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>87</td>\n",
" <td>4</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>17.0</td>\n",
" <td>3.0</td>\n",
" <td>Madison CT</td>\n",
" <td>United States</td>\n",
" <td>06443</td>\n",
" <td>125.0</td>\n",
" <td>8.0</td>\n",
" <td>Host</td>\n",
" <td>Hostaway</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1900</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" <td>389 Mountain View Dr, Jeffersonville, VT 05464...</td>\n",
" <td>Cambridge</td>\n",
" <td>United States</td>\n",
" <td>05464</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Mountain Life Retreat at Smuggler's Notch Resort</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>109</td>\n",
" <td>3</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>20.0</td>\n",
" <td>7.0</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" <td>N16 6DD</td>\n",
" <td>125.0</td>\n",
" <td>8.0</td>\n",
" <td>PMC - Property Management Company</td>\n",
" <td>Hostify</td>\n",
" <td>467</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>610</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" <td>Tudor Grange Hotel, 31 Gervis Road</td>\n",
" <td>Dorset</td>\n",
" <td>United Kingdom</td>\n",
" <td>BH1 3EE</td>\n",
" <td>12.0</td>\n",
" <td>12.0</td>\n",
" <td>Mansion by the Sea, 12BR/12BA, Perfect for Events</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>50</td>\n",
" <td>7</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>15.0</td>\n",
" <td>3.0</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" <td>N16 6DD</td>\n",
" <td>125.0</td>\n",
" <td>8.0</td>\n",
" <td>PMC - Property Management Company</td>\n",
" <td>Hostify</td>\n",
" <td>467</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>136</td>\n",
" <td>0</td>\n",
" <td>True</td>\n",
" <td>Tudor Grange Hotel, 31 Gervis Road</td>\n",
" <td>Dorset</td>\n",
" <td>United Kingdom</td>\n",
" <td>BH1 3EE</td>\n",
" <td>12.0</td>\n",
" <td>12.0</td>\n",
" <td>Mansion by the Sea, 12BR/12BA, Perfect for Events</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>15</td>\n",
" <td>3</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>8.0</td>\n",
" <td>5.0</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" <td>N16 6DD</td>\n",
" <td>125.0</td>\n",
" <td>8.0</td>\n",
" <td>PMC - Property Management Company</td>\n",
" <td>Hostify</td>\n",
" <td>467</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>73</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" <td>Aird House, 15 Wellesley Ct, Rockingham Street</td>\n",
" <td>Greater London</td>\n",
" <td>United Kingdom</td>\n",
" <td>SE1 6PD</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>Your London Home: 2BR Flat with Modern Amenities</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>8</td>\n",
" <td>5</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" days_from_booking_creation_to_check_in number_of_nights host_town \\\n",
"0 26.0 4.0 Madison CT \n",
"1 17.0 3.0 Madison CT \n",
"2 20.0 7.0 London \n",
"3 15.0 3.0 London \n",
"4 8.0 5.0 London \n",
"\n",
" host_country host_postcode host_age host_months_with_truvi \\\n",
"0 United States 06443 125.0 8.0 \n",
"1 United States 06443 125.0 8.0 \n",
"2 United Kingdom N16 6DD 125.0 8.0 \n",
"3 United Kingdom N16 6DD 125.0 8.0 \n",
"4 United Kingdom N16 6DD 125.0 8.0 \n",
"\n",
" host_account_type host_active_pms_list \\\n",
"0 Host Hostaway \n",
"1 Host Hostaway \n",
"2 PMC - Property Management Company Hostify \n",
"3 PMC - Property Management Company Hostify \n",
"4 PMC - Property Management Company Hostify \n",
"\n",
" number_of_listings_of_host number_of_previous_incidents_of_host \\\n",
"0 2 0 \n",
"1 2 0 \n",
"2 467 0 \n",
"3 467 0 \n",
"4 467 0 \n",
"\n",
" number_of_previous_payouts_of_host guest_town guest_country guest_postcode \\\n",
"0 0 NaN NaN NaN \n",
"1 0 NaN NaN NaN \n",
"2 0 NaN NaN NaN \n",
"3 0 NaN NaN NaN \n",
"4 0 NaN NaN NaN \n",
"\n",
" guest_age number_of_previous_bookings_of_guest \\\n",
"0 NaN 1032 \n",
"1 NaN 1900 \n",
"2 NaN 610 \n",
"3 NaN 136 \n",
"4 NaN 73 \n",
"\n",
" number_of_previous_incidents_of_guest \\\n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 \n",
"\n",
" has_guest_previously_booked_same_listing \\\n",
"0 True \n",
"1 True \n",
"2 True \n",
"3 True \n",
"4 False \n",
"\n",
" listing_address listing_town \\\n",
"0 389 Mountain View Dr, Jeffersonville, VT 05464... Cambridge \n",
"1 389 Mountain View Dr, Jeffersonville, VT 05464... Cambridge \n",
"2 Tudor Grange Hotel, 31 Gervis Road Dorset \n",
"3 Tudor Grange Hotel, 31 Gervis Road Dorset \n",
"4 Aird House, 15 Wellesley Ct, Rockingham Street Greater London \n",
"\n",
" listing_country listing_postcode listing_number_of_bedrooms \\\n",
"0 United States 05464 2.0 \n",
"1 United States 05464 2.0 \n",
"2 United Kingdom BH1 3EE 12.0 \n",
"3 United Kingdom BH1 3EE 12.0 \n",
"4 United Kingdom SE1 6PD 2.0 \n",
"\n",
" listing_number_of_bathrooms \\\n",
"0 2.0 \n",
"1 2.0 \n",
"2 12.0 \n",
"3 12.0 \n",
"4 1.0 \n",
"\n",
" listing_description \\\n",
"0 Mountain Life Retreat at Smuggler's Notch Resort \n",
"1 Mountain Life Retreat at Smuggler's Notch Resort \n",
"2 Mansion by the Sea, 12BR/12BA, Perfect for Events \n",
"3 Mansion by the Sea, 12BR/12BA, Perfect for Events \n",
"4 Your London Home: 2BR Flat with Modern Amenities \n",
"\n",
" previous_bookings_in_listing_count \\\n",
"0 3 \n",
"1 5 \n",
"2 5 \n",
"3 2 \n",
"4 0 \n",
"\n",
" number_of_previous_incidents_in_listing \\\n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 \n",
"\n",
" number_of_previous_payouts_in_listing days_to_start_verification \\\n",
"0 0 0.0 \n",
"1 0 0.0 \n",
"2 0 0.0 \n",
"3 0 0.0 \n",
"4 0 0.0 \n",
"\n",
" days_to_complete_verification screening_status government_id_status \\\n",
"0 0.0 NaN NaN \n",
"1 0.0 NaN NaN \n",
"2 0.0 NaN NaN \n",
"3 0.0 NaN NaN \n",
"4 0.0 NaN NaN \n",
"\n",
" contract_status selfie_confidence_score_status payment_validation_status \\\n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"\n",
" first_name_status date_of_birth_status last_name_status \\\n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"\n",
" autohost_partner_status criminal_record_status guest_csat_score \\\n",
"0 NaN NaN NaN \n",
"1 NaN NaN NaN \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN \n",
"4 NaN NaN NaN \n",
"\n",
" guest_csat_comments guest_has_email guest_has_phone_number \\\n",
"0 NaN False False \n",
"1 NaN False False \n",
"2 NaN False False \n",
"3 NaN False False \n",
"4 NaN False False \n",
"\n",
" is_guest_from_listing_town is_guest_from_listing_country \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" is_guest_from_listing_postcode is_host_from_listing_town \\\n",
"0 NaN False \n",
"1 NaN False \n",
"2 NaN False \n",
"3 NaN False \n",
"4 NaN False \n",
"\n",
" is_host_from_listing_country is_host_from_listing_postcode \\\n",
"0 True False \n",
"1 True False \n",
"2 True False \n",
"3 True False \n",
"4 True False \n",
"\n",
" has_completed_verification number_of_applied_services \\\n",
"0 False 3 \n",
"1 False 3 \n",
"2 False 2 \n",
"3 False 2 \n",
"4 False 2 \n",
"\n",
" number_of_applied_upgraded_services number_of_applied_billable_services \\\n",
"0 2 2 \n",
"1 2 2 \n",
"2 1 1 \n",
"3 1 1 \n",
"4 1 1 \n",
"\n",
" booking_days_to_check_in booking_number_of_nights \\\n",
"0 87 4 \n",
"1 109 3 \n",
"2 50 7 \n",
"3 15 3 \n",
"4 8 5 \n",
"\n",
" has_verification_request has_billable_services \\\n",
"0 False True \n",
"1 False True \n",
"2 False True \n",
"3 False True \n",
"4 False True \n",
"\n",
" has_upgraded_screening_service_business_type \\\n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" has_deposit_management_service_business_type \\\n",
"0 True \n",
"1 True \n",
"2 False \n",
"3 False \n",
"4 False \n",
"\n",
" has_protection_service_business_type has_resolution_incident \n",
"0 True False \n",
"1 True False \n",
"2 True False \n",
"3 True False \n",
"4 True False "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Remove columns limit to display all columns and rows\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.max_rows', None)\n",
"\n",
"# Preview of the dataset\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "de4c6753",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 21307 entries, 0 to 21306\n",
"Data columns (total 63 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 days_from_booking_creation_to_check_in 21307 non-null float64\n",
" 1 number_of_nights 21307 non-null float64\n",
" 2 host_town 21281 non-null object \n",
" 3 host_country 21300 non-null object \n",
" 4 host_postcode 15800 non-null object \n",
" 5 host_age 21307 non-null float64\n",
" 6 host_months_with_truvi 21307 non-null float64\n",
" 7 host_account_type 17831 non-null object \n",
" 8 host_active_pms_list 20363 non-null object \n",
" 9 number_of_listings_of_host 21307 non-null int64 \n",
" 10 number_of_previous_incidents_of_host 21307 non-null int64 \n",
" 11 number_of_previous_payouts_of_host 21307 non-null int64 \n",
" 12 guest_town 11676 non-null object \n",
" 13 guest_country 11677 non-null object \n",
" 14 guest_postcode 11676 non-null object \n",
" 15 guest_age 11677 non-null float64\n",
" 16 number_of_previous_bookings_of_guest 21307 non-null int64 \n",
" 17 number_of_previous_incidents_of_guest 21307 non-null int64 \n",
" 18 has_guest_previously_booked_same_listing 21307 non-null bool \n",
" 19 listing_address 21307 non-null object \n",
" 20 listing_town 21307 non-null object \n",
" 21 listing_country 21307 non-null object \n",
" 22 listing_postcode 21307 non-null object \n",
" 23 listing_number_of_bedrooms 21185 non-null float64\n",
" 24 listing_number_of_bathrooms 21185 non-null float64\n",
" 25 listing_description 21294 non-null object \n",
" 26 previous_bookings_in_listing_count 21307 non-null int64 \n",
" 27 number_of_previous_incidents_in_listing 21307 non-null int64 \n",
" 28 number_of_previous_payouts_in_listing 21307 non-null int64 \n",
" 29 days_to_start_verification 20084 non-null float64\n",
" 30 days_to_complete_verification 18500 non-null float64\n",
" 31 screening_status 9332 non-null float64\n",
" 32 government_id_status 8082 non-null float64\n",
" 33 contract_status 5856 non-null float64\n",
" 34 selfie_confidence_score_status 6622 non-null float64\n",
" 35 payment_validation_status 8047 non-null float64\n",
" 36 first_name_status 4810 non-null float64\n",
" 37 date_of_birth_status 4810 non-null float64\n",
" 38 last_name_status 4810 non-null float64\n",
" 39 autohost_partner_status 0 non-null float64\n",
" 40 criminal_record_status 2075 non-null float64\n",
" 41 guest_csat_score 3221 non-null float64\n",
" 42 guest_csat_comments 454 non-null object \n",
" 43 guest_has_email 21307 non-null bool \n",
" 44 guest_has_phone_number 21307 non-null bool \n",
" 45 is_guest_from_listing_town 11677 non-null object \n",
" 46 is_guest_from_listing_country 11677 non-null object \n",
" 47 is_guest_from_listing_postcode 11677 non-null object \n",
" 48 is_host_from_listing_town 21307 non-null bool \n",
" 49 is_host_from_listing_country 21300 non-null object \n",
" 50 is_host_from_listing_postcode 18102 non-null object \n",
" 51 has_completed_verification 21307 non-null bool \n",
" 52 number_of_applied_services 21307 non-null int64 \n",
" 53 number_of_applied_upgraded_services 21307 non-null int64 \n",
" 54 number_of_applied_billable_services 21307 non-null int64 \n",
" 55 booking_days_to_check_in 21307 non-null int64 \n",
" 56 booking_number_of_nights 21307 non-null int64 \n",
" 57 has_verification_request 21307 non-null bool \n",
" 58 has_billable_services 21307 non-null bool \n",
" 59 has_upgraded_screening_service_business_type 21307 non-null bool \n",
" 60 has_deposit_management_service_business_type 21307 non-null bool \n",
" 61 has_protection_service_business_type 21307 non-null bool \n",
" 62 has_resolution_incident 21307 non-null bool \n",
"dtypes: bool(11), float64(20), int64(13), object(19)\n",
"memory usage: 8.7+ MB\n"
]
}
],
"source": [
"# View summary of dataset\n",
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "9c79c06a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Missing Values (%):\n",
"autohost_partner_status 100.000000\n",
"guest_csat_comments 97.869245\n",
"criminal_record_status 90.261416\n",
"guest_csat_score 84.882902\n",
"date_of_birth_status 77.425259\n",
"last_name_status 77.425259\n",
"first_name_status 77.425259\n",
"contract_status 72.516075\n",
"selfie_confidence_score_status 68.921012\n",
"payment_validation_status 62.233069\n",
"government_id_status 62.068804\n",
"screening_status 56.202187\n",
"guest_postcode 45.201108\n",
"guest_town 45.201108\n",
"guest_country 45.196414\n",
"is_guest_from_listing_country 45.196414\n",
"is_guest_from_listing_postcode 45.196414\n",
"guest_age 45.196414\n",
"is_guest_from_listing_town 45.196414\n",
"host_postcode 25.845966\n",
"host_account_type 16.313887\n",
"is_host_from_listing_postcode 15.042005\n",
"days_to_complete_verification 13.174074\n",
"days_to_start_verification 5.739898\n",
"host_active_pms_list 4.430469\n",
"listing_number_of_bedrooms 0.572582\n",
"listing_number_of_bathrooms 0.572582\n",
"host_town 0.122026\n",
"listing_description 0.061013\n",
"host_country 0.032853\n",
"is_host_from_listing_country 0.032853\n",
"dtype: float64\n"
]
}
],
"source": [
"# View percentage of missing values\n",
"missing_values = df.isnull().mean() * 100\n",
"missing_values = missing_values[missing_values > 0].sort_values(ascending=False)\n",
"print(\"Missing Values (%):\")\n",
"print(missing_values)"
]
},
{
"cell_type": "markdown",
"id": "1837c541",
"metadata": {},
"source": [
"Despite the small amount of data with on CSAT, I want to check if there might be any interesting correlation with the incidents."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "6e89712c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"guest_csat_score\n",
"1.0 0.010695\n",
"2.0 0.013761\n",
"3.0 0.018293\n",
"4.0 0.013105\n",
"5.0 0.022619\n",
"Name: has_resolution_incident, dtype: float64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('guest_csat_score')['has_resolution_incident'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "ce9ed8a0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Correlation: 0.02\n"
]
}
],
"source": [
"correlation = df['guest_csat_score'].corr(df['has_resolution_incident'])\n",
"print(f\"Correlation: {correlation:.2f}\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8ac447bb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dropping columns with more than 50% missing values: ['autohost_partner_status', 'guest_csat_comments', 'criminal_record_status', 'guest_csat_score', 'date_of_birth_status', 'last_name_status', 'first_name_status', 'contract_status', 'selfie_confidence_score_status', 'payment_validation_status', 'government_id_status', 'screening_status']\n"
]
}
],
"source": [
"# Remove columns with more than 50% missing values\n",
"threshold = 50\n",
"columns_to_drop = missing_values[missing_values > threshold].index\n",
"print(f\"Dropping columns with more than {threshold}% missing values: {columns_to_drop.tolist()}\")\n",
"df.drop(columns=columns_to_drop, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "20bd5c86",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 18 categorical variables\n",
"\n",
"The categorical variables are: ['host_town', 'host_country', 'host_postcode', 'host_account_type', 'host_active_pms_list', 'guest_town', 'guest_country', 'guest_postcode', 'listing_address', 'listing_town', 'listing_country', 'listing_postcode', 'listing_description', 'is_guest_from_listing_town', 'is_guest_from_listing_country', 'is_guest_from_listing_postcode', 'is_host_from_listing_country', 'is_host_from_listing_postcode']\n"
]
}
],
"source": [
"# Find categorical variables\n",
"categorical = df.select_dtypes(include=['object']).columns.tolist()\n",
"print(f'There are {len(categorical)} categorical variables\\n')\n",
"print('The categorical variables are:', categorical)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "67ddd437",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>host_town</th>\n",
" <th>host_country</th>\n",
" <th>host_postcode</th>\n",
" <th>host_account_type</th>\n",
" <th>host_active_pms_list</th>\n",
" <th>guest_town</th>\n",
" <th>guest_country</th>\n",
" <th>guest_postcode</th>\n",
" <th>listing_address</th>\n",
" <th>listing_town</th>\n",
" <th>listing_country</th>\n",
" <th>listing_postcode</th>\n",
" <th>listing_description</th>\n",
" <th>is_guest_from_listing_town</th>\n",
" <th>is_guest_from_listing_country</th>\n",
" <th>is_guest_from_listing_postcode</th>\n",
" <th>is_host_from_listing_country</th>\n",
" <th>is_host_from_listing_postcode</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Madison CT</td>\n",
" <td>United States</td>\n",
" <td>06443</td>\n",
" <td>Host</td>\n",
" <td>Hostaway</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>389 Mountain View Dr, Jeffersonville, VT 05464...</td>\n",
" <td>Cambridge</td>\n",
" <td>United States</td>\n",
" <td>05464</td>\n",
" <td>Mountain Life Retreat at Smuggler's Notch Resort</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Madison CT</td>\n",
" <td>United States</td>\n",
" <td>06443</td>\n",
" <td>Host</td>\n",
" <td>Hostaway</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>389 Mountain View Dr, Jeffersonville, VT 05464...</td>\n",
" <td>Cambridge</td>\n",
" <td>United States</td>\n",
" <td>05464</td>\n",
" <td>Mountain Life Retreat at Smuggler's Notch Resort</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" <td>N16 6DD</td>\n",
" <td>PMC - Property Management Company</td>\n",
" <td>Hostify</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Tudor Grange Hotel, 31 Gervis Road</td>\n",
" <td>Dorset</td>\n",
" <td>United Kingdom</td>\n",
" <td>BH1 3EE</td>\n",
" <td>Mansion by the Sea, 12BR/12BA, Perfect for Events</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" <td>N16 6DD</td>\n",
" <td>PMC - Property Management Company</td>\n",
" <td>Hostify</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Tudor Grange Hotel, 31 Gervis Road</td>\n",
" <td>Dorset</td>\n",
" <td>United Kingdom</td>\n",
" <td>BH1 3EE</td>\n",
" <td>Mansion by the Sea, 12BR/12BA, Perfect for Events</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" <td>N16 6DD</td>\n",
" <td>PMC - Property Management Company</td>\n",
" <td>Hostify</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Aird House, 15 Wellesley Ct, Rockingham Street</td>\n",
" <td>Greater London</td>\n",
" <td>United Kingdom</td>\n",
" <td>SE1 6PD</td>\n",
" <td>Your London Home: 2BR Flat with Modern Amenities</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" host_town host_country host_postcode \\\n",
"0 Madison CT United States 06443 \n",
"1 Madison CT United States 06443 \n",
"2 London United Kingdom N16 6DD \n",
"3 London United Kingdom N16 6DD \n",
"4 London United Kingdom N16 6DD \n",
"\n",
" host_account_type host_active_pms_list guest_town \\\n",
"0 Host Hostaway NaN \n",
"1 Host Hostaway NaN \n",
"2 PMC - Property Management Company Hostify NaN \n",
"3 PMC - Property Management Company Hostify NaN \n",
"4 PMC - Property Management Company Hostify NaN \n",
"\n",
" guest_country guest_postcode \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" listing_address listing_town \\\n",
"0 389 Mountain View Dr, Jeffersonville, VT 05464... Cambridge \n",
"1 389 Mountain View Dr, Jeffersonville, VT 05464... Cambridge \n",
"2 Tudor Grange Hotel, 31 Gervis Road Dorset \n",
"3 Tudor Grange Hotel, 31 Gervis Road Dorset \n",
"4 Aird House, 15 Wellesley Ct, Rockingham Street Greater London \n",
"\n",
" listing_country listing_postcode \\\n",
"0 United States 05464 \n",
"1 United States 05464 \n",
"2 United Kingdom BH1 3EE \n",
"3 United Kingdom BH1 3EE \n",
"4 United Kingdom SE1 6PD \n",
"\n",
" listing_description \\\n",
"0 Mountain Life Retreat at Smuggler's Notch Resort \n",
"1 Mountain Life Retreat at Smuggler's Notch Resort \n",
"2 Mansion by the Sea, 12BR/12BA, Perfect for Events \n",
"3 Mansion by the Sea, 12BR/12BA, Perfect for Events \n",
"4 Your London Home: 2BR Flat with Modern Amenities \n",
"\n",
" is_guest_from_listing_town is_guest_from_listing_country \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" is_guest_from_listing_postcode is_host_from_listing_country \\\n",
"0 NaN True \n",
"1 NaN True \n",
"2 NaN True \n",
"3 NaN True \n",
"4 NaN True \n",
"\n",
" is_host_from_listing_postcode \n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view the categorical variables\n",
"df[categorical].head()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "841347ea",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"host_town 26\n",
"host_country 7\n",
"host_postcode 5507\n",
"host_account_type 3476\n",
"host_active_pms_list 944\n",
"guest_town 9631\n",
"guest_country 9630\n",
"guest_postcode 9631\n",
"listing_address 0\n",
"listing_town 0\n",
"listing_country 0\n",
"listing_postcode 0\n",
"listing_description 13\n",
"is_guest_from_listing_town 9630\n",
"is_guest_from_listing_country 9630\n",
"is_guest_from_listing_postcode 9630\n",
"is_host_from_listing_country 7\n",
"is_host_from_listing_postcode 3205\n",
"dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check missing values in categorical variables\n",
"df[categorical].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "a58cd17e",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_48568/2855830200.py:2: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" df['is_guest_from_listing_town'] = df['is_guest_from_listing_town'].fillna(False)\n",
"/tmp/ipykernel_48568/2855830200.py:3: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" df['is_guest_from_listing_country'] = df['is_guest_from_listing_country'].fillna(False)\n",
"/tmp/ipykernel_48568/2855830200.py:4: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" df['is_guest_from_listing_postcode'] = df['is_guest_from_listing_postcode'].fillna(False)\n",
"/tmp/ipykernel_48568/2855830200.py:6: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" df['is_host_from_listing_country'] = df['is_host_from_listing_country'].fillna(False)\n",
"/tmp/ipykernel_48568/2855830200.py:7: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" df['is_host_from_listing_postcode'] = df['is_host_from_listing_postcode'].fillna(False)\n"
]
}
],
"source": [
"# For all missing values in listing location with both host and guest, we will fill with False\n",
"df['is_guest_from_listing_town'] = df['is_guest_from_listing_town'].fillna(False)\n",
"df['is_guest_from_listing_country'] = df['is_guest_from_listing_country'].fillna(False)\n",
"df['is_guest_from_listing_postcode'] = df['is_guest_from_listing_postcode'].fillna(False)\n",
"df['is_host_from_listing_town'] = df['is_host_from_listing_town'].fillna(False)\n",
"df['is_host_from_listing_country'] = df['is_host_from_listing_country'].fillna(False)\n",
"df['is_host_from_listing_postcode'] = df['is_host_from_listing_postcode'].fillna(False)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "e5aefb50",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"host_town 26\n",
"host_country 7\n",
"host_postcode 5507\n",
"host_account_type 3476\n",
"host_active_pms_list 944\n",
"guest_town 9631\n",
"guest_country 9630\n",
"guest_postcode 9631\n",
"listing_address 0\n",
"listing_town 0\n",
"listing_country 0\n",
"listing_postcode 0\n",
"listing_description 13\n",
"is_guest_from_listing_town 0\n",
"is_guest_from_listing_country 0\n",
"is_guest_from_listing_postcode 0\n",
"is_host_from_listing_country 0\n",
"is_host_from_listing_postcode 0\n",
"dtype: int64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Checking again missing values in categorical variables\n",
"df[categorical].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "292eaad2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Unique values in 'host_account_type':\n",
"host_account_type\n",
"PMC - Property Management Company 12719\n",
"Host 5112\n",
"Name: count, dtype: int64 \n",
"\n",
"Unique values in 'host_active_pms_list':\n",
"host_active_pms_list\n",
"Hostify 6468\n",
"Hostaway 3675\n",
"Guesty 3108\n",
"Hospitable 2739\n",
"Hostfully 1905\n",
"Lodgify 1341\n",
"OwnerRez 649\n",
"Avantio 248\n",
"TrackHs 142\n",
"Uplisting 61\n",
"Hospitable Connect 15\n",
"Smoobu 12\n",
"Name: count, dtype: int64 \n",
"\n",
"Unique values in 'host_country':\n",
"host_country\n",
"United States 10962\n",
"United Kingdom 6707\n",
"Canada 2007\n",
"Australia 305\n",
"Mexico 273\n",
"New Zealand 154\n",
"Sweden 122\n",
"Norway 117\n",
"Bulgaria 117\n",
"Portugal 87\n",
"South Africa 78\n",
"Costa Rica 75\n",
"Puerto Rico 50\n",
"Belgium 50\n",
"Italy 35\n",
"Barbados 34\n",
"Spain 31\n",
"France 26\n",
"Jamaica 20\n",
"Egypt 19\n",
"Switzerland 10\n",
"Isle of Man 8\n",
"Bahamas 3\n",
"Guernsey 3\n",
"United Arab Emirates 2\n",
"Colombia 2\n",
"Germany 1\n",
"Greece 1\n",
"Hungary 1\n",
"Name: count, dtype: int64 \n",
"\n",
"Unique values in 'guest_country':\n",
"guest_country\n",
"United States 7409\n",
"Canada 1458\n",
"United Kingdom 1175\n",
"Australia 287\n",
"Colombia 151\n",
"Mexico 134\n",
"Germany 100\n",
"Ireland 77\n",
"New Zealand 70\n",
"France 56\n",
"Spain 53\n",
"Costa Rica 43\n",
"Netherlands 37\n",
"Brazil 36\n",
"Switzerland 34\n",
"Puerto Rico 31\n",
"Italy 29\n",
"Argentina 23\n",
"Singapore 23\n",
"China 21\n",
"Belgium 20\n",
"Ecuador 20\n",
"India 20\n",
"United Arab Emirates 20\n",
"Panama 19\n",
"Poland 17\n",
"Dominican Republic 15\n",
"Israel 14\n",
"Saudi Arabia 13\n",
"South Africa 12\n",
"Romania 11\n",
"Malaysia 11\n",
"El Salvador 10\n",
"Chile 9\n",
"Norway 9\n",
"Japan 9\n",
"Portugal 9\n",
"Sweden 8\n",
"Hong Kong 8\n",
"Austria 8\n",
"South Korea 8\n",
"United States Minor Outlying Islands 8\n",
"Finland 8\n",
"Philippines 7\n",
"Czech Republic 7\n",
"Guatemala 7\n",
"Hungary 6\n",
"Venezuela 6\n",
"Denmark 6\n",
"Honduras 6\n",
"Jamaica 5\n",
"Thailand 5\n",
"Peru 5\n",
"Taiwan 5\n",
"Russian Federation 5\n",
"French Polynesia 4\n",
"Turkey 4\n",
"Kazakhstan 4\n",
"Curacao 4\n",
"Martinique 3\n",
"Cayman Islands 3\n",
"Saint Pierre and Miquelon 3\n",
"Slovenia 3\n",
"Estonia 3\n",
"Iceland 3\n",
"Georgia 3\n",
"Indonesia 2\n",
"Qatar 2\n",
"Greece 2\n",
"Egypt 2\n",
"Latvia 2\n",
"Pakistan 2\n",
"Barbados 2\n",
"Bolivia 2\n",
"Aruba 2\n",
"Malta 2\n",
"Suriname 1\n",
"Lebanon 1\n",
"Nauru 1\n",
"Fiji 1\n",
"Cook Islands 1\n",
"Bahamas 1\n",
"Albania 1\n",
"Uruguay 1\n",
"Jersey 1\n",
"Croatia 1\n",
"Bulgaria 1\n",
"Belize 1\n",
"Nicaragua 1\n",
"DR Congo 1\n",
"Kuwait 1\n",
"Niger 1\n",
"Cyprus 1\n",
"Name: count, dtype: int64 \n",
"\n",
"Unique values in 'listing_country':\n",
"listing_country\n",
"United States 10067\n",
"United Kingdom 6574\n",
"Canada 1870\n",
"Colombia 599\n",
"Australia 305\n",
"Mexico 303\n",
"Ireland 168\n",
"New Zealand 153\n",
"Virgin Islands, U.s. 130\n",
"Bahamas 130\n",
"Norway 125\n",
"Sweden 122\n",
"Bulgaria 117\n",
"Costa Rica 108\n",
"Portugal 87\n",
"South Africa 83\n",
"Puerto Rico 50\n",
"Belgium 48\n",
"France 46\n",
"Italy 44\n",
"Spain 36\n",
"Barbados 34\n",
"Morocco 25\n",
"Jamaica 20\n",
"Egypt 19\n",
"Saint Lucia 10\n",
"Germany 10\n",
"Sint Maarten 9\n",
"Isle of Man 8\n",
"United Arab Emirates 2\n",
"Lithuania 2\n",
"Antigua and Barbuda 1\n",
"Greece 1\n",
"Hungary 1\n",
"Name: count, dtype: int64 \n",
"\n"
]
}
],
"source": [
"# Check unique values in host_account_type, host_active_pms_list, host_country and guest_country with their counts\n",
"print(\"Unique values in 'host_account_type':\")\n",
"print(df['host_account_type'].value_counts(), \"\\n\")\n",
"print(\"Unique values in 'host_active_pms_list':\")\n",
"print(df['host_active_pms_list'].value_counts(), \"\\n\")\n",
"print(\"Unique values in 'host_country':\")\n",
"print(df['host_country'].value_counts(), \"\\n\")\n",
"print(\"Unique values in 'guest_country':\")\n",
"print(df['guest_country'].value_counts(), \"\\n\")\n",
"print(\"Unique values in 'listing_country':\")\n",
"print(df['listing_country'].value_counts(), \"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "7289f9fd",
"metadata": {},
"outputs": [],
"source": [
"# Due to the many unique values in host_country, guest_country and listing_country, we will only keep the top 10 most frequent values and set the rest to 'Other'\n",
"top_host_countries = df['host_country'].value_counts().nlargest(10).index\n",
"top_guest_countries = df['guest_country'].value_counts().nlargest(10).index\n",
"top_listing_countries = df['listing_country'].value_counts().nlargest(10).index\n",
"\n",
"df['host_country'] = df['host_country'].where(df['host_country'].isin(top_host_countries), 'Other')\n",
"df['guest_country'] = df['guest_country'].where(df['guest_country'].isin(top_guest_countries), 'Other')\n",
"df['listing_country'] = df['listing_country'].where(df['listing_country'].isin(top_listing_countries), 'Other')"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "7348866c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"New columns created from one-hot encoding: ['host_account_type_Host', 'host_account_type_PMC - Property Management Company', 'host_active_pms_list_Avantio', 'host_active_pms_list_Guesty', 'host_active_pms_list_Hospitable', 'host_active_pms_list_Hospitable Connect', 'host_active_pms_list_Hostaway', 'host_active_pms_list_Hostfully', 'host_active_pms_list_Hostify', 'host_active_pms_list_Lodgify', 'host_active_pms_list_OwnerRez', 'host_active_pms_list_Smoobu', 'host_active_pms_list_TrackHs', 'host_active_pms_list_Uplisting', 'host_country_Australia', 'host_country_Bulgaria', 'host_country_Canada', 'host_country_Mexico', 'host_country_New Zealand', 'host_country_Norway', 'host_country_Other', 'host_country_Portugal', 'host_country_Sweden', 'host_country_United Kingdom', 'host_country_United States', 'guest_country_Australia', 'guest_country_Canada', 'guest_country_Colombia', 'guest_country_France', 'guest_country_Germany', 'guest_country_Ireland', 'guest_country_Mexico', 'guest_country_New Zealand', 'guest_country_Other', 'guest_country_United Kingdom', 'guest_country_United States', 'listing_country_Australia', 'listing_country_Bahamas', 'listing_country_Canada', 'listing_country_Colombia', 'listing_country_Ireland', 'listing_country_Mexico', 'listing_country_New Zealand', 'listing_country_Other', 'listing_country_United Kingdom', 'listing_country_United States', 'listing_country_Virgin Islands, U.s.']\n"
]
}
],
"source": [
"# Lets one hot encode host_account_type, host_active_pms_list, host_country, guest_country and listing_country\n",
"df = pd.get_dummies(df, columns=['host_account_type', 'host_active_pms_list', 'host_country', 'guest_country', 'listing_country'], drop_first=False)\n",
"# Check the new columns created\n",
"new_columns = df.columns[df.columns.str.startswith(('host_account_type_', 'host_active_pms_list_', 'host_country', 'guest_country', 'listing_country'))]\n",
"print(f\"New columns created from one-hot encoding: {new_columns.tolist()}\")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "b443ccf4",
"metadata": {},
"outputs": [],
"source": [
"# Drop the original categorical columns and the ones we are not going to use like postcodes and towns\n",
"df.drop(columns=['host_postcode', 'guest_postcode', 'listing_postcode', 'listing_town', 'host_town', 'guest_town', 'listing_description', 'listing_address'], inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "a31ae1fd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 22 numerical variables\n",
"\n",
"The numerical variables are : ['days_from_booking_creation_to_check_in', 'number_of_nights', 'host_age', 'host_months_with_truvi', 'number_of_listings_of_host', 'number_of_previous_incidents_of_host', 'number_of_previous_payouts_of_host', 'guest_age', 'number_of_previous_bookings_of_guest', 'number_of_previous_incidents_of_guest', 'listing_number_of_bedrooms', 'listing_number_of_bathrooms', 'previous_bookings_in_listing_count', 'number_of_previous_incidents_in_listing', 'number_of_previous_payouts_in_listing', 'days_to_start_verification', 'days_to_complete_verification', 'number_of_applied_services', 'number_of_applied_upgraded_services', 'number_of_applied_billable_services', 'booking_days_to_check_in', 'booking_number_of_nights']\n"
]
}
],
"source": [
"# Find numerical variables\n",
"numerical = df.select_dtypes(include=[np.number]).columns.tolist()\n",
"print('There are {} numerical variables\\n'.format(len(numerical)))\n",
"print('The numerical variables are :', numerical)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "cf795d45",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Summary statistics of numerical variables:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>days_from_booking_creation_to_check_in</th>\n",
" <th>number_of_nights</th>\n",
" <th>host_age</th>\n",
" <th>host_months_with_truvi</th>\n",
" <th>number_of_listings_of_host</th>\n",
" <th>number_of_previous_incidents_of_host</th>\n",
" <th>number_of_previous_payouts_of_host</th>\n",
" <th>guest_age</th>\n",
" <th>number_of_previous_bookings_of_guest</th>\n",
" <th>number_of_previous_incidents_of_guest</th>\n",
" <th>listing_number_of_bedrooms</th>\n",
" <th>listing_number_of_bathrooms</th>\n",
" <th>previous_bookings_in_listing_count</th>\n",
" <th>number_of_previous_incidents_in_listing</th>\n",
" <th>number_of_previous_payouts_in_listing</th>\n",
" <th>days_to_start_verification</th>\n",
" <th>days_to_complete_verification</th>\n",
" <th>number_of_applied_services</th>\n",
" <th>number_of_applied_upgraded_services</th>\n",
" <th>number_of_applied_billable_services</th>\n",
" <th>booking_days_to_check_in</th>\n",
" <th>booking_number_of_nights</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>11677.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.0</td>\n",
" <td>21185.000000</td>\n",
" <td>21185.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>20084.000000</td>\n",
" <td>18500.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" <td>21307.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>8.740038</td>\n",
" <td>3.876801</td>\n",
" <td>96.533017</td>\n",
" <td>5.482142</td>\n",
" <td>152.875815</td>\n",
" <td>2.718496</td>\n",
" <td>0.751302</td>\n",
" <td>42.317890</td>\n",
" <td>2175.999812</td>\n",
" <td>0.0</td>\n",
" <td>2.052962</td>\n",
" <td>1.601841</td>\n",
" <td>6.215094</td>\n",
" <td>0.123387</td>\n",
" <td>0.043507</td>\n",
" <td>0.996764</td>\n",
" <td>0.713135</td>\n",
" <td>3.721594</td>\n",
" <td>2.721688</td>\n",
" <td>1.865209</td>\n",
" <td>17.592247</td>\n",
" <td>4.144507</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>8.389242</td>\n",
" <td>3.335615</td>\n",
" <td>43.616341</td>\n",
" <td>2.714314</td>\n",
" <td>179.028829</td>\n",
" <td>5.582857</td>\n",
" <td>2.957053</td>\n",
" <td>13.212509</td>\n",
" <td>3038.837496</td>\n",
" <td>0.0</td>\n",
" <td>1.745281</td>\n",
" <td>1.297739</td>\n",
" <td>6.727896</td>\n",
" <td>0.537464</td>\n",
" <td>0.270994</td>\n",
" <td>3.423303</td>\n",
" <td>2.768474</td>\n",
" <td>1.553612</td>\n",
" <td>1.553629</td>\n",
" <td>0.949857</td>\n",
" <td>23.572901</td>\n",
" <td>4.799364</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-20.000000</td>\n",
" <td>0.000000</td>\n",
" <td>19.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>18.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-48.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>1.000000</td>\n",
" <td>2.000000</td>\n",
" <td>39.000000</td>\n",
" <td>4.000000</td>\n",
" <td>9.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>32.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>2.000000</td>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>6.000000</td>\n",
" <td>3.000000</td>\n",
" <td>125.000000</td>\n",
" <td>5.000000</td>\n",
" <td>72.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>41.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>4.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>4.000000</td>\n",
" <td>3.000000</td>\n",
" <td>2.000000</td>\n",
" <td>8.000000</td>\n",
" <td>3.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>15.000000</td>\n",
" <td>4.000000</td>\n",
" <td>125.000000</td>\n",
" <td>8.000000</td>\n",
" <td>247.000000</td>\n",
" <td>3.000000</td>\n",
" <td>1.000000</td>\n",
" <td>51.000000</td>\n",
" <td>4302.500000</td>\n",
" <td>0.0</td>\n",
" <td>3.000000</td>\n",
" <td>2.000000</td>\n",
" <td>9.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>5.000000</td>\n",
" <td>4.000000</td>\n",
" <td>3.000000</td>\n",
" <td>24.000000</td>\n",
" <td>5.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>30.000000</td>\n",
" <td>30.000000</td>\n",
" <td>125.000000</td>\n",
" <td>11.000000</td>\n",
" <td>467.000000</td>\n",
" <td>85.000000</td>\n",
" <td>62.000000</td>\n",
" <td>89.000000</td>\n",
" <td>9629.000000</td>\n",
" <td>0.0</td>\n",
" <td>15.000000</td>\n",
" <td>17.000000</td>\n",
" <td>41.000000</td>\n",
" <td>9.000000</td>\n",
" <td>6.000000</td>\n",
" <td>30.000000</td>\n",
" <td>30.000000</td>\n",
" <td>8.000000</td>\n",
" <td>7.000000</td>\n",
" <td>5.000000</td>\n",
" <td>218.000000</td>\n",
" <td>116.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" days_from_booking_creation_to_check_in number_of_nights host_age \\\n",
"count 21307.000000 21307.000000 21307.000000 \n",
"mean 8.740038 3.876801 96.533017 \n",
"std 8.389242 3.335615 43.616341 \n",
"min -20.000000 0.000000 19.000000 \n",
"25% 1.000000 2.000000 39.000000 \n",
"50% 6.000000 3.000000 125.000000 \n",
"75% 15.000000 4.000000 125.000000 \n",
"max 30.000000 30.000000 125.000000 \n",
"\n",
" host_months_with_truvi number_of_listings_of_host \\\n",
"count 21307.000000 21307.000000 \n",
"mean 5.482142 152.875815 \n",
"std 2.714314 179.028829 \n",
"min 0.000000 0.000000 \n",
"25% 4.000000 9.000000 \n",
"50% 5.000000 72.000000 \n",
"75% 8.000000 247.000000 \n",
"max 11.000000 467.000000 \n",
"\n",
" number_of_previous_incidents_of_host \\\n",
"count 21307.000000 \n",
"mean 2.718496 \n",
"std 5.582857 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 1.000000 \n",
"75% 3.000000 \n",
"max 85.000000 \n",
"\n",
" number_of_previous_payouts_of_host guest_age \\\n",
"count 21307.000000 11677.000000 \n",
"mean 0.751302 42.317890 \n",
"std 2.957053 13.212509 \n",
"min 0.000000 18.000000 \n",
"25% 0.000000 32.000000 \n",
"50% 0.000000 41.000000 \n",
"75% 1.000000 51.000000 \n",
"max 62.000000 89.000000 \n",
"\n",
" number_of_previous_bookings_of_guest \\\n",
"count 21307.000000 \n",
"mean 2175.999812 \n",
"std 3038.837496 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 0.000000 \n",
"75% 4302.500000 \n",
"max 9629.000000 \n",
"\n",
" number_of_previous_incidents_of_guest listing_number_of_bedrooms \\\n",
"count 21307.0 21185.000000 \n",
"mean 0.0 2.052962 \n",
"std 0.0 1.745281 \n",
"min 0.0 0.000000 \n",
"25% 0.0 1.000000 \n",
"50% 0.0 2.000000 \n",
"75% 0.0 3.000000 \n",
"max 0.0 15.000000 \n",
"\n",
" listing_number_of_bathrooms previous_bookings_in_listing_count \\\n",
"count 21185.000000 21307.000000 \n",
"mean 1.601841 6.215094 \n",
"std 1.297739 6.727896 \n",
"min 0.000000 0.000000 \n",
"25% 1.000000 1.000000 \n",
"50% 1.000000 4.000000 \n",
"75% 2.000000 9.000000 \n",
"max 17.000000 41.000000 \n",
"\n",
" number_of_previous_incidents_in_listing \\\n",
"count 21307.000000 \n",
"mean 0.123387 \n",
"std 0.537464 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 0.000000 \n",
"75% 0.000000 \n",
"max 9.000000 \n",
"\n",
" number_of_previous_payouts_in_listing days_to_start_verification \\\n",
"count 21307.000000 20084.000000 \n",
"mean 0.043507 0.996764 \n",
"std 0.270994 3.423303 \n",
"min 0.000000 0.000000 \n",
"25% 0.000000 0.000000 \n",
"50% 0.000000 0.000000 \n",
"75% 0.000000 0.000000 \n",
"max 6.000000 30.000000 \n",
"\n",
" days_to_complete_verification number_of_applied_services \\\n",
"count 18500.000000 21307.000000 \n",
"mean 0.713135 3.721594 \n",
"std 2.768474 1.553612 \n",
"min 0.000000 2.000000 \n",
"25% 0.000000 2.000000 \n",
"50% 0.000000 4.000000 \n",
"75% 0.000000 5.000000 \n",
"max 30.000000 8.000000 \n",
"\n",
" number_of_applied_upgraded_services \\\n",
"count 21307.000000 \n",
"mean 2.721688 \n",
"std 1.553629 \n",
"min 1.000000 \n",
"25% 1.000000 \n",
"50% 3.000000 \n",
"75% 4.000000 \n",
"max 7.000000 \n",
"\n",
" number_of_applied_billable_services booking_days_to_check_in \\\n",
"count 21307.000000 21307.000000 \n",
"mean 1.865209 17.592247 \n",
"std 0.949857 23.572901 \n",
"min 0.000000 -48.000000 \n",
"25% 1.000000 2.000000 \n",
"50% 2.000000 8.000000 \n",
"75% 3.000000 24.000000 \n",
"max 5.000000 218.000000 \n",
"\n",
" booking_number_of_nights \n",
"count 21307.000000 \n",
"mean 4.144507 \n",
"std 4.799364 \n",
"min 0.000000 \n",
"25% 2.000000 \n",
"50% 3.000000 \n",
"75% 5.000000 \n",
"max 116.000000 "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# View summary statistics of numerical variables\n",
"print(\"\\nSummary statistics of numerical variables:\")\n",
"df[numerical].describe()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "2cf714c9",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABdoAAAx2CAYAAAAYNEt4AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs3XlcVdX+//E3yigICApIKnLVVBQ1savkkClJipZJg7MpZhpaDqnXMnOovFnOWuStxApvabdRTUVNzURTknLKrEwsBcMBwgFQ9u+PfuyvR5BBOEy+no/HedRZ67PXXuuc41l7f9hnbRvDMAwBAAAAAAAAAICbUqWsOwAAAAAAAAAAQEVGoh0AAAAAAAAAgGIg0Q4AAAAAAAAAQDGQaAcAAAAAAAAAoBhItAMAAAAAAAAAUAwk2gEAAAAAAAAAKAYS7QAAAAAAAAAAFAOJdgAAAAAAAAAAioFEOwAAAAAAAAAAxUCiXdL06dNlY2NTKvvq3LmzOnfubD7funWrbGxs9NFHH5XK/h977DHVr1+/VPZ1s9LT0zV8+HD5+PjIxsZGY8eOLXIbNjY2mj59eon3raQdPXpU3bp1k5ubm2xsbPTpp5+WdZcKpbQ/t9fK+feakpKSb1xF+KyXJ7/99ptsbGwUHR1d1l0pFwr7OStpjz32mFxcXIrdTnR0tGxsbPTbb78Vv1O4pXGMVL6UxDFSWercubOaN29e1t24KevXr1erVq3k6OgoGxsbnT9/vkTbv/7zX9RtK+rrCqBy4bihfCnKcYONjY1Gjx5dep0DKqlKl2jPSS7kPBwdHeXr66vQ0FAtWrRIf/31V4ns5+TJk5o+fboSEhJKpL2SVJ77Vhgvv/yyoqOjNWrUKL333nsaNGhQWXfJaoYMGaL9+/frpZde0nvvvac2bdqUdZdQya1cuVILFiwo627c0Lp16yrEH8mAiohjpPLdt8K4lY6RypMzZ87okUcekZOTk5YuXar33ntPzs7OZd2tIqvon38ApYvjhvLdt8Io78cNnPuhMrIt6w5Yy8yZM+Xv76+srCwlJSVp69atGjt2rObNm6fPP/9cLVq0MGOnTp2qf/3rX0Vq/+TJk5oxY4bq16+vVq1aFXq7jRs3Fmk/NyO/vv3nP/9Rdna21ftQHFu2bFG7du30wgsvlHVXrOrSpUuKi4vTc889x1+OraAifNbLwsqVK3XgwIFcVzP4+fnp0qVLsrOzK5uO/X/r1q3T0qVLOeAqpkGDBqlv375ycHAo666gHOIYiWMkFM2ePXv0119/adasWQoJCbHKPsr68w8AN8JxA8cN1sK5HyqjSpto7969u8XVwVOmTNGWLVvUs2dP3X///Tp8+LCcnJwkSba2trK1te5LcfHiRVWrVk329vZW3U9ByjqJVhinT59WQEBAWXfD6v78809Jkru7e4GxFy5cqJBXTpWlivBZz09pv+c5V6mgcqhataqqVq1a1t1AOcUxUt4qwrxxqxwjFUd2drYyMzNLdE47ffq0pMIds92ssv78A8CNcNyQN44bAOSl0i0dk58uXbro+eef1/Hjx/X++++b5XmtIxYbG6sOHTrI3d1dLi4uaty4sZ599llJf6/9deedd0qShg4dav6UKmdt45x1EuPj49WpUydVq1bN3PZG6y9evXpVzz77rHx8fOTs7Kz7779fJ06csIipX7++HnvssVzbXttmQX3Lax2xCxcuaMKECapbt64cHBzUuHFjvfbaazIMwyIuZ82uTz/9VM2bN5eDg4OaNWum9evX5/2CX+f06dOKiIiQt7e3HB0d1bJlS61YscKsz1lT7dixY1q7dq3Z9/zWGM7IyNC4ceNUq1YtVa9eXffff79+//33XHHHjx/Xk08+qcaNG8vJyUmenp56+OGHLdr+9ddfZWNjo/nz5+fafufOnbKxsdF///tfSdJff/2lsWPHqn79+nJwcJCXl5fuvfdefffdd4V6LaZPny4/Pz9J0sSJE2VjY2O+Lzmfx0OHDql///6qUaOGOnToIEm6cuWKZs2apQYNGsjBwUH169fXs88+q4yMDIv269evr549e2rr1q1q06aNnJycFBgYqK1bt0qSPv74YwUGBsrR0VFBQUHat29fofp9vcJ8biVp9erVCgoKkpOTk2rWrKmBAwfqjz/+yBW3ZcsWdezYUc7OznJ3d9cDDzygw4cPF9iP48ePq2HDhmrevLmSk5Ml5f6s56xB/tprr2nZsmXma3jnnXdqz549efY5ICBAjo6Oat68uT755JObXodv9+7d6tGjh2rUqCFnZ2e1aNFCCxcuNOtz1uX+5Zdf1KNHD1WvXl0DBgyQ9HfCYMGCBWrWrJkcHR3l7e2tJ554QufOnbPYx2effaawsDD5+vrKwcFBDRo00KxZs3T16lUzpnPnzlq7dq2OHz9u/vvKGc+N1mgvzHuS85n9+eef9dhjj8nd3V1ubm4aOnSoLl68WOjX6bHHHtPSpUslyeJnqjkK+11VGD/++KMeeeQR1apVS05OTmrcuLGee+65XHHnz58v1Jjef/998zPu4eGhvn375vlvoaDPQl4SEhJUq1Ytde7cWenp6YUaX15rtOd8L+zYsUP//Oc/5ejoqH/84x969913C9UmKjeOkSrfMVJh+3SjuS2v9z6nzZw50snJScHBwdq/f78k6c0331TDhg3l6Oiozp0737B/8fHxuuuuu+Tk5CR/f39FRUXlisnIyNALL7yghg0bysHBQXXr1tWkSZNyHfPk9CkmJkbNmjWTg4NDoV93qeBjlM6dO2vIkCGSpDvvvFM2NjZ5ftbykvNd/M0332j8+PGqVauWnJ2d9eCDD5oXXFy7n+s//8ePH9f9998vZ2dneXl5ady4cdqwYYNsbGzMY7prHTp0SPfcc4+qVaum2267TXPmzDHrCvr8Hz16VOHh4fLx8ZGjo6Pq1Kmjvn37KjU1tVBjBXBr4bih8h035ChMn/bt26fu3bvL1dVVLi4u6tq1q3bt2mURk5WVpRkzZqhRo0ZydHSUp6enOnTooNjYWPP1y+/cryBff/21Hn74YdWrV888Thg3bpwuXbqUK7aw5/aFPfcG8lNpr2i/kUGDBunZZ5/Vxo0b9fjjj+cZc/DgQfXs2VMtWrTQzJkz5eDgoJ9//lnffPONJKlp06aaOXOmpk2bphEjRqhjx46SpLvuusts48yZM+revbv69u2rgQMHytvbO99+vfTSS7KxsdHkyZN1+vRpLViwQCEhIUpISDD/OlwYhenbtQzD0P3336+vvvpKERERatWqlTZs2KCJEyfqjz/+yJV03rFjhz7++GM9+eSTql69uhYtWqTw8HAlJibK09Pzhv26dOmSOnfurJ9//lmjR4+Wv7+/Vq9erccee0znz5/X008/raZNm+q9997TuHHjVKdOHU2YMEGSVKtWrRu2O3z4cL3//vvq37+/7rrrLm3ZskVhYWG54vbs2aOdO3eqb9++qlOnjn777Te98cYb6ty5sw4dOqRq1arpH//4h9q3b6+YmBiNGzfOYvuYmBhVr15dDzzwgCRp5MiR+uijjzR69GgFBATozJkz2rFjhw4fPqzWrVvfsL85+vTpI3d3d40bN079+vVTjx49ct0A8eGHH1ajRo308ssvmxPz8OHDtWLFCj300EOaMGGCdu/erdmzZ+vw4cP65JNPLLb/+eef1b9/fz3xxBMaOHCgXnvtNfXq1UtRUVF69tln9eSTT0qSZs+erUceeURHjhxRlSpF+9tbYT630dHRGjp0qO68807Nnj1bycnJWrhwob755hvt27fPvDps06ZN6t69u/7xj39o+vTpunTpkhYvXqz27dvru+++u2GC+5dfflGXLl3k4eGh2NhY1axZM98+r1y5Un/99ZeeeOIJ2djYaM6cOerTp49+/fVX86qEtWvX6tFHH1VgYKBmz56tc+fOKSIiQrfddluRXh/p7wPLnj17qnbt2nr66afl4+Ojw4cPa82aNXr66afNuCtXrig0NFQdOnTQa6+9pmrVqkmSnnjiCfM1fOqpp3Ts2DEtWbJE+/bt0zfffGP2OTo6Wi4
"text/plain": [
"<Figure size 1500x3200 with 22 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Select numeric columns\n",
"numerical = df.select_dtypes(include='number').columns\n",
"n_cols = 3\n",
"n_rows = math.ceil(len(numerical) / n_cols)\n",
"\n",
"# Create subplots\n",
"fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows))\n",
"axes = axes.flatten()\n",
"\n",
"# Plot each numeric column\n",
"for i, col in enumerate(numerical):\n",
" axes[i].hist(df[col].dropna(), bins=30, edgecolor='black')\n",
" axes[i].set_title(f'Distribution of {col}')\n",
" axes[i].set_xlabel(col)\n",
" axes[i].set_ylabel('Frequency')\n",
"\n",
"# Hide any unused subplots\n",
"for j in range(i + 1, len(axes)):\n",
" fig.delaxes(axes[j])\n",
"\n",
"plt.tight_layout()\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "311da64d",
"metadata": {},
"outputs": [],
"source": [
"# We see that there are some outliers in host_age with ages above 100, we will remove those\n",
"df['host_age'] = df['host_age'].where(df['host_age'] <= 100, np.nan)\n",
"\n",
"# We drop number_of_previous_incidents_of_guest as it has only 0 values\n",
"df.drop(columns=['number_of_previous_incidents_of_guest'], inplace=True)\n",
"numerical = df.select_dtypes(include='number').columns"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "692854bb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Missing Values (%):\n",
"host_age 69.826817\n",
"guest_age 45.196414\n",
"days_to_complete_verification 13.174074\n",
"days_to_start_verification 5.739898\n",
"listing_number_of_bathrooms 0.572582\n",
"listing_number_of_bedrooms 0.572582\n",
"dtype: float64\n"
]
}
],
"source": [
"# Check missing values for the remaining columns\n",
"missing_values = df.isnull().mean() * 100\n",
"missing_values = missing_values[missing_values > 0].sort_values(ascending=False)\n",
"print(\"Missing Values (%):\")\n",
"print(missing_values)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "9f333fd5",
"metadata": {},
"outputs": [],
"source": [
"# We will fill the remaining missing values with the median for numerical columns\n",
"for col in numerical:\n",
" df[col] = df[col].fillna(df[col].median())"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "ccd46ddc",
"metadata": {},
"outputs": [],
"source": [
"# Convert all boolean columns to int\n",
"bool_columns = df.select_dtypes(include='bool').columns\n",
"for col in bool_columns:\n",
" df[col] = df[col].astype(int)"
]
},
{
"cell_type": "markdown",
"id": "2c84ebe5",
"metadata": {},
"source": [
"### Feature Relevance Analysis"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "74a582c8",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABQgAAASPCAYAAABCohK6AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs3Xtcjvf/B/DXdd/V3d1Zikrno0KKHMNytjluztoSNswwxLDNiiFMMowxP9U2xmaYzZjDhDXDUs5CIYcmp9L5cN/X74++3evW6XqnHOb9fDzux6Ou6319rs91vu/P9TkIoiiKYIwxxhhjjDHGGGOMvZRkzzoDjDHGGGOMMcYYY4yxZ4cLCBljjDHGGGOMMcYYe4lxASFjjDHGGGOMMcYYYy8xLiBkjDHGGGOMMcYYY+wlxgWEjDHGGGOMMcYYY4y9xLiAkDHGGGOMMcYYY4yxlxgXEDLGGGOMMcYYY4wx9hLjAkLGGGOMMcYYY4wxxl5iXEDIGGOMMcYYY4wxxthLjAsIGWOM/Sfs27cPo0aNgru7O0xMTKBQKGBtbY3u3bsjMjISd+/efdZZfGJhYWEQBAFhYWFPbZ2Ojo4QBAHXrl17auukCggIgCAIEAQB/fv3rzL2hx9+0MQKgoCbN28+pVxKU5qvp0WtVsPPzw9WVlbIycnRygPlExAQ8NTyzEoEBwdDEARER0dLXiY6OhqCIMDR0bHO8gU8/XtVTfbF41QqFRo3bgwHBwfk5eXVXuYYY4yxF4TOs84AY4wx9iTu3buH4cOHY//+/QBKCrQ6d+4MQ0ND/PPPP/jzzz+xf/9+fPLJJ9i/fz/atGnzjHP8/AgODkZMTAyioqIQHBz8rLNTK3799VfcuXMHDRs2rHD+//3f/9XJeksL9URRrJP068r//d//IT4+HqtWrYKhoSEAYOTIkeXi/vnnH/z222+Vzm/cuHHdZvQ59qIee6ZNLpdj/vz5GDx4MJYsWYLQ0NBnnSXGGGPsqeICQsYYYy+szMxMdOjQAUlJSWjcuDHWrVuHjh07asUUFBQgJiYGoaGhSEtLe0Y5fXEdOHAARUVFaNSo0bPOSrX8/Pzw999/4+uvv8aMGTPKzb9x4wb27duHVq1a4cSJE88gh9W7cOHCU1tXXl4ePvroI9jY2GDs2LGa6RXVwoqNjdUUED5JLS3GnmeDBg1Cs2bNsHjxYowbNw5WVlbPOkuMMcbYU8NNjBljjL2wJk2ahKSkJDg6OiIuLq5c4SAAKBQKjB07FomJifD09HwGuXyxubi4oHHjxtDV1X3WWanWm2++CT09PURFRVU4Pzo6Gmq1GqNHj37KOZOucePGT6023rfffou7d+8iKCjohTi+jD0No0ePRl5eHtatW/ess8IYY4w9VVxAyBhj7IWUkpKCTZs2AQCWLVsGc3PzKuMbNmwIDw+PctM3b96Mrl27wtzcHAqFAg4ODhg9ejQuXbpUYTpl++T76aef0KVLF5ibm0MQBMTGxgLQ7kcuKioK7dq1g6mpabm+/G7fvo1p06bB09MTBgYGMDY2RqtWrbBq1SoUFxdL3hdFRUX49ttvERgYiMaNG8PExARKpRIeHh6YPHkybt++rRV/7do1CIKAmJgYAMCoUaO0+pMr229YVX0Q5ubmYtGiRWjRogWMjY1hYGCAJk2a4OOPP8bDhw/LxZeu19HREaIoYt26dWjZsiUMDQ1hamqKHj164OjRo5K3+3H169dHv379cOHChXLpiKKI6OhoKJVKDB8+vNI0rl+/jsWLF6NLly6wt7eHQqGAmZkZOnTogLVr10KtVmvFl/a1VurxvvlK91tp32/BwcF48OABpkyZAhcXFygUCq3++yrqgzAiIgKCIMDd3R1ZWVnl8vzVV19BEATY2dnh3r17UncXVq1aBQBP3Ly8bH9zqampGDNmDOzs7KCrq6tJu+z2V6TsuVHZ9JqcM7m5uVi+fDk6dOiAevXqaa7xvn37au4fperq2Je6dOkSxo0bBxcXF+jr68PU1BSdOnXCt99+W2n+S88VBwcHKBQK2NvbY+LEiXjw4EGly9S2/fv3Y9KkSfDx8YGFhQUUCgVsbW0xdOhQSTVxr1+/jqCgIFhbW0NfXx/u7u4ICwursp+/muyriqjVaqxbtw7+/v4wMzODrq4uGjRogObNm2PSpEkV3tcCAwOho6ODtWvXku7DjDHG2IuOmxgzxhh7If3yyy9QqVQwMzNDv379yMuLoojg4GB8/fXX0NHRQadOndCgQQOcPHkSUVFR2LJlC3788Uf06tWrwuUjIiKwatUq+Pn5oVevXrh9+zbkcrlWzKRJk7B69Wq0b98evXv3RkpKiqZA4fDhwxgwYAAePnwIR0dHdO/eHQUFBTh+/DgmTZqEn3/+Gb/88oukml137tzBW2+9BVNTU3h6esLb2xs5OTlITEzEypUrsXnzZvz5559wdXUFABgZGWHkyJH4448/kJycDH9/f808APDx8al2nQ8ePEDXrl2RmJgIExMTdOnSBbq6ujh06BAWLFiATZs24ffff690MIRRo0Zh06ZN6NixI/r06YPExETs27cPhw8fxqFDh2rcV+To0aOxdetWbNiwAe3atdNMP3jwIFJSUhAYGAhTU9NKl//mm28wZ84cODk5wd3dHf7+/khLS8PRo0cRFxeHvXv3YuvWrZrj6OPjg5EjR2oKWx/vn8/IyEjr/3v37sHPzw8ZGRno2LEjWrZsCT09vSq3KSQkBIcPH8bOnTsxduxYfPfdd5p5p06dwuTJk6Gjo4MtW7bAwsJC0n66evUqTp8+DVtb2woLzmvi8uXL8PX1hZ6eHvz9/SGKouT8SEE9Z27cuIFevXrh/PnzMDAwgL+/P+rXr49bt27hyJEjOHPmDEaMGKGJr8tj/8MPPyAoKAj5+flo3LgxXnvtNWRmZuLYsWN466238Pvvv2PDhg1ay9+5cwcdO3bE5cuXUa9ePfTp0wdqtRobN27Enj170KRJk1rbt1UZP348bty4gSZNmsDf3x86Ojq4ePEivv/+e2zbtg2bN2/GwIEDK1z26tWraNmypeYem5eXh4MHD2Lu3LnYv38/9u/fD319fa1larKvKvP2228jKioK+vr66NChAywtLfHgwQOkpKRg1apV6Nq1a7l7lKWlJXx8fPD333/jxIkTWvcRxhhj7D9NZIwxxl5Ab731lghA7NKlS42WX7NmjQhAtLCwEBMSEjTT1Wq1GBoaKgIQzczMxPT0dK3lHBwcRACiXC4Xf/rppwrTBiACEE1MTMSjR4+Wm5+WlibWr19fFARBXL16tahSqTTz7t27J3bp0kUEIM6dO1drudJ8hYaGak1/9OiR+NNPP4kFBQVa0wsLC8XZs2eLAMTXXnutXD5GjhwpAhCjoqIq3I6y23v16lWt6UOHDhUBiG3atBHv3bunmZ6VlSW++uqrIgCxffv2WstcvXpVs28cHBzEpKQkzbzi4mJx9OjRIgCxR48eleanIq+88ooIQPzmm29ElUol2traisbGxmJOTo4mJjAwUAQg/v7776Io/nuMbty4oZXW8ePHxTNnzpRbx61bt8TmzZuLAMTvv/++3PzS9CoTFRWlienatauYmZlZYVxl6Tx8+FB0dHQUAYhr1qwRRbHkuLu5uYkAxM8++6zSdVdk/fr1IgBx8ODBkuIPHjxYad5Kz0sA4ptvvinm5+eXiynd/pEjR1aYfum54eDgUOF06jmjUqlEPz8/zbzHr+O8vDxx165dWtPq6tifPn1aVCgUor6+vvjjjz9qzbt27ZrYrFkzEYAYExOjNW/QoEEiALFjx45iRkaGZvr9+/fFNm3aaNZb1fX7uNLj8Ph+rsr27dvFBw8eVDhdR0dHrF+/vpibm6s1r+w50b9/f635N27cEN3d3UUA4qxZs7SWq+m+quhedv36dRGAaGtrK6alpZXL//nz58Xr169XuM2TJ08WAYiffvppxTuFMcYY+w/iAkLGGGMvpF69eokAxGHDhtVoeRcXFxGAuGLFinLz1Gq16O3tLQIQFyxYoDWvtMBs9OjRlaZd+sN43rx5Fc6fOXOmCEC
"text/plain": [
"<Figure size 1400x1200 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Check correlation matrix\n",
"import seaborn as sns\n",
"\n",
"# 1. Move 'has_resolution_incident' to the end\n",
"target_col = 'has_resolution_incident'\n",
"if target_col in df.columns:\n",
" columns = [col for col in df.columns if col != target_col] + [target_col]\n",
" df = df[columns]\n",
"\n",
"# 2. Create short column names (truncate to, say, 15 chars)\n",
"short_columns = [col[:15] for col in df.columns]\n",
"\n",
"# 3. Compute correlation matrix\n",
"correlation_matrix = df.corr()\n",
"\n",
"# 4. Plot with Seaborn\n",
"plt.figure(figsize=(14, 12))\n",
"sns.heatmap(\n",
" correlation_matrix,\n",
" xticklabels=short_columns,\n",
" yticklabels=short_columns,\n",
" cmap='coolwarm',\n",
" annot=False,\n",
" fmt=\".2f\",\n",
" square=True,\n",
" cbar_kws={'shrink': 0.6}\n",
")\n",
"plt.title('Correlation Matrix (Truncated Labels)', fontsize=16)\n",
"plt.tight_layout()\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "a6f7988d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"number_of_previous_incidents_in_listing 0.101702\n",
"number_of_previous_payouts_in_listing 0.096180\n",
"host_account_type_Host 0.073745\n",
"number_of_listings_of_host 0.070200\n",
"listing_number_of_bedrooms 0.065542\n",
"listing_country_United States 0.062555\n",
"host_active_pms_list_Hostify 0.060898\n",
"host_country_United States 0.055897\n",
"has_deposit_management_service_business_type 0.055543\n",
"host_country_United Kingdom 0.049846\n",
"listing_country_United Kingdom 0.048641\n",
"guest_country_United States 0.047742\n",
"number_of_applied_billable_services 0.045234\n",
"host_account_type_PMC - Property Management Company 0.044632\n",
"has_completed_verification 0.040583\n",
"guest_age 0.039814\n",
"is_guest_from_listing_country 0.038440\n",
"listing_number_of_bathrooms 0.038292\n",
"listing_country_New Zealand 0.036971\n",
"is_guest_from_listing_town 0.036880\n",
"host_country_New Zealand 0.036791\n",
"previous_bookings_in_listing_count 0.035117\n",
"guest_has_email 0.033928\n",
"number_of_applied_services 0.032459\n",
"number_of_applied_upgraded_services 0.032452\n",
"guest_country_New Zealand 0.031652\n",
"guest_country_Other 0.031166\n",
"guest_has_phone_number 0.030379\n",
"booking_number_of_nights 0.026738\n",
"has_guest_previously_booked_same_listing 0.026621\n",
"host_active_pms_list_Hospitable 0.025430\n",
"host_active_pms_list_Hostfully 0.025058\n",
"number_of_previous_bookings_of_guest 0.024027\n",
"number_of_nights 0.023304\n",
"guest_country_Canada 0.022773\n",
"host_active_pms_list_Hostaway 0.021299\n",
"booking_days_to_check_in 0.020963\n",
"host_country_Canada 0.020417\n",
"has_upgraded_screening_service_business_type 0.020254\n",
"has_verification_request 0.019356\n",
"listing_country_Colombia 0.018607\n",
"listing_country_Canada 0.018591\n",
"is_host_from_listing_country 0.018029\n",
"number_of_previous_incidents_of_host 0.017803\n",
"number_of_previous_payouts_of_host 0.017717\n",
"days_from_booking_creation_to_check_in 0.016637\n",
"host_active_pms_list_OwnerRez 0.015977\n",
"is_host_from_listing_town 0.015359\n",
"is_host_from_listing_postcode 0.014238\n",
"host_active_pms_list_Avantio 0.011872\n",
"host_active_pms_list_Lodgify 0.010976\n",
"guest_country_Australia 0.009813\n",
"listing_country_Ireland 0.009753\n",
"listing_country_Mexico 0.009473\n",
"guest_country_Colombia 0.009243\n",
"is_guest_from_listing_postcode 0.009204\n",
"host_active_pms_list_TrackHs 0.008961\n",
"has_protection_service_business_type 0.008933\n",
"guest_country_Mexico 0.008703\n",
"host_country_Mexico 0.008603\n",
"listing_country_Bahamas 0.008572\n",
"host_country_Sweden 0.008302\n",
"host_country_Bulgaria 0.008129\n",
"guest_country_Germany 0.007512\n",
"guest_country_United Kingdom 0.007411\n",
"host_months_with_truvi 0.007277\n",
"host_active_pms_list_Guesty 0.007083\n",
"host_country_Portugal 0.007005\n",
"guest_country_Ireland 0.006589\n",
"host_active_pms_list_Uplisting 0.005862\n",
"guest_country_France 0.005616\n",
"host_country_Other 0.004820\n",
"has_billable_services 0.004251\n",
"listing_country_Other 0.003930\n",
"days_to_start_verification 0.003879\n",
"listing_country_Virgin Islands, U.s. 0.002997\n",
"host_age 0.002981\n",
"host_active_pms_list_Hospitable Connect 0.002904\n",
"host_active_pms_list_Smoobu 0.002597\n",
"host_country_Norway 0.002255\n",
"host_country_Australia 0.001435\n",
"listing_country_Australia 0.001435\n",
"days_to_complete_verification 0.001179\n",
"dtype: float64\n"
]
}
],
"source": [
"# Compute correlation with the target variable\n",
"correlation_with_target = df.corrwith(df['has_resolution_incident'])\n",
"\n",
"# Drop the target itself (its correlation with itself is always 1)\n",
"correlation_with_target = correlation_with_target.drop(labels='has_resolution_incident')\n",
"\n",
"# Sort by absolute correlation, descending\n",
"correlation_sorted = correlation_with_target.abs().sort_values(ascending=False)\n",
"\n",
"# Print the sorted correlations (you can keep the original signs too if preferred)\n",
"print(correlation_sorted)"
]
},
{
"cell_type": "markdown",
"id": "2caec836",
"metadata": {},
"source": [
"### Weighted classes"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "e6d091fb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{0: np.float64(1.0119419188492333), 1: np.float64(84.73863636363637)}"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We will use weight classes due to the inbalance of the target variable\n",
"X = df.drop(columns=['has_resolution_incident'])\n",
"y = df['has_resolution_incident']\n",
"\n",
"# 1. Split data into training and testing sets\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, y, test_size=0.3, random_state=123, stratify=y\n",
")\n",
"\n",
"# Compute label distribution on the training set\n",
"label_distribution = y_train.value_counts(normalize=True)\n",
"\n",
"# Calculate inverse weights\n",
"weights = {\n",
" 0: 1 / label_distribution[0],\n",
" 1: 1 / label_distribution[1]\n",
"}\n",
"weights"
]
},
{
"cell_type": "markdown",
"id": "ab8f7646",
"metadata": {},
"source": [
"### Feature Selection\n",
"\n",
"Since we have many columns, well apply feature selection techniques like KBest, RFE (Recursive Feature Elimination), and Lasso (L1 regularization), to reduce the number of fields used in our predictive model. This helps:\n",
"- Avoid overfitting\n",
"- Potentially improve model performance (simpler models often generalize better)\n",
"- Reduce training time\n",
"\n",
"We'll also experiment with different numbers of features to determine which combination produces the model best suited to our objectives."
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "0246eb6c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected Features:\n",
"Index(['number_of_nights', 'number_of_listings_of_host', 'guest_age',\n",
" 'has_guest_previously_booked_same_listing',\n",
" 'listing_number_of_bedrooms', 'listing_number_of_bathrooms',\n",
" 'previous_bookings_in_listing_count',\n",
" 'number_of_previous_incidents_in_listing',\n",
" 'number_of_previous_payouts_in_listing', 'guest_has_email',\n",
" 'is_guest_from_listing_town', 'is_guest_from_listing_country',\n",
" 'has_completed_verification', 'number_of_applied_services',\n",
" 'number_of_applied_upgraded_services',\n",
" 'number_of_applied_billable_services', 'booking_number_of_nights',\n",
" 'has_deposit_management_service_business_type',\n",
" 'host_account_type_Host',\n",
" 'host_account_type_PMC - Property Management Company',\n",
" 'host_active_pms_list_Hospitable', 'host_active_pms_list_Hostify',\n",
" 'host_country_New Zealand', 'host_country_United Kingdom',\n",
" 'host_country_United States', 'guest_country_Canada',\n",
" 'guest_country_United States', 'listing_country_New Zealand',\n",
" 'listing_country_United Kingdom', 'listing_country_United States'],\n",
" dtype='object')\n"
]
}
],
"source": [
"selector = SelectKBest(score_func=f_classif, k=30)\n",
"X_new = selector.fit_transform(X_train, y_train)\n",
"selected_features_kbest = X_train.columns[selector.get_support()]\n",
"\n",
"print(\"Selected Features:\")\n",
"print(selected_features_kbest)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "736a8d68",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/joaquin/data-jupyter-notebooks/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:465: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n",
"/home/joaquin/data-jupyter-notebooks/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:465: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n",
"/home/joaquin/data-jupyter-notebooks/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:465: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n",
"/home/joaquin/data-jupyter-notebooks/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:465: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n",
"/home/joaquin/data-jupyter-notebooks/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:465: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected Features using RFE:\n",
"Index(['has_guest_previously_booked_same_listing',\n",
" 'number_of_previous_payouts_in_listing', 'guest_has_email',\n",
" 'is_guest_from_listing_town', 'is_guest_from_listing_country',\n",
" 'is_host_from_listing_country', 'is_host_from_listing_postcode',\n",
" 'has_completed_verification', 'has_verification_request',\n",
" 'has_upgraded_screening_service_business_type',\n",
" 'has_deposit_management_service_business_type',\n",
" 'host_account_type_Host',\n",
" 'host_account_type_PMC - Property Management Company',\n",
" 'host_active_pms_list_Avantio', 'host_active_pms_list_Hostify',\n",
" 'host_active_pms_list_TrackHs', 'host_country_Bulgaria',\n",
" 'host_country_Canada', 'host_country_New Zealand',\n",
" 'guest_country_Australia', 'guest_country_Canada',\n",
" 'guest_country_Germany', 'guest_country_Mexico', 'guest_country_Other',\n",
" 'listing_country_Bahamas', 'listing_country_Canada',\n",
" 'listing_country_Colombia', 'listing_country_Ireland',\n",
" 'listing_country_New Zealand', 'listing_country_United States'],\n",
" dtype='object')\n"
]
}
],
"source": [
"# Recursive Feature Elimination (RFE) with Logistic Regression\n",
"model = LogisticRegression(max_iter=1000)\n",
"rfe = RFE(model, n_features_to_select=30)\n",
"rfe.fit(X_train, y_train)\n",
"selected_features_rfe = X_train.columns[rfe.support_]\n",
"\n",
"print(\"Selected Features using RFE:\")\n",
"print(selected_features_rfe)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "484786aa",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected Features using Lasso Regression:\n",
"Index(['days_from_booking_creation_to_check_in', 'number_of_nights',\n",
" 'host_age', 'host_months_with_truvi', 'number_of_listings_of_host',\n",
" 'number_of_previous_incidents_of_host',\n",
" 'number_of_previous_payouts_of_host', 'guest_age',\n",
" 'number_of_previous_bookings_of_guest',\n",
" 'has_guest_previously_booked_same_listing',\n",
" 'listing_number_of_bedrooms', 'listing_number_of_bathrooms',\n",
" 'previous_bookings_in_listing_count',\n",
" 'number_of_previous_incidents_in_listing',\n",
" 'number_of_previous_payouts_in_listing', 'days_to_start_verification',\n",
" 'days_to_complete_verification', 'is_guest_from_listing_town',\n",
" 'is_guest_from_listing_country', 'is_host_from_listing_town',\n",
" 'is_host_from_listing_postcode', 'has_completed_verification',\n",
" 'number_of_applied_services', 'number_of_applied_billable_services',\n",
" 'booking_days_to_check_in', 'booking_number_of_nights',\n",
" 'has_verification_request',\n",
" 'has_upgraded_screening_service_business_type',\n",
" 'has_deposit_management_service_business_type',\n",
" 'has_protection_service_business_type', 'host_account_type_Host',\n",
" 'host_account_type_PMC - Property Management Company',\n",
" 'host_active_pms_list_Guesty', 'host_active_pms_list_Hospitable',\n",
" 'host_active_pms_list_Hostaway', 'host_active_pms_list_Hostfully',\n",
" 'host_active_pms_list_Hostify', 'host_active_pms_list_Lodgify',\n",
" 'host_active_pms_list_OwnerRez', 'host_country_New Zealand',\n",
" 'guest_country_Canada', 'guest_country_Other',\n",
" 'guest_country_United Kingdom', 'guest_country_United States',\n",
" 'listing_country_Colombia', 'listing_country_New Zealand',\n",
" 'listing_country_United States'],\n",
" dtype='object')\n"
]
}
],
"source": [
"# Lasso Regression for feature selection\n",
"model = LogisticRegression(penalty='l1', solver='liblinear')\n",
"model.fit(X_train, y_train)\n",
"\n",
"# Check which features have non-zero coefficients\n",
"selected_features_lasso = X_train.columns[model.coef_[0] != 0]\n",
"print(\"Selected Features using Lasso Regression:\")\n",
"print(selected_features_lasso)"
]
},
{
"cell_type": "markdown",
"id": "04010a1e",
"metadata": {},
"source": [
"## Processing\n",
"Processing in this notebook is quite straight-forward: we just drop id booking, split the features and target and apply a scaling to numeric features.\n",
"Afterwards, we split the dataset between train and test and display their sizes and target distribution."
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "f735b111",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set size: 14914 rows\n",
"Test set size: 6393 rows\n",
"\n",
"Training target distribution:\n",
"has_resolution_incident\n",
"0 0.988199\n",
"1 0.011801\n",
"Name: proportion, dtype: float64\n",
"\n",
"Test target distribution:\n",
"has_resolution_incident\n",
"0 0.988112\n",
"1 0.011888\n",
"Name: proportion, dtype: float64\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_48568/2398832410.py:8: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" X_train_kbest[selected_features_kbest] = X_train_kbest[selected_features_kbest].astype(float)\n"
]
}
],
"source": [
"# Separate features and target\n",
"X_train_kbest = X_train[selected_features_kbest] # Use the features selected by SelectKBest\n",
"y_train_kbest = y_train\n",
"X_test_kbest = X_test[selected_features_kbest]\n",
"y_test_kbest = y_test\n",
"\n",
"# Scale numeric features\n",
"X_train_kbest[selected_features_kbest] = X_train_kbest[selected_features_kbest].astype(float)\n",
"\n",
"print(f\"Training set size: {X_train_kbest.shape[0]} rows\")\n",
"print(f\"Test set size: {X_test_kbest.shape[0]} rows\")\n",
"\n",
"print(\"\\nTraining target distribution:\")\n",
"print(y_train_kbest.value_counts(normalize=True))\n",
"\n",
"print(\"\\nTest target distribution:\")\n",
"print(y_test_kbest.value_counts(normalize=True))"
]
},
{
"cell_type": "markdown",
"id": "897eb678",
"metadata": {},
"source": [
"### Using RFE Features"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "301a8fb2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set size: 14914 rows\n",
"Test set size: 6393 rows\n",
"\n",
"Training target distribution:\n",
"has_resolution_incident\n",
"0 0.988199\n",
"1 0.011801\n",
"Name: proportion, dtype: float64\n",
"\n",
"Test target distribution:\n",
"has_resolution_incident\n",
"0 0.988112\n",
"1 0.011888\n",
"Name: proportion, dtype: float64\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_48568/2877144001.py:8: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" X_train_rfe[selected_features_rfe] = X_train_rfe[selected_features_rfe].astype(float)\n"
]
}
],
"source": [
"# Separate features and target\n",
"X_train_rfe = X_train[selected_features_rfe] # Use the features selected by RFE\n",
"y_train_rfe = y_train\n",
"X_test_rfe = X_test[selected_features_rfe]\n",
"y_test_rfe = y_test\n",
"\n",
"# Scale numeric features\n",
"X_train_rfe[selected_features_rfe] = X_train_rfe[selected_features_rfe].astype(float)\n",
"\n",
"print(f\"Training set size: {X_train_rfe.shape[0]} rows\")\n",
"print(f\"Test set size: {X_test_rfe.shape[0]} rows\")\n",
"\n",
"print(\"\\nTraining target distribution:\")\n",
"print(y_train_rfe.value_counts(normalize=True))\n",
"\n",
"print(\"\\nTest target distribution:\")\n",
"print(y_test_rfe.value_counts(normalize=True))"
]
},
{
"cell_type": "markdown",
"id": "2bbc1524",
"metadata": {},
"source": [
"### Using Lasso Features"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "f4b9c01a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set size: 14914 rows\n",
"Test set size: 6393 rows\n",
"\n",
"Training target distribution:\n",
"has_resolution_incident\n",
"0 0.988199\n",
"1 0.011801\n",
"Name: proportion, dtype: float64\n",
"\n",
"Test target distribution:\n",
"has_resolution_incident\n",
"0 0.988112\n",
"1 0.011888\n",
"Name: proportion, dtype: float64\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_48568/1333565449.py:8: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" X_train_lasso[selected_features_lasso] = X_train_lasso[selected_features_lasso].astype(float)\n"
]
}
],
"source": [
"# Separate features and target\n",
"X_train_lasso = X_train[selected_features_lasso] # Use the features selected by lasso\n",
"y_train_lasso = y_train\n",
"X_test_lasso = X_test[selected_features_lasso]\n",
"y_test_lasso = y_test\n",
"\n",
"# Scale numeric features\n",
"X_train_lasso[selected_features_lasso] = X_train_lasso[selected_features_lasso].astype(float)\n",
"\n",
"print(f\"Training set size: {X_train_lasso.shape[0]} rows\")\n",
"print(f\"Test set size: {X_test_lasso.shape[0]} rows\")\n",
"\n",
"print(\"\\nTraining target distribution:\")\n",
"print(y_train_lasso.value_counts(normalize=True))\n",
"\n",
"print(\"\\nTest target distribution:\")\n",
"print(y_test_lasso.value_counts(normalize=True))"
]
},
{
"cell_type": "markdown",
"id": "d36c9276",
"metadata": {},
"source": [
"## Classification Model with Random Forest\n",
"\n",
"We define a machine learning pipeline that includes:\n",
"- **Scaling numeric features** with `StandardScaler`\n",
"- **Training a Random Forest classifier** with balanced class weights to handle the imbalanced dataset\n",
"\n",
"We then use `GridSearchCV` to perform a **grid search with cross-validation** over a range of key hyperparameters (e.g., number of trees, max depth, etc.). \n",
"The model is evaluated using **Average Precision**, which is better suited for imbalanced classification tasks.\n",
"\n",
"The best combination of parameters is selected, and the resulting model is used to make predictions on the test set.\n"
]
},
{
"cell_type": "markdown",
"id": "fe3351be",
"metadata": {},
"source": [
"### Model 1 with Kbest Features"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "943ef7d6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 5 folds for each of 72 candidates, totalling 360 fits\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 7.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 0.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.3s[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.2s\n",
"\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 7.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.1s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.1s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.1s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.7s\n",
"Best hyperparameters: {'model__max_depth': None, 'model__max_features': 'log2', 'model__min_samples_leaf': 2, 'model__min_samples_split': 2, 'model__n_estimators': 300}\n"
]
}
],
"source": [
"# Define pipeline (scaling numeric features only)\n",
"pipeline = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('model', RandomForestClassifier(class_weight=weights, # We have an imbalanced dataset\n",
" random_state=123))\n",
"])\n",
"\n",
"# Define parameter grid\n",
"param_grid = {\n",
" 'model__n_estimators': [100, 200, 300],\n",
" 'model__max_depth': [None, 10, 20],\n",
" 'model__min_samples_split': [2, 5],\n",
" 'model__min_samples_leaf': [1, 2],\n",
" 'model__max_features': ['sqrt', 'log2']\n",
"}\n",
"\n",
"# GridSearchCV\n",
"grid_search = GridSearchCV(\n",
" estimator=pipeline,\n",
" param_grid=param_grid,\n",
" scoring='average_precision', # For imbalanced classification\n",
" cv=5, # 5-fold cross-validation\n",
" n_jobs=-1, # Use all available cores\n",
" verbose=2 # Verbose output for progress tracking\n",
")\n",
"\n",
"# Fit the grid search on training data\n",
"grid_search.fit(X_train_kbest, y_train_kbest)\n",
"\n",
"# Best model\n",
"best_pipeline_kbest = grid_search.best_estimator_\n",
"print(\"Best hyperparameters:\", grid_search.best_params_)\n",
"\n",
"# Predict on test set\n",
"y_pred_proba_kbest = best_pipeline_kbest.predict_proba(X_test_kbest)[:, 1]\n",
"y_pred_kbest = best_pipeline_kbest.predict(X_test_kbest)\n"
]
},
{
"cell_type": "markdown",
"id": "672444f7",
"metadata": {},
"source": [
"### Model 2 with RFE Features"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "49cb625c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 5 folds for each of 72 candidates, totalling 360 fits\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 0.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 0.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 0.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 0.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 2.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 2.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 1.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 1.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 1.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 1.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 2.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.1s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.9s\n",
"\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 0.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 1.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 1.8s[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 3.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 2.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 2.5s\n",
"Best hyperparameters: {'model__max_depth': 10, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 2, 'model__min_samples_split': 5, 'model__n_estimators': 100}\n"
]
}
],
"source": [
"# Define pipeline (scaling numeric features only)\n",
"pipeline = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('model', RandomForestClassifier(class_weight=weights, # We have an imbalanced dataset\n",
" random_state=123))\n",
"])\n",
"\n",
"# Define parameter grid\n",
"param_grid = {\n",
" 'model__n_estimators': [100, 200, 300],\n",
" 'model__max_depth': [None, 10, 20],\n",
" 'model__min_samples_split': [2, 5],\n",
" 'model__min_samples_leaf': [1, 2],\n",
" 'model__max_features': ['sqrt', 'log2']\n",
"}\n",
"\n",
"# GridSearchCV\n",
"grid_search = GridSearchCV(\n",
" estimator=pipeline,\n",
" param_grid=param_grid,\n",
" scoring='average_precision', # For imbalanced classification\n",
" cv=5, # 5-fold cross-validation\n",
" n_jobs=-1, # Use all available cores\n",
" verbose=2 # Verbose output for progress tracking\n",
")\n",
"\n",
"# Fit the grid search on training data\n",
"grid_search.fit(X_train_rfe, y_train_rfe)\n",
"\n",
"# Best model\n",
"best_pipeline_rfe = grid_search.best_estimator_\n",
"print(\"Best hyperparameters:\", grid_search.best_params_)\n",
"\n",
"# Predict on test set\n",
"y_pred_proba_rfe = best_pipeline_rfe.predict_proba(X_test_rfe)[:, 1]\n",
"y_pred_rfe = best_pipeline_rfe.predict(X_test_rfe)\n"
]
},
{
"cell_type": "markdown",
"id": "b763f4cd",
"metadata": {},
"source": [
"### Model 3 with Lasso Features"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "47c6ab43",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 5 folds for each of 72 candidates, totalling 360 fits\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 7.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 7.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 7.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 5.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 3.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.9s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 7.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 7.2s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 7.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 7.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.7s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 5.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 7.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.6s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.8s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 8.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.6s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.5s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=None, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.3s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=None, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.5s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.3s\n",
"[CV] END model__max_depth=10, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 0.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.2s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.7s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.1s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.5s\n",
"[CV] END model__max_depth=10, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 4.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 5.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 6.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.1s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 7.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.4s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.5s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 7.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 7.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 5.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 4.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=100; total time= 1.6s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.7s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.1s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.2s\n",
"[CV] END model__max_depth=20, model__max_features=sqrt, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=200; total time= 3.7s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 4.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=2, model__n_estimators=300; total time= 5.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=200; total time= 4.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.1s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 5.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=1, model__min_samples_split=5, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.1s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=200; total time= 4.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 1.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=100; total time= 2.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 5.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=2, model__n_estimators=300; total time= 6.0s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.8s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.6s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=200; total time= 3.4s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.5s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.9s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.2s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 4.3s\n",
"[CV] END model__max_depth=20, model__max_features=log2, model__min_samples_leaf=2, model__min_samples_split=5, model__n_estimators=300; total time= 3.9s\n",
"Best hyperparameters: {'model__max_depth': None, 'model__max_features': 'log2', 'model__min_samples_leaf': 2, 'model__min_samples_split': 2, 'model__n_estimators': 200}\n"
]
}
],
"source": [
"# Define pipeline (scaling numeric features only)\n",
"pipeline = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('model', RandomForestClassifier(class_weight=weights, # We have an imbalanced dataset\n",
" random_state=123))\n",
"])\n",
"\n",
"# Define parameter grid\n",
"param_grid = {\n",
" 'model__n_estimators': [100, 200, 300],\n",
" 'model__max_depth': [None, 10, 20],\n",
" 'model__min_samples_split': [2, 5],\n",
" 'model__min_samples_leaf': [1, 2],\n",
" 'model__max_features': ['sqrt', 'log2']\n",
"}\n",
"\n",
"# GridSearchCV\n",
"grid_search = GridSearchCV(\n",
" estimator=pipeline,\n",
" param_grid=param_grid,\n",
" scoring='average_precision', # For imbalanced classification\n",
" cv=5, # 5-fold cross-validation\n",
" n_jobs=-1, # Use all available cores\n",
" verbose=2 # Verbose output for progress tracking\n",
")\n",
"\n",
"# Fit the grid search on training data\n",
"grid_search.fit(X_train_lasso, y_train_lasso)\n",
"\n",
"# Best model\n",
"best_pipeline_lasso = grid_search.best_estimator_\n",
"print(\"Best hyperparameters:\", grid_search.best_params_)\n",
"\n",
"# Predict on test set\n",
"y_pred_proba_lasso = best_pipeline_lasso.predict_proba(X_test_lasso)[:, 1]\n",
"y_pred_lasso = best_pipeline_lasso.predict(X_test_lasso)\n"
]
},
{
"cell_type": "markdown",
"id": "fc2fcc89",
"metadata": {},
"source": [
"## Evaluation\n",
"This section aims to evaluate how good the new model is vs. the actual Resolution Incidents.\n",
"\n",
"We start by computing and displaying the classification report, ROC Curve, PR Curve and the respective Area Under the Curve (AUC)."
]
},
{
"cell_type": "markdown",
"id": "76099daf",
"metadata": {},
"source": [
"### Model 1 evaluation"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "78887f46",
"metadata": {},
"outputs": [],
"source": [
"# Actual and predicted\n",
"y_true_kbest = y_test_kbest\n",
"\n",
"# Compute confusion matrix: [ [TN, FP], [FN, TP] ]\n",
"tn, fp, fn, tp = confusion_matrix(y_true_kbest, y_pred_kbest).ravel()\n",
"\n",
"# Total predictions\n",
"total = tp + tn + fp + fn\n",
"\n",
"# Compute all requested metrics\n",
"recall_kbest = recall_score(y_true_kbest, y_pred_kbest)\n",
"precision_kbest = precision_score(y_true_kbest, y_pred_kbest)\n",
"f1_kbest = fbeta_score(y_true_kbest, y_pred_kbest, beta=1)\n",
"f2_kbest = fbeta_score(y_true_kbest, y_pred_kbest, beta=2)\n",
"fpr_kbest = fp / (fp + tn) if (fp + tn) != 0 else 0\n",
"\n",
"# Scores relative to total\n",
"tp_score_kbest = tp / total\n",
"tn_score_kbest = tn / total\n",
"fp_score_kbest = fp / total\n",
"fn_score_kbest = fn / total\n",
"\n",
"# Create DataFrame\n",
"summary_df_kbest = pd.DataFrame([{\n",
" \"title\": \"Kbest\",\n",
" \"flagging_analysis_type\": \"RISK_VS_CLAIM using KBest Features from all features\",\n",
" \"count_total\": total,\n",
" \"count_true_positive\": tp,\n",
" \"count_true_negative\": tn,\n",
" \"count_false_positive\": fp,\n",
" \"count_false_negative\": fn,\n",
" \"true_positive_score\": tp_score_kbest,\n",
" \"true_negative_score\": tn_score_kbest,\n",
" \"false_positive_score\": fp_score_kbest,\n",
" \"false_negative_score\": fn_score_kbest,\n",
" \"recall_score\": recall_kbest,\n",
" \"precision_score\": precision_kbest,\n",
" \"false_positive_rate_score\": fpr_kbest,\n",
" \"f1_score\": f1_kbest,\n",
" \"f2_score\": f2_kbest\n",
"}])"
]
},
{
"cell_type": "markdown",
"id": "ea079e83",
"metadata": {},
"source": [
"### Model 2 evaluation"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "03c83137",
"metadata": {},
"outputs": [],
"source": [
"# Actual and predicted\n",
"y_true_rfe = y_test_rfe\n",
"\n",
"# Compute confusion matrix: [ [TN, FP], [FN, TP] ]\n",
"tn, fp, fn, tp = confusion_matrix(y_true_rfe, y_pred_rfe).ravel()\n",
"\n",
"# Total predictions\n",
"total = tp + tn + fp + fn\n",
"\n",
"# Compute all requested metrics\n",
"recall_rfe = recall_score(y_true_rfe, y_pred_rfe)\n",
"precision_rfe = precision_score(y_true_rfe, y_pred_rfe)\n",
"f1_rfe = fbeta_score(y_true_rfe, y_pred_rfe, beta=1)\n",
"f2_rfe = fbeta_score(y_true_rfe, y_pred_rfe, beta=2)\n",
"fpr_rfe = fp / (fp + tn) if (fp + tn) != 0 else 0\n",
"\n",
"# Scores relative to total\n",
"tp_score_rfe = tp / total\n",
"tn_score_rfe = tn / total\n",
"fp_score_rfe = fp / total\n",
"fn_score_rfe = fn / total\n",
"\n",
"# Create DataFrame\n",
"summary_df_rfe = pd.DataFrame([{\n",
" \"title\": \"RFE\",\n",
" \"flagging_analysis_type\": \"RISK_VS_CLAIM using RFE Features from all features\",\n",
" \"count_total\": total,\n",
" \"count_true_positive\": tp,\n",
" \"count_true_negative\": tn,\n",
" \"count_false_positive\": fp,\n",
" \"count_false_negative\": fn,\n",
" \"true_positive_score\": tp_score_rfe,\n",
" \"true_negative_score\": tn_score_rfe,\n",
" \"false_positive_score\": fp_score_rfe,\n",
" \"false_negative_score\": fn_score_rfe,\n",
" \"recall_score\": recall_rfe,\n",
" \"precision_score\": precision_rfe,\n",
" \"false_positive_rate_score\": fpr_rfe,\n",
" \"f1_score\": f1_rfe,\n",
" \"f2_score\": f2_rfe\n",
"}])"
]
},
{
"cell_type": "markdown",
"id": "8c2f75c9",
"metadata": {},
"source": [
"### Model 3 evaluation"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "7d34f389",
"metadata": {},
"outputs": [],
"source": [
"# Actual and predicted\n",
"y_true_lasso = y_test_lasso\n",
"\n",
"# Compute confusion matrix: [ [TN, FP], [FN, TP] ]\n",
"tn, fp, fn, tp = confusion_matrix(y_true_lasso, y_pred_lasso).ravel()\n",
"\n",
"# Total predictions\n",
"total = tp + tn + fp + fn\n",
"\n",
"# Compute all requested metrics\n",
"recall_lasso = recall_score(y_true_lasso, y_pred_lasso)\n",
"precision_lasso = precision_score(y_true_lasso, y_pred_lasso)\n",
"f1_lasso = fbeta_score(y_true_lasso, y_pred_lasso, beta=1)\n",
"f2_lasso = fbeta_score(y_true_lasso, y_pred_lasso, beta=2)\n",
"fpr_lasso = fp / (fp + tn) if (fp + tn) != 0 else 0\n",
"\n",
"# Scores relative to total\n",
"tp_score_lasso = tp / total\n",
"tn_score_lasso = tn / total\n",
"fp_score_lasso = fp / total\n",
"fn_score_lasso = fn / total\n",
"\n",
"# Create DataFrame\n",
"summary_df_lasso = pd.DataFrame([{\n",
" \"title\": \"Lasso\",\n",
" \"flagging_analysis_type\": \"RISK_VS_CLAIM using Lasso Features from all features\",\n",
" \"count_total\": total,\n",
" \"count_true_positive\": tp,\n",
" \"count_true_negative\": tn,\n",
" \"count_false_positive\": fp,\n",
" \"count_false_negative\": fn,\n",
" \"true_positive_score\": tp_score_lasso,\n",
" \"true_negative_score\": tn_score_lasso,\n",
" \"false_positive_score\": fp_score_lasso,\n",
" \"false_negative_score\": fn_score_lasso,\n",
" \"recall_score\": recall_lasso,\n",
" \"precision_score\": precision_lasso,\n",
" \"false_positive_rate_score\": fpr_lasso,\n",
" \"f1_score\": f1_lasso,\n",
" \"f2_score\": f2_lasso\n",
"}])"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "09609773",
"metadata": {},
"outputs": [],
"source": [
"def plot_confusion_matrix_from_df(df, flagging_analysis_type):\n",
"\n",
" # Subset - just retrieve one row depending on the flagging_analysis_type\n",
" row = df[df['flagging_analysis_type'] == flagging_analysis_type].iloc[0]\n",
"\n",
" # Define custom x-axis labels and wording\n",
" if flagging_analysis_type == 'RISK_VS_CLAIM':\n",
" x_labels = ['With Submitted Claim', 'Without Submitted Claim']\n",
" outcome_label = \"submitted claim\"\n",
" elif flagging_analysis_type == 'RISK_VS_SUBMITTED_PAYOUT':\n",
" x_labels = ['With Submitted Payout', 'Without Submitted Payout']\n",
" outcome_label = \"submitted payout\"\n",
" else:\n",
" x_labels = ['Actual Positive', 'Actual Negative'] \n",
" outcome_label = \"outcome\"\n",
"\n",
" # Confusion matrix structure\n",
" cm = np.array([\n",
" [row['count_true_positive'], row['count_false_positive']],\n",
" [row['count_false_negative'], row['count_true_negative']]\n",
" ])\n",
"\n",
" # Create annotations for the confusion matrix\n",
" labels = [['True Positives', 'False Positives'], ['False Negatives', 'True Negatives']]\n",
" counts = [[f\"{v:,}\" for v in [row['count_true_positive'], row['count_false_positive']]],\n",
" [f\"{v:,}\" for v in [row['count_false_negative'], row['count_true_negative']]]]\n",
" percentages = [[f\"{round(100*v,2):,}\" for v in [row['true_positive_score'], row['false_positive_score']]],\n",
" [f\"{round(100*v,2):,}\" for v in [row['false_negative_score'], row['true_negative_score']]]]\n",
" annot = [[f\"{labels[i][j]}\\n{counts[i][j]} ({percentages[i][j]}%)\" for j in range(2)] for i in range(2)]\n",
"\n",
" # Scores formatted as percentages\n",
" recall = row['recall_score'] * 100\n",
" precision = row['precision_score'] * 100\n",
" f1 = row['f1_score'] * 100\n",
" f2 = row['f2_score'] * 100\n",
"\n",
" # Set up figure and axes manually for precise control\n",
" fig = plt.figure(figsize=(9, 8))\n",
" grid = fig.add_gridspec(nrows=4, height_ratios=[2, 2, 15, 2])\n",
"\n",
" \n",
" ax_main_title = fig.add_subplot(grid[0])\n",
" ax_main_title.axis('off')\n",
" ax_main_title.set_title(f\"Random Predictor - Flagged as Risk vs. {outcome_label.title()}\", fontsize=14, weight='bold')\n",
" \n",
" # Business explanation text\n",
" ax_text = fig.add_subplot(grid[1])\n",
" ax_text.axis('off')\n",
" business_text = (\n",
" f\"Flagging performance analysis:\\n\\n\"\n",
" f\"- Of all the bookings we flagged as at Risk, {precision:.2f}% actually turned into a {outcome_label}.\\n\"\n",
" f\"- Of all the bookings that resulted in a {outcome_label}, we correctly flagged {recall:.2f}% of them.\\n\"\n",
" f\"- The pure balance between these two is summarized by a score of {f1:.2f}%.\\n\"\n",
" f\"- If we prioritise better probability of detection of a {outcome_label}, the balanced score is {f2:.2f}%.\\n\"\n",
" )\n",
" ax_text.text(0.0, 0.0, business_text, fontsize=10.5, ha='left', va='bottom', wrap=False, linespacing=1.5)\n",
"\n",
" # Heatmap\n",
" ax_heatmap = fig.add_subplot(grid[2])\n",
" ax_heatmap.set_title(f\"Confusion Matrix Risk vs. {outcome_label.title()}\", fontsize=12, weight='bold', ha='center', va='center', wrap=False)\n",
"\n",
" cmap = sns.light_palette(\"#315584\", as_cmap=True)\n",
"\n",
" sns.heatmap(cm, annot=annot, fmt='', cmap=cmap, cbar=False,\n",
" xticklabels=x_labels,\n",
" yticklabels=['Flagged as Risk', 'Flagged as No Risk'],\n",
" ax=ax_heatmap,\n",
" linewidths=1.0,\n",
" annot_kws={'fontsize': 10, 'linespacing': 1.2})\n",
" ax_heatmap.set_xlabel(\"Resolution Outcome (Actual)\", fontsize=11, labelpad=10)\n",
" ax_heatmap.set_ylabel(\"Flagging (Prediction)\", fontsize=11, labelpad=10)\n",
" \n",
" # Make borders visible\n",
" for _, spine in ax_heatmap.spines.items():\n",
" spine.set_visible(True)\n",
"\n",
" # Footer with metrics and date\n",
" ax_footer = fig.add_subplot(grid[3])\n",
" ax_footer.axis('off')\n",
" metrics_text = f\"Total Booking Count: {row['count_total']} | Recall: {recall:.2f}% | Precision: {precision:.2f}% | F1 Score: {f1:.2f}% | F2 Score: {f2:.2f}%\"\n",
" date_text = f\"Generated on {date.today().strftime('%B %d, %Y')}\"\n",
" ax_footer.text(0.5, 0.7, metrics_text, ha='center', fontsize=9)\n",
" ax_footer.text(0.5, 0.1, date_text, ha='center', fontsize=8, color='gray')\n",
"\n",
" plt.tight_layout()\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "7cc4a1d2",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA3kAAAMVCAYAAADKxj30AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs3XdcFMf/P/DXUQ6O3juhWAArTUVBKYoFUIk1xgKoscbEWGJXbNEEe41RY6+o2IJdsaLBFluMGsGKIooiitT37w9/t1+Wu4PDEj8h7+fjwUN3d3Zmdrbd7MzOSoiIwBhjjDHGGGOsUtD41BlgjDHGGGOMMfbhcCWPMcYYY4wxxioRruQxxhhjjDHGWCXClTzGGGOMMcYYq0S4kscYY4wxxhhjlQhX8hhjjDHGGGOsEuFKHmOMMcYYY4xVIlzJY4wxxhhjjLFKhCt5jDHGGGOMMVaJcCWPsf+gtLQ0SCQS4S8pKelTZ6lSi46OFso6KChItKzkfli5cuUnyV9lExsbK5Sps7Pzp87OJ7dy5UrRcVbZlXW+VQQfR4yxfzOu5DFWQUlJSaIfTPI/TU1NmJiYwNvbGyNGjMCjR48+dVYrLWdnZ6X7QCqVws7ODm3atMHOnTs/dTb/UZX1h3zJH+xl/fGDisqlZAWr9DluZWWFwMBAzJ07F2/evPnUWa1Ujh8/jp49e8LNzQ2GhobQ0dGBnZ0dwsLCsGTJkg9e3h+qQs4YU6T1qTPAWGVRXFyMFy9e4MKFC7hw4QJWr16N33//HY6Ojp86a/8ZBQUFSE9Px65du7Br1y589dVX+OWXXz51tsoUFxcn/L9evXqfMCeM/e8rKCjAkydP8OTJExw7dgzbtm3D4cOHoampKYT54osvUKtWLQDg66+acnJy0KtXL2zevFlhWXp6OtLT07Fnzx5Mnz4dW7ZsgY+PzyfIJWOsIriSx9h76ty5M3x9fZGdnY3t27fj8uXLAIBHjx5h9uzZmDVr1ifOYeXm6uqK/v37AwDu3r2LVatWITs7GwCwdOlShIeHo23btuXG8+rVK8hkMmho/LMdHIYNG/aPpveh/dPlVrJSXFKVKlX+kfTZpzF69GiYmJjg0aNHWLt2LTIyMgAAx44dw2+//YY2bdoIYVu2bImWLVt+qqz+6xQXF6Nz585ITEwU5lWrVg2ff/45DA0NkZycLCxLS0tDaGgozpw5g2rVqn2qLDPG1EGMsQo5cuQIARD+VqxYISx7/vw5SaVSYVmLFi1E6z59+pSGDx9OISEh5OTkRAYGBqStrU1WVlbUrFkzWr16NRUXF5eZ3t9//00LFy6k2rVrk46ODllaWlKvXr3o2bNnCnl99eoVjRgxghwcHEhHR4dq1KhBCxYsoNu3b4viPHLkiMK6W7ZsobCwMLK2tiZtbW0yMTGhhg0b0owZM+jVq1cK4UuXyerVq6lu3bqkq6tLVapUoVmzZhERUUFBAU2ePJmcnZ1JKpWSu7s7/fLLLxXaB05OTkJagYGBomUHDhwQ5aV79+5K15swYQIdP36cmjZtSkZGRgSAsrKyhLAXL16kmJgYcnV1JV1dXdLX1ydPT0+aOnUq5eTkKM3X0aNHKTAwkPT09MjU1JQ6dOhAt27doqioKJX5VXUsyf3+++8UHR1NVapUIZlMRvr6+lStWjWKjo6mW7duUWpqqigOZX8TJkwQxXnw4EFq37492dvbk1QqJUNDQ/Ly8qLx48fT06dPyyzv8srtQytZduresiZMmCCEd3JyEi3btm0bdevWjWrXrk1WVlakra1N+vr65OHhQQMHDqTU1FSlcV66dIkiIiLI0NCQDA0NqWXLlnThwoUy0yIiOnbsmOiY6NixI92+fbvMY4KI6NGjRzRq1CiqW7cuGRgYkI6ODlWpUoUGDBhAd+7cUZrHtLQ0+uKLL8jU1JT09PSocePGdODAAVqxYkWFy5CIaPny5dSxY0dyd3cnc3Nz0tLSIkNDQ6pbty59//339OTJE6V56NOnD1WtWpV0dXVJR0eH7OzsqFGjRvTdd9/RtWvX1Eq7ZLkCEO2XPXv2iJZNmzZNtG5ZZXvp0iXq2rUrOTk5kVQqJV1dXXJ0dKTg4GAaOXIk3b9/X2keSu7bgoICat++vbBMV1eX9uzZo3JbXrx4QXp6emWe5506dRKWN2vWTJh/7NgxioyMJDs7O+FYdXJyopYtW9KECRPo+fPnapVnWdatWycqz1atWlFeXp4ozMqVK0VhWrZsKVpe1nVM2f4ofUwq+yt5XyouLqb4+Hhq3bo12dnZkVQqJVNTU/L09KTvvvtOIb/379+nYcOGUa1atUhfX590dHTIycmJunbtSmfOnFEog9L7+uHDh9SjRw8yNzcnQ0NDioiIoL/++ouIiM6dO0ctWrQgAwMDMjExoQ4dOtDdu3eVlu273EcY+1C4ksdYBZVVySMiMjMzE5Z17dpVtOzy5cvl3thiYmLKTC8gIEDpek2aNBGtl5+fT40bN1YaNjw8XOXNtLCwUPSDQ9mfh4cHPXz4UJReyeU+Pj5K1xs3bhy1bdtW6bLly5ervQ/KquTl5OSI4g0NDVW6XsOGDUlTU1MUVl5ZWbRoEWlpaanc/ho1alB6eroo3V27dildx8zMjBo2bKgyv2UdSxMnTiSJRKIyHwkJCRWu5A0ZMqTMsPb29nTlyhWV5V1WuX0MH7qSV/LHubI/IyMjunTpkmidlJQUMjAwUAirq6tLoaGhKtNSdUyYm5tTo0aNVB4Tp06dIgsLC5V5NDY2pmPHjonWSU1NJRsbG4WwEomEwsLCKlyGRKTyPC55rDx48EAI//jxY7K0tCxzncWLF6uVdlmVvEuXLomWLV26VLSuqkre1atXRZUtZX8lK2vKjqPCwkL64osvhPn6+vp06NChcrene/fuwjrNmzcXLXv58iXJZDJh+fr164no7cOY0uda6b8///xTrfIsS2BgoBCfhoaGUJkpreR1DAClpaUJy8q6jr1vJS83N1fhnlX6r+Q16OjRo2RqaqoyrIaGBs2cOVOUx5L72szMjJydnRXWs7S0pISEBNLR0VFYVq1aNcrNzRXF+S73EcY+JO6uydgHkp2djZUrV+LZs2fCvE6dOonCaGhowMPDA/Xr14eNjQ1MTEzw5s0bXLhwAbt27QIRYcWKFejXrx/q16+vNJ0TJ06gadOmaNSokah76LFjx3D69Gn4+fkBAObOnYvjx48L63l5eSEiIgJXrlxBQkKCyu344YcfRO9l+Pn5oXnz5vjzzz8RHx8PAPjzzz/RtWtXHD58WGkc586dQ8OGDREaGopNmzbhr7/+AgBMnjwZABAYGIgmTZpg6dKlwgA1P/30E3r27KkyX+pKTk4WTdvY2KgMp6enh27dusHe3h4XLlyApqYmTp06ha+//hrFxcXC9rds2RIvX77EqlWrkJmZiWvXrqFHjx7Yv38/AOD169fo1asXCgsLAQDa2tro2bMnTE1NsXbtWoU8qSM+Ph4TJkwQpvX09PDFF1/AyckJqamp2LVrFwDAzMwMcXFxOHv2LDZt2iSEL9mtsVGjRgCANWvWiLoP16xZE59//jkePnyIVatWoaioCA8ePEC7du1w9epVaGkp3iJUlds/ZcaMGQrzjI2N8dVXX6m1vomJCZo3bw4PDw+YmppCKpXi8ePHSEhIwN27d5GdnY0RI0YI3dOICD179kROTo4QR5cuXeDq6orNmzfjwIEDStMpfUxoaWkhJiYGZmZmWL16NU6dOqV0vezsbERGRiIzMxMA4OTkhM6dO0Mmk2HLli24evUqXrx4gfbt2+PmzZswNjYGAHz99deiwZ5at24NLy8v7NmzR9QNryKsrKzQunVrVKlSBWZmZtDU1MSDBw+wadMmPH36FA8ePMCUKVOwaNEiAMDWrVvx5MkTAICpqSliYmJgbm6Ohw8f4vr166Lr0bsgIjx69Eh0bMtkMkRERKi1/qpVq/D69WsAgIODA7p16wZ9fX3cv38fV65cwenTp8tcv7i4GD179sTGjRsBAEZGRkhMTIS/v3+5acfExGDNmjUAgEOHDiEjIwNWVlYAgO3
"text/plain": [
"<Figure size 900x800 with 4 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA3kAAAMVCAYAAADKxj30AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs3XVcFdn/P/DXBS5w6W4kDEBRUTBBBcTEwEDXJFxb1+7CNVestV27AxVFxVawULHWWmvFTgwMUOr9+8PfnS/DvRcuxvpZ9v18PHjonDlzzpkzdc/MmTMSIiIwxhhjjDHGGCsWNH50ARhjjDHGGGOMfTvcyGOMMcYYY4yxYoQbeYwxxhhjjDFWjHAjjzHGGGOMMcaKEW7kMcYYY4wxxlgxwo08xhhjjDHGGCtGuJHHGGOMMcYYY8UIN/IYY4wxxhhjrBjhRh5jjDHGGGOMFSPcyGPsP+ju3buQSCTCX0JCwo8uUrEWHh4u1LW/v79oXt7tsHLlyh9SvuImKipKqFNnZ+cfXZwfbuXKlaL9rLgr6HgrCt6PGGP/ZtzIY6yIEhISRD+Y5H+ampowMTFB5cqVMWzYMDx9+vRHF7XYcnZ2VroNtLW1YWdnh2bNmiEuLu5HF/MfVVx/yOf9wV7QH9+oKF7yNrDyH+NWVlaoU6cOfv/9d3z8+PFHF7VYOXbsGCIjI+Hm5gZDQ0Po6OjAzs4OjRs3xuLFi795fX+rBjljTJHWjy4AY8VFbm4u0tLScOHCBVy4cAGrV6/GmTNn4Ojo+KOL9p+RlZWFJ0+eYOfOndi5cye6du2KP/7440cXq0DR0dHC/6tUqfIDS8LY/76srCy8ePECL168wNGjR7Ft2zYcPnwYmpqaQpyffvoJnp6eAMDnXzW9f/8eXbp0webNmxXmPXnyBE+ePMGePXswdepUbNmyBd7e3j+glIyxouBGHmNfqW3btvDx8cHbt2+xfft2XL58GQDw9OlTzJo1CzNnzvzBJSzeXF1d0bNnTwDA/fv3sWrVKrx9+xYAsGTJEgQHB6N58+aFpvPhwwfIZDJoaPyzHRwGDx78j+b3rf3T9Za3UZxXyZIl/5H82Y8xcuRImJiY4OnTp1i7di2eP38OADh69Ch2796NZs2aCXEbNmyIhg0b/qii/uvk5uaibdu2iI+PF8JKly6NFi1awNDQEElJScK8u3fvol69ejh9+jRKly79o4rMGFMHMcaK5MiRIwRA+FuxYoUw782bN6StrS3Ma9CggWjZly9f0pAhQygwMJCcnJzIwMCApFIpWVlZUVBQEK1evZpyc3MLzO/vv/+m+fPnU/ny5UlHR4csLS2pS5cu9OrVK4WyfvjwgYYNG0YODg6ko6NDZcuWpXnz5tGdO3dEaR45ckRh2S1btlDjxo3J2tqapFIpmZiYUI0aNWj69On04cMHhfj562T16tVUsWJF0tXVpZIlS9LMmTOJiCgrK4smTJhAzs7OpK2tTe7u7vTHH38UaRs4OTkJedWpU0c078CBA6KydOrUSely48aNo2PHjlHdunXJyMiIANDr16+FuBcvXqSIiAhydXUlXV1d0tfXJy8vL5o0aRK9f/9eabkSExOpTp06pKenR6amptS6dWu6ffs2hYWFqSyvqn1J7syZMxQeHk4lS5YkmUxG+vr6VLp0aQoPD6fbt29TSkqKKA1lf+PGjROlefDgQWrVqhXZ29uTtrY2GRoaUqVKlWjs2LH08uXLAuu7sHr71vLWnbqXrHHjxgnxnZycRPO2bdtGHTt2pPLly5OVlRVJpVLS19cnDw8P6t27N6WkpChN89KlS9SkSRMyNDQkQ0NDatiwIV24cKHAvIiIjh49KtonQkND6c6dOwXuE0RET58+pREjRlDFihXJwMCAdHR0qGTJktSrVy+6d++e0jLevXuXfvrpJzI1NSU9PT2qVasWHThwgFasWFHkOiQiWrZsGYWGhpK7uzuZm5uTlpYWGRoaUsWKFWno0KH04sULpWXo1q0blSpVinR1dUlHR4fs7OyoZs2aNGDAALp27ZpaeeetVwCi7bJnzx7RvClTpoiWLahuL126RB06dCAnJyfS1tYmXV1dcnR0pICAABo+fDg9fPhQaRnybtusrCxq1aqVME9XV5f27Nmjcl3S0tJIT0+vwOO8TZs2wvygoCAh/OjRoxQSEkJ2dnbCvurk5EQNGzakcePG0Zs3b9Sqz4KsW7dOVJ+NGjWiT58+ieKsXLlSFKdhw4ai+QWdx5Rtj/z7pLK/vNel3NxciomJoaZNm5KdnR1pa2uTqakpeXl50YABAxTK+/DhQxo8eDB5enqSvr4+6ejokJOTE3Xo0IFOnz6tUAf5t/Xjx4+pc+fOZG5uToaGhtSkSRO6ceMGERGdO3eOGjRoQAYGBmRiYkKtW7em+/fvK63bL7mOMPatcCOPsSIqqJFHRGRmZibM69Chg2je5cuXC72wRUREFJifn5+f0uVq164tWi4zM5Nq1aqlNG5wcLDKi2l2drboB4eyPw8PD3r8+LEov7zzvb29lS43ZswYat68udJ5y5YtU3sbFNTIe//+vSjdevXqKV2uRo0apKmpKYorb6wsWLCAtLS0VK5/2bJl6cmTJ6J8d+7cqXQZMzMzqlGjhsryFrQvjR8/niQSicpyxMbGFrmRN3DgwALj2tvb05UrV1TWd0H19j1860Ze3h/nyv6MjIzo0qVLomWSk5PJwMBAIa6uri7Vq1dPZV6q9glzc3OqWbOmyn3i5MmTZGFhobKMxsbGdPToUdEyKSkpZGNjoxBXIpFQ48aNi1yHRKTyOM67rzx69EiI/+zZM7K0tCxwmYULF6qVd0GNvEuXLonmLVmyRLSsqkbe1atXRY0tZX95G2vK9qPs7Gz66aefhHB9fX06dOhQoevTqVMnYZn69euL5r17945kMpkwf/369UT0+WZM/mMt/99ff/2lVn0WpE6dOkJ6GhoaQmMmv7znMQB09+5dYV5B57GvbeRlZGQoXLPy/+U9ByUmJpKpqanKuBoaGjRjxgxRGfNuazMzM3J2dlZYztLSkmJjY0lHR0dhXunSpSkjI0OU5pdcRxj7lri7JmPfyNu3b7Fy5Uq8evVKCGvTpo0ojoaGBjw8PFC1alXY2NjAxMQEHz9+xIULF7Bz504QEVasWIEePXqgatWqSvM5fvw46tati5o1a4q6hx49ehSnTp1C9erVAQC///47jh07JixXqVIlNGnSBFeuXEFsbKzK9Zg8ebLovYzq1aujfv36+OuvvxATEwMA+Ouvv9ChQwccPnxYaRrnzp1DjRo1UK9ePWzatAk3btwAAEyYMAEAUKdOHdSuXRtLliwRBqiZNm0aIiMjVZZLXUlJSaJpGxsblfH09PTQsWNH2Nvb48KFC9DU1MTJkyfRp08f5ObmCuvfsGFDvHv3DqtWrUJqaiquXbuGzp07Y//+/QCA9PR0dOnSBdnZ2QAAqVSKyMhImJqaYu3atQplUkdMTAzGjRsnTOvp6eGnn36Ck5MTUlJSsHPnTgCAmZkZoqOjcfbsWWzatEmIn7dbY82aNQEAa9asEXUfLleuHFq0aIHHjx9j1apVyMnJwaNHj9CyZUtcvXoVWlqKlwhV9fZPmT59ukKYsbExunbtqtbyJiYmqF+/Pjw8PGBqagptbW08e/YMsbGxuH//Pt6+fYthw4YJ3dOICJGRkXj//r2QRrt27eDq6orNmzfjwIEDSvPJv09oaWkhIiICZmZmWL16NU6ePKl0ubdv3yIkJASpqakAACcnJ7Rt2xYymQxbtmzB1atXkZaWhlatWuHWrVswNjYGAPTp00c02FPTpk1RqVIl7NmzR9QNryisrKzQtGlTlCxZEmZmZtDU1MSjR4+wadMmvHz5Eo8ePcLEiROxYMECAMDWrVvx4sULAICpqSkiIiJgbm6Ox48f4/r166Lz0ZcgIjx9+lS0b8tkMjRp0kSt5VetWoX09HQAgIODAzp27Ah9fX08fPgQV65cwalTpwpcPjc3F5GRkdi4cSMAwMjICPHx8fD19S0074iICKxZswYAcOjQITx//hxWVlYAgO3
"text/plain": [
"<Figure size 900x800 with 4 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA3kAAAMVCAYAAADKxj30AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs3XdcFMf/P/DXUQ6O3juhWAAbIKgoqIgNxYIFjRXQJLYYC/aKLRqx1xiNvSEqSgxWFGxosLfYQeyIvSD1/fvD3+2XhTs4LPET8n4+Hjz0ZmdnZmdnd29uZ2clRERgjDHGGGOMMVYuqH3tAjDGGGOMMcYY+3y4k8cYY4wxxhhj5Qh38hhjjDHGGGOsHOFOHmOMMcYYY4yVI9zJY4wxxhhjjLFyhDt5jDHGGGOMMVaOcCePMcYYY4wxxsoR7uQxxhhjjDHGWDnCnTzGGGOMMcYYK0e4k8fYf1BaWhokEonwl5iY+LWLVK6FhYUJde3v7y9aVng/rF69+quUr7yJjIwU6tTR0fFrF+erW716taidlXclHW9lwe2IMfZvxp08xsooMTFR9IVJ/qeurg4jIyPUrFkTI0eOxKNHj752UcstR0dHhftAKpXCxsYGbdq0QVxc3Ncu5j+qvH6RL/yFvaQ//qGifCncwSp6jFtYWKBhw4aYP38+3r9//7WLWq4cOXIEvXr1gouLC/T19aGlpQUbGxu0bNkSy5Yt++z1/bk65Iyx4jS+dgEYKy8KCgrw8uVLnD17FmfPnsXatWvx119/wd7e/msX7T8jNzcXDx8+xB9//IE//vgD33//PX777bevXawSRUVFCf+vVavWVywJY//7cnNz8eTJEzx58gSHDx/G9u3bcfDgQairqwtxvv32W1SrVg0A+Pyrojdv3qB3797YsmVLsWUPHz7Ew4cPsXv3bsyYMQNbt26Fl5fXVyglY6wsuJPH2Cfq3LkzvL298erVK+zYsQMXL14EADx69Ahz587FnDlzvnIJyzdnZ2f069cPAJCeno41a9bg1atXAIDly5cjKCgIbdu2LTWdt2/fQiaTQU3tnx3gMGzYsH80v8/tn663wp3iwipUqPCP5M++jjFjxsDIyAiPHj3C+vXrkZGRAQA4fPgw/vzzT7Rp00aIGxgYiMDAwK9V1H+dgoICdO7cGfHx8UJYpUqV0K5dO+jr6yM5OVlYlpaWhqZNm+LkyZOoVKnS1yoyY0wVxBgrk0OHDhEA4W/VqlXCshcvXpBUKhWWNW/eXLTu06dPafjw4RQQEEAODg6kp6dHmpqaZGFhQU2aNKG1a9dSQUFBifndunWLFi9eTNWrVyctLS0yNzen3r1707Nnz4qV9e3btzRy5Eiys7MjLS0tqlKlCi1atIhu374tSvPQoUPF1t26dSu1bNmSLC0tSVNTk4yMjKhu3bo0a9Ysevv2bbH4Retk7dq15O7uTtra2lShQgWaM2cOERHl5ubSlClTyNHRkaRSKbm6utJvv/1Wpn3g4OAg5NWwYUPRsv3794vK0qNHD4XrTZw4kY4cOUKNGzcmAwMDAkDPnz8X4p47d47Cw8PJ2dmZtLW1SVdXlzw8PGjatGn05s0bheVKSkqihg0bko6ODhkbG1PHjh3p5s2bFBoaqrS8ytqS3F9//UVhYWFUoUIFkslkpKurS5UqVaKwsDC6efMmpaamitJQ9Ddx4kRRmgcOHKAOHTqQra0tSaVS0tfXJ09PT5owYQI9ffq0xPourd4+t8J1p+ola+LEiUJ8BwcH0bLt27dT9+7dqXr16mRhYUGampqkq6tLbm5uNGDAAEpNTVWY5oULF6hVq1akr69P+vr6FBgYSGfPni0xLyKiw4cPi9pESEgI3b59u8Q2QUT06NEjGj16NLm7u5Oenh5paWlRhQoVqH///nTnzh2FZUxLS6Nvv/2WjI2NSUdHh+rXr0/79++nVatWlbkOiYh+//13CgkJIVdXVzI1NSUNDQ3S19cnd3d3GjFiBD158kRhGX744QeqWLEiaWtrk5aWFtnY2FC9evVoyJAhdOXKFZXyLlyvAET7Zffu3aJl06dPF61bUt1euHCBunXrRg4ODiSVSklbW5vs7e2pUaNGNGrUKLp3757CMhTet7m5udShQwdhmba2Nu3evVvptrx8+ZJ0dHRKPM47deokLG/SpIkQfvjwYQoODiYbGxuhrTo4OFBgYCBNnDiRXrx4oVJ9lmTDhg2i+mzRogVlZ2eL4qxevVoUJzAwULS8pPOYov1RtE0q+it8XSooKKCYmBhq3bo12djYkFQqJWNjY/Lw8KAhQ4YUK++9e/do2LBhVK1aNdLV1SUtLS1ycHCgbt260cmTJ4vVQdF9/eDBA+rZsyeZmpqSvr4+tWrViq5du0ZERKdPn6bmzZuTnp4eGRkZUceOHSk9PV1h3X7MdYSxz4U7eYyVUUmdPCIiExMTYVm3bt1Eyy5evFjqhS08PLzE/Pz8/BSu16BBA9F6OTk5VL9+fYVxg4KClF5M8/LyRF84FP25ubnRgwcPRPkVXu7l5aVwvfHjx1Pbtm0VLvv9999V3gcldfLevHkjSrdp06YK16tbty6pq6uL4so7K0uWLCENDQ2l21+lShV6+PChKN8//vhD4TomJiZUt25dpeUtqS1NmjSJJBKJ0nLExsaWuZM3dOjQEuPa2trSpUuXlNZ3SfX2JXzuTl7hL+eK/gwMDOjChQuidVJSUkhPT69YXG1tbWratKnSvJS1CVNTU6pXr57SNnH8+HEyMzNTWkZDQ0M6fPiwaJ3U1FSysrIqFlcikVDLli3LXIdEpPQ4LtxW7t+/L8R//PgxmZubl7jO0qVLVcq7pE7ehQsXRMuWL18uWldZJ+/y5cuizpaiv8KdNUXtKC8vj7799lshXFdXlxISEkrdnh49egjrNGvWTLTs9evXJJPJhOUbN24kog8/xhQ91or+/f333yrVZ0kaNmwopKempiZ0ZooqfB4DQGlpacKyks5jn9rJy8rKKnbNKvpX+ByUlJRExsbGSuOqqanR7NmzRWUsvK9NTEzI0dGx2Hrm5uYUGxtLWlpaxZZVqlSJsrKyRGl+zHWEsc+Jh2sy9pm8evUKq1evxrNnz4SwTp06ieKoqanBzc0NtWvXhpWVFYyMjPD+/XucPXsWf/zxB4gIq1atQt++fVG7dm2F+Rw9ehSNGzdGvXr1RMNDDx8+jBMnTsDHxwcAMH/+fBw5ckRYz9PTE61atcKlS5cQGxurdDt+/vln0XMZPj4+aNasGf7++2/ExMQAAP7++29069YNBw8eVJjG6dOnUbduXTRt2hTR0dG4du0aAGDKlCkAgIYNG6JBgwZYvny5MEHNzJkz0atXL6XlUlVycrLos5WVldJ4Ojo66N69O2xtbXH27Fmoq6vj+PHj+PHHH1FQUCBsf2BgIF6/fo01a9YgMzMTV65cQc+ePbFv3z4AwLt379C7d2/k5eUBADQ1NdGrVy8YGxtj/fr1xcqkipiYGEycOFH4rKOjg2+//RYODg5ITU3FH3/8AQAwMTFBVFQUTp06hejoaCF+4WGN9erVAwCsW7dONHy4atWqaNeuHR48eIA1a9YgPz8f9+/fR/v27XH58mVoaBS/RCirt3/KrFmzioUZGhri+++/V2l9IyMjNGvWDG5ubjA2NoZUKsXjx48RGxuL9PR0vHr1CiNHjhSGpxERevXqhTdv3ghpdOnSBc7OztiyZQv279+vMJ+ibUJDQwPh4eEwMTHB2rVrcfz4cYXrvXr1CsHBwcjMzAQAODg4oHPnzpDJZNi6dSsuX76Mly9fokOHDrhx4wYMDQ0BAD/++KNosqfWrVvD09MTu3fvFg3DKwsLCwu0bt0aFSpUgImJCdTV1XH//n1ER0fj6dOnuH//PqZOnYolS5YAALZt24YnT54AAIyNjREeHg5TU1M8ePAAV69eFZ2PPgYR4dGjR6K2LZPJ0KpVK5XWX7NmDd69ewcAsLOzQ/fu3aGrq4t79+7h0qVLOHHiRInrFxQUoFevXti8eTMAwMDAAPHx8fD19S017/DwcKxbtw4AkJCQgIyMDFhYWAA
"text/plain": [
"<Figure size 900x800 with 4 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot confusion matrix for claim scenario\n",
"plot_confusion_matrix_from_df(summary_df_kbest, 'RISK_VS_CLAIM using KBest Features from all features')\n",
"plot_confusion_matrix_from_df(summary_df_rfe, 'RISK_VS_CLAIM using RFE Features from all features')\n",
"plot_confusion_matrix_from_df(summary_df_lasso, 'RISK_VS_CLAIM using Lasso Features from all features')"
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "30786f7c",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABeMAAAFICAYAAADTdeWXAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQAAaXhJREFUeJzt3Xd8FNX+//F3IJ3QIST0XkRAiiCgIkXKRaVdQQUERRBukKIioCJY6SJFERApolKkyEVE6dJLSKgxBAhwhQQE6YEEk/P7g1/2y5pCEjJsJryej8c+vMycnT1n33fmzH4yO+tmjDECAAAAAAAAAACWyeHqDgAAAAAAAAAAkN1RjAcAAAAAAAAAwGIU4wEAAAAAAAAAsBjFeAAAAAAAAAAALEYxHgAAAAAAAAAAi1GMBwAAAAAAAADAYhTjAQAAAAAAAACwGMV4AAAAAAAAAAAsRjEeAAAAAAAAAACLUYwHAAAAAAAAAMBiFOMBAAAAAAAAALAYxXgAAAAAAAAAACxGMR4AAAAAAAAAAItRjAcAAAAAAAAAwGIU4wEAAAAAAAAAsBjFeAAAAAAAAAAALEYxHgAAAAAAAAAAi1GMBwAAAAAAAADAYhTjAQAAAAAAAACwGMV4AAAAAAAAAAAsRjEeAAAAAAAAAACLUYwHAAAAAAAAAMBiFOMBAAAAAAAAALAYxXgAAAAAAAAAACxGMR4AAAAAAAAAAItRjAcAAAAAAAAAwGIU4wEAAAAAAAAAsBjFeAAAAAAAAAAALEYxHgAAAAAAAAAAi1GMBwAAAAAAAADAYhTjAQAAAAAAAACwGMV4AAAAAAAAAAAsRjEeAAAAAAAAAACLUYwHAAAAAAAAAMBiFOMBAAAAAAAAALAYxXgAAAAAAAAAACxGMR4AAAAAAAAAAItRjAcAAAAAAAAAwGIU4wEAAAAAAAAAsBjFeAAAAAAAAAAALEYxHgAAAAAAAAAAi1GMBwAAAAAAAADAYhTjAQAAAAAAAACwGMV4AAAAAAAAAAAsRjEeAAAAAAAAAACLUYwHAAAAAAAAAMBiFOMBAAAAAAAAALAYxXgAAAAAAAAAACxGMR4AAAAAAAAAAItRjAcAAAAAAAAAwGIU4wEAAAAAAAAAsBjFeAAAAAAAAAAALEYxHgAAAAAAAAAAi1GMBwAAAAAAAADAYhTjAQAAAAAAAACwGMV4AAAAAAAAAAAsRjEeAAAAAAAAAACLUYwHAAAAAAAAAMBiFOMBAAAAAAAAALAYxXgAAAAAAAAAACxGMR4AAAAAAAAAAItRjAcAAAAAAAAAwGIU4wEAAAAAAAAAsBjFeAAAAAAAAAAALEYxHgAAAAAAAAAAi1GMBwAAAAAAAADAYhTjAQAAAAAAAACwGMV4AAAAAAAAAAAsRjEeAAAAAAAAAACLUYwHAAAAAAAAAMBiFOMBAAAAAAAAALAYxXgAAAAAAAAAACxGMR4AAAAAAAAAAItRjAcAAAAAAAAAwGIU4wEAAAAAAAAAsBjFeAAAAAAAAAAALEYxHgAAAAAAAAAAi1GMBwAAAAAAAADAYhTjAQAAAAAAAACwGMV4AAAAAAAAAAAsRjEeAAAAAAAAAACLUYwHAAAAAAAAAMBiFOMBAAAAAAAAALAYxXgAAAAAAAAAACxGMR4AAAAAAAAAAItRjAcAAAAAAAAAwGIU4wEAAAAAAAAAsBjFeAAAAAAAAAAALEYxHgAAAAAAAAAAi1GMBwAAAAAAAADAYhTjAQAAAAAAAACwGMV4AAAAAAAAAAAsRjEeAAAAAAAAAACLUYwHAAAAAAAAAMBiFOMBAAAAAAAAALAYxXgAAAAAAAAAACxGMR4AAAAAAAAAAItRjAcAAAAAAAAAwGIU4wEAAAAAAAAAsBjFeAAAAAAAAAAALEYxHgAAAAAAAAAAi1GMBwAAAAAAAADAYhTjAQAAAAAAAACwGMV4AAAAAAAAAAAsRjEeAAAAAAAAAACLUYwHAAAAAAAAAMBiFOMBAAAAAAAAALAYxXgAAAAAAAAAACxGMR4AAAAAAAAAAItRjAcAAAAAAAAAwGIU4wEAAAAAAAAAsBjFeAAAAAAAAAAALOae1oYnT57UuXPnrOwLMkFsbKy8vLxc3Q2kAVnZAznZAznZAznZB1nZAznZAznZB1nZAznZAznZAznZB1nZQ6FChVSyZMlU26SpGH/y5ElVqVJFMTExmdIxWCdnzpyKj493dTeQBmRlD+RkD+RkD+RkH2RlD+RkD+RkH2RlD+RkD+RkD+RkH2RlD76+vgoLC0u1IJ+mYvy5c+cUExOjefPmqUqVKpnWQWSulStXatiwYeRkA2RlD+RkD+RkD+RkH2RlD+RkD+RkH2RlD+RkD+RkD+RkH2RlD2FhYerSpYvOnTt398X4RFWqVFGtWrXuunOwRlhYmCRysgOysgdysgdysgdysg+ysgdysgdysg+ysgdysgdysgdysg+yyl74AVcAAAAAAAAAACxGMR4AAAAAAAAAAItRjP+HDRs2yM3NTRcvXkzzc0qXLq3PPvvMsj4BAAAAAHAvuLm5admyZZKk48ePy83NTaGhoS7tU3Z2+/udmW0BAFmT7Yrx3bt3l5ubm3r37p1kXVBQkNzc3NS9e/d73zFkmJubW6qPESNGOE4CEx8FCxZU8+bNFRIS4uru3xfSk5G/v7+uXLni9PyHHnpII0aMcE3n71OJx8p/Po4cOeK0ztPTU+XLl9cHH3ygv//+29Xdvq+kJaNRo0Y5PWfZsmVyc3NzUY/txaq5Zfbs2Y72OXLkUGBgoDp16qSTJ0+mq38jRozQQw89dJejtDer5xayynypHbf+uT6t80vp0qUdz/H19VW1atX01VdfpbtvFKhuSWtGGZlfyOreuD1DDw8PlSlTRm+99ZZu3Ljh6q7dF1xxnhwVFaVWrVplelukzd18bkq8mDPxUbhwYf3rX//S/v37XTyq+0Nq2f322296+umnVbRoUeYdF0stp5EjR+rhhx9W7ty55e/vr7Zt2yo8PNzVXbac7YrxklSiRAnNnz9f169fdyy7ceOGvvvuu1R/rRZZU1RUlOPx2WefKU+ePE7L3nzzTUfbNWvWKCoqSr/88ouuXr2qVq1apetbDMiY9GR05coVjRs3zoW9RaKWLVs65RQVFaUyZco4rYuIiNAbb7yhESNGaOzYsS7u8f0ntYy8vb01evRoXbhwwcW9tCcr55bEbZ06dUqLFy9WeHi4nn322XswquzlXswtZJX5Ujtu3b4+PfPLBx98oKioKB04cEBdunRRz5499fPPP1s9lGzrThndzfxCVvdGYobHjh3ThAkTNG3aNA0fPtzV3bpvpPU4FhcXlymvFxAQIC8vr0xvi7S7289N4eHhjnPJ2NhYtW7dOtP+/4HUpZTdtWvXVKNGDX3++eeu7iKUck4bN25UUFCQtm/frtWrV+vmzZtq3ry5rl275uouW8qWxfhatWqpRIkSWrJkiWPZkiVLVLJkSdWsWdOxLDY2Vv369ZO/v7+8vb316KOPateuXU7bWrlypSpWrCgfHx81btxYx48fT/J6mzdv1mOPPSYfHx+VKFFC/fr1y/b/x7iXAgICHI+8efPKzc3NaZmfn5+jbcGCBRUQEKA6depo3LhxOnPmjHbs2OHC3t8f0pPRa6+9pk8//VRnz551YY8hSV5eXk45BQQEKGfOnE7rSpUqpT59+qhZs2Zavny5i3t8/0kto2bNmikgIEAjR450cS/tycq5JXFbgYGBatCggXr06KGdO3fq8uXLjjaDBw9WxYoV5evrq7Jly2rYsGG6efOmpFtXbL///vvau3ev48qQ2bNnS5IuXryoV155RYULF1aePHnUpEkT7d2715o3ycXuxdxCVpkvtePW7evTM7/kzp1bAQEBKlu2rAYPHqwCBQpo9erVjvW7du3Sk08+qUKFCilv3rxq1KiR9uzZ41hfunRpSVK7du3k5ubm+Lck/fjjj6pVq5a8vb1VtmxZvf/++9n+m2B3yuhu5heyujcSMyxRooTatm2rZs2aOd7nhIQEjRw5UmXKlJGPj49q1KihH374wen5Bw8e1FN
"text/plain": [
"<Figure size 1600x400 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Print a table to summarize the results\n",
"summary_table = pd.concat([summary_df_kbest, summary_df_rfe, summary_df_lasso], ignore_index=True)\n",
"summary_table = summary_table[['title', 'count_true_positive', 'count_true_negative',\n",
" 'count_false_positive', 'count_false_negative', 'true_positive_score', 'true_negative_score',\n",
" 'false_positive_score', 'false_negative_score', 'recall_score', 'precision_score',\n",
" 'false_positive_rate_score', 'f1_score', 'f2_score']]\n",
"\n",
"# Rename them\n",
"summary_table.columns = ['Model', 'TP', 'TN', 'FP', 'FN',\n",
" 'TP Rate', 'TN Rate', 'FP Rate', 'FN Rate',\n",
" 'Recall', 'Precision', 'FPR', 'F1', 'F2']\n",
" \n",
"# summary_table.to_csv('flagging_analysis_summary.csv', index=False)\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Set up figure and axis\n",
"fig, ax = plt.subplots(figsize=(16, 4)) # Adjust width/height as needed\n",
"ax.axis('off') # Hide axes\n",
"\n",
"# Create table from DataFrame\n",
"table = ax.table(cellText=summary_table.round(3).values,\n",
" colLabels=summary_table.columns,\n",
" loc='center',\n",
" cellLoc='center')\n",
"\n",
"table.auto_set_font_size(False)\n",
"table.set_fontsize(10)\n",
"table.scale(1.2, 1.5) # Adjust cell size\n",
"\n",
"# Save as image\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "d731d0c5",
"metadata": {},
"source": [
"### Interpreting the Classification Report\n",
"\n",
"The **Classification Report** provides key metrics to evaluate how well the model performed on each class.\n",
"\n",
"It includes the following metrics for each class (0 and 1):\n",
"* Metric: Meaning\n",
"* Precision: Out of all predicted positives, how many were actually positive?\n",
"* Recall: Out of all actual positives, how many did we correctly identify?\n",
"* F1-score: Harmonic mean of precision and recall (balances both)\n",
"* Support: Number of true samples of that class in the test data\n",
"\n",
"Interpretation:\n",
"* Class 0 = No incident\n",
"* Class 1 = Has resolution incident (rare, but important!)\n",
"\n",
"A few explanatory cases:\n",
"* A high recall for class 1 means we're catching most incidents.\n",
"* A high precision for class 1 means when we predict an incident, we're often correct.\n",
"* The F1-score gives a single balanced measure (good for imbalanced data).\n",
"\n",
"Special note for imbalanced data:\n",
"Since class 1 (or just True) is rare (1% in our case), metrics for that class are more critical.\n",
"We want to maximize recall to catch as many real incidents as possible — without letting precision drop too low (to avoid too many false alarms)."
]
},
{
"cell_type": "markdown",
"id": "c366cfe7",
"metadata": {},
"source": [
"### Results Summary\n",
"\n",
"- Model 1 (Kbest) best in F1 Score (0.227), but has a moderate recall.\n",
"- Model 2 (RFE) provides the highest recall (0.875) and the best F2 score (0.345), meaning it's most effective at capturing positives while tolerating more false positives.\n",
"- Model 3 (Lasso) offers the highest precision (0.9) and the lowest FPR, though it misses most real incidents (low recall)."
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "4b4da914",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAhgAAAHWCAYAAAA1jvBJAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQAAfLpJREFUeJzt3XdcU1f/B/BPAoQ9RESGKILg3qvuhaLWvUCtom3t0j59apd2qR3aX22tfVpbW611VAFxKy60al1Vq+IWB+ICVOpAZkJyfn+kBCmgBG+4CXzerxcvTk7u+OYQyJd7zj1HIYQQICIiIpKQUu4AiIiIqOJhgkFERESSY4JBREREkmOCQURERJJjgkFERESSY4JBREREkmOCQURERJJjgkFERESSY4JBREREkmOCQURERJJjgkFUCSxevBgKhcLwZW1tDV9fX4wbNw43b94sdh8hBJYtW4bOnTvDzc0NDg4OaNy4MT755BNkZmaWeK61a9eiT58+8PDwgEqlgo+PD0aMGIHff/+9VLHm5OTgm2++Qdu2beHq6go7OzsEBwdj0qRJuHDhQplePxGVPwXXIiGq+BYvXozx48fjk08+Qe3atZGTk4M///wTixcvhr+/P06fPg07OzvD9lqtFqNGjcLKlSvRqVMnDBkyBA4ODti7dy9WrFiBBg0aYMeOHahevbphHyEEnn/+eSxevBjNmzfHsGHD4OXlhZSUFKxduxZHjx7F/v370b59+xLjTEtLQ+/evXH06FH069cPISEhcHJyQkJCAqKiopCamgq1Wm3StiIiiQgiqvB+/fVXAUAcOXKkUP17770nAIjo6OhC9TNnzhQAxNtvv13kWBs2bBBKpVL07t27UP3s2bMFAPHf//5X6HS6IvstXbpUHDp06LFxPvvss0KpVIpVq1YVeS4nJ0e89dZbj92/tDQajcjNzZXkWERUPCYYRJVASQnGpk2bBAAxc+ZMQ11WVpaoUqWKCA4OFhqNptjjjR8/XgAQBw8eNOzj7u4u6tWrJ/Ly8soU459//ikAiAkTJpRq+y5duoguXboUqY+IiBC1atUyPL5y5YoAIGbPni2++eYbERAQIJRKpfjzzz+FlZWVmD59epFjnD9/XgAQ3333naHu3r174o033hA1atQQKpVKBAYGii+++EJotVqjXytRZcAxGESVWFJSEgCgSpUqhrp9+/bh3r17GDVqFKytrYvdb+zYsQCATZs2Gfa5e/cuRo0aBSsrqzLFsmHDBgDAmDFjyrT/k/z666/47rvv8NJLL+Hrr7+Gt7c3unTpgpUrVxbZNjo6GlZWVhg+fDgAICsrC126dMFvv/2GsWPH4n//+x86dOiAqVOnYvLkySaJl8jSFf/Xg4gqpAcPHiAtLQ05OTk4dOgQZsyYAVtbW/Tr18+wzdmzZwEATZs2LfE4+c+dO3eu0PfGjRuXOTYpjvE4N27cwKVLl1CtWjVDXVhYGF5++WWcPn0ajRo1MtRHR0ejS5cuhjEmc+bMweXLl3H8+HEEBQUBAF5++WX4+Phg9uzZeOutt+Dn52eSuIksFa9gEFUiISEhqFatGvz8/DBs2DA4Ojpiw4YNqFGjhmGbhw8fAgCcnZ1LPE7+c+np6YW+P26fJ5HiGI8zdOjQQskFAAwZMgTW1taIjo421J0+fRpnz55FWFiYoS4mJgadOnVClSpVkJaWZvgKCQmBVqvFH3/8YZKYiSwZr2AQVSLz5s1DcHAwHjx4gEWLFuGPP/6Ara1toW3yP+DzE43i/DsJcXFxeeI+T/LoMdzc3Mp8nJLUrl27SJ2Hhwd69OiBlStX4tNPPwWgv3phbW2NIUOGGLa7ePEiTp48WSRByXf79m3J4yWydEwwiCqRNm3aoFWrVgCAQYMGoWPHjhg1ahQSEhLg5OQEAKhfvz4A4OTJkxg0aFCxxzl58iQAoEGDBgCAevXqAQBOnTpV4j5P8ugxOnXq9MTtFQoFRDF32Wu12mK3t7e3L7Y+PDwc48ePR3x8PJo1a4aVK1eiR48e8PDwMGyj0+nQs2dPvPvuu8UeIzg4+InxElU27CIhqqSsrKwwa9YsJCcn4/vvvzfUd+zYEW5ublixYkWJH9ZLly4FAMPYjY4dO6JKlSqIjIwscZ8n6d+/PwDgt99+K9X2VapUwf3794vUX7161ajzDho0CCqVCtHR0YiPj8eFCxcQHh5eaJvAwEBkZGQgJCSk2K+aNWsadU6iyoAJBlEl1rVrV7Rp0wZz585FTk4OAMDBwQFvv/02EhIS8MEHHxTZJzY2FosXL0ZoaCieeeYZwz7vvfcezp07h/fee6/YKwu//fYbDh8+XGIs7dq1Q+/evbFw4UKsW7euyPNqtRpvv/224XFgYCDOnz+PO3fuGOpOnDiB/fv3l/r1A4CbmxtCQ0OxcuVKREVFQaVSFbkKM2LECBw8eBDbtm0rsv/9+/eRl5dn1DmJKgPO5ElUCeTP5HnkyBFDF0m+VatWYfjw4fjxxx/xyiuvANB3M4SFhWH16tXo3Lkzhg4dCnt7e+zbtw+//fYb6tevj507dxaayVOn02HcuHFYtmwZWrRoYZjJMzU1FevWrcPhw4dx4MABtGvXrsQ479y5g169euHEiRPo378/evToAUdHR1y8eBFRUVFISUlBbm4uAP1dJ40aNULTpk3xwgsv4Pbt25g/fz6qV6+O9PR0wy24SUlJqF27NmbPnl0oQXnU8uXL8dxzz8HZ2Rldu3Y13DKbLysrC506dcLJkycxbtw4tGzZEpmZmTh16hRWrVqFpKSkQl0qRATO5ElUGZQ00ZYQQmi1WhEYGCgCAwMLTZKl1WrFr7/+Kjp06CBcXFyEnZ2daNiwoZgxY4bIyMgo8VyrVq0SvXr1Eu7u7sLa2lp4e3uLsLAwsXv37lLFmpWVJb766ivRunVr4eTkJFQqlQgKChKvv/66uHTpUqFtf/vtNxEQECBUKpVo1qyZ2LZt22Mn2ipJenq6sLe3FwDEb7/9Vuw2Dx8+FFOnThV16tQRKpVKeHh4iPbt24uvvvpKqNXqUr02osqEVzCIiIhIchyDQURERJJjgkFERESSY4JBREREkmOCQURERJJjgkFERESSY4JBREREkqt0a5HodDokJyfD2dkZCoVC7nCIiIgshhACDx8+hI+PD5TKx1+jqHQJRnJyMvz8/OQOg4iIyGJdv34dNWrUeOw2lS7ByF9e+vr164bloZ+WRqPB9u3b0atXL9jY2EhyzMqObSo9tqm02J7SY5tKyxTtmZ6eDj8/P8Nn6eNUugQjv1vExcVF0gTDwcEBLi4u/KWQCNtUemxTabE9pcc2lZYp27M0Qww4yJOIiIgkxwSDiIiIJMcEg4iIiCTHBIOIiIgkxwSDiIiIJMcEg4iIiCTHBIOIiIgkxwSDiIiIJMcEg4iIiCTHBIOIiIgkJ2uC8ccff6B///7w8fGBQqHAunXrnrjP7t270aJFC9ja2qJOnTpYvHixyeMkIiIi48iaYGRmZqJp06aYN29eqba/cuUKnn32WXTr1g3x8fH473//ixdffBHbtm0zcaRERERkDFkXO+vTpw/69OlT6u3nz5+P2rVr4+uvvwYA1K9fH/v27cM333yD0NBQU4VJRERkNpKTgXv3nrydRgNcu+aMW7eAJ6ysbhIWtZrqwYMHERISUqguNDQU//3vf0vcJzc3F7m5uYbH6enpAPSrzGk0Gkniyj+OVMcjtqkpsE2lxfaUHtv08a5dA956ywrr1z+588HKSgt//6u4fLk7kpI0+OILaT/vSsOiEozU1FRUr169UF316tWRnp6O7Oxs2NvbF9ln1qxZmDFjRpH67du3w8HBQdL44uLiJD0esU1NgW0qLban9NimhWk0CqxfXwcrVwZDrX5ycuHgkIURI1aiVq2riIwciaQkBTZvPitJLFlZWaXe1qISjLKYOnUqJk+ebHicnp4OPz8/9OrVCy4uLpKcQ6PRIC4uDj179oSNjY0kx6zs2KbSY5tKi+0pPbZpUTt3KvDuu1a4cEFhqKteXaBPHwGFouj2Nja34ekZDWvr+9DpbNG06R0MGtQ
"text/plain": [
"<Figure size 600x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# ROC Curve\n",
"fpr, tpr, _ = roc_curve(y_test_rfe, y_pred_proba_rfe)\n",
"roc_auc = auc(fpr, tpr)\n",
"\n",
"plt.figure(figsize=(6, 5))\n",
"plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')\n",
"plt.plot([0, 1], [0, 1], color='gray', linestyle='--')\n",
"plt.xlabel('False Positive Rate')\n",
"plt.ylabel('True Positive Rate')\n",
"plt.title('ROC Curve')\n",
"plt.legend(loc='lower right')\n",
"plt.grid(True)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "e403edb1",
"metadata": {},
"source": [
"### Interpreting the ROC Curve\n",
"\n",
"The **Receiver Operating Characteristic (ROC) curve** shows how well the model distinguishes between the positive and negative classes across all decision thresholds.\n",
"\n",
"A quick reminder of the definitions:\n",
"* True Positive Rate (TPR) = Recall\n",
"* False Positive Rate (FPR) = Proportion of negatives wrongly classified as positives\n",
"\n",
"What we display in this plot is:\n",
"* The x-axis is False Positive Rate\n",
"* The y-axis is True Positive Rate\n",
"\n",
"The curve shows how TPR and FPR change as the threshold varies\n",
"\n",
"It's important to note that:\n",
"* A model with no skill will produce a diagonal line (AUC = 0.5)\n",
"* A model with perfect discrimination will hug the top-left corner (AUC = 1.0)\n",
"\n",
"The Area Under the Curve (ROC AUC) gives a single performance score:\n",
"* Closer to 1 means better at ranking positive cases higher than negative ones\n",
"\n",
"**Important!**\n",
"\n",
"While useful, the ROC curve can sometimes overestimate performance when the dataset is imbalanced, because it includes negatives (which dominate in our case, around 99%!). Thats why we also MUST check the Precision-Recall curve."
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "6790d41d",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAhgAAAHWCAYAAAA1jvBJAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQAAU9hJREFUeJzt3XlYlOX+BvB7ZhgGkE1FQBDFNXPFUAnNHUUpy46puWu5S6lkpaaiWaKmpplKedzOLw3TzEwRJdTcKJfAU7lvaSqIGvsyw8zz+8PD5DiDAj4woPfnurhknnne9/3OF5CbdxuFEEKAiIiISCKltQsgIiKiJw8DBhEREUnHgEFERETSMWAQERGRdAwYREREJB0DBhEREUnHgEFERETSMWAQERGRdAwYREREJB0DBlEFNWzYMPj6+hZrmf3790OhUGD//v2lUlNF17FjR3Ts2NH4+MqVK1AoFFi3bp3VaiKqqBgwiIpo3bp1UCgUxg87Ozs0aNAAoaGhSE5OtnZ55V7BL+uCD6VSiSpVqqBHjx6Ij4+3dnlSJCcnY/LkyWjYsCEcHBxQqVIl+Pv746OPPkJqaqq1yyMqUzbWLoCoovnwww9Ru3Zt5Obm4tChQ1i5ciWio6Px+++/w8HBoczqWLVqFQwGQ7GWad++PXJycmBra1tKVT1a//79ERISAr1ej3PnzmHFihXo1KkTjh07hqZNm1qtrsd17NgxhISEIDMzE4MGDYK/vz8A4Pjx45g3bx4OHDiAPXv2WLlKorLDgEFUTD169EDLli0BACNGjEDVqlWxePFifP/99+jfv7/FZbKyslCpUiWpdajV6mIvo1QqYWdnJ7WO4nruuecwaNAg4+N27dqhR48eWLlyJVasWGHFykouNTUVr776KlQqFRISEtCwYUOT5z/++GOsWrVKyrZK43uJqDTwEAnRY+rcuTMA4PLlywDunRvh6OiIixcvIiQkBE5OThg4cCAAwGAwYMmSJWjcuDHs7Ozg4eGB0aNH4++//zZb765du9ChQwc4OTnB2dkZrVq1wsaNG43PWzoHIyoqCv7+/sZlmjZtiqVLlxqfL+wcjM2bN8Pf3x/29vZwc3PDoEGDcP36dZM5Ba/r+vXr6NWrFxwdHVGtWjVMnjwZer2+xP1r164dAODixYsm46mpqZg4cSJ8fHyg0WhQr149zJ8/32yvjcFgwNKlS9G0aVPY2dmhWrVq6N69O44fP26cs3btWnTu3Bnu7u7QaDRo1KgRVq5cWeKaH/TFF1/g+vXrWLx4sVm4AAAPDw9Mnz7d+FihUGDWrFlm83x9fTFs2DDj44LDcj/99BPGjRsHd3d31KhRA1u2bDGOW6pFoVDg999/N46dOXMGr732GqpUqQI7Ozu0bNkS27dvf7wXTfQI3INB9JgKfjFWrVrVOJafn4/g4GC88MILWLhwofHQyejRo7Fu3ToMHz4cb7/9Ni5fvozPP/8cCQkJOHz4sHGvxLp16/DGG2+gcePGmDp1KlxdXZGQkICYmBgMGDDAYh2xsbHo378/unTpgvnz5wMATp8+jcOHD2PChAmF1l9QT6tWrRAREYHk5GQsXboUhw8fRkJCAlxdXY1z9Xo9goODERAQgIULF+LHH3/EokWLULduXYwdO7ZE/bty5QoAoHLlysax7OxsdOjQAdevX8fo0aNRs2ZNHDlyBFOnTsXNmzexZMkS49w333wT69atQ48ePTBixAjk5+fj4MGD+Pnnn417mlauXInGjRvj5Zdfho2NDX744QeMGzcOBoMB48ePL1Hd99u+fTvs7e3x2muvPfa6LBk3bhyqVauGmTNnIisrCy+++CIcHR3xzTffoEOHDiZzN23ahMaNG6NJkyYAgD/++ANt27aFt7c3pkyZgkqVKuGbb75Br1698O233+LVV18tlZqJIIioSNauXSsAiB9//FGkpKSIa9euiaioKFG1alVhb28v/vrrLyGEEEOHDhUAxJQpU0yWP3jwoAAgNmzYYDIeExNjMp6amiqcnJxEQECAyMnJMZlrMBiMnw8dOlTUqlXL+HjChAnC2dlZ5OfnF/oa9u3bJwCIffv2CSGE0Gq1wt3dXTRp0sRkWzt27BAAxMyZM022B0B8+OGHJuts0aKF8Pf3L3SbBS5fviwAiNmzZ4uUlBSRlJQkDh48KFq1aiUAiM2bNxvnzpkzR1SqVEmcO3fOZB1TpkwRKpVKXL16VQghxN69ewUA8fbbb5tt7/5eZWdnmz0fHBws6tSpYzLWoUMH0aFDB7Oa165d+9DXVrlyZdG8efOHzrkfABEeHm42XqtWLTF06FDj44LvuRdeeMHs69q/f3/h7u5uMn7z5k2hVCpNvkZdunQRTZs2Fbm5ucYxg8Eg2rRpI+rXr1/kmomKi4dIiIopKCgI1apVg4+PD15//XU4Ojriu+++g7e3t8m8B/+i37x5M1xcXNC1a1fcvn3b+OHv7w9HR0fs27cPwL09ERkZGZgyZYrZ+RIKhaLQulxdXZGVlYXY2Ngiv5bjx4/j1q1bGDdunMm2XnzxRTRs2BA7d+40W2bMmDEmj9u1a4dLly4VeZvh4eGoVq0aPD090a5dO5w+fRqLFi0y+et/8+bNaNeuHSpXrmzSq6CgIOj1ehw4cAAA8O2330KhUCA8PNxsO/f3yt7e3vh5Wloabt++jQ4dOuDSpUtIS0srcu2FSU9Ph5OT02OvpzAjR46ESqUyGevXrx9u3bplcrhry5YtMBgM6NevHwDg7t272Lt3L/r27YuMjAxjH+/cuYPg4GCcP3/e7FAYkSw8REJUTMuXL0eDBg1gY2MDDw8PPPPMM1AqTbO6jY0NatSoYTJ2/vx5pKWlwd3d3eJ6b926BeCfQy4Fu7iLaty4cfjmm2/Qo0cPeHt7o1u3bujbty+6d+9e6DJ//vknAOCZZ54xe65hw4Y4dOiQyVjBOQ73q1y5ssk5JCkpKSbnZDg6OsLR0dH4eNSoUejTpw9yc3Oxd+9efPbZZ2bncJw/fx7//e9/zbZV4P5eeXl5oUqVKoW+RgA4fPgwwsPDER8fj+zsbJPn0tLS4OLi8tDlH8XZ2RkZGRmPtY6HqV27ttlY9+7d4eLigk2bNqFLly4A7h0e8fPzQ4MGDQAAFy5cgBACM2bMwIwZMyyu+9atW2bhmEgGBgyiYmrdurXx2H5hNBqNWegwGAxwd3fHhg0bLC5T2C/TonJ3d0diYiJ2796NXbt2YdeuXVi7di2GDBmC9evXP9a6Czz4V7QlrVq1MgYX4N4ei/tPaKxfvz6CgoIAAC+99BJUKhWmTJmCTp06GftqMBjQtWtXvPfeexa3UfALtCguXryILl26oGHDhli8eDF8fHxga2uL6OhofPrpp8W+1NeShg0bIjExEVqt9rEuAS7sZNn798AU0Gg06NWrF7777jusWLECycnJOHz4MObOnWucU/DaJk+ejODgYIvrrlevXonrJXoYBgyiMlK3bl38+OOPaNu2rcVfGPfPA4Dff/+92P/529raomfPnujZsycMBgPGjRuHL774AjNmzLC4rlq1agEAzp49a7wapsDZs2eNzxfHhg0bkJOTY3xcp06dh87/4IMPsGrVKkyfPh0xMTEA7vUgMzPTGEQKU7duXezevRt3794tdC/GDz/8gLy8PGzfvh01a9Y0jhcckpKhZ8+eiI+Px7ffflvopcr3q1y5stmNt7RaLW7evFms7fbr1w/r169HXFwcTp8+DSGE8fAI8E/v1Wr1I3tJJBvPwSAqI3379oVer8ecOXPMnsvPzzf+wunWrRucnJwQERGB3Nxck3lCiELXf+fOHZPHSqUSzZo1AwDk5eVZXKZly5Zwd3dHZGSkyZxdu3bh9OnTePHFF4v02u7Xtm1bBAUFGT8eFTBcXV0xevRo7N69G4mJiQDu9So+Ph67d+82m5+amor8/HwAQO/evSGEwOzZs83mFfSqYK/L/b1LS0vD2rVri/3aCjNmzBhUr14d77zzDs6dO2f2/K1bt/DRRx8ZH9etW9d4HkmBL7/8stiX+wYFBaFKlSrYtGkTNm3ahNatW5scTnF3d0f
"text/plain": [
"<Figure size 600x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# PR Curve\n",
"precision, recall, _ = precision_recall_curve(y_test_rfe, y_pred_proba_rfe)\n",
"pr_auc = average_precision_score(y_test_rfe, y_pred_proba_rfe)\n",
"\n",
"plt.figure(figsize=(6, 5))\n",
"plt.plot(recall, precision, color='green', lw=2, label=f'PR curve (AUC = {pr_auc:.4f})')\n",
"plt.xlabel('Recall')\n",
"plt.ylabel('Precision')\n",
"plt.title('Precision-Recall Curve')\n",
"plt.legend(loc='lower left')\n",
"plt.grid(True)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "c111a266",
"metadata": {},
"source": [
"### Interpreting the Precision-Recall (PR) Curve\n",
"\n",
"The **Precision-Recall (PR) curve** helps evaluate model performance, especially on imbalanced datasets like ours (where positive cases are rare).\n",
"\n",
"A quick reminder of the definitions:\n",
"* Precision = How many of the predicted positives are actually positive\n",
"* Recall = How many of the actual positives the model correctly identifies\n",
"\n",
"What we display in this plot is:\n",
"* The x-axis is Recall \n",
"* The y-axis is Precision \n",
"\n",
"The curve shows the trade-off between them at different model thresholds\n",
"\n",
"In imbalanced datasets, accuracy can be misleading — the PR curve focuses only on the positive class, making it much more meaningful:\n",
"* A higher curve means better performance\n",
"* The area under the curve (PR AUC) summarizes this: closer to 1 is better"
]
},
{
"cell_type": "markdown",
"id": "1c83ddcd",
"metadata": {},
"source": [
"## Feature Importance\n",
"Understanding what drives the prediction is useful for future experiments and business knowledge. Here we track both the native feature importances of the trees, as well as a more heavy SHAP values analysis.\n",
"\n",
"Important! Be aware that SHAP analysis might take quite a bit of time."
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "d66ffe2c",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAxgAAAHqCAYAAACHuOhfAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs3XdUFdf++P33AaRIB1FQkaKASMCGJkoEC17sXdRwBUSMxhArtm8izRoj9lgSE0CDJcYaW6LYsWEBURFRQcyV2MVgAYR5/uBhfh4pHpSoJPu11lnLM2XPZ8/Mwdmzm0KSJAlBEARBEARBEIRKoPauAxAEQRAEQRAE4Z9DFDAEQRAEQRAEQag0ooAhCIIgCIIgCEKlEQUMQRAEQRAEQRAqjShgCIIgCIIgCIJQaUQBQxAEQRAEQRCESiMKGIIgCIIgCIIgVBpRwBAEQRAEQRAEodKIAoYgCIIgCIIgCJVGFDAEQRAEQRAEQag0ooAhCIIgCIIsOjoahUJR6mfy5Ml/yzGPHj1KWFgYDx8+/FvSfxPF5+PUqVPvOpTXtnTpUqKjo991GMK/iMa7DkAQBEEQhPdPREQENjY2Sss++OCDv+VYR48eJTw8HH9/f4yMjP6WY/ybLV26lBo1auDv7/+uQxH+JUQBQxAEQRCEEjp37oyrq+u7DuONPH78GF1d3Xcdxjvz5MkTqlev/q7DEP6FRBMpQRAEQRAqbNeuXbRp0wZdXV309fXp2rUrFy5cUNrm3Llz+Pv7Y2tri7a2Nubm5gQEBHDv3j15m7CwMCZMmACAjY2N3BwrIyODjIwMFApFqc17FAoFYWFhSukoFAouXrzIJ598grGxMR9//LG8/qeffqJ58+bo6OhgYmLCwIEDuXHjxmvl3d/fHz09PTIzM+nWrRt6enrUqVOHb7/9FoDk5GTat2+Prq4uVlZWrFmzRmn/4mZXhw4dYvjw4ZiammJgYICvry8PHjwocbylS5fi5OSElpYWtWvX5vPPPy/RnKxt27Z88MEHnD59Gnd3d6pXr87//d//YW1tzYULFzh48KB8btu2bQvA/fv3CQ4OxtnZGT09PQwMDOjcuTNJSUlKaR84cACFQsHPP//MjBkzqFu3Ltra2nTo0IErV66UiPfEiRN06dIFY2NjdHV1cXFxYeHChUrbXLp0iX79+mFiYoK2tjaurq5s27ZNaZv8/HzCw8Oxs7NDW1sbU1NTPv74Y/bs2aPSdRLeHVGDIQiCIAhCCdnZ2dy9e1dpWY0aNQBYvXo1fn5+eHl58fXXX/PkyROWLVvGxx9/zNmzZ7G2tgZgz549XLt2jSFDhmBubs6FCxf47rvvuHDhAsePH0ehUNCnTx8uX77M2rVrmT9/vnwMMzMz7ty5U+G4+/fvj52dHTNnzkSSJABmzJjB1KlT8fb2JjAwkDt37rB48WLc3d05e/bsazXLKigooHPnzri7uzNnzhxiY2MJCgpCV1eXL7/8Eh8fH/r06cPy5cvx9fWlVatWJZqcBQUFYWRkRFhYGKmpqSxbtozr16/LD/RQVHAKDw/H09OTzz77TN4uISGB+Ph4qlWrJqd37949OnfuzMCBA/nvf/9LrVq1aNu2LV988QV6enp8+eWXANSqVQuAa9eusWXLFvr374+NjQ23bt1ixYoVeHh4cPHiRWrXrq0U7+zZs1FTUyM4OJjs7GzmzJmDj48PJ06ckLfZs2cP3bp1w8LCgtGjR2Nubk5KSgrbt29n9OjRAFy4cAE3Nzfq1KnD5MmT0dXV5eeff6ZXr15s3LiR3r17y3mfNWsWgYGBtGzZkkePHnHq1CnOnDlDx44dK3zNhLdIEgRBEARB+P9FRUVJQKkfSZKkv/76SzIyMpKGDRumtN+ff/4pGRoaKi1/8uRJifTXrl0rAdKhQ4fkZd98840ESOnp6UrbpqenS4AUFRVVIh1ACg0Nlb+HhoZKgDRo0CCl7TIyMiR1dXVpxowZSsuTk5MlDQ2NEsvLOh8JCQnyMj8/PwmQZs6cKS978OCBpKOjIykUCmndunXy8kuXLpWItTjN5s2bS3l5efLyOXPmSIC0detWSZIk6fbt25Kmpqb0n//8RyooKJC3W7JkiQRIP/74o7zMw8NDAqTly5eXyIOTk5Pk4eFRYvmzZ8+U0pWkonOupaUlRUREyMv2798vAZKjo6OUm5srL1+4cKEESMnJyZIkSdLz588lGxsbycrKSnrw4IFSuoWFhfK/O3ToIDk7O0vPnj1TWt+6dWvJzs5OXta4cWOpa9euJeIW3n+iiZQgCIIgCCV8++237NmzR+kDRW+oHz58yKBBg7h79678UVdX58MPP2T//v1yGjo6OvK/nz17xt27d/noo48AOHPmzN8S94gRI5S+b9q0icLCQry9vZXiNTc3x87OTineigoMDJT/bWRkhIODA7q6unh7e8vLHRwcMDIy4tq1ayX2//TTT5VqID777DM0NDTYuXMnAHv37iUvL48xY8agpvb/HtmGDRuGgYEBO3bsUEpPS0uLIUOGqBy/lpaWnG5BQQH37t1DT08PBweHUq/PkCFD0NTUlL+3adMGQM7b2bNnSU9PZ8yYMSVqhYprZO7fv8++ffvw9vbmr7/+kq/HvXv38PLyIi0tjf/9739A0Tm9cOECaWlpKudJeD+IJlKCIAiCIJTQsmXLUjt5Fz/stW/fvtT9DAwM5H/fv3+f8PBw1q1bx+3bt5W2y87OrsRo/5+XmyGlpaUhSRJ2dnalbv/iA35FaGtrY2ZmprTM0NCQunXryg/TLy4vrW/FyzHp6elhYWFBRkYGANevXweKCikv0tTUxNbWVl5frE6dOkoFgFcpLCxk4cKFLF26lPT0dAoKCuR1pqamJbavV6+e0ndjY2MAOW9Xr14Fyh9t7MqVK0iSxNSpU5k6dWqp29y+fZs6deoQERFBz549sbe354MPPqBTp04MHjwYFxcXlfMovBuigCEIgiAIgsoKCwuBon4Y5ubmJdZraPy/Rwtvb2+OHj3KhAkTaNKkCXp6ehQWFtKpUyc5nfK8/KBe7MUH4Ze9WGtSHK9CoWDXrl2oq6uX2F5PT++VcZSmtLTKWy79//1B/k4v5/1VZs6cydSpUwkICGDatGmYmJigpqbGmDFjSr0+lZG34nSDg4Px8vIqdZsGDRoA4O7uztWrV9m6dSu///47K1euZP78+Sxfvlyp9kh4/4gChiAIgiAIKqtfvz4ANWvWxNPTs8ztHjx4QFxcHOHh4YSEhMjLS2vuUlZBovgN+csjJr385v5V8UqShI2NDfb29irv9zakpaXRrl07+XtOTg5ZWVl06dIFACsrKwBSU1OxtbWVt8vLyyM9Pb3c8/+iss7vL7/8Qrt27fjhhx+Ulj98+FDubF8RxffG+fPny4ytOB/VqlVTKX4TExOGDBnCkCFDyMnJwd3dnbCwMFHAeM+JPhiCIAiCIKjMy8sLAwMDZs6cSX5+fon1xSM/Fb/tfvnt9oIFC0rsUzxXxcsFCQMDA2rUqMGhQ4eUli9dulTlePv06YO6ujrh4eElYpEkSWnI3Lftu+++UzqHy5Yt4/nz53Tu3BkAT09PNDU1WbRokVLsP/zwA9nZ2XTt2lWl4+jq6pY6S7q6unqJc7Jhwwa5D0RFNWvWDBsbGxYsWFDieMXHqVmzJm3btmXFihVkZWWVSOPFkcNevjZ6eno0aNCA3Nzc14pPeHtEDYYgCIIgCCozMDBg2bJlDB48mGbNmjFw4EDMzMzIzMxkx44duLm5sWTJEgwMDOQhXPPz86lTpw6///476enpJdJs3rw5AF9++SUDBw6kWrVqdO/eHV1dXQIDA5k9ezaBgYG4urpy6NAhLl++rHK89evXZ/r06UyZMoWMjAx69eqFvr4+6enpbN68mU8//ZTg4OBKOz8VkZeXR4cOHfD29iY1NZWlS5fy8ccf06NHD6BoqN4pU6YQHh5Op06d6NGjh7xdixYt+O9//6vScZo3b86yZcuYPn06DRo0oGbNmrRv355u3boRERHBkCFDaN26NcnJycTGxirVllSEmpoay5Yto3v37jRp0oQ
"text/plain": [
"<Figure size 800x500 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"## BUILT-IN\n",
"\n",
"# Get feature importances from the model\n",
"importances = best_pipeline_rfe.named_steps['model'].feature_importances_\n",
"\n",
"# Create a Series and sort\n",
"feat_series = pd.Series(importances, index=selected_features_rfe).sort_values(ascending=True) # ascending=True for horizontal plot\n",
"\n",
"# Plot Feature Importances\n",
"plt.figure(figsize=(8, 5))\n",
"feat_series.plot(kind='barh', color='skyblue')\n",
"plt.title('Feature Importances')\n",
"plt.xlabel('Importance')\n",
"plt.grid(axis='x')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "3897f25c",
"metadata": {},
"source": [
"### Interpreting the Feature Importance Plot\n",
"The **feature importance plot** shows how much each feature contributes to the models overall decision-making.\n",
"\n",
"For tree-based models like Random Forest, importance is based on how often and how effectively a feature is used to split the data across all trees.\n",
"A higher score means the feature plays a bigger role in improving prediction accuracy.\n",
"\n",
"In the graph you will see that:\n",
"* Features are ranked from most to least important.\n",
"* The values are relative and model-specific — not directly interpretable as weights or probabilities.\n",
"\n",
"This helps us identify which features the model relies on most when making predictions.\n",
"\n",
"**Important!**\n",
"Unlike SHAP values, native importance doesn't show how a feature affects predictions — only how useful it is to the model overall. For deeper interpretability (e.g., direction and context), SHAP is better (but it takes more time to run)."
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "e2197cea",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"PermutationExplainer explainer: 6394it [13:25, 7.93it/s] \n",
"/tmp/ipykernel_29610/4064815753.py:21: FutureWarning: The NumPy global RNG was seeded by calling `np.random.seed`. In a future version this function will no longer use the global RNG. Pass `rng` explicitly to opt-in to the new behaviour and silence this warning.\n",
" shap.summary_plot(shap_values.values, X_test_shap)\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAzsAAAOsCAYAAABtTKjUAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQABAABJREFUeJzs3XdYFGfXwOHfLk0EpNkQrNh7wfraEjsC9pLEElTsMYnRxDQ1ifmieY29IbFhSVREKfZYMBq7xmiMJSqKYkME6bDsfn/wsrouZUGaeO7r2kt25pmZM7Oz65x5yig0Go0GIYQQQgghhChmlIUdgBBCCCGEEELkB0l2hBBCCCGEEMWSJDtCCCGEEEKIYkmSHSGEEEIIIUSxJMmOEEIIIYQQoliSZEcIIYQQQghRLEmyI4QQQgghhCiWJNkRQgghhBBCFEuS7AghhBBCCCGKJUl2hBBCCCGEeAPMnDkTS0vLbOeFhoaiUCjw8/PL0fpzu1x+Mi7sAIQQQgghhBBFh4ODA8ePH6dmzZqFHcork2RHCCGEEEIIoWVmZkarVq0KO4w8Ic3YhBBCCCGEEFoZNUdLTk5m0qRJ2NnZYWNjw5gxY9i0aRMKhYLQ0FCd5RMTE5k4cSK2trY4ODgwZcoUVCpVAe9FGkl2hBBCCCGEeIOoVCq9l1qtznKZadOm4e3tzWeffcbmzZtRq9VMmzYtw7JffvklSqWSLVu2MHbsWH766Sd+/vnn/NiVbEkzNiGEEEIIId4QcXFxmJiYZDjPwsIiw+mRkZEsX76cr776is8++wyAbt260blzZ8LCwvTKt2zZkkWLFgHQpUsXDh06hJ+fH2PHjs2jvTCcJDtCCCGEECJPpaSksGbNGgA8PT0zvbgWBlL0Nbysxj/L2ebm5hw5ckRv+sqVK9m0aVOGy1y8eJHExEQ8PDx0pvfq1YsDBw7ole/atavO+7p163Lw4MHsIs8XkuwIIYQQQgjxhlAqlbi4uOhNDw4OznSZ+/fvA1CmTBmd6WXLls2wvI2Njc57U1NTEhMTcxhp3pA+O0IIIYQQQohMOTg4APD48WOd6Y8ePSqMcHJEkh0hhBBCCCGKNEUOXnmvfv36lChRgoCAAJ3pO3bsyJft5SVpxiaEEEIIIYTIlL29PePGjeP777+nRIkSNG7cmK1bt3Lt2jUgrWlcUVV0IxNCCCGEEEIUCbNnz2b06NH88MMPDBgwgJSUFO3Q09bW1oUcXeYUGo1GU9hBCCGEEEKI4kNGY8tjin6Gl9Vsy784XjJ06FCOHj3KrVu3CmybOSXN2IQQQgghhCjS8qcvTk6EhIRw7NgxmjVrhlqtJjg4mI0bNzJv3rzCDi1LkuwIIYQQQgghsmRpaUlwcDBz5swhISGBqlWrMm/ePD766KPCDi1LkuwIIYQQQgghstSsWTP++OOPwg4jxyTZEUIIIYQQokgr/GZsrysZjU0IIYQQQghRLEmyI4QQQgghhCiWJNkRQgghhBBCFEvSZ0cIIYQQQogiTfrs5JbU7AghhBBCCCGKJUl2hBBCCCGEEMWSJDtCCCGEEEKIYkmSHSGEEEIIIUSxJMmOEEIIIYQQoliS0diEEEIIIYQo0mQ0ttySmh0hhBBCCCFEsSTJjhBCCCGEEKJYkmRHCCGEEEIUSRv+VjFuv4q7z1SFHYp4TUmfHSGEEEIIUaTEJquxW6wmRZP2fsUFMFOqiP/YCKXiTey/8ibuc96Qmh0hhBBCCFGkdNr8PNFJl6QGh6WphROQeG1JsiOEEEIIIYqU0w8znv4osWDjEK8/SXaEEEIIIV4Xl0Kh0wwY8CM8iirsaPKNJvsibxhFDl7iRdJnRwghhBCiqLseDjUn6k7zOwHLvGBcj8KJqZDcjEql9moNKeq096bA5ZFKnG3lHr7QJ2eFEEIIIURRFpugn+ikG+9TsLHkg48OqFDOVaGYq6LVBhW3o7Pul+P88/NEByAZqL5KnWl58WaTZEcIIYQQoiibvDbr+UkpBRJGfui8WcXC88+brZ18AFV8cteI7cz94jx4gTRjyy1JdoQQQgghirItx7KeX3p4wcSRDw6E5d263P2lp4/QJ312hBBCCCGKsrhshiCLlSHKAB4kgO1CFVEpYKKAwN7Q3Vkudd90UrMjhBBCCFGUqQzoj6LOpExicvbJUjES9b8WfSka6LEd/n6sKtyARKGTZEcIka+CgoJwcXHhzJkzhR1KkSfHKm+5u7szevRonWmjR4/G3d29kCIq2s6cOYOLiwtBQUE606Oiopg+fTrdu3fHxcVF75iKfHbsH8PK/X1b971aDdXHgflgsHwXFH1h9NK8j+8VRMbnfyLSeB1YL0wb/EAxV4X1QhUxSa9j3x7ps5NbkuwIIYqtw4cP4+3tXdhh6Dhz5gze3t7ExMQUdigiB65evYq3tzfh4eG5Wj48PBwXFxfmzJmTaRl3d3cGDhyY2xANkpv9mD9/Pvv376dfv358++23jBgxIh8jFHo8/s+wcs2m6r7vOxtuvPRkTp8DsHx33sSVB6aE5P82VMCzF8ZveJYC9kukb8+bRBoyCiGKrcOHDxMcHMyYMWMKOxSts2fP4uPjg7u7O1ZWVjrzXF1d6dq1KyYmJoUUXfG3dOlSNJqcX+hcu3YNHx8fmjVrRoUKFfIhsoKR1X40bdqUY8eOYWyse2lw8uRJWrVqhZeXV0GGKtJFxhlWLkUNf9+BepXS3gdkUkM83gdKW8GAtnkT3ytY83fhbDdFcp03iiQ7QghRRBgZGWFkZFTYYRRrkkhmTqlUYmZmpjf9yZMnWFtbF0JEgtuPclY+4n81xl9tyLrcwHnAPOjSCKqUhlnvQVmb3ESYa6nqws04hu5UcewuqDSwpDN4VC/ql8TSPC23ivonK4QoJjQaDevXr8fPz49Hjx7h4ODAiBEjcHNz0ym3Y8cOtm7dSmhoKMbGxtSvXx8vLy8aN26sU+7o0aP4+vpy48YNEhMTsbGxoW7dukycOJHKlSszevRozp07B4CLi4t2uRkzZhjcZ+Px48ds2LCB06dPc//+fZKSknB0dKRnz54MHTpULzFJSUlh06ZN7N27l9u3b2NsbEylSpVwc3Nj0KBBzJw5k+DgYAA8PDy0y3l5eTFmzBiCgoL45ptvWLFiBS4uLhw7dowPP/yQKVOmMHjwYL34PD09CQsLY8+ePdq78Xfu3MHHx4dTp04RHR1NmTJl6Ny5M6NHj8bc3Nyg/U7n7e2Nj48Pmzdvxt/fn99++43Y2FiqV6/OhAkTaNGihU55FxcX3Nzc6NmzJ8uWLePatWtYW1szcOBA3n//fZ49e8aCBQv4/fffiY+Pp3nz5nz55ZeUKVNGu47o6Gh+/vlnjhw5wuPHjzE3N8fBwYGuXbsybNiwHMWfkdGjR3P//n2dfik3btxg5cqV/PXXX0RFRVGqVCmqVKnC0KFDadu2rfY4AIwdO1a7nJubGzNnznzlmLJz7tw5fv75Z/7++29UKhVVqlRhwIAB9O7dW6fcq+7HmTNnGDt2rPY78mL54OBg7bn78ccfM3/+fL788kv69OmjF+/AgQNJTk5m+/btKBRv+AVaair4HYfNx+DsTXjyDOKS8m97Hb/OWfn9F9L+9TnwfJq5CTSpCnUqQt2KMKwjlC6VZyGmm3a4cPvNbHihK1SvHZDW4C3NFy3h+3ZyiVxcyCcphCgQS5cuJSkpib59+2Jqaoqfnx8zZ87EyclJm8gsWrQIX19f6tWrx/jx44mPj2f79u2MGTOGn376ibZt05pdnD17lsmTJ+Ps7IynpyeWlpZERERw6tQpwsLCqFy5MiNGjECj0XD+/Hm+/fZbbRwNGzY0OObr169z6NAhOnbsiJOTEyqViuPHj7NkyRLu3bvHl19+qS2bkpLCxIkTOXv2LK1ataJHjx6Ympry77//cujQIQYNGkTfvn2Ji4vj0KFDTJ48GRsbGwBq1KiR4fZbtWqFvb09O3fu1Et27ty5w8WLFxk8eLA20fnnn38YO3YsVlZW9O3bl7Jly3Lt2jV+/fVXLly4wMqVK/WaKBlixowZKJVKhg0bRnx8PP7+/nzwwQc
"text/plain": [
"<Figure size 800x950 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"## SHAP VALUES\n",
"\n",
"# SHAP requires that all features passed to Explainer be numeric (floats/ints)\n",
"X_test_shap = X_test_rfe.copy()\n",
"X_test_shap = X_test_shap.astype(float)\n",
"\n",
"# Function that returns the probability of the positive class\n",
"def model_predict(data):\n",
" return best_pipeline_rfe.predict_proba(data)[:, 1]\n",
"\n",
"# Ensure input to SHAP is numeric\n",
"X_test_shap = X_test_rfe.astype(float)\n",
"\n",
"# Create SHAP explainer\n",
"explainer = shap.Explainer(model_predict, X_test_shap)\n",
"\n",
"# Compute SHAP values\n",
"shap_values = explainer(X_test_shap)\n",
"\n",
"# Plot summary\n",
"shap.summary_plot(shap_values.values, X_test_shap)"
]
},
{
"cell_type": "markdown",
"id": "e9ae2701",
"metadata": {},
"source": [
"### Interpreting the SHAP Summary Plot\n",
"\n",
"Each point on a row represents a SHAP value for a single prediction (row = feature).\n",
"The x-axis shows how much the feature contributed to increasing or decreasing the prediction.\n",
"* Right (positive SHAP value): pushes prediction toward the positive class (i.e., higher chance of incident).\n",
"* Left (negative SHAP value): pushes prediction toward the negative class (i.e., lower chance of incident).\n",
"\n",
"Color shows the actual feature value for that point:\n",
"* Red = high value\n",
"* Blue = low value\n",
"\n",
"In other words:\n",
"* The position tells you impact.\n",
"* The color tells you feature value.\n",
"* The density (thickness) of dots shows how often a value occurs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "345467a8",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}