sh-notion/notion_data_team_no_files/Data quality assessment Billable Bookings 97008b7f1cbb4beb98295a22528acd03.md
Pablo Martin a256b48b01 pages
2025-07-11 16:15:17 +02:00

26 KiB
Raw Blame History

Data quality assessment: Billable Bookings

2024-07-17, by Uri

This page aims to document the differences in Billable Bookings observed from Finance point of view vs. DWH point of view. This originates from the fact that, wanting to expose this KPI through DWH prove to be inconsistent with the billable bookings coming from Finance side.

Finance side

Finance retrieves the Billable Bookings from the monthly accounting reports, that are generated usually on the first working day of each month by the project data-invoicing-exporter. Specifically, they retrieve it from the Account Reports Summary, summing the tab of BillableBookingCount. Jamie provided the file they were using for June 2024 so I (Uri) could dig more into this. There was no amendment made this month.

The number of billable bookings according to Finance in June 2024 is: 25,538

DWH side

The billable bookings should come from the table int_core__booking_charge_events. Mainly, theres 2 ways of charge a booking: either when the verification starts, or when the check-in starts.

The verification start date that is used for the logic has changed recently (mid June 24) because of the new estimated Guest Journey Start Date, that is when the link is used. Here you can see the changes:

Therefore, before this change, the booking charge events was considered a verification start that corresponded to joined_at_utc coming from the int_core__unified_user table.

In a nutshell, running this on 17th July 2024 for the month of June, the values retrieved are:

  • Using verification estimated start date (link used at): 23,440
  • Using verification guest joined date: 21,604

Query available here

with 
booking_charge_events_joined_at as (
with
    stg_core__booking as (select * from staging.stg_core__booking),
    int_core__price_plans as (select * from intermediate.int_core__price_plans),
    int_core__verification_requests as (
        select * from intermediate.int_core__verification_requests
    ),
    booking_with_relevant_price_plans as (
        select
            *,
            case
                when pp.price_plan_charged_by_type = 'CheckInDate'
                then b.check_in_at_utc
                when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
                then icuu.joined_at_utc
            end as booking_fee_charge_at_utc
        from stg_core__booking b
        left join
            int_core__verification_requests vr
            on b.id_verification_request = vr.id_verification_request
        left join intermediate.int_core__unified_user icuu 
			on vr.id_user_guest = icuu.id_user 
        left join int_core__price_plans pp on b.id_user_host = pp.id_user_host
        where
            -- The dates that defines which price plan applies to the booking depends
            -- on charged by type. With the below case, we evaluate if a certain price
            -- plan relates to the booking
            case
                when pp.price_plan_charged_by_type = 'CheckInDate'
                then b.check_in_at_utc between pp.start_at_utc and pp.end_at_utc
                when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
                then
                    coalesce(
                        (
                        	icuu.joined_at_utc
                            between pp.start_at_utc and pp.end_at_utc
                        ),
                        false
                    )
                else false
            end
            = true
    ),
    untied_bookings as (
        -- If a booking has two valid price plans, take the earliest
        select id_booking, min(id_price_plan) as id_price_plan
        from booking_with_relevant_price_plans brpp
        group by id_booking
    )

select
    ub.id_booking,
    ub.id_price_plan,
    brpp.booking_fee_local,
    brpp.booking_fee_charge_at_utc,
    cast(brpp.booking_fee_charge_at_utc as date) as booking_fee_charge_date_utc
from untied_bookings ub
left join
    booking_with_relevant_price_plans brpp
    on ub.id_booking = brpp.id_booking
    and ub.id_price_plan = brpp.id_price_plan

),
booking_charge_events_estimated_at as (
with
    stg_core__booking as (select * from staging.stg_core__booking),
    int_core__unified_user as (select * from intermediate.int_core__unified_user),
    int_core__price_plans as (select * from intermediate.int_core__price_plans),
    int_core__verification_requests as (
        select * from intermediate.int_core__verification_requests
    ),
    booking_with_relevant_price_plans as (
        select
            *,
            case
                when pp.price_plan_charged_by_type = 'CheckInDate'
                then b.check_in_at_utc
                when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
                then vr.verification_estimated_started_at_utc
            end as booking_fee_charge_at_utc
        from stg_core__booking b
        left join
            int_core__verification_requests vr
            on b.id_verification_request = vr.id_verification_request
        left join int_core__price_plans pp on b.id_user_host = pp.id_user_host
        where
            -- The dates that defines which price plan applies to the booking depends
            -- on charged by type. With the below case, we evaluate if a certain price
            -- plan relates to the booking
            case
                when pp.price_plan_charged_by_type = 'CheckInDate'
                then b.check_in_at_utc between pp.start_at_utc and pp.end_at_utc
                when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
                then
                    coalesce(
                        (
                            vr.verification_estimated_started_at_utc
                            between pp.start_at_utc and pp.end_at_utc
                        ),
                        false
                    )
                else false
            end
            = true
    ),
    untied_bookings as (
        -- If a booking has two valid price plans, take the earliest
        select id_booking, min(id_price_plan) as id_price_plan
        from booking_with_relevant_price_plans brpp
        group by id_booking
    )

select
    ub.id_booking,
    ub.id_price_plan,
    brpp.booking_fee_local,
    brpp.booking_fee_charge_at_utc,
    cast(brpp.booking_fee_charge_at_utc as date) as booking_fee_charge_date_utc
from untied_bookings ub
left join
    booking_with_relevant_price_plans brpp
    on ub.id_booking = brpp.id_booking
    and ub.id_price_plan = brpp.id_price_plan

)
select 
	'verification_estimated_at' as type, 
	COUNT(distinct id_booking) as billable_bookings
from booking_charge_events_estimated_at
where date_trunc('month', booking_fee_charge_date_utc) = '2024-06-01'
union all
select 
	'verification_guest_joined_at' as type, 
	COUNT(distinct id_booking) as billable_bookings
from booking_charge_events_joined_at
where date_trunc('month', booking_fee_charge_date_utc) = '2024-06-01'

First volume check

Well, given that theres clearly a difference between Finance numbers and what we can obtain from DWH (in either case…), before jumping into some python-based project that runs queries in the Core schema directly… maybe lets do some simple checks.

The Account Reports Summary excel is split per Company name, and it should be easy to retrieve the Deal Id linked to the company name (at least manually) for the most notable cases. So I just ordered the file in descending order of billable bookings and retrieved the first one:

  • Company Name: LavandaBilling Account
  • Billable Bookings: 3,178

This has a couple of id_deals linked to it: '1604445496','20529225110

Untitled

Query available here

select id_user as id_user_host, id_deal, company_name
from intermediate.int_core__unified_user icuu 
where id_deal IN ('1604445496','20529225110')

The good news is that we have the KPIs view by Deal, so we can retrieve the information for these 2 Deals super fast. It will be considering the estimated start date, as its the current behaviour for DWH.

The bad news is that… well… theres no billable bookings at all!

Untitled

Query available here

select 
	*
from intermediate.int_monthly_aggregated_metrics_history_by_deal 
where date = '2024-06-30'
and id_deal in ('20529225110','1604445496')
order by date desc

So clearly theres a big difference hiding somewhere.

Do we have a similar behaviour if using the joined_at_utc for verification start in booking_charge_events?

… well, yes. Exactly 0 billable bookings!

Untitled

Query available here

with 
booking_charge_events_joined_at as (
with
    stg_core__booking as (select * from staging.stg_core__booking),
    int_core__price_plans as (select * from intermediate.int_core__price_plans),
    int_core__verification_requests as (
        select * from intermediate.int_core__verification_requests
    ),
    booking_with_relevant_price_plans as (
        select
            *,
            case
                when pp.price_plan_charged_by_type = 'CheckInDate'
                then b.check_in_at_utc
                when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
                then icuu.joined_at_utc
            end as booking_fee_charge_at_utc
        from stg_core__booking b
        left join
            int_core__verification_requests vr
            on b.id_verification_request = vr.id_verification_request
        left join intermediate.int_core__unified_user icuu 
			on vr.id_user_guest = icuu.id_user 
        left join int_core__price_plans pp on b.id_user_host = pp.id_user_host
        where
            -- The dates that defines which price plan applies to the booking depends
            -- on charged by type. With the below case, we evaluate if a certain price
            -- plan relates to the booking
            case
                when pp.price_plan_charged_by_type = 'CheckInDate'
                then b.check_in_at_utc between pp.start_at_utc and pp.end_at_utc
                when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
                then
                    coalesce(
                        (
                        	icuu.joined_at_utc
                            between pp.start_at_utc and pp.end_at_utc
                        ),
                        false
                    )
                else false
            end
            = true
    ),
    untied_bookings as (
        -- If a booking has two valid price plans, take the earliest
        select id_booking, min(id_price_plan) as id_price_plan
        from booking_with_relevant_price_plans brpp
        group by id_booking
    )

select
    ub.id_booking,
    ub.id_price_plan,
    brpp.booking_fee_local,
    brpp.booking_fee_charge_at_utc,
    cast(brpp.booking_fee_charge_at_utc as date) as booking_fee_charge_date_utc
from untied_bookings ub
left join
    booking_with_relevant_price_plans brpp
    on ub.id_booking = brpp.id_booking
    and ub.id_price_plan = brpp.id_price_plan

),
booking_charge_events_estimated_at as (
with
    stg_core__booking as (select * from staging.stg_core__booking),
    int_core__unified_user as (select * from intermediate.int_core__unified_user),
    int_core__price_plans as (select * from intermediate.int_core__price_plans),
    int_core__verification_requests as (
        select * from intermediate.int_core__verification_requests
    ),
    booking_with_relevant_price_plans as (
        select
            *,
            case
                when pp.price_plan_charged_by_type = 'CheckInDate'
                then b.check_in_at_utc
                when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
                then vr.verification_estimated_started_at_utc
            end as booking_fee_charge_at_utc
        from stg_core__booking b
        left join
            int_core__verification_requests vr
            on b.id_verification_request = vr.id_verification_request
        left join int_core__price_plans pp on b.id_user_host = pp.id_user_host
        where
            -- The dates that defines which price plan applies to the booking depends
            -- on charged by type. With the below case, we evaluate if a certain price
            -- plan relates to the booking
            case
                when pp.price_plan_charged_by_type = 'CheckInDate'
                then b.check_in_at_utc between pp.start_at_utc and pp.end_at_utc
                when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
                then
                    coalesce(
                        (
                            vr.verification_estimated_started_at_utc
                            between pp.start_at_utc and pp.end_at_utc
                        ),
                        false
                    )
                else false
            end
            = true
    ),
    untied_bookings as (
        -- If a booking has two valid price plans, take the earliest
        select id_booking, min(id_price_plan) as id_price_plan
        from booking_with_relevant_price_plans brpp
        group by id_booking
    )

select
    ub.id_booking,
    ub.id_price_plan,
    brpp.booking_fee_local,
    brpp.booking_fee_charge_at_utc,
    cast(brpp.booking_fee_charge_at_utc as date) as booking_fee_charge_date_utc
from untied_bookings ub
left join
    booking_with_relevant_price_plans brpp
    on ub.id_booking = brpp.id_booking
    and ub.id_price_plan = brpp.id_price_plan

)
select 
	'verification_estimated_at' as type, 
	COUNT(distinct bce.id_booking) as billable_bookings
from booking_charge_events_estimated_at bce
inner join intermediate.int_core__bookings b 
	on bce.id_booking = b.id_booking
inner join intermediate.int_core__unified_user u 
	on b.id_user_host = u.id_user
	and u.id_deal in ('1604445496','20529225110')
where date_trunc('month', bce.booking_fee_charge_date_utc) = '2024-06-01'
union all
select 
	'verification_guest_joined_at' as type, 
	COUNT(distinct bce.id_booking) as billable_bookings
from booking_charge_events_joined_at bce
inner join intermediate.int_core__bookings b 
	on bce.id_booking = b.id_booking
inner join intermediate.int_core__unified_user u 
	on b.id_user_host = u.id_user
	and u.id_deal in ('1604445496','20529225110')
where date_trunc('month', bce.booking_fee_charge_date_utc) = '2024-06-01'

So clearly we have a situation here. Thankfully, Clay took the time to explain to me that this company is a special case in which these bookings do not have Guest Journeys, so looks suspicious to me in the sense that probably this is affecting somehow the logic in DWH. This is because its an “autohost account”, that might probably come from Partner API.

I also tried with the 2nd Company with more Billable Bookings in June 2024, namely:

  • Company Name: Home Team Vacation Rentals LLC
  • Billable Bookings: 1,350
  • Deal Id: 15463726437

Reusing the previous query and changing the deal id, we get these results:

Untitled

So its looking MUCH better, even though its true that we miss/exceed the 1,350 for a few bookings. Lets be pragmatic and focus first on the big issues, and then on the small ones…

Debugging data-invoicing-exporter

Ive already put some comments on the DevOps ticket, but might be worth to aim to understand where these Lavanda Billable Bookings come from.

Billable Booking Count is computed in dashboard_tools.py here, which depends on user_report_contents here. In turn, this is retrieved from the function get_user_data, defined here.

From here, we can retrieve the query being used on bookings queries.get_query_booking_data_of_user here.

Time to switch to queries.py, where the query were interested in is here. This query has some transactional sql.

Firstly, here, it creates a snapshot of the latest available PricePlanToUser at the moment of the extraction, to be applied to all the bookings of the exporting time period considered. These are the ones that are going to be used for the export. This is, technically, different from what we do in DWH in which we apply the price plan according on what was active at the moment of verification start or check-in. In case of more than 1 active price plan for a booking, we take the first one. This could explain changes on the volumes, but probably not the massive difference. We can check this later on.

Secondly, here, it somehow creates a unique booking table based on Guest, Accommodation and CheckIn date. This is very similar on how we handle it in DWH in int_core__duplicate_booking model. To keep in mind though that were not forcing this duplicate booking in DWH for booking charge events.

Thirdly, here, it computes a variable called ListingStartDateByField (which looks like a mistake in the name) that mainly retrieves the PricePlanChargedByTypeId. This comes from the snapshot made on point 1.

The last part is mainly a split on whether the price plan charge by type id is =3 or not. This for human beings means a different logic between A) if the billing needs to be considered in the CheckInDate and B) if it needs to be considered in the VerificationStartDate or All.

Untitled

This is quite similar as we do in DWH in Booking Charge Events, that we just filter by the name instead of the ID.

These queries are massive monsters with around ~10 tables joined. However, these are all left joins (except for the unique bookings mentioned in the second point), so either the booking does not exist in Bookings or the difference lies in the WHERE clauses. For the sake of me not getting a stroke, I hope its the second, so lets go.

Billable at check in here

  • UserVerificationStatusId IS NOT NULL, from SuperhogUser table, understanding this as the guest user because of how its joined
  • GuestUserId IS NOT NULL, from User table
  • CreatedByUserId equals the user id were retrieving
  • CheckIn date is between Extraction Start and Extraction End, in a nutshell

Else (meaning billable at verification start ) here

  • UserVerificationStatusId IS NOT NULL, from SuperhogUser table, understanding this as the guest user because of how its joined
  • GuestUserId IS NOT NULL, from User table
  • CreatedByUserId equals the user id were retrieving
  • and at least one of these conditions needs to be true
    • If the VerificationRequestId from Booking table IS NOT NULL then the UpdatedDate from the VerificationRequest needs to be between Extraction Start and Extraction End
    • If the VerificationRequestId from Booking table IS NULL then the JoinDate from the User table (Guest) needs to be between Extraction Start and Extraction End

If checking this versus what DWH is computing… well, it seems we do not have (or ever had) the UpdatedDate condition for the verification start. It would make sense that the main problem for the Lavanda subject comes from this point, after all, this condition only applies if VerificationRequestId is not filled, meaning its a booking that does not have a verification process because we trust autohost verifications, and that would be in line with what Clay explained.

Lets run a quick query to replicate the verification request being null/not null in DWH for this case. Ill also retrieve the latest price plan for the users assigned to the Id Deals.

Query here!

with pp as (
select pp.*
from intermediate.int_core__price_plans pp
inner join intermediate.int_core__unified_user uu 
on pp.id_user_host = uu.id_user 
and uu.id_deal in ('20529225110', '1604445496')
where pp.end_date_utc >= '2024-07-01' and pp.start_date_utc < '2024-07-01'
)
select 
	host.id_deal,
	b.id_user_host,
	count(distinct b.id_booking) as cd_booking, 
	count(1) as count
from
	intermediate.int_core__bookings b
inner join intermediate.int_core__unified_user host
	on
	b.id_user_host = host.id_user
inner join intermediate.int_core__unified_user guest 
	on
	b.id_user_guest = guest.id_user
left join intermediate.int_core__verification_requests vr 
	on
	b.id_verification_request = vr.id_verification_request
left join pp on pp.id_user_host = b.id_user_host 
where
	guest.id_user_verification_status is not null 
	and guest.id_user is not null
	and host.id_deal in ('20529225110', '1604445496')
	and pp.id_price_plan <> 3
	and
	(
	b.id_verification_request is null
		and date_trunc('month',
		guest.joined_date_utc) = '2024-06-01'
			or 
	b.id_verification_request is not null
			and date_trunc('month',
			vr.updated_date_utc) = '2024-06-01'
	)
group by 1 ,2
order by cd_booking desc

The result of the query is the following:

Untitled

and as you can see its quite close the Lavanda billable bookings from Finance:

Untitled

At this stage I wonder if theres 1 day difference in the period considered between the DWH query vs. Finance export; or if the data is exactly extracted over the last month.

By removing the filter on the deal id and including the 2 types of billing, meaning at verification start and at check in, we get:

Untitled

Which is quite close to the 25,538 billable bookings from Finance. The fact that cd_booking and count display different numbers might be worth to double check to ensure theres no duplicates.

Untitled

Query here!

with pp as (
select pp.*
from intermediate.int_core__price_plans pp
inner join intermediate.int_core__unified_user uu 
on pp.id_user_host = uu.id_user 
where pp.end_date_utc >= '2024-07-01' and pp.start_date_utc < '2024-07-01'
), 
verification_start as (
select 
	host.id_deal,
	b.id_user_host,
	'verification_start' as type,
	count(distinct b.id_booking) as cd_booking, 
	count(1) as count
from
	intermediate.int_core__bookings b
inner join intermediate.int_core__unified_user host
	on
	b.id_user_host = host.id_user
inner join intermediate.int_core__unified_user guest 
	on
	b.id_user_guest = guest.id_user
left join intermediate.int_core__verification_requests vr 
	on
	b.id_verification_request = vr.id_verification_request
left join pp on pp.id_user_host = b.id_user_host 
where
	guest.id_user_verification_status is not null 
	and guest.id_user is not null
	and pp.id_price_plan <> 3
	and
	(
	b.id_verification_request is null
		and date_trunc('month',
		guest.joined_date_utc) = '2024-06-01'
			or 
	b.id_verification_request is not null
			and date_trunc('month',
			vr.updated_date_utc) = '2024-06-01'
	)
group by 1,2,3
),
check_in as (
select 
	host.id_deal,
	b.id_user_host,
	'check-in' as type,
	count(distinct b.id_booking) as cd_booking, 
	count(1) as count
from
	intermediate.int_core__bookings b
inner join intermediate.int_core__unified_user host
	on
	b.id_user_host = host.id_user
inner join intermediate.int_core__unified_user guest 
	on
	b.id_user_guest = guest.id_user
left join intermediate.int_core__verification_requests vr 
	on
	b.id_verification_request = vr.id_verification_request
left join pp on pp.id_user_host = b.id_user_host 
where
	guest.id_user_verification_status is not null 
	and guest.id_user is not null
	and pp.id_price_plan = 3
	and b.id_verification_request is not null
	and date_trunc('month',
			b.check_in_at_utc) = '2024-06-01'
group by 1,2,3
),
totals as (
select * from check_in
union all
select * from verification_start
)
select 
	type,
	sum(cd_booking) as cd_booking,
	sum(count) as count
from totals
group by 1