sh-notion/notion_data_team_no_files/Data quality assessment Billable Bookings 97008b7f1cbb4beb98295a22528acd03.md

576 lines
26 KiB
Markdown
Raw Permalink Normal View History

2025-07-11 16:15:17 +02:00
# Data quality assessment: Billable Bookings
**2024-07-17, by Uri**
This page aims to document the differences in Billable Bookings observed from Finance point of view vs. DWH point of view. This originates from the fact that, wanting to expose this KPI through DWH prove to be inconsistent with the billable bookings coming from Finance side.
# Finance side
Finance retrieves the Billable Bookings from the monthly accounting reports, that are generated usually on the first working day of each month by the project [data-invoicing-exporter](https://guardhog.visualstudio.com/Data/_git/data-invoicing-exporter). Specifically, they retrieve it from the **Account Reports Summary**, summing the tab of **BillableBookingCount**. Jamie provided the file they were using for June 2024 so I (Uri) could dig more into this. There was no amendment made this month.
The number of billable bookings according to Finance in June 2024 is: **25,538**
# DWH side
The billable bookings should come from the table **int_core__booking_charge_events**. Mainly, theres 2 ways of charge a booking: either when the **verification** starts, or when the **check-in** starts.
The verification start date that is used for the logic has changed recently (mid June 24) because of the new estimated Guest Journey Start Date, that is when the link is used. Here you can see the changes:
- [Commit 4f672800 - Change on int_core__verification_requests](https://guardhog.visualstudio.com/Data/_git/data-dwh-dbt-project/commit/4f6728003a30f7188f10c48cc60eb2191d604907?refName=refs/heads/master&path=/models/intermediate/core/int_core__verification_requests.sql&_a=compare)
- [Commit 4f672800 - Change on int_core__booking_charge_events](https://guardhog.visualstudio.com/Data/_git/data-dwh-dbt-project/commit/4f6728003a30f7188f10c48cc60eb2191d604907?refName=refs%2Fheads%2Fmaster&path=%2Fmodels%2Fintermediate%2Fcore%2Fint_core__booking_charge_events.sql&_a=compare)
Therefore, before this change, the booking charge events was considered a verification start that corresponded to **joined_at_utc** coming from the **int_core__unified_user** table.
In a nutshell, running this on 17th July 2024 for the month of June, the values retrieved are:
- Using verification estimated start date (link used at): **23,440**
- Using verification guest joined date: **21,604**
### Query available here
```sql
with
booking_charge_events_joined_at as (
with
stg_core__booking as (select * from staging.stg_core__booking),
int_core__price_plans as (select * from intermediate.int_core__price_plans),
int_core__verification_requests as (
select * from intermediate.int_core__verification_requests
),
booking_with_relevant_price_plans as (
select
*,
case
when pp.price_plan_charged_by_type = 'CheckInDate'
then b.check_in_at_utc
when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
then icuu.joined_at_utc
end as booking_fee_charge_at_utc
from stg_core__booking b
left join
int_core__verification_requests vr
on b.id_verification_request = vr.id_verification_request
left join intermediate.int_core__unified_user icuu
on vr.id_user_guest = icuu.id_user
left join int_core__price_plans pp on b.id_user_host = pp.id_user_host
where
-- The dates that defines which price plan applies to the booking depends
-- on charged by type. With the below case, we evaluate if a certain price
-- plan relates to the booking
case
when pp.price_plan_charged_by_type = 'CheckInDate'
then b.check_in_at_utc between pp.start_at_utc and pp.end_at_utc
when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
then
coalesce(
(
icuu.joined_at_utc
between pp.start_at_utc and pp.end_at_utc
),
false
)
else false
end
= true
),
untied_bookings as (
-- If a booking has two valid price plans, take the earliest
select id_booking, min(id_price_plan) as id_price_plan
from booking_with_relevant_price_plans brpp
group by id_booking
)
select
ub.id_booking,
ub.id_price_plan,
brpp.booking_fee_local,
brpp.booking_fee_charge_at_utc,
cast(brpp.booking_fee_charge_at_utc as date) as booking_fee_charge_date_utc
from untied_bookings ub
left join
booking_with_relevant_price_plans brpp
on ub.id_booking = brpp.id_booking
and ub.id_price_plan = brpp.id_price_plan
),
booking_charge_events_estimated_at as (
with
stg_core__booking as (select * from staging.stg_core__booking),
int_core__unified_user as (select * from intermediate.int_core__unified_user),
int_core__price_plans as (select * from intermediate.int_core__price_plans),
int_core__verification_requests as (
select * from intermediate.int_core__verification_requests
),
booking_with_relevant_price_plans as (
select
*,
case
when pp.price_plan_charged_by_type = 'CheckInDate'
then b.check_in_at_utc
when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
then vr.verification_estimated_started_at_utc
end as booking_fee_charge_at_utc
from stg_core__booking b
left join
int_core__verification_requests vr
on b.id_verification_request = vr.id_verification_request
left join int_core__price_plans pp on b.id_user_host = pp.id_user_host
where
-- The dates that defines which price plan applies to the booking depends
-- on charged by type. With the below case, we evaluate if a certain price
-- plan relates to the booking
case
when pp.price_plan_charged_by_type = 'CheckInDate'
then b.check_in_at_utc between pp.start_at_utc and pp.end_at_utc
when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
then
coalesce(
(
vr.verification_estimated_started_at_utc
between pp.start_at_utc and pp.end_at_utc
),
false
)
else false
end
= true
),
untied_bookings as (
-- If a booking has two valid price plans, take the earliest
select id_booking, min(id_price_plan) as id_price_plan
from booking_with_relevant_price_plans brpp
group by id_booking
)
select
ub.id_booking,
ub.id_price_plan,
brpp.booking_fee_local,
brpp.booking_fee_charge_at_utc,
cast(brpp.booking_fee_charge_at_utc as date) as booking_fee_charge_date_utc
from untied_bookings ub
left join
booking_with_relevant_price_plans brpp
on ub.id_booking = brpp.id_booking
and ub.id_price_plan = brpp.id_price_plan
)
select
'verification_estimated_at' as type,
COUNT(distinct id_booking) as billable_bookings
from booking_charge_events_estimated_at
where date_trunc('month', booking_fee_charge_date_utc) = '2024-06-01'
union all
select
'verification_guest_joined_at' as type,
COUNT(distinct id_booking) as billable_bookings
from booking_charge_events_joined_at
where date_trunc('month', booking_fee_charge_date_utc) = '2024-06-01'
```
# First volume check
Well, given that theres clearly a difference between Finance numbers and what we can obtain from DWH (in either case…), before jumping into some python-based project that runs queries in the Core schema directly… maybe lets do some simple checks.
The Account Reports Summary excel is split per Company name, and it should be easy to retrieve the Deal Id linked to the company name (at least manually) for the most notable cases. So I just ordered the file in descending order of billable bookings and retrieved the first one:
- Company Name: LavandaBilling Account
- Billable Bookings: 3,178
This has a couple of id_deals linked to it: '1604445496','20529225110
![Untitled](Untitled%2022.png)
### Query available here
```sql
select id_user as id_user_host, id_deal, company_name
from intermediate.int_core__unified_user icuu
where id_deal IN ('1604445496','20529225110')
```
The good news is that we have the KPIs view by Deal, so we can retrieve the information for these 2 Deals super fast. It will be considering the estimated start date, as its the current behaviour for DWH.
The bad news is that… well… theres no billable bookings at all!
![Untitled](Untitled%2023.png)
### Query available here
```sql
select
*
from intermediate.int_monthly_aggregated_metrics_history_by_deal
where date = '2024-06-30'
and id_deal in ('20529225110','1604445496')
order by date desc
```
So clearly theres a big difference hiding somewhere.
Do we have a similar behaviour if using the **joined_at_utc** for verification start in **booking_charge_events**?
… well, yes. Exactly 0 billable bookings!
![Untitled](Untitled%2024.png)
### Query available here
```sql
with
booking_charge_events_joined_at as (
with
stg_core__booking as (select * from staging.stg_core__booking),
int_core__price_plans as (select * from intermediate.int_core__price_plans),
int_core__verification_requests as (
select * from intermediate.int_core__verification_requests
),
booking_with_relevant_price_plans as (
select
*,
case
when pp.price_plan_charged_by_type = 'CheckInDate'
then b.check_in_at_utc
when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
then icuu.joined_at_utc
end as booking_fee_charge_at_utc
from stg_core__booking b
left join
int_core__verification_requests vr
on b.id_verification_request = vr.id_verification_request
left join intermediate.int_core__unified_user icuu
on vr.id_user_guest = icuu.id_user
left join int_core__price_plans pp on b.id_user_host = pp.id_user_host
where
-- The dates that defines which price plan applies to the booking depends
-- on charged by type. With the below case, we evaluate if a certain price
-- plan relates to the booking
case
when pp.price_plan_charged_by_type = 'CheckInDate'
then b.check_in_at_utc between pp.start_at_utc and pp.end_at_utc
when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
then
coalesce(
(
icuu.joined_at_utc
between pp.start_at_utc and pp.end_at_utc
),
false
)
else false
end
= true
),
untied_bookings as (
-- If a booking has two valid price plans, take the earliest
select id_booking, min(id_price_plan) as id_price_plan
from booking_with_relevant_price_plans brpp
group by id_booking
)
select
ub.id_booking,
ub.id_price_plan,
brpp.booking_fee_local,
brpp.booking_fee_charge_at_utc,
cast(brpp.booking_fee_charge_at_utc as date) as booking_fee_charge_date_utc
from untied_bookings ub
left join
booking_with_relevant_price_plans brpp
on ub.id_booking = brpp.id_booking
and ub.id_price_plan = brpp.id_price_plan
),
booking_charge_events_estimated_at as (
with
stg_core__booking as (select * from staging.stg_core__booking),
int_core__unified_user as (select * from intermediate.int_core__unified_user),
int_core__price_plans as (select * from intermediate.int_core__price_plans),
int_core__verification_requests as (
select * from intermediate.int_core__verification_requests
),
booking_with_relevant_price_plans as (
select
*,
case
when pp.price_plan_charged_by_type = 'CheckInDate'
then b.check_in_at_utc
when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
then vr.verification_estimated_started_at_utc
end as booking_fee_charge_at_utc
from stg_core__booking b
left join
int_core__verification_requests vr
on b.id_verification_request = vr.id_verification_request
left join int_core__price_plans pp on b.id_user_host = pp.id_user_host
where
-- The dates that defines which price plan applies to the booking depends
-- on charged by type. With the below case, we evaluate if a certain price
-- plan relates to the booking
case
when pp.price_plan_charged_by_type = 'CheckInDate'
then b.check_in_at_utc between pp.start_at_utc and pp.end_at_utc
when pp.price_plan_charged_by_type in ('VerificationStartDate', 'All')
then
coalesce(
(
vr.verification_estimated_started_at_utc
between pp.start_at_utc and pp.end_at_utc
),
false
)
else false
end
= true
),
untied_bookings as (
-- If a booking has two valid price plans, take the earliest
select id_booking, min(id_price_plan) as id_price_plan
from booking_with_relevant_price_plans brpp
group by id_booking
)
select
ub.id_booking,
ub.id_price_plan,
brpp.booking_fee_local,
brpp.booking_fee_charge_at_utc,
cast(brpp.booking_fee_charge_at_utc as date) as booking_fee_charge_date_utc
from untied_bookings ub
left join
booking_with_relevant_price_plans brpp
on ub.id_booking = brpp.id_booking
and ub.id_price_plan = brpp.id_price_plan
)
select
'verification_estimated_at' as type,
COUNT(distinct bce.id_booking) as billable_bookings
from booking_charge_events_estimated_at bce
inner join intermediate.int_core__bookings b
on bce.id_booking = b.id_booking
inner join intermediate.int_core__unified_user u
on b.id_user_host = u.id_user
and u.id_deal in ('1604445496','20529225110')
where date_trunc('month', bce.booking_fee_charge_date_utc) = '2024-06-01'
union all
select
'verification_guest_joined_at' as type,
COUNT(distinct bce.id_booking) as billable_bookings
from booking_charge_events_joined_at bce
inner join intermediate.int_core__bookings b
on bce.id_booking = b.id_booking
inner join intermediate.int_core__unified_user u
on b.id_user_host = u.id_user
and u.id_deal in ('1604445496','20529225110')
where date_trunc('month', bce.booking_fee_charge_date_utc) = '2024-06-01'
```
So clearly we have a situation here. Thankfully, Clay took the time to explain to me that this company is a special case in which these bookings do not have Guest Journeys, so looks suspicious to me in the sense that probably this is affecting somehow the logic in DWH. This is because its an “autohost account”, that might probably come from Partner API.
I also tried with the 2nd Company with more Billable Bookings in June 2024, namely:
- Company Name: Home Team Vacation Rentals LLC
- Billable Bookings: 1,350
- Deal Id: 15463726437
Reusing the previous query and changing the deal id, we get these results:
![Untitled](Untitled%2025.png)
So its looking MUCH better, even though its true that we miss/exceed the 1,350 for a few bookings. Lets be pragmatic and focus first on the big issues, and then on the small ones…
# Debugging data-invoicing-exporter
Ive already put some comments on the DevOps ticket, but might be worth to aim to understand where these Lavanda Billable Bookings come from.
Billable Booking Count is computed in `dashboard_tools.py` [here](https://guardhog.visualstudio.com/Data/_git/data-invoicing-exporter?path=/sh_invoicing/dashboard_tools.py&version=GBmain&line=122&lineEnd=123&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents), which depends on `user_report_contents` [here](https://guardhog.visualstudio.com/Data/_git/data-invoicing-exporter?path=/sh_invoicing/dashboard_tools.py&version=GBmain&line=81&lineEnd=82&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents). In turn, this is retrieved from the function `get_user_data`, defined [here](https://guardhog.visualstudio.com/Data/_git/data-invoicing-exporter?path=/sh_invoicing/dashboard_tools.py&version=GBmain&line=186&lineEnd=187&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents).
From here, we can retrieve the query being used on bookings `queries.get_query_booking_data_of_user` [here](https://guardhog.visualstudio.com/Data/_git/data-invoicing-exporter?path=/sh_invoicing/dashboard_tools.py&version=GBmain&line=200&lineEnd=201&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents).
Time to switch to queries.py, where the query were interested in is [here](https://guardhog.visualstudio.com/Data/_git/data-invoicing-exporter?path=/sh_invoicing/queries.py&version=GBmain&line=126&lineEnd=127&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents). This query has some transactional sql.
Firstly, [here](https://guardhog.visualstudio.com/Data/_git/data-invoicing-exporter?path=/sh_invoicing/queries.py&version=GBmain&line=140&lineEnd=141&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents), it creates a snapshot of the latest available `PricePlanToUser` at the moment of the extraction, **to be applied to all the bookings of the exporting time period considered**. These are the ones that are going to be used for the export. This is, technically, different from what we do in DWH in which **we apply the price plan according on what was active at the moment of verification start or check-in**. In case of more than 1 active price plan for a booking, we take the first one. This could explain changes on the volumes, but probably not the massive difference. We can check this later on.
Secondly, [here](https://guardhog.visualstudio.com/Data/_git/data-invoicing-exporter?path=/sh_invoicing/queries.py&version=GBmain&line=166&lineEnd=167&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents), it somehow creates a unique booking table based on Guest, Accommodation and CheckIn date. This is very similar on how we handle it in DWH in `int_core__duplicate_booking` model. To keep in mind though that were not forcing this duplicate booking in DWH for booking charge events.
Thirdly, [here](https://guardhog.visualstudio.com/Data/_git/data-invoicing-exporter?path=/sh_invoicing/queries.py&version=GBmain&line=174&lineEnd=175&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents), it computes a variable called `ListingStartDateByField` (which looks like a mistake in the name) that mainly retrieves the `PricePlanChargedByTypeId`. This comes from the snapshot made on point 1.
The last part is mainly a split on whether the price plan charge by type id is =3 or not. This for human beings means a different logic between A) if the billing needs to be considered in the CheckInDate and B) if it needs to be considered in the VerificationStartDate or All.
![Untitled](Untitled%2026.png)
This is quite similar as we do in DWH in Booking Charge Events, that we just filter by the name instead of the ID.
These queries are massive monsters with around ~10 tables joined. However, these are all left joins (except for the unique bookings mentioned in the second point), so either the booking does not exist in Bookings or the difference lies in the WHERE clauses. For the sake of me not getting a stroke, I hope its the second, so lets go.
Billable at check in [here](https://guardhog.visualstudio.com/Data/_git/data-invoicing-exporter?path=/sh_invoicing/queries.py&version=GBmain&line=180&lineEnd=181&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents)
- UserVerificationStatusId IS NOT NULL, from SuperhogUser table, understanding this as the guest user because of how its joined
- GuestUserId IS NOT NULL, from User table
- CreatedByUserId equals the user id were retrieving
- CheckIn date is between Extraction Start and Extraction End, in a nutshell
Else (meaning billable at verification start ) [here](https://guardhog.visualstudio.com/Data/_git/data-invoicing-exporter?path=/sh_invoicing/queries.py&version=GBmain&line=237&lineEnd=238&lineStartColumn=1&lineEndColumn=1&lineStyle=plain&_a=contents)
- UserVerificationStatusId IS NOT NULL, from SuperhogUser table, understanding this as the guest user because of how its joined
- GuestUserId IS NOT NULL, from User table
- CreatedByUserId equals the user id were retrieving
- and at least one of these conditions needs to be true
- If the VerificationRequestId from Booking table IS NOT NULL then the UpdatedDate from the VerificationRequest needs to be between Extraction Start and Extraction End
- If the VerificationRequestId from Booking table IS NULL then the JoinDate from the User table (Guest) needs to be between Extraction Start and Extraction End
If checking this versus what DWH is computing… well, it seems we do not have (or ever had) the UpdatedDate condition for the verification start. It would make sense that the main problem for the Lavanda subject comes from this point, after all, this condition only applies if VerificationRequestId is not filled, meaning its a booking that does not have a verification process because we trust autohost verifications, and that would be in line with what Clay explained.
Lets run a quick query to replicate the verification request being null/not null in DWH for this case. Ill also retrieve the latest price plan for the users assigned to the Id Deals.
### Query here!
```sql
with pp as (
select pp.*
from intermediate.int_core__price_plans pp
inner join intermediate.int_core__unified_user uu
on pp.id_user_host = uu.id_user
and uu.id_deal in ('20529225110', '1604445496')
where pp.end_date_utc >= '2024-07-01' and pp.start_date_utc < '2024-07-01'
)
select
host.id_deal,
b.id_user_host,
count(distinct b.id_booking) as cd_booking,
count(1) as count
from
intermediate.int_core__bookings b
inner join intermediate.int_core__unified_user host
on
b.id_user_host = host.id_user
inner join intermediate.int_core__unified_user guest
on
b.id_user_guest = guest.id_user
left join intermediate.int_core__verification_requests vr
on
b.id_verification_request = vr.id_verification_request
left join pp on pp.id_user_host = b.id_user_host
where
guest.id_user_verification_status is not null
and guest.id_user is not null
and host.id_deal in ('20529225110', '1604445496')
and pp.id_price_plan <> 3
and
(
b.id_verification_request is null
and date_trunc('month',
guest.joined_date_utc) = '2024-06-01'
or
b.id_verification_request is not null
and date_trunc('month',
vr.updated_date_utc) = '2024-06-01'
)
group by 1 ,2
order by cd_booking desc
```
The result of the query is the following:
![Untitled](Untitled%2027.png)
and as you can see its quite close the Lavanda billable bookings from Finance:
![Untitled](Untitled%2028.png)
At this stage I wonder if theres 1 day difference in the period considered between the DWH query vs. Finance export; or if the data is exactly extracted over the last month.
By removing the filter on the deal id and including the 2 types of billing, meaning at verification start and at check in, we get:
![Untitled](Untitled%2029.png)
Which is quite close to the 25,538 billable bookings from Finance. The fact that cd_booking and count display different numbers might be worth to double check to ensure theres no duplicates.
![Untitled](Untitled%2030.png)
### Query here!
```sql
with pp as (
select pp.*
from intermediate.int_core__price_plans pp
inner join intermediate.int_core__unified_user uu
on pp.id_user_host = uu.id_user
where pp.end_date_utc >= '2024-07-01' and pp.start_date_utc < '2024-07-01'
),
verification_start as (
select
host.id_deal,
b.id_user_host,
'verification_start' as type,
count(distinct b.id_booking) as cd_booking,
count(1) as count
from
intermediate.int_core__bookings b
inner join intermediate.int_core__unified_user host
on
b.id_user_host = host.id_user
inner join intermediate.int_core__unified_user guest
on
b.id_user_guest = guest.id_user
left join intermediate.int_core__verification_requests vr
on
b.id_verification_request = vr.id_verification_request
left join pp on pp.id_user_host = b.id_user_host
where
guest.id_user_verification_status is not null
and guest.id_user is not null
and pp.id_price_plan <> 3
and
(
b.id_verification_request is null
and date_trunc('month',
guest.joined_date_utc) = '2024-06-01'
or
b.id_verification_request is not null
and date_trunc('month',
vr.updated_date_utc) = '2024-06-01'
)
group by 1,2,3
),
check_in as (
select
host.id_deal,
b.id_user_host,
'check-in' as type,
count(distinct b.id_booking) as cd_booking,
count(1) as count
from
intermediate.int_core__bookings b
inner join intermediate.int_core__unified_user host
on
b.id_user_host = host.id_user
inner join intermediate.int_core__unified_user guest
on
b.id_user_guest = guest.id_user
left join intermediate.int_core__verification_requests vr
on
b.id_verification_request = vr.id_verification_request
left join pp on pp.id_user_host = b.id_user_host
where
guest.id_user_verification_status is not null
and guest.id_user is not null
and pp.id_price_plan = 3
and b.id_verification_request is not null
and date_trunc('month',
b.check_in_at_utc) = '2024-06-01'
group by 1,2,3
),
totals as (
select * from check_in
union all
select * from verification_start
)
select
type,
sum(cd_booking) as cd_booking,
sum(count) as count
from totals
group by 1
```