283 lines
20 KiB
Markdown
283 lines
20 KiB
Markdown
|
|
# KPIs Refactor - Let’s go daily - 2024-10-23
|
|||
|
|
|
|||
|
|
# Context & initial thoughts
|
|||
|
|
|
|||
|
|
Uri here. After the discussions with Product teams regarding the needs for Product KPIs, and anticipating future needs, I’m starting this Notion Page to gather thoughts and feedbacks on a potential refactor for KPIs.
|
|||
|
|
|
|||
|
|
Currently, KPIs follow 2 different trends: either these are in a MTD computation for global and a couple of dimensions, or we have them in a monthly basis per Deal. Life was good and easy - despite having 2 flows - until we started computing Churn. Churn Rates (Revenue, Bookings, Listings) are Deal dependant thus need to go into the monthly basis compute. However, we needed to aggregate them at MTD computation and this creates some dependencies between the 2 flows that - to me - show that we can do better. Especially when considering that MTD computation is not historicised - we keep each day of the current month with the aggregated data of that month, but next month, we lose it except for the last day. So add this up with many more specific requirements from Product KPIs and boom! Refactor time <3
|
|||
|
|
|
|||
|
|
SO! We have mostly 2 options:
|
|||
|
|
|
|||
|
|
- We keep Main KPIs as is, and implement each product flow independently.
|
|||
|
|
- Advantages: looks easier at first!
|
|||
|
|
- Disadvantages: We’ll have some repeated logic ongoing (ex: if I want to compute Created Bookings) and likely some of the Product KPIs needs will need to somehow be reflected in the Main KPIs anyway (ex: Revenue per Service of New Pricing)
|
|||
|
|
- We put every KPI in a master flow, and from there we built any reporting that needs a specific KPI
|
|||
|
|
- Advantages: everything is centralised, so creating/updating a new metric/dimension would likely be easier than changing it N times. Also, much more scalable.
|
|||
|
|
- Disadvantages: Complexity will increase, and likely, it requires quite a bit of work. Specially in ensuring that whatever is currently deployed and used by our users keeps working well.
|
|||
|
|
- I would challenge the complexity increase. That happens in both cases. I’ll admit this path probably requires more thorough planning and will consume more brain calories, but I don’t think the end result is more complex than the alternative.
|
|||
|
|
- Regarding ensuring that things keep working well, I have some ideas on how this can be monitored *very* easily. Won’t explain here, let’s discuss in the right places, but I would encourage you to assume you will have a traffic light that will let you know if you’ve broken something refactoring.
|
|||
|
|
|
|||
|
|
> *Being realistic though, likely there’s going to be some standalone computations coexisting with KPIs computations in different reports, so it won’t be full one-sided. Example: Top Losers (Account Managers report). Mostly uses data from existing metrics, but is enriched with Hubspot data specifically needed for RevOps.*
|
|||
|
|
>
|
|||
|
|
|
|||
|
|
# Proposal: let’s create a semantic model
|
|||
|
|
|
|||
|
|
> *Maybe Semantic Model it’s not the good name for this 🙂. Happy to be challenged.*
|
|||
|
|
>
|
|||
|
|
|
|||
|
|
I agree that Semantic Model will lead to confusion due to how it’s used by industry/market tools. Perhaps just coming up with some silly name is good enough. `URI` models? 😏
|
|||
|
|
|
|||
|
|
KPI models I think might be better 🤤
|
|||
|
|
|
|||
|
|
I’d like to go for the second option: Put every KPI in a master flow. And I’d do it by create a semantic model. I played a bit with MetricFlow in the past and despite the tool not being super mature, I got some nice ideas that we could implement without going for a dedicated tool. Just by using some standards.
|
|||
|
|
|
|||
|
|
Ideally, I’d built it on the intermediate layer, in a standalone folder called semantic_model or sem or something like this. Main reason behind it is that anyway we’ll likely integrate information from many sources (core, hubspot, etc) and I’d like to specifically differentiate the model of `int_core__bookings` with `int_sem__bookings` - if it ever exists -, since likely the second will depend on the first. Other reasons could be providing specific access by labelling the models to other users outside Data Team to this eventually more “mature” data model.
|
|||
|
|
|
|||
|
|
## Pre-aggregating data within deepest granularity
|
|||
|
|
|
|||
|
|
Let’s talk about a simple metric, such as Created Bookings. It just computes the count of bookings that have been created in a certain time period - easy-peasy.
|
|||
|
|
|
|||
|
|
The current model of `int_core__mtd_created_bookings_metric` does some crazy computation because 1) directly applies the MTD logic within the model and 2) directly computes the aggregation per dimension, which usually requires additional joins with other tables. This has already shown some limitations (when adding more dimensions we needed to split the `booking_metrics` model into 4; deal KPIs are mostly monthly because couldn’t sustain the complexity of the MTD and Deal granularity).
|
|||
|
|
|
|||
|
|
- A small challenge from my side here. The split of `booking_metrics` into four different models looked positive at the start, but weeks later we found out the performance problems were truly originating in the `VACUUM ANALYZE` issues. In any case, I agree with the overall point that some mtd models are overly complex due to having both the metric and the mtd logic there.
|
|||
|
|
|
|||
|
|
In order to overcome this, I’d create a pre-aggregated model that has the deepest granularity needed which, by definition, is not at `id_booking` level. Continuing with this example of Created Bookings, at the moment, we would just need the following granularity:
|
|||
|
|
|
|||
|
|
- dimensions
|
|||
|
|
- `date` (created_date_utc)
|
|||
|
|
- `id_deal`
|
|||
|
|
- metrics
|
|||
|
|
- `created_bookings` (count id_booking) → effectively it will be daily booking count for each dimension value
|
|||
|
|
|
|||
|
|
Why? because the current status of KPIs just needs:
|
|||
|
|
|
|||
|
|
- Created Bookings per Deal and Month → We already have Deal here so perfect, and month can be obtained by doing `date_trunc(’month’,date)::date` and doing `sum(created_bookings)`
|
|||
|
|
- MTD Global Created Bookings → We don’t care about the fact that id_deal exists here, so no need to take it into account for the aggregation. We could just do the sum of created bookings per day, and later on, apply a MTD computation
|
|||
|
|
- MTD By Billing Country Created Bookings → Billing Country is Deal dependant with the assumption we made. Thus, either we already provide the Billing Country in each model OR we join it later with a much smaller Deal table
|
|||
|
|
- MTD By # of Listings Created Bookings → Similarly as before, the Listings segmentation depends on Deal, but also on Date.
|
|||
|
|
|
|||
|
|
Ideally, I’d opt for having already all necessary joins handled in the deepest granularity, meaning the model for `int_sem__created_bookings` would already contain the needed dimensions within it to ease up further logic. It would look like:
|
|||
|
|
|
|||
|
|
- dimensions:
|
|||
|
|
- `date` (created_date_utc)
|
|||
|
|
- `id_deal`
|
|||
|
|
- `billing_country`
|
|||
|
|
- `listing_segmentation`
|
|||
|
|
- metrics
|
|||
|
|
- `created_bookings` (count id_booking)
|
|||
|
|
|
|||
|
|
Why? mainly because so far the dimensions we have are somehow Deal-dependant. But if we go for Listing Country dimension, for instance, it would create an additional join that could be handled here. Need to flag if a booking is coming from New Dash or Old Dash? cool, let’s flag it here. Is this booking coming from a PMS, and if so, which one? Cool, flag it here.
|
|||
|
|
|
|||
|
|
## Scalability is key
|
|||
|
|
|
|||
|
|
Having this proposed deepest granularity setup - assuming it exists - would makes things much easier for upper layers, cool. But it’s possible we end up with TONS of dimensions that will make full-refreshes very costly. To continue with the example:
|
|||
|
|
|
|||
|
|
- The proposed setup for a table containing `date`, `id_deal`, `created_bookings` would have at this moment 150k rows.
|
|||
|
|
- The current state of `int_core__mtd_created_bookings_metric` shows 3.8k rows
|
|||
|
|
- The current state of `int_core__monthly_booking_history_by_deal` has 29k rows
|
|||
|
|
- I know I’m annoying, but let me just say it once again: let’s just build it and optimize if and when it’s needed. The volumes described here are perfectly manageable. Incrementality can help a lot. [Microbatching](https://docs.getdbt.com/docs/build/incremental-microbatch) even more.
|
|||
|
|
|
|||
|
|
So yeah. Let’s not do full-refreshes, at least every day for what it matters. We could materialise these models as incremental updates so each day we just update the new data we have based on the `updated_at_utc`. This way we could ensure that at least we don’t have massive executions every day - but we need to have these full-refreshes scheduled from time to time to avoid invisible data drift. Here’s where an orchestration engine will be super useful.
|
|||
|
|
|
|||
|
|
We should also consider that adding or modifying an existing dimension or metric would necessarily need a full-refresh, though. But just from that semantic model and it’s upper dependencies.
|
|||
|
|
|
|||
|
|
Afterwards, the rest of upper logic could be full-refreshed every day as we do now. Since data would be at the deepest granularity, at least in the current state, it wouldn’t be necessary to do crazy joins (maybe just those needed within semantic models to create weighted and converted metrics).
|
|||
|
|
|
|||
|
|
I think it could be useful to have metadata on when a specific dimension/metric pair has been 1) created and 2) last updated. This has nothing to do with the creation or update of the booking, but more to know, internally, if data is updating correctly, sufficiently fresh enough, etc. Even add some tests directly here for semantic models.
|
|||
|
|
|
|||
|
|
## Some open thoughts
|
|||
|
|
|
|||
|
|
- **Not all dimensions will make sense for all metrics**. For example, we likely don’t care about knowing the Invoiced Athena Revenue per Listing Segmentation since (at the moment) Listing segmentation is based on Platform users. This effectively means that on upper layers, there might be the need to handle different dimension definitions to apply within certain ranges. Not a big deal I think but just to keep in mind
|
|||
|
|
- **We might not need all dimensions**. For example, we might want to exclude cancelled bookings from created bookings. Maybe going for adding a Booking State dimension is too overkill while we could just create a dedicated metric of Created Bookings wo Cancellations or similar.
|
|||
|
|
- **On deepest granularity**. Whatever we chose as the deepest granularity is something that we need to be very cautious about. I’m specifically thinking of going into a maximum granularity by default of Date and Deal. This means, no Hour granularity and also means no Platform User (User Host) granularity. I’d like to be challenged here if needed.
|
|||
|
|
- **Nulls might be a problem when aggregating**. Not all users have a deal for instance and we know that global created bookings will effectively have more bookings that the sum of created bookings for each deal. I wonder here if we should explicitly cast nulls as UNSET or similar, specially when joining, to avoid errors and ensure data completeness independently of the granularity chosen. However, considering UNSET might affect row number and shares computation, such as churn rates.
|
|||
|
|
- Some of these details fly over my head because I’m not deep enough into the mental model of the current setup to see this intuitively. I’m happy for you to do whatever feels best if I you feel you can manage. I’m also happy to set time aside and jump into the dirt if you don’t feel in control and truly need another pair of eyes going low level on this.
|
|||
|
|
|
|||
|
|
# Refined proposal
|
|||
|
|
|
|||
|
|
1. We create a new folder at intermediate logic to handle the different extractions and aggregation logic. We name this folder `kpis`.
|
|||
|
|
2. We have a standard nomenclature in this folder that all models start with `int_kpis`
|
|||
|
|
3. We have a first layer models, that handles all the necessary joins and pre-aggregate the information at the deepest granularity. Each extraction model has the minimal granularity needed - at this stage, temporality wise, we go for daily. The convention would look like `int_kpis__daily_name_of_the_kpi`.
|
|||
|
|
- For instance, we would have a `int_kpis__daily_created_bookings`
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
{{ config(materialized="table", unique_key=["date", "id_deal", "dash_source"]) }}
|
|||
|
|
select
|
|||
|
|
-- Unique Key --
|
|||
|
|
icb.created_date_utc as date,
|
|||
|
|
coalesce(icuh.id_deal, 'UNSET') as id_deal,
|
|||
|
|
case
|
|||
|
|
when icbtpb.id_booking is not null then 'New Dash' else 'Old Dash'
|
|||
|
|
end as dash_source,
|
|||
|
|
-- Dimensions --
|
|||
|
|
coalesce(
|
|||
|
|
icd.main_billing_country_iso_3_per_deal, 'UNSET'
|
|||
|
|
) as main_billing_country_iso_3_per_deal,
|
|||
|
|
coalesce(
|
|||
|
|
icmas.active_accommodations_per_deal_segmentation, 'UNSET'
|
|||
|
|
) as active_accommodations_per_deal_segmentation,
|
|||
|
|
-- Metrics --
|
|||
|
|
count(distinct icb.id_booking) as created_bookings
|
|||
|
|
from {{ ref("int_core__bookings") }} as icb
|
|||
|
|
left join
|
|||
|
|
{{ ref("int_core__user_host") }} as icuh on icb.id_user_host = icuh.id_user_host
|
|||
|
|
left join {{ ref("int_core__deal") }} as icd on icuh.id_deal = icd.id_deal
|
|||
|
|
left join
|
|||
|
|
{{ ref("int_kpis__daily_accommodation_segmentation") }} as icmas
|
|||
|
|
on icuh.id_deal = icmas.id_deal
|
|||
|
|
and icb.created_date_utc = icmas.date
|
|||
|
|
left join
|
|||
|
|
{{ ref("int_core__booking_to_product_bundle") }} as icbtpb
|
|||
|
|
on icb.id_booking = icbtpb.id_booking
|
|||
|
|
group by 1, 2, 3, 4, 5
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- Currently materialised as a table, but easy to adapt to incremental merge.
|
|||
|
|
- Regarding UNSET: we ensure that there’s no nulls in the dimensions. This is beneficial because
|
|||
|
|
- 1) in future aggregations (monthly, per dimension, etc) will allow us to ensure data is complete for additive metrics - meaning the sum of a metric in every dimension per all possible dimension values will always match the global figure. This can be - and should be - a data test within dbt.
|
|||
|
|
- 2) allows easier for joins with other metrics to compute weighted metrics, such as Total Revenue per Created Bookings. It means we could quantify the Total Revenue per Created Bookings of these users that do not have a Deal Id set, for instance
|
|||
|
|
- 3) allows to quantify the % of data incompleteness we have in a given dimension: for instance, if the Dimension Billing Country has 10 UNSET created bookings over a total of 100, we know that 10% of the bookings cannot be attributed to a specific Billing Country.
|
|||
|
|
4. Time aggregations need to depend on these first layer models. For instance, we can have MTD, YTD, Monthly, etc. If it just handles the time aggregation and nothing more, these models would be called `int_kpis__time_aggregation_name_of_the_kpi`.
|
|||
|
|
- For instance, we would have `int_kpis__mtd_created_bookings`
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
{{
|
|||
|
|
config(
|
|||
|
|
materialized="view",
|
|||
|
|
unique_key=[
|
|||
|
|
"date",
|
|||
|
|
"id_deal",
|
|||
|
|
"dash_source",
|
|||
|
|
"active_accommodations_per_deal_segmentation",
|
|||
|
|
],
|
|||
|
|
)
|
|||
|
|
}}
|
|||
|
|
|
|||
|
|
select
|
|||
|
|
-- Unique Key --
|
|||
|
|
d.date,
|
|||
|
|
b.id_deal,
|
|||
|
|
b.dash_source,
|
|||
|
|
b.active_accommodations_per_deal_segmentation,
|
|||
|
|
-- Dimensions --
|
|||
|
|
b.main_billing_country_iso_3_per_deal,
|
|||
|
|
-- Metrics --
|
|||
|
|
sum(b.created_bookings) as created_bookings
|
|||
|
|
from {{ ref("int_dates_mtd") }} d
|
|||
|
|
left join
|
|||
|
|
{{ ref("int_kpis__daily_created_bookings") }} b
|
|||
|
|
on date_trunc('month', b.date)::date = d.first_day_month
|
|||
|
|
and extract(day from b.date) <= d.day
|
|||
|
|
where id_deal is not null
|
|||
|
|
group by 1, 2, 3, 4, 5
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- Bear in mind that the unique key can change in these kind of aggregations. In this case, we keep the previous 3 dimensions (date, id_deal, dash_source) BUT we are forced to include active_accommodations_per_deal_segmentation. The reason is that one Deal will only have one Listing Segment value in a given Date, but can have more than one over a month. In essence, any dimension that can change over the month needs to appear in the unique key.
|
|||
|
|
- **OPEN QUESTION: do we want to store all MTD dates, instead of just last day of the month + any day of the current month?**
|
|||
|
|
5. We can also aggregate per dimensions. These will be configured in a macro so each model can have shared dimensions (Global, By number of listings, by billing country) that likely will go to Main KPIs or specific dimensions (By Dash Type - new dash, old dash) that likely will be used in specific domains. These models should be called `int_kpis__dim_agg_time_aggregation_name_of_the_kpi`.
|
|||
|
|
- For instance, we would have `int_kpis__dim_agg_mtd_created_bookings`
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
{% set dimensions = get_kpi_dimensions_per_model("BOOKINGS") %}
|
|||
|
|
|
|||
|
|
{{ config(materialized="table", unique_key=["date", "dimension", "dimension_value"]) }}
|
|||
|
|
|
|||
|
|
{% for dimension in dimensions %}
|
|||
|
|
select
|
|||
|
|
-- Unique Key --
|
|||
|
|
date,
|
|||
|
|
{{ dimension.dimension }} as dimension,
|
|||
|
|
{{ dimension.dimension_value }} as dimension_value,
|
|||
|
|
-- Metrics --
|
|||
|
|
sum(created_bookings) as created_bookings
|
|||
|
|
from {{ ref("int_kpis__mtd_created_bookings") }}
|
|||
|
|
group by 1, 2, 3
|
|||
|
|
{% if not loop.last %}
|
|||
|
|
union all
|
|||
|
|
{% endif %}
|
|||
|
|
{% endfor %}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- It can be seen that we retrieve a specific set of dimensions that in this case are dedicated to Bookings. This is configured in the macro as follows:
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
/*
|
|||
|
|
The following lines specify for each dimension the field to be used in a
|
|||
|
|
standalone macro.
|
|||
|
|
Please note that strings should be encoded with " ' your_value_here ' ",
|
|||
|
|
while fields from tables should be specified like " your_field_here "
|
|||
|
|
*/
|
|||
|
|
{% macro dim_global() %}
|
|||
|
|
{{ return({"dimension": "'global'", "dimension_value": "'global'"}) }}
|
|||
|
|
{% endmacro %}
|
|||
|
|
{% macro dim_billing_country() %}
|
|||
|
|
{{
|
|||
|
|
return(
|
|||
|
|
{
|
|||
|
|
"dimension": "'by_billing_country'",
|
|||
|
|
"dimension_value": "main_billing_country_iso_3_per_deal",
|
|||
|
|
}
|
|||
|
|
)
|
|||
|
|
}}
|
|||
|
|
{% endmacro %}
|
|||
|
|
{% macro dim_number_of_listings() %}
|
|||
|
|
{{
|
|||
|
|
return(
|
|||
|
|
{
|
|||
|
|
"dimension": "'by_number_of_listings'",
|
|||
|
|
"dimension_value": "active_accommodations_per_deal_segmentation",
|
|||
|
|
}
|
|||
|
|
)
|
|||
|
|
}}
|
|||
|
|
{% endmacro %}
|
|||
|
|
{% macro dim_deal() %}
|
|||
|
|
{{ return({"dimension": "'by_deal'", "dimension_value": "id_deal"}) }}
|
|||
|
|
{% endmacro %}
|
|||
|
|
{% macro dim_dash() %}
|
|||
|
|
{{ return({"dimension": "'by_dash_source'", "dimension_value": "dash_source"}) }}
|
|||
|
|
{% endmacro %}
|
|||
|
|
|
|||
|
|
/*
|
|||
|
|
Macro: get_kpi_dimensions_per_model
|
|||
|
|
|
|||
|
|
Provides a general assignemnt for the Dimensions available for each KPI
|
|||
|
|
model. Keep in mind that these assignations need to be previously
|
|||
|
|
declared.
|
|||
|
|
|
|||
|
|
*/
|
|||
|
|
{% macro get_kpi_dimensions_per_model(entity_name) %}
|
|||
|
|
|
|||
|
|
{# Base dimensions shared by all models #}
|
|||
|
|
{% set base_dimensions = [
|
|||
|
|
dim_global(),
|
|||
|
|
dim_number_of_listings(),
|
|||
|
|
dim_billing_country(),
|
|||
|
|
dim_deal(),
|
|||
|
|
] %}
|
|||
|
|
|
|||
|
|
{# Initialize a list to hold any model-specific dimensions #}
|
|||
|
|
{% set additional_dimensions = [] %}
|
|||
|
|
|
|||
|
|
{# Add entity-specific dimensions #}
|
|||
|
|
{% if entity_name == "BOOKINGS" %}
|
|||
|
|
{% set additional_dimensions = [dim_dash()] %}
|
|||
|
|
{% endif %}
|
|||
|
|
|
|||
|
|
{# Combine base dimensions with additional dimensions for the specific model #}
|
|||
|
|
{% set dimensions = base_dimensions + additional_dimensions %}
|
|||
|
|
{{ return(dimensions) }}
|
|||
|
|
{% endmacro %}
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
6. Be aware that this setup needs to have daily segmentations to work properly, which might be dependant on some lifecycle models. For instance, Deal Metrics and Listings Metrics are depending on the respective lifecycle models. Thus, lifecycle models need to be computed in a Daily basis.
|
|||
|
|
|
|||
|
|
That’s it. For the rest, we will adapt based on the needs as these arise.
|
|||
|
|
|
|||
|
|
**Note**: MTD should have a from and to date to clarify.
|
|||
|
|
|
|||
|
|
**Note**: differentiate namings daily metrics vs. daily
|
|||
|
|
|
|||
|
|
**Note**: migration - all together or transitional → transitional but taking into account that I don’t want to block Joaquin on Guest KPIs. Add deprecation flags
|
|||
|
|
|
|||
|
|
**Note**: MTD and Monthly should be different models. We can aggregate later! Include date_from and date_to
|