data-dwh-dbt-project/dbt_project.yml

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "dwh_dbt"
version: "1.0.0"
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: "dwh_dbt"

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

clean-targets: # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"

# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/
# directory as views. These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
  +unlogged: true
  # ^ This makes all the tables created by dbt be unlogged. This is a Postgres
  # specific setting that we activate for performance. It has deep implications
  # on how Postgres handles tables and is typically considered crazy risky, but
  # it's very fitting for our needs. You can read more here:
  # https://www.crunchydata.com/blog/postgresl-unlogged-tables
  dwh_dbt:
    +post-hook:
      sql: "VACUUM ANALYZE {{ this }}"
      transaction: false
    # ^ This makes dbt run a VACUUM ANALYZE on the models after building each.
    # It's pointless for views, but it doesn't matter because Postgres fails
    # silently withour raising an unhandled exception.
    staging:
      +materialized: table
      +schema: staging
    intermediate:
      +materialized: view
      +schema: intermediate
    reporting:
      +materialized: table
      +schema: reporting

seeds:
  dwh_dbt:
    schema: staging

vars:
  "dbt_date:time_zone": "Europe/London"
  # A general cutoff date for relevancy. Many models assume this to be the point
  # in time after which they should work.
  "start_date": "'2020-01-01'"

  # KPIs Start Date. This is the date from which we start calculating KPIs.
  "kpis_start_date": "'2022-04-01'"

  # New Dash First Invoicing Date. This is the first date considered for New Dash invoicing.
  "new_dash_first_invoicing_date": "'2024-12-31'"

  # A distant future date to use as a default when cutoff values are missing.
  "end_of_time": "'2050-12-31'"

  # Booking state variables
  # States should be strings in capital letters. Models need to force an upper()
  "cancelled_booking_state": "'CANCELLED'"
  "approved_booking_state": "'APPROVED'"
  "flagged_booking_state": "'FLAGGED'"

  # Payment state variables
  # States should be strings in capital letters. Models need to force an upper()
  "paid_payment_state": "'PAID'"

  # Protection service state variables
  # States should be strings in capital letters. Models need to force an upper()
  "default_service": "'BASIC SCREENING'"
start project 2024-01-18 11:24:35 +01:00			`# Name your project! Project names should contain only lowercase characters`
			`# and underscores. A good package name should reflect your organization's`
			`# name or the intended use of these models`
format 2024-02-22 16:14:16 +01:00			`name: "dwh_dbt"`
			`version: "1.0.0"`
start project 2024-01-18 11:24:35 +01:00			`config-version: 2`

			`# This setting configures which "profile" dbt uses for this project.`
format 2024-02-22 16:14:16 +01:00			`profile: "dwh_dbt"`
start project 2024-01-18 11:24:35 +01:00
			`# These configurations specify where dbt should look for different types of files.`
			# The `model-paths` config, for example, states that models in this project can be
			`# found in the "models/" directory. You probably won't need to change these!`
			`model-paths: ["models"]`
			`analysis-paths: ["analyses"]`
			`test-paths: ["tests"]`
			`seed-paths: ["seeds"]`
			`macro-paths: ["macros"]`
			`snapshot-paths: ["snapshots"]`

format 2024-02-22 16:14:16 +01:00			clean-targets: # directories to be removed by `dbt clean`
start project 2024-01-18 11:24:35 +01:00			`- "target"`
			`- "dbt_packages"`

			`# Configuring models`
			`# Full documentation: https://docs.getdbt.com/docs/configuring-models`

			`# In this example config, we tell dbt to build all models in the example/`
			`# directory as views. These settings can be overridden in the individual model`
			# files using the `{{ config(...) }}` macro.
			`models:`
a few quick and dirty improvements 2024-05-10 00:31:27 +02:00			`+unlogged: true`
add explanatory comment 2024-09-19 11:57:13 +02:00			`# ^ This makes all the tables created by dbt be unlogged. This is a Postgres`
			`# specific setting that we activate for performance. It has deep implications`
			`# on how Postgres handles tables and is typically considered crazy risky, but`
			`# it's very fitting for our needs. You can read more here:`
			`# https://www.crunchydata.com/blog/postgresl-unlogged-tables`
start project 2024-01-18 11:24:35 +01:00			`dwh_dbt:`
activate vacuum analyze again 2024-10-02 15:32:22 +02:00			`+post-hook:`
			`sql: "VACUUM ANALYZE {{ this }}"`
			`transaction: false`
comment vacuum analyze to run weekend test 2024-09-27 16:29:29 +02:00			`# ^ This makes dbt run a VACUUM ANALYZE on the models after building each.`
			`# It's pointless for views, but it doesn't matter because Postgres fails`
			`# silently withour raising an unhandled exception.`
first table reading from sync_core 2024-01-18 12:20:14 +01:00			`staging:`
we materialize staging tables instead of using views 2024-02-16 11:57:13 +01:00			`+materialized: table`
first table reading from sync_core 2024-01-18 12:20:14 +01:00			`+schema: staging`
moving stuff 2024-02-01 16:46:41 +01:00			`intermediate:`
Add another model 2024-01-18 14:25:13 +01:00			`+materialized: view`
moving stuff 2024-02-01 16:46:41 +01:00			`+schema: intermediate`
reporting table 2024-01-18 14:49:33 +01:00			`reporting:`
			`+materialized: table`
set timezone var 2024-02-22 15:49:36 +01:00			`+schema: reporting`

add file and schema 2024-02-23 13:59:13 +01:00			`seeds:`
			`dwh_dbt:`
			`schema: staging`

set timezone var 2024-02-22 15:49:36 +01:00			`vars:`
format 2024-02-22 16:14:16 +01:00			`"dbt_date:time_zone": "Europe/London"`
little comment to explain var 2024-06-14 15:12:44 +02:00			`# A general cutoff date for relevancy. Many models assume this to be the point`
			`# in time after which they should work.`
tiny commit ever just for foramatting 2024-09-27 16:28:00 +02:00			`"start_date": "'2020-01-01'"`
Merged PR 2164: Adding booking metrics by deal id for business kpis This is a first approach to compute some easy metrics for the "deal" based business kpis. At this stage, it contains the information of bookings (created, checkout, cancelled) per deal and month, including both historic months as well as the current one. This do not contain MTD computation because it's overkill to do a MTD at deal level (+ we have 1k deals, so scalability can become a problem in the future) Models: - int_dates_by_deal: simple model that reads from int_dates and just joins it with unified_users to retrieve the deals. It will be used as the 'source of truth' for which deals should be considered in a given month, basically, since the first host associated to a deal is created (not necessarily booked) - int_core__monthly_booking_history_by_deal: it contains the history of bookings per deal id in a monthly basis. It should be easy enough to integrate here, in the future and if needed, B2B macro segmentation. In terms of performance, comparing the model int_core__monthly_booking_history_by_deal and int_core__mtd_booking_metrics you'll see that I removed the joined with the int_dates_xxx in the CTEs. This is because I want to avoid a double join of date & deal that I tried and I stopped after 5 min running. Since this computation is in a monthly basis - no MTD - it's easy enough to just apply the int_dates_by_deal on the last part of the query. With this approach, it runs in 7 seconds. Related work items: #17689 2024-07-01 16:00:14 +00:00
Merged PR 4124: Adds dedicated start date for KPIs # Description This is a extremely simple but rather important PR. It just sets the cutoff for KPIs reporting to April 2022. This affects 1) Main KPIs, 2) Guest KPIs and 3) Account Managers report Motivation behind this is to have accurate data. Early 2022 might still be shitty, but at least we have a source of truth to compare against (on revenue side, finance P&L) I set a dedicated variable because currency rates is reading from the same start date, and I intend only to modify KPIs cutoff. # Checklist - [X] The edited models and dependants run properly with production data. - [NA] The edited models are sufficiently documented. - [NA] The edited models contain PK tests, and I've ran and passed them. - [NA] I have checked for DRY opportunities with other models and docs. - [NA] I've picked the right materialization for the affected models. # Other - [ ] Check if a full-refresh is required after this PR is merged. I need to manually run a full-refresh on daily listing segmentation that is incremental Adds dedicated start date for KPIs Related work items: #26712 2025-01-21 11:18:16 +00:00			`# KPIs Start Date. This is the date from which we start calculating KPIs.`
			`"kpis_start_date": "'2022-04-01'"`

Merged PR 4349: Xero metrics by Business Scope # Description Changes: * Creation of a deal-based model that contains when a "deal has appeared in new dash". This is tricky because a Deal can still have multiple users, thus it needs to be attributed to a date. I've chosen the first user appearance for the rest of the metrics. * Adaptation of dimension deals in KPIs to include a client type, that indicates if the deal is from APIs or not (Platform, i.e., Dashboard). * Xero metrics by Business Scope. This is the previous "dash source" that I need to change in the previously worked models. I decided to include APIs in the segmentation since in most cases we distinguish old dash from new dash by just "anything that is not in new dash". This is very wrong for invoicing metrics, in which we have APIs. So this actually properly computes a client segmentation by scope. Note that I'll need to handle the monthly/mtd metrics/agg for these 2 metric models (Resolutions + Invoiced revenue) separately. # Checklist - [X] The edited models and dependants run properly with production data. - [X] The edited models are sufficiently documented. - [X] The edited models contain PK tests, and I've ran and passed them. - [X] I have checked for DRY opportunities with other models and docs. - [X] I've picked the right materialization for the affected models. # Other - [ ] Check if a full-refresh is required after this PR is merged. Related work items: #27356 2025-02-11 15:13:42 +00:00			`# New Dash First Invoicing Date. This is the first date considered for New Dash invoicing.`
			`"new_dash_first_invoicing_date": "'2024-12-31'"`

Added end_of_time variable 2025-02-07 15:20:22 +01:00			`# A distant future date to use as a default when cutoff values are missing.`
			`"end_of_time": "'2050-12-31'"`

Merged PR 2164: Adding booking metrics by deal id for business kpis This is a first approach to compute some easy metrics for the "deal" based business kpis. At this stage, it contains the information of bookings (created, checkout, cancelled) per deal and month, including both historic months as well as the current one. This do not contain MTD computation because it's overkill to do a MTD at deal level (+ we have 1k deals, so scalability can become a problem in the future) Models: - int_dates_by_deal: simple model that reads from int_dates and just joins it with unified_users to retrieve the deals. It will be used as the 'source of truth' for which deals should be considered in a given month, basically, since the first host associated to a deal is created (not necessarily booked) - int_core__monthly_booking_history_by_deal: it contains the history of bookings per deal id in a monthly basis. It should be easy enough to integrate here, in the future and if needed, B2B macro segmentation. In terms of performance, comparing the model int_core__monthly_booking_history_by_deal and int_core__mtd_booking_metrics you'll see that I removed the joined with the int_dates_xxx in the CTEs. This is because I want to avoid a double join of date & deal that I tried and I stopped after 5 min running. Since this computation is in a monthly basis - no MTD - it's easy enough to just apply the int_dates_by_deal on the last part of the query. With this approach, it runs in 7 seconds. Related work items: #17689 2024-07-01 16:00:14 +00:00			`# Booking state variables`
			`# States should be strings in capital letters. Models need to force an upper()`
Merged PR 2195: Computes GJ with Payment and GJ Payment Rate metrics Adds the following metrics: - Guest Journey with Payment - Guest Journey Payment Rate by both visions (global and by deal id) Important: it does not expose these metrics to the dashboard, this will be done after we have feedback from Ben R. on the paid GJ without GJ completeness. Missing steps to make them appear is to adapt `int_core__mtd_aggregated_metrics` and `int_core__monthly_aggregated_metrics_history_by_deal` and the respective reporting counterparts. It adapts: - `int_core__mtd_guest_journey_metrics` - `int_core__monthly_guest_journey_history_by_deal` the approaches are similar in the sense that we join with `int_core__verification_payments` and filter by a PAID status, that has been defined in the `dbt_project.yml` in a similar manner as we did with cancelled bookings. It can happen that the same verification request has multiple payments (see screenshot), which in this case we keep the first date in which the paid payment happens. The volume is quite low anyway. ![image.png](https://guardhog.visualstudio.com/4148d95f-4b6d-4205-bcff-e9c8e0d2ca65/_apis/git/repositories/54ac356f-aad7-46d2-b62c-e8c5b3bb8ebf/pullRequests/2195/attachments/image.png) code for the screenshot: ``` with pre as ( select id_verification_request, count(distinct icvp.id_payment) as total_paid_payments from intermediate.int_core__verification_payments icvp where icvp.payment_status = 'Paid' group by 1 ) select case when total_paid_payments > 2 then 'more than 2' when total_paid_payments = 2 then '2' when total_paid_payments = 1 then '1' end as payment_volume_category, count(1) as vr_volume from pre group by 1 order by 2 desc ``` I also added a missing reference in `schema.yaml` int about `int_core__mtd_guest_journey_metrics` Related work items: #18105 2024-07-04 09:54:41 +00:00			`"cancelled_booking_state": "'CANCELLED'"`
Addressed comments 2025-02-07 15:15:17 +01:00			`"approved_booking_state": "'APPROVED'"`
			`"flagged_booking_state": "'FLAGGED'"`

Merged PR 2195: Computes GJ with Payment and GJ Payment Rate metrics Adds the following metrics: - Guest Journey with Payment - Guest Journey Payment Rate by both visions (global and by deal id) Important: it does not expose these metrics to the dashboard, this will be done after we have feedback from Ben R. on the paid GJ without GJ completeness. Missing steps to make them appear is to adapt `int_core__mtd_aggregated_metrics` and `int_core__monthly_aggregated_metrics_history_by_deal` and the respective reporting counterparts. It adapts: - `int_core__mtd_guest_journey_metrics` - `int_core__monthly_guest_journey_history_by_deal` the approaches are similar in the sense that we join with `int_core__verification_payments` and filter by a PAID status, that has been defined in the `dbt_project.yml` in a similar manner as we did with cancelled bookings. It can happen that the same verification request has multiple payments (see screenshot), which in this case we keep the first date in which the paid payment happens. The volume is quite low anyway. ![image.png](https://guardhog.visualstudio.com/4148d95f-4b6d-4205-bcff-e9c8e0d2ca65/_apis/git/repositories/54ac356f-aad7-46d2-b62c-e8c5b3bb8ebf/pullRequests/2195/attachments/image.png) code for the screenshot: ``` with pre as ( select id_verification_request, count(distinct icvp.id_payment) as total_paid_payments from intermediate.int_core__verification_payments icvp where icvp.payment_status = 'Paid' group by 1 ) select case when total_paid_payments > 2 then 'more than 2' when total_paid_payments = 2 then '2' when total_paid_payments = 1 then '1' end as payment_volume_category, count(1) as vr_volume from pre group by 1 order by 2 desc ``` I also added a missing reference in `schema.yaml` int about `int_core__mtd_guest_journey_metrics` Related work items: #18105 2024-07-04 09:54:41 +00:00			`# Payment state variables`
			`# States should be strings in capital letters. Models need to force an upper()`
tiny commit ever just for foramatting 2024-09-27 16:28:00 +02:00			`"paid_payment_state": "'PAID'"`
Addressed comments 2025-02-07 15:15:17 +01:00
			`# Protection service state variables`
			`# States should be strings in capital letters. Models need to force an upper()`
			`"default_service": "'BASIC SCREENING'"`