pages

2025-07-11 16:15:17 +02:00 · 2025-07-11 16:15:17 +02:00 · a256b48b01
commit a256b48b01
parent 729d6d6db4
630 changed files with 16494 additions and 0 deletions
--- a/notion_data_team_no_files/Exploration
+++ b/notion_data_team_no_files/Exploration
@ -0,0 +1,385 @@
+# Exploration - MetricFlow - 2024-08-06
+
+[MetricFlow](https://docs.getdbt.com/docs/build/about-metricflow), which powers the [dbt Semantic Layer](https://docs.getdbt.com/docs/use-dbt-semantic-layer/dbt-sl), helps you define and manage the logic for your company's metrics. It's an opinionated set of abstractions and helps data consumers retrieve metric datasets from a data platform quickly and efficiently.
+
+This is Notion page summarises the explorations we carried out to determine if MetricFlow could be a good package to integrate within our DWH to create a semantic layer and centralise KPIs definitions, in the context of Business KPIs.
+
+# Table of contents
+
+You can jump directly to Conclusions to check the pros&cons and the recommendation. For examples without setting things up, go to Exploring MetricFlow.
+
+# How to set up your environment
+
+This MetricFlow exploration only requires setting things up in the `data-dwh-dbt-project`.
+
+You will need to have `dbt-core` >= 1.8 to have `dbt-metricflow` installed to work properly. You can install them from this `requirements.txt` file:
+
+[requirements.txt](requirements.txt)
+
+```
+dbt-core~=1.8.4
+dbt-postgres~=1.8.2
+dbt-metricflow~=0.7.1
+```
+
+Once you have the requirements updated and installed, you will need to set up a few new files within DWH. Follow these steps:
+
+1. Create a new folder named `semantic_models`. This should be within models folder, at the same level as the folders staging, intermediate and reporting.
+2. In the `semantic_models` folder, copy-paste the following files:
+    
+    [README.MD](README.md)
+    
+    [metricflow_time_spine.sql](metricflow_time_spine.sql)
+    
+    [sem_accommodations.sql](sem_accommodations.sql)
+    
+    [sem_bookings.sql](sem_bookings.sql)
+    
+    [sem_accommodations.yaml](sem_accommodations.yaml)
+    
+    [sem_bookings.yaml](sem_bookings.yaml)
+    
+    [metrics.yaml](metrics.yaml)
+    
+3. Once the files are copied, run `dbt-parse` to ensure there’s no error. 
+    - Your dwh project should look something like this
+        
+        ![Untitled](Untitled%204.png)
+        
+
+# Exploring MetricFlow
+
+In the README file you will find similar contents as we have in this Notion page, so feel free to skip it.
+
+### Documentation
+
+Make sure to have these links near you when playing with MetricFlow:
+
+- [Introduction to the dbt Semantic Layer](https://docs.getdbt.com/best-practices/how-we-build-our-metrics/semantic-layer-1-intro). Official documentation, you'll find tons of information here. Keep in mind that we're using dbt-core, so not all functionalities are available. But still, it's a must read.
+- [Query examples](https://docs.getdbt.com/docs/build/metricflow-commands#query-examples). This should provide the main commands to be used when playing around with MetricFlow.
+- [Github Example 1](https://github.com/dbt-labs/jaffle-sl-template/tree/main/models/marts/customer360). It's a nice way to see a use-case of implementations, can help creating new configurations based on what other people do.
+
+### DWH test case
+
+At this stage we only have a dummy implementation of the semantic models. Some arbitrary decisions have been taken, so this can be challenged.
+
+Firstly, you'll notice that these semantic models are in a dedicated `semantic_models` folder. This is to separate it from the rest of the modelisation within DWH.
+
+Secondly, you'll notice that we have a `metricflow_time_spine.sql`. It's a bit weird but apparently it's mandatory for MetricFlow to run. It serves as a master Date table, similarly as we have for `int_dates.sql`. The name of this model needs to be exactly `metricflow_time_spine.sql`, otherwise it won't run.
+
+Thirdly, we have the actual implementation of the semantic models, that would act as data marts. These can be identified by the prefix `sem_`. So far we only have a `sem_bookings.sql`, and the respective `sem_bookings.yaml`. You will see we also have a `sem_accommodations.sql` and `sem_accommodations.yaml`. The model `.sql` are quite straight forward - a copy of the useful fields from the intermediate bookings table. What is interesting is the `.yaml`.
+
+Inside of the yaml file you'll observe the usual schema documentation as we have for other DWH models. What's new though is the following:
+
+- `semantic_models`, for both `sem_accommodation.yaml` and `sem_bookings.yaml`
+- `metrics`, only for `metrics.yaml`. Technically, metrics can be defined inside the `sem_xxx.yaml`, but I found it easier to have them separated since you can combine measures from different sources.
+
+Semantic models allow for the configuration of MetricFlow based on entities, dimensions and measures. This is the main layer of configuration. You can check the configurations of the 2 models here:
+
+- sem_accommodations.yaml
+    
+    ```sql
+    version: 2
+    
+    models:
+      - name: sem_accommodations
+        description: Accommodations overview data mart, offering key details for each accommodation. One row per accommodation.
+        columns:
+          - name: id_accommodation
+            description: The unique key of the accommodation mart.
+            tests:
+              - not_null
+              - unique
+          - name: id_user_host
+            description: The foreign key relating to the user who owns the accommodation.
+          - name: country_iso_2
+            description: The country iso code consisting of two characters where the accommodation is located.
+          - name: country_name
+            description: The name of the country where the accommodation is located.
+          - name: country_preferred_currency_code
+            description: The currency code that is preferred in the country of the accommodation.
+          - name: is_active
+            description: Boolean stating if an accommodation can receive bookings (true) or not (false).
+          - name: created_date_utc
+            description: The date of when the accommodation is created in our systems.
+          - name: updated_date_utc
+            description: The date of when the accommodation has been last updated.
+      
+    semantic_models:
+      - name: accommodations # The name of the semantic model
+        description: |
+          Accommodation fact table. This table is at the accommodation granularity with one row per accommodation.
+        defaults:
+          agg_time_dimension: created_date_utc
+        model: ref('sem_accommodations')
+        entities: # Entities, which usually correspond to keys in the table.
+          - name: id_accommodation
+            type: primary
+          - name: id_user_host
+            type: foreign
+            expr: id_user_host
+    
+        dimensions: # Dimensions are either categorical or time. They add additional context to metrics and the typical querying pattern is Metric by Dimension.
+          - name: created_date_utc
+            expr: created_date_utc
+            type: time
+            type_params:
+              time_granularity: day
+          - name: updated_date_utc
+            expr: updated_date_utc
+            type: time
+            type_params:
+              time_granularity: day
+          - name: country_preferred_currency_code
+            type: categorical
+            expr: country_preferred_currency_code
+          - name: is_active
+            type: categorical
+            expr: is_active
+          - name: country_iso_2
+            type: categorical
+            expr: country_iso_2
+    
+        measures: # Measures, which are the aggregations on the columns in the table.
+          - name: accommodation_count
+            description: The total amount of accommodations.
+            expr: 1
+            agg: sum
+          - name: hosts_with_accommodations_count_distinct
+            description: Distinct count of host users with a booking.
+            agg: count_distinct
+            expr: id_user_host
+          - name: countries_with_accommodations_count_distinct
+            agg: count_distinct
+            expr: country_iso_2
+    
+    ```
+    
+- sem_bookings.yaml
+    
+    ```sql
+    version: 2
+    
+    models:
+      - name: sem_bookings
+        description: Bookings overview data mart, offering key details for each booking. One row per booking.
+        columns:
+          - name: id_booking
+            description: The unique key of the booking mart.
+            tests:
+              - not_null
+              - unique
+          - name: id_user_guest
+            description: The foreign key relating to the user who has booked an accommodation.
+          - name: id_user_host
+            description: The foreign key relating to the user who owns the accommodation.
+          - name: id_accommodation
+            description: The foreign key relating to the accommodation of this booking.
+          - name: id_verification_request
+            description: The foreign key relating to the guest verification request, if any, of this booking. Can be null.
+          - name: booking_state
+            description: The state of the booking.
+          - name: check_in_date_utc
+            description: The date of when the check in of the booking takes place.
+          - name: check_out_date_utc
+            description: The date of when the check out of the booking takes place.
+          - name: created_date_utc
+            description: The date of when the booking is created in our systems.
+          - name: updated_date_utc
+            description: The date of when the booking has been last updated.
+      
+      
+    semantic_models:
+      - name: bookings # The name of the semantic model
+        description: |
+          Booking fact table. This table is at the booking granularity with one row per booking.
+        defaults:
+          agg_time_dimension: created_date_utc
+        model: ref('sem_bookings')
+        entities: # Entities, which usually correspond to keys in the table.
+          - name: id_booking
+            type: primary
+          - name: id_user_guest
+            type: foreign
+            expr: id_user_guest
+          - name: id_user_host
+            type: foreign
+            expr: id_user_host
+          - name: id_accommodation
+            type: foreign
+            expr: id_accommodation
+          - name: id_verification_request
+            type: foreign
+            expr: id_verification_request
+    
+        dimensions: # Dimensions are either categorical or time. They add additional context to metrics and the typical querying pattern is Metric by Dimension.
+          - name: created_date_utc
+            expr: created_date_utc
+            type: time
+            type_params:
+              time_granularity: day
+          - name: check_in_date_utc
+            expr: check_in_date_utc
+            type: time
+            type_params:
+              time_granularity: day
+          - name: check_out_date_utc
+            expr: check_out_date_utc
+            type: time
+            type_params:
+              time_granularity: day
+          - name: updated_date_utc
+            expr: updated_date_utc
+            type: time
+            type_params:
+              time_granularity: day
+          - name: booking_state
+            type: categorical
+            expr: booking_state
+    
+        measures: # Measures, which are the aggregations on the columns in the table.
+          - name: booking_count
+            description: The total amount of bookings.
+            expr: 1
+            agg: sum
+          - name: hosts_with_bookings_count_distinct
+            description: Distinct count of host users with a booking.
+            agg: count_distinct
+            expr: id_user_host
+          - name: guests_with_bookings_count_distinct
+            description: Distinct count of guest users with a booking.
+            agg: count_distinct
+            expr: id_user_guest
+          - name: accommodations_with_bookings_count_distinct
+            description: Distinct count of accommodations with a booking.
+            agg: count_distinct
+            expr: id_accommodation
+          - name: verification_requests_with_bookings_count_distinct
+            agg: count_distinct
+            expr: id_verification_request
+    
+    ```
+    
+
+Metrics provides a way to read from the previously configured semantic models and can be used to query the data in multiple manners, including by different dimensions, orders, time periods, filters, etc. Here’s the configuration of the metrics file:
+
+- metrics.yaml
+    
+    ```sql
+    version: 2
+    
+    metrics:
+      - name: total_accommodations
+        description: Count of unique accommodations.
+        label: Total Accommodations
+        type: simple
+        type_params:
+          measure: accommodation_count
+      - name: total_bookings
+        description: Count of unique bookings.
+        label: Total Bookings
+        type: simple
+        type_params:
+          measure: booking_count
+      - name: bookings_growth_mom
+        description: Percentage growth of bookings compared to 1 month ago.
+        type: derived
+        label: Bookings Growth % M/M
+        type_params:
+          expr: (current_bookings - previous_month_bookings) * 100 / previous_month_bookings
+          metrics:
+            - name: total_bookings
+              alias: current_bookings
+            - name: total_bookings
+              offset_window: 1 month
+              alias: previous_month_bookings
+    
+    ```
+    
+
+Note: when changing the configuration of the yaml file, I recommend running the command:
+
+`dbt parse`
+
+Otherwise I feel like the changes are not applied, even if saving the files.
+
+### Running queries
+
+To run queries we need to use the MetricFlow command. The options for the command can be found by:
+
+`mf --help`
+
+![Untitled](Untitled%205.png)
+
+For running queries you'll need to use mf query command. To access the options you can run:
+
+`mf query --help`
+
+![Untitled](Untitled%206.png)
+
+Let's try to run some queries!
+
+Retrieve the total amount of bookings
+
+`mf query --metrics total_bookings`
+
+![Untitled](Untitled%207.png)
+
+Retrieve the total amount of bookings by booking state
+
+`mf query --metrics total_bookings --group-by id_booking__booking_state`
+
+![Untitled](Untitled%208.png)
+
+Not impressed? Well, imagine you want to...:
+Retrieve the total amount of bookings created, ordered by date, from the period 15th May 2024 to 30th May 2024, and compare the % of growth in the total bookings vs. what was observed in the previous month.
+
+When settling different semantic models, make sure to properly set up the foreign keys. For example, `sem_bookings` have a foreign key on `id_accommodation`, that links with the primary key of the `sem_accommodations` table. This ensures that you can call metrics defined in the `sem_bookings` model split by dimensions that are in the `sem_accommodations` model. For instance, if you want to retrieve the whole history of total bookings per country, in descending total booking order:
+
+`mf query --metrics total_bookings --group-by id_accommodation__country_iso_2 --order -total_bookings`
+
+![Untitled](Untitled%209.png)
+
+Note the usage of the '-' in the order clause to order descending.
+
+If we want to retrieve the query that MetricFlow is using, we can use the argument --explain when running a query. For instance, in the previous query:
+
+`mf query --metrics total_bookings --group-by id_accommodation__country_iso_2 --order -total_bookings --explain`
+
+Results with the following output:
+
+```
+✔ Success 🦄 - query completed after 0.15 seconds
+🔎 SQL (remove --explain to see data or add --show-dataflow-plan to see the generated dataflow plan):
+SELECT
+  accommodations_src_10000.is_active AS id_accommodation__is_active
+  , SUM(subq_2.booking_count) AS total_bookings
+FROM (
+  SELECT
+    id_accommodation
+    , 1 AS booking_count
+  FROM "dwh"."working"."sem_bookings" bookings_src_10000
+) subq_2
+LEFT OUTER JOIN
+  "dwh"."working"."sem_accommodations" accommodations_src_10000
+ON
+  subq_2.id_accommodation = accommodations_src_10000.id_accommodation
+GROUP BY
+  accommodations_src_10000.is_active
+ORDER BY total_bookings DESC
+```
+
+# Conclusions
+
+**Pros**:
+
+- Very flexible KPI combination. Ensuring configurations are well built and models properly normalised allows a crazy flexibility since you can define dimensions wherever you want.
+- Scalability. Code-wise, each time we need a new dimension or metric, all previous dimensions and metrics already exist thus it’s quite easy to add new dimensions without breaking existing models.
+- Clean structure. The fact of forcing the yaml configuration ensures the structures are clear and documented - otherwise metrics won’t work… - which is very nice.
+
+**Cons**:
+
+- Learning curve, relatively new package. I struggled a lot to find information to set it up, run commands, etc. There’s very few information available and part of it it’s outdated because it uses other versions of MetricFlow so… it generates a bit of frustration.
+- Unable to materialise as tables. From what I’ve read, DBT Cloud has integrations and different ways to expose this into a Data Viz tools. On DBT Core, things are much more limited at the moment. I’ve spent some time looking around for materialising tables - there’s no native command - and MAYBE with a python integration could be possible. But it’s quite overkill at this stage.
+
+**Recommendation: not to use MetricFlow for the time being. I’d gladly re-explore it once it has new features, proper documentation and more users developing with it.**