data-dwh-dbt-project/README.md

# DWH dbt

Welcome to Superhog's DWH dbt project. Here we model the entire DWH.

## How to set up your environment

### Basics

- Pre-requisites
  - You need a Linux environment. That can be Linux, macOS or WSL.
  - You need to have Python `>=3.10` installed.
  - All docs will assume you are using VSCode.
  - Also install the following VSCode Python extension: ms-python.python
- Prepare networking
  - You must be able to reach the DWH server through the network. There are several ways to do this.
  - The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.
- Set up
  - Create a virtual environment for the project with `python3 -m venv venv`.
  - It's recommended that you set up the new `venv` as your default interpreter for VSCode. To do this, click Ctrl+Shift+P, and look for the `Python: Select interpreter` option. Choose the new `venv`.
  - Activate the virtual environment and run `pip install -r requirements.txt`
  - Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`
  - Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
  - Run `chmod 600 ~/.dbt/profiles.yml` to secure your profiles file.
  - Run `dbt deps` to install dbt dependencies
- Check
  - Ensure you are running in the project venv.
  - Run `dbt debug`. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
- Complements
  - If you are in VSCode, you most probably want to have this extension installed: [dbt Power User](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user)
  - It is advised to use [this autoformatter](https://sqlfmt.com/) and to automatically [run it on save](https://docs.sqlfmt.com/integrations/vs-code). **Important**: if you have already installed dbt Power User, [follow the instructions of this link directly](https://docs.sqlfmt.com/integrations/dbt-power-user).

### Local DWH

Running a local version of the DWH allows you to test things as you develop: a must if you want to push changes to master without breaking everything.

You can read on how to set this up in `dev-env/local_dwh.md`.

## Branching strategy

This repo works in a trunk-based-development philosophy (<https://trunkbaseddevelopment.com/>).

If your branch is related to a work item from DevOps, we encourage adding the ticket number in the branch name. For example: `models/123-some-fancy-name`. If you don't have a ticket number, you can simply do a `NOTICKET` one: `models/NOTICKET-some-fancy-name`.

When working on Data modeling stuff (models, sources, seeds, docs, etc.) use a `models` branch (i.e. `models/782-churned-users`). It's fine and encouraged to build incrementally towards a `reporting` level table with multiple PRs as long as you keep the model buildable along the way.

For other matters, use a `chores` branch (i.e. `chores/656-add-dbt-package`).

## Project organization

We organize models in three folders:

- `staging`
  - Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/2-staging>
  - One `.yml` per `sync` schema, with naming `_<sourcename>_sources.yml`. For example, for Core, `_core_sources.yml`.
  - All models go prefixed with `stg_`.
  - Avoid `SELECT *`. We don't know what dirty stuff can come from the `sync` schemas.
- `intermediate`
  - Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate>
  - It's strictly forbidden to use tables here to serve end users.
  - Make an effort to practice DRY.
- `reporting`
  - Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/4-marts>
  - For now, we follow a monolithic approach and just have one `reporting` schema. When this becomes insufficient, we will judge splitting into several schemas.
  - Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control.

## Conventions

- dbt practices:
  - Always use CTEs in your models to `source` and `ref` other models.
- Columns and naming
  - We follow [snake case](https://en.wikipedia.org/wiki/Snake_case) for column names and table names.
  - Identifier columns should begin with `id_`, not finish with `_id`.
  - Use binary question-like column names for binary, bool, and flag columns (i.e. not `active` but `is_active`, not `verified` but `has_been_verified`, not `imported` but `was_imported`)
  - Datetime columns should either finish in `_utc` or `_local`. If they finish in local, the table should contain a `local_timezone` column that contains the [timezone identifier](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).
  - We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named `currency`. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.
- Folder structures and naming
  - All models live in models, and either in staging, intermediate or reporting.
  - Staging models should be prepended with `stg_` and intermediate with `int_`.
  - Split schema and domain with double underscode (ie `stg_core__booking`).
  - Always use sources to read into staging models.
- SQL formatting should be done with `sqlfmt`.
- YAML files:
  - Should use the `.yml` extension, not `.yaml`.
  - Should be autoformatted on save. If you install [this vscode extension](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml), autoformatting should happen out of the box thanks to the settings included in the `.vscode/settings.json` file.
- Other conventions
  - In staging, enforce a `lower()` to user UUID fields to avoid nasty propagations in the DWH.

When in doubt, do what dbt guys would do: <https://docs.getdbt.com/best-practices>
Or Gitlab: <https://handbook.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/>

## Testing Standards

- All tables in staging need Primary Key and Null tests.
- Tables in reporting should have more thorough testing. What to look for is up to you, but it should provide strong confidence in the quality of data.
- Tests will be ran after every `dbt run`.

## How to schedule

We currently use a minimal setup where we run the project from a VM within our infra with a simple cron job. These instructions are fit for Azure VMs running Ubuntu 22.04, you might need to change details if you are running somewhere else.

To deploy:

- Prepare a VM with Ubuntu 22.04
- You need to have Python `>=3.10` installed.
- You must be able to reach the DWH server through the network.
- On the VM, set up git creds for the project (for example, with an ssh key) and clone the git project in the `azureuser` home dir. And checkout main.
- Create a virtual environment for the project with `python3 -m venv venv`.
- Activate the virtual environment and run `pip install -r requirements.txt`
- Also run `dbt deps` to install the dbt packages required by the project.
- Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`. Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
- There are two scripts in the root of this project called `run_dbt.sh` and `run_tests.sh`. Place them in the running user's home folder. Adjust the paths of the script if you want/need to.
- The scripts are designed to send both success and failure messages to slack channels upon completion. To properly set this up, you will need to place a file called `slack_webhook_urls.txt` on the same path you put the script files. The slack webhooks file should have two lines: `SLACK_ALERT_WEBHOOK_URL=<url-of-webhook-for-failures>` and `SLACK_RECEIPT_WEBHOOK_URL=<url-of-webhook-for-successful-runs>`. Setting up the slack channels and webhooks is outside of the scope of this readme.
- Create a cron entry with `crontab -e` that runs the scripts. For example: `0 2 * * * /bin/bash /home/azureuser/run_dbt.sh` to run the dbt models every day at 2AM, and `15 2 * * * /bin/bash /home/azureuser/run_tests.sh` to run the tests fifteen minutes later.

To monitor:

- The model building script writes output to a `dbt_run.log` file. You can check the contents to see what happened in the past runs. The exact location of the log file depends on how you set up the `run_dbt.sh` script. If you are unsure of where your logs are being written, check the script to find out.
- Same applies to the test script, except it will write into a separate `dbt_test.log`.

To maintain:

- Remember to update dbt package dependencies when including new packages.

## Stuff that we haven't done but we would like to

- Automate formatting with git pre-commit.
- Prepare a quick way to replicate parts of the `prd` dwh in our local machines.
many things 2024-01-18 17:25:41 +01:00			`# DWH dbt`
Readme 2024-01-18 10:59:12 +01:00
many things 2024-01-18 17:25:41 +01:00			`Welcome to Superhog's DWH dbt project. Here we model the entire DWH.`
Readme 2024-01-18 10:59:12 +01:00
many things 2024-01-18 17:25:41 +01:00			`## How to set up your environment`
superhog user raw 2024-01-18 12:24:29 +01:00
instructions for local dwh 2024-02-15 15:36:04 +01:00			`### Basics`

many things 2024-01-18 17:25:41 +01:00			`- Pre-requisites`
			`- You need a Linux environment. That can be Linux, macOS or WSL.`
switch to requirements 2024-02-02 16:21:58 +01:00			- You need to have Python `>=3.10` installed.
many things 2024-01-18 17:25:41 +01:00			`- All docs will assume you are using VSCode.`
couple of small readme details 2024-05-29 11:27:21 +02:00			`- Also install the following VSCode Python extension: ms-python.python`
some changes in networking instructions 2024-02-01 15:32:50 +01:00			`- Prepare networking`
			`- You must be able to reach the DWH server through the network. There are several ways to do this.`
			`- The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.`
many things 2024-01-18 17:25:41 +01:00			`- Set up`
switch to requirements 2024-02-02 16:21:58 +01:00			- Create a virtual environment for the project with `python3 -m venv venv`.
couple of small readme details 2024-05-29 11:27:21 +02:00			- It's recommended that you set up the new `venv` as your default interpreter for VSCode. To do this, click Ctrl+Shift+P, and look for the `Python: Select interpreter` option. Choose the new `venv`.
small docs details 2024-02-02 16:31:45 +01:00			- Activate the virtual environment and run `pip install -r requirements.txt`
many things 2024-01-18 17:25:41 +01:00			- Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`
some changes in networking instructions 2024-02-01 15:32:50 +01:00			- Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
a few tiny changes in readme and profiles template 2024-05-28 14:29:31 +02:00			- Run `chmod 600 ~/.dbt/profiles.yml` to secure your profiles file.
			- Run `dbt deps` to install dbt dependencies
many things 2024-01-18 17:25:41 +01:00			`- Check`
couple of small readme details 2024-05-29 11:27:21 +02:00			`- Ensure you are running in the project venv.`
a few tiny changes in readme and profiles template 2024-05-28 14:29:31 +02:00			- Run `dbt debug`. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
many things 2024-01-18 17:25:41 +01:00			`- Complements`
			`- If you are in VSCode, you most probably want to have this extension installed: [dbt Power User](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user)`
Specifying sqlfmt instructions to follow for dbt Power User 2024-09-06 15:46:19 +02:00			`- It is advised to use [this autoformatter](https://sqlfmt.com/) and to automatically [run it on save](https://docs.sqlfmt.com/integrations/vs-code). Important: if you have already installed dbt Power User, [follow the instructions of this link directly](https://docs.sqlfmt.com/integrations/dbt-power-user).`
Readme 2024-01-18 10:59:12 +01:00
instructions for local dwh 2024-02-15 15:36:04 +01:00			`### Local DWH`

stuff 2024-09-30 09:52:43 +02:00			`Running a local version of the DWH allows you to test things as you develop: a must if you want to push changes to master without breaking everything.`

			You can read on how to set this up in `dev-env/local_dwh.md`.

many things 2024-01-18 17:25:41 +01:00			`## Branching strategy`
Readme 2024-01-18 10:59:12 +01:00
many things 2024-01-18 17:25:41 +01:00			`This repo works in a trunk-based-development philosophy (<https://trunkbaseddevelopment.com/>).`

update readme 2024-09-12 15:35:13 +02:00			If your branch is related to a work item from DevOps, we encourage adding the ticket number in the branch name. For example: `models/123-some-fancy-name`. If you don't have a ticket number, you can simply do a `NOTICKET` one: `models/NOTICKET-some-fancy-name`.
some changes in branching strategy 2024-03-06 11:21:11 +01:00
update readme 2024-09-12 15:35:13 +02:00			When working on Data modeling stuff (models, sources, seeds, docs, etc.) use a `models` branch (i.e. `models/782-churned-users`). It's fine and encouraged to build incrementally towards a `reporting` level table with multiple PRs as long as you keep the model buildable along the way.

			For other matters, use a `chores` branch (i.e. `chores/656-add-dbt-package`).
many things 2024-01-18 17:25:41 +01:00
			`## Project organization`

Merged PR 2725: Force id user field to lower in staging # Description Forces lower case to all id_users in staging. Removes hardcoded lower case in intermediate. Adapts readme to contemplate the lowering of id users. I propose to merge, run in prod and run tests in prod as a proper evaluation method. BTW, I only find one id_user_host that was in capital letters, so that's why probably we didn't care that much about this. Still, I prefer have things clean from the start! ``` select * from staging.stg_core__booking scb left join intermediate.int_core__unified_user icuu on lower(scb.id_user_host) = lower(icuu.id_user) where scb.id_user_host <> icuu.id_user ``` # Checklist - [ ] The edited models and dependants run properly with production data. All models run in stg, did not check all the dependants - [ ] The edited models are sufficiently documented. Have not checked - [ ] The edited models contain PK tests, and I've ran and passed them. - [X] I have checked for DRY opportunities with other models and docs. - [ ] I've picked the right materialization for the affected models. # Other - [ ] Check if a full-refresh is required after this PR is merged. Related work items: #20776 2024-09-03 14:36:21 +00:00			`We organize models in three folders:`
many things 2024-01-18 17:25:41 +01:00
			- `staging`
			`- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/2-staging>`
some details in readme 2024-02-06 11:39:52 +01:00			- One `.yml` per `sync` schema, with naming `_<sourcename>_sources.yml`. For example, for Core, `_core_sources.yml`.
many things 2024-01-18 17:25:41 +01:00			- All models go prefixed with `stg_`.
			- Avoid `SELECT *`. We don't know what dirty stuff can come from the `sync` schemas.
			- `intermediate`
			`- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate>`
some details in readme 2024-02-06 11:39:52 +01:00			`- It's strictly forbidden to use tables here to serve end users.`
many things 2024-01-18 17:25:41 +01:00			`- Make an effort to practice DRY.`
			- `reporting`
			`- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/4-marts>`
			- For now, we follow a monolithic approach and just have one `reporting` schema. When this becomes insufficient, we will judge splitting into several schemas.
			`- Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control.`

			`## Conventions`

add a few more conventions to readme 2024-05-31 10:35:57 +02:00			`- dbt practices:`
			- Always use CTEs in your models to `source` and `ref` other models.
			`- Columns and naming`
			`- We follow [snake case](https://en.wikipedia.org/wiki/Snake_case) for column names and table names.`
			- Identifier columns should begin with `id_`, not finish with `_id`.
			- Use binary question-like column names for binary, bool, and flag columns (i.e. not `active` but `is_active`, not `verified` but `has_been_verified`, not `imported` but `was_imported`)
			- Datetime columns should either finish in `_utc` or `_local`. If they finish in local, the table should contain a `local_timezone` column that contains the [timezone identifier](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).
			- We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named `currency`. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.
			`- Folder structures and naming`
			`- All models live in models, and either in staging, intermediate or reporting.`
			- Staging models should be prepended with `stg_` and intermediate with `int_`.
			- Split schema and domain with double underscode (ie `stg_core__booking`).
			`- Always use sources to read into staging models.`
			- SQL formatting should be done with `sqlfmt`.
update readme 2024-09-12 15:35:13 +02:00			`- YAML files:`
			- Should use the `.yml` extension, not `.yaml`.
			- Should be autoformatted on save. If you install [this vscode extension](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml), autoformatting should happen out of the box thanks to the settings included in the `.vscode/settings.json` file.
Merged PR 2725: Force id user field to lower in staging # Description Forces lower case to all id_users in staging. Removes hardcoded lower case in intermediate. Adapts readme to contemplate the lowering of id users. I propose to merge, run in prod and run tests in prod as a proper evaluation method. BTW, I only find one id_user_host that was in capital letters, so that's why probably we didn't care that much about this. Still, I prefer have things clean from the start! ``` select * from staging.stg_core__booking scb left join intermediate.int_core__unified_user icuu on lower(scb.id_user_host) = lower(icuu.id_user) where scb.id_user_host <> icuu.id_user ``` # Checklist - [ ] The edited models and dependants run properly with production data. All models run in stg, did not check all the dependants - [ ] The edited models are sufficiently documented. Have not checked - [ ] The edited models contain PK tests, and I've ran and passed them. - [X] I have checked for DRY opportunities with other models and docs. - [ ] I've picked the right materialization for the affected models. # Other - [ ] Check if a full-refresh is required after this PR is merged. Related work items: #20776 2024-09-03 14:36:21 +00:00			`- Other conventions`
			- In staging, enforce a `lower()` to user UUID fields to avoid nasty propagations in the DWH.
add a few more conventions to readme 2024-05-31 10:35:57 +02:00
			`When in doubt, do what dbt guys would do: <https://docs.getdbt.com/best-practices>`
			`Or Gitlab: <https://handbook.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/>`

			`## Testing Standards`

			`- All tables in staging need Primary Key and Null tests.`
			`- Tables in reporting should have more thorough testing. What to look for is up to you, but it should provide strong confidence in the quality of data.`
			- Tests will be ran after every `dbt run`.
many things 2024-01-18 17:25:41 +01:00
deployment instructions 2024-02-06 12:09:47 +01:00			`## How to schedule`

			`We currently use a minimal setup where we run the project from a VM within our infra with a simple cron job. These instructions are fit for Azure VMs running Ubuntu 22.04, you might need to change details if you are running somewhere else.`

			`To deploy:`

			`- Prepare a VM with Ubuntu 22.04`
			- You need to have Python `>=3.10` installed.
			`- You must be able to reach the DWH server through the network.`
			- On the VM, set up git creds for the project (for example, with an ssh key) and clone the git project in the `azureuser` home dir. And checkout main.
			- Create a virtual environment for the project with `python3 -m venv venv`.
			- Activate the virtual environment and run `pip install -r requirements.txt`
update instructions 2024-03-06 11:30:44 +01:00			- Also run `dbt deps` to install the dbt packages required by the project.
deployment instructions 2024-02-06 12:09:47 +01:00			- Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`. Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
update readme 2024-08-28 16:20:51 +02:00			- There are two scripts in the root of this project called `run_dbt.sh` and `run_tests.sh`. Place them in the running user's home folder. Adjust the paths of the script if you want/need to.
			- The scripts are designed to send both success and failure messages to slack channels upon completion. To properly set this up, you will need to place a file called `slack_webhook_urls.txt` on the same path you put the script files. The slack webhooks file should have two lines: `SLACK_ALERT_WEBHOOK_URL=<url-of-webhook-for-failures>` and `SLACK_RECEIPT_WEBHOOK_URL=<url-of-webhook-for-successful-runs>`. Setting up the slack channels and webhooks is outside of the scope of this readme.
			- Create a cron entry with `crontab -e` that runs the scripts. For example: `0 2 * * * /bin/bash /home/azureuser/run_dbt.sh` to run the dbt models every day at 2AM, and `15 2 * * * /bin/bash /home/azureuser/run_tests.sh` to run the tests fifteen minutes later.
pending todo so i don't forget 2024-08-26 16:30:18 +02:00
update instructions 2024-03-06 11:30:44 +01:00			`To monitor:`
deployment instructions 2024-02-06 12:09:47 +01:00
update readme 2024-08-28 16:20:51 +02:00			- The model building script writes output to a `dbt_run.log` file. You can check the contents to see what happened in the past runs. The exact location of the log file depends on how you set up the `run_dbt.sh` script. If you are unsure of where your logs are being written, check the script to find out.
			- Same applies to the test script, except it will write into a separate `dbt_test.log`.
deployment instructions 2024-02-06 12:09:47 +01:00
update instructions 2024-03-06 11:30:44 +01:00			`To maintain:`

			`- Remember to update dbt package dependencies when including new packages.`

many things 2024-01-18 17:25:41 +01:00			`## Stuff that we haven't done but we would like to`

			`- Automate formatting with git pre-commit.`
			- Prepare a quick way to replicate parts of the `prd` dwh in our local machines.