# DWH dbt Welcome to Superhog's DWH dbt project. Here we model the entire DWH. ## How to set up your environment ### Basics - Pre-requisites - You need a Linux environment. That can be Linux, macOS or WSL. - You need to have Python `>=3.10` installed. - All docs will assume you are using VSCode. - Also install the following VSCode Python extension: ms-python.python - Prepare networking - You must be able to reach the DWH server through the network. There are several ways to do this. - The current recommended route is to use the data VPN. You can ask Pablo to help you set it up. - Set up - Create a virtual environment for the project with `python3 -m venv venv`. - It's recommended that you set up the new `venv` as your default interpreter for VSCode. To do this, click Ctrl+Shift+P, and look for the `Python: Select interpreter` option. Choose the new `venv`. - Activate the virtual environment and run `pip install -r requirements.txt` - Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example` - Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken. - Run `chmod 600 ~/.dbt/profiles.yml` to secure your profiles file. - Run `dbt deps` to install dbt dependencies - Check - Ensure you are running in the project venv. - Run `dbt debug`. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread. - Complements - If you are in VSCode, you most probably want to have this extension installed: [dbt Power User](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user) - It is advised to use [this autoformatter](https://sqlfmt.com/) and to automatically [run it on save](https://docs.sqlfmt.com/integrations/vs-code). **Important**: if you have already installed dbt Power User, [follow the instructions of this link directly](https://docs.sqlfmt.com/integrations/dbt-power-user). ### Local DWH Running a local version of the DWH allows you to test things as you develop: a must if you want to push changes to master without breaking everything. You can read on how to set this up in `dev-env/local_dwh.md`. ## Branching strategy This repo works in a trunk-based-development philosophy (). If your branch is related to a work item from DevOps, we encourage adding the ticket number in the branch name. For example: `models/123-some-fancy-name`. If you don't have a ticket number, you can simply do a `NOTICKET` one: `models/NOTICKET-some-fancy-name`. When working on Data modeling stuff (models, sources, seeds, docs, etc.) use a `models` branch (i.e. `models/782-churned-users`). It's fine and encouraged to build incrementally towards a `reporting` level table with multiple PRs as long as you keep the model buildable along the way. For other matters, use a `chores` branch (i.e. `chores/656-add-dbt-package`). ## Project organization We organize models in three folders: - `staging` - Pretty much this: - One `.yml` per `sync` schema, with naming `__sources.yml`. For example, for Core, `_core_sources.yml`. - All models go prefixed with `stg_`. - Avoid `SELECT *`. We don't know what dirty stuff can come from the `sync` schemas. - `intermediate` - Pretty much this: - It's strictly forbidden to use tables here to serve end users. - Make an effort to practice DRY. - `reporting` - Pretty much this: - For now, we follow a monolithic approach and just have one `reporting` schema. When this becomes insufficient, we will judge splitting into several schemas. - Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control. ## Conventions - dbt practices: - Always use CTEs in your models to `source` and `ref` other models. - Columns and naming - We follow [snake case](https://en.wikipedia.org/wiki/Snake_case) for column names and table names. - Identifier columns should begin with `id_`, not finish with `_id`. - Use binary question-like column names for binary, bool, and flag columns (i.e. not `active` but `is_active`, not `verified` but `has_been_verified`, not `imported` but `was_imported`) - Datetime columns should either finish in `_utc` or `_local`. If they finish in local, the table should contain a `local_timezone` column that contains the [timezone identifier](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones). - We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named `currency`. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something. - Folder structures and naming - All models live in models, and either in staging, intermediate or reporting. - Staging models should be prepended with `stg_` and intermediate with `int_`. - Split schema and domain with double underscode (ie `stg_core__booking`). - Always use sources to read into staging models. - SQL formatting should be done with `sqlfmt`. - YAML files: - Should use the `.yml` extension, not `.yaml`. - Should be autoformatted on save. If you install [this vscode extension](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml), autoformatting should happen out of the box thanks to the settings included in the `.vscode/settings.json` file. - Other conventions - In staging, enforce a `lower()` to user UUID fields to avoid nasty propagations in the DWH. When in doubt, do what dbt guys would do: Or Gitlab: ## Testing Standards - All tables in staging need Primary Key and Null tests. - Tables in reporting should have more thorough testing. What to look for is up to you, but it should provide strong confidence in the quality of data. - Tests will be ran after every `dbt run`. ## How to schedule We currently use a minimal setup where we run the project from a VM within our infra with a simple cron job. These instructions are fit for Azure VMs running Ubuntu 22.04, you might need to change details if you are running somewhere else. To deploy: - Prepare a VM with Ubuntu 22.04 - You need to have Python `>=3.10` installed. - You must be able to reach the DWH server through the network. - On the VM, set up git creds for the project (for example, with an ssh key) and clone the git project in the `azureuser` home dir. And checkout main. - Create a virtual environment for the project with `python3 -m venv venv`. - Activate the virtual environment and run `pip install -r requirements.txt` - Also run `dbt deps` to install the dbt packages required by the project. - Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`. Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken. - There are two scripts in the root of this project called `run_dbt.sh` and `run_tests.sh`. Place them in the running user's home folder. Adjust the paths of the script if you want/need to. - The scripts are designed to send both success and failure messages to slack channels upon completion. To properly set this up, you will need to place a file called `slack_webhook_urls.txt` on the same path you put the script files. The slack webhooks file should have two lines: `SLACK_ALERT_WEBHOOK_URL=` and `SLACK_RECEIPT_WEBHOOK_URL=`. Setting up the slack channels and webhooks is outside of the scope of this readme. - Create a cron entry with `crontab -e` that runs the scripts. For example: `0 2 * * * /bin/bash /home/azureuser/run_dbt.sh` to run the dbt models every day at 2AM, and `15 2 * * * /bin/bash /home/azureuser/run_tests.sh` to run the tests fifteen minutes later. To monitor: - The model building script writes output to a `dbt_run.log` file. You can check the contents to see what happened in the past runs. The exact location of the log file depends on how you set up the `run_dbt.sh` script. If you are unsure of where your logs are being written, check the script to find out. - Same applies to the test script, except it will write into a separate `dbt_test.log`. To maintain: - Remember to update dbt package dependencies when including new packages. ## Stuff that we haven't done but we would like to - Automate formatting with git pre-commit. - Prepare a quick way to replicate parts of the `prd` dwh in our local machines.