data-dwh-dbt-project/README.md

135 lines
9.6 KiB
Markdown
Raw Normal View History

2024-01-18 17:25:41 +01:00
# DWH dbt
2024-01-18 10:59:12 +01:00
2024-01-18 17:25:41 +01:00
Welcome to Superhog's DWH dbt project. Here we model the entire DWH.
2024-01-18 10:59:12 +01:00
2024-01-18 17:25:41 +01:00
## How to set up your environment
2024-01-18 12:24:29 +01:00
2024-02-15 15:36:04 +01:00
### Basics
2024-01-18 17:25:41 +01:00
- Pre-requisites
- You need a Linux environment. That can be Linux, macOS or WSL.
2024-02-02 16:21:58 +01:00
- You need to have Python `>=3.10` installed.
2024-01-18 17:25:41 +01:00
- All docs will assume you are using VSCode.
2024-05-29 11:27:21 +02:00
- Also install the following VSCode Python extension: ms-python.python
- Prepare networking
- You must be able to reach the DWH server through the network. There are several ways to do this.
- The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.
2024-01-18 17:25:41 +01:00
- Set up
2024-02-02 16:21:58 +01:00
- Create a virtual environment for the project with `python3 -m venv venv`.
2024-05-29 11:27:21 +02:00
- It's recommended that you set up the new `venv` as your default interpreter for VSCode. To do this, click Ctrl+Shift+P, and look for the `Python: Select interpreter` option. Choose the new `venv`.
2024-02-02 16:31:45 +01:00
- Activate the virtual environment and run `pip install -r requirements.txt`
2024-01-18 17:25:41 +01:00
- Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`
- Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
- Run `chmod 600 ~/.dbt/profiles.yml` to secure your profiles file.
- Run `dbt deps` to install dbt dependencies
2024-01-18 17:25:41 +01:00
- Check
2024-05-29 11:27:21 +02:00
- Ensure you are running in the project venv.
- Run `dbt debug`. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
2024-01-18 17:25:41 +01:00
- Complements
- If you are in VSCode, you most probably want to have this extension installed: [dbt Power User](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user)
- It is advised to use [this autoformatter](https://sqlfmt.com/) and to automatically [run it on save](https://docs.sqlfmt.com/integrations/vs-code).
2024-01-18 10:59:12 +01:00
2024-02-15 15:36:04 +01:00
### Local DWH
Having a database where you can run your WIP models is very useful to ease development. But obviously, we can't do that in production. We could do it in a shared dev instance, but then we would step into each others toes when developing.
To overcome these issues, we rely on local clones of the DWH. The idea is to have a PostgreSQL instance running on your laptop. You perform your `dbt run` statements for testing and you validate the outcome of your work there. When you are confident and have tested properly, you can PR to master.
2024-08-23 16:27:00 +02:00
You will find a docker compose file named `dev-dwh.docker-compose.yml`. It will simply start a PostgreSQL 16 database in your device. You can raise it, adjust it to your needs, and adapt the `profiles.yml` file to point to it when you are developing locally. Bear in mind the file comes with Postgres server settings which were based on the laptops being used in the team on August 2024. They might be more or less relevant to you. In case of doubt, you might want to use: https://pgtune.leopard.in.ua/.
2024-02-15 15:36:04 +01:00
The only missing bit to make your local deployment be like the production DWH is to have the source data from the source systems. The current policy is to generate a dump from the production database with what you need and restore it in your local postgres. That way, you are using accurate and representative data to do your work.
For example, if you are working on models that use data from Core, you can dump and restore from your terminal with something roughly like this:
```bash
pg_dump -h superhog-dwh-prd.postgres.database.azure.com -U airbyte_user -W -F t dwh -n sync_xero_superhog_limited > xero.dump
2024-02-15 15:36:04 +01:00
pg_restore -h localhost -U postgres -W -d dwh xero.dump
2024-02-15 15:36:04 +01:00
```
2024-01-18 17:25:41 +01:00
## Branching strategy
2024-01-18 10:59:12 +01:00
2024-01-18 17:25:41 +01:00
This repo works in a trunk-based-development philosophy (<https://trunkbaseddevelopment.com/>).
2024-03-06 11:21:11 +01:00
When working on Data modeling stuff (models, sources, seeds, docs, etc.) use a `models` branch (i.e. `models/churned-users`). It's fine and encouraged to build incrementally towards a `reporting` level table with multiple PRs as long as you keep the model buildable along the way.
For other matters, use a `chores` branch (i.e. `chores/add-dbt-package`).
2024-01-18 17:25:41 +01:00
## Project organization
We organize models in four folders:
- `staging`
- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/2-staging>
2024-02-06 11:39:52 +01:00
- One `.yml` per `sync` schema, with naming `_<sourcename>_sources.yml`. For example, for Core, `_core_sources.yml`.
2024-01-18 17:25:41 +01:00
- All models go prefixed with `stg_`.
- Avoid `SELECT *`. We don't know what dirty stuff can come from the `sync` schemas.
- `intermediate`
- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate>
2024-02-06 11:39:52 +01:00
- It's strictly forbidden to use tables here to serve end users.
2024-01-18 17:25:41 +01:00
- Make an effort to practice DRY.
- `reporting`
- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/4-marts>
- For now, we follow a monolithic approach and just have one `reporting` schema. When this becomes insufficient, we will judge splitting into several schemas.
- Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control.
## Conventions
2024-05-31 10:35:57 +02:00
- dbt practices:
- Always use CTEs in your models to `source` and `ref` other models.
- Columns and naming
- We follow [snake case](https://en.wikipedia.org/wiki/Snake_case) for column names and table names.
- Identifier columns should begin with `id_`, not finish with `_id`.
- Use binary question-like column names for binary, bool, and flag columns (i.e. not `active` but `is_active`, not `verified` but `has_been_verified`, not `imported` but `was_imported`)
- Datetime columns should either finish in `_utc` or `_local`. If they finish in local, the table should contain a `local_timezone` column that contains the [timezone identifier](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).
- We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named `currency`. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.
- Folder structures and naming
- All models live in models, and either in staging, intermediate or reporting.
- Staging models should be prepended with `stg_` and intermediate with `int_`.
- Split schema and domain with double underscode (ie `stg_core__booking`).
- Always use sources to read into staging models.
- SQL formatting should be done with `sqlfmt`.
When in doubt, do what dbt guys would do: <https://docs.getdbt.com/best-practices>
Or Gitlab: <https://handbook.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/>
## Testing Standards
- All tables in staging need Primary Key and Null tests.
- Tables in reporting should have more thorough testing. What to look for is up to you, but it should provide strong confidence in the quality of data.
- Tests will be ran after every `dbt run`.
2024-01-18 17:25:41 +01:00
2024-02-06 12:09:47 +01:00
## How to schedule
We currently use a minimal setup where we run the project from a VM within our infra with a simple cron job. These instructions are fit for Azure VMs running Ubuntu 22.04, you might need to change details if you are running somewhere else.
To deploy:
- Prepare a VM with Ubuntu 22.04
- You need to have Python `>=3.10` installed.
- You must be able to reach the DWH server through the network.
- On the VM, set up git creds for the project (for example, with an ssh key) and clone the git project in the `azureuser` home dir. And checkout main.
- Create a virtual environment for the project with `python3 -m venv venv`.
- Activate the virtual environment and run `pip install -r requirements.txt`
2024-03-06 11:30:44 +01:00
- Also run `dbt deps` to install the dbt packages required by the project.
2024-02-06 12:09:47 +01:00
- Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`. Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
2024-08-28 16:20:51 +02:00
- There are two scripts in the root of this project called `run_dbt.sh` and `run_tests.sh`. Place them in the running user's home folder. Adjust the paths of the script if you want/need to.
- The scripts are designed to send both success and failure messages to slack channels upon completion. To properly set this up, you will need to place a file called `slack_webhook_urls.txt` on the same path you put the script files. The slack webhooks file should have two lines: `SLACK_ALERT_WEBHOOK_URL=<url-of-webhook-for-failures>` and `SLACK_RECEIPT_WEBHOOK_URL=<url-of-webhook-for-successful-runs>`. Setting up the slack channels and webhooks is outside of the scope of this readme.
- Create a cron entry with `crontab -e` that runs the scripts. For example: `0 2 * * * /bin/bash /home/azureuser/run_dbt.sh` to run the dbt models every day at 2AM, and `15 2 * * * /bin/bash /home/azureuser/run_tests.sh` to run the tests fifteen minutes later.
2024-08-26 16:30:18 +02:00
2024-03-06 11:30:44 +01:00
To monitor:
2024-02-06 12:09:47 +01:00
2024-08-28 16:20:51 +02:00
- The model building script writes output to a `dbt_run.log` file. You can check the contents to see what happened in the past runs. The exact location of the log file depends on how you set up the `run_dbt.sh` script. If you are unsure of where your logs are being written, check the script to find out.
- Same applies to the test script, except it will write into a separate `dbt_test.log`.
2024-02-06 12:09:47 +01:00
2024-03-06 11:30:44 +01:00
To maintain:
- Remember to update dbt package dependencies when including new packages.
2024-01-18 17:25:41 +01:00
## Stuff that we haven't done but we would like to
- Automate formatting with git pre-commit.
- Define conventions on testing (and enforce them).
- Define conventions on documentation (and enforce them).
- Prepare a quick way to replicate parts of the `prd` dwh in our local machines.