115 lines
7.8 KiB
Markdown
115 lines
7.8 KiB
Markdown
# DWH dbt
|
|
|
|
Welcome to Superhog's DWH dbt project. Here we model the entire DWH.
|
|
|
|
## How to set up your environment
|
|
|
|
### Basics
|
|
|
|
- Pre-requisites
|
|
- You need a Linux environment. That can be Linux, macOS or WSL.
|
|
- You need to have Python `>=3.10` installed.
|
|
- All docs will assume you are using VSCode.
|
|
- Also install the following VSCode Python extension: ms-python.python
|
|
- Prepare networking
|
|
- You must be able to reach the DWH server through the network. There are several ways to do this.
|
|
- The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.
|
|
- Set up
|
|
- Create a virtual environment for the project with `python3 -m venv venv`.
|
|
- It's recommended that you set up the new `venv` as your default interpreter for VSCode. To do this, click Ctrl+Shift+P, and look for the `Python: Select interpreter` option. Choose the new `venv`.
|
|
- Activate the virtual environment and run `pip install -r requirements.txt`
|
|
- Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`
|
|
- Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
|
|
- Run `chmod 600 ~/.dbt/profiles.yml` to secure your profiles file.
|
|
- Run `dbt deps` to install dbt dependencies
|
|
- Check
|
|
- Ensure you are running in the project venv.
|
|
- Run `dbt debug`. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
|
|
- Complements
|
|
- If you are in VSCode, you most probably want to have this extension installed: [dbt Power User](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user)
|
|
- It is advised to use [this autoformatter](https://sqlfmt.com/) and to automatically [run it on save](https://docs.sqlfmt.com/integrations/vs-code).
|
|
|
|
### Local DWH
|
|
|
|
Having a database where you can run your WIP models is very useful to ease development. But obviously, we can't do that in production. We could do it in a shared dev instance, but then we would step into each others toes when developing.
|
|
|
|
To overcome these issues, we rely on local clones of the DWH. The idea is to have a PostgreSQL instance running on your laptop. You perform your `dbt run` statements for testing and you validate the outcome of your work there. When you are confident and have tested properly, you can PR to master.
|
|
|
|
You will find a docker compose file named `dev-dwh.docker-compose.yml`. It will simply start a PostgreSQL 16 database in your device. You can raise it, adjust it to your needs, and adapt the `profiles.yml` file to point to it when you are developing locally.
|
|
|
|
The only missing bit to make your local deployment be like the production DWH is to have the source data from the source systems. The current policy is to generate a dump from the production database with what you need and restore it in your local postgres. That way, you are using accurate and representative data to do your work.
|
|
|
|
For example, if you are working on models that use data from Core, you can dump and restore from your terminal with something roughly like this:
|
|
|
|
```bash
|
|
pg_dump -h superhog-dwh-prd.postgres.database.azure.com -U airbyte_user -W -F t dwh -n sync_xero_superhog_limited > xero.dump
|
|
|
|
pg_restore -h localhost -U postgres -W -d dwh xero.dump
|
|
```
|
|
|
|
## Branching strategy
|
|
|
|
This repo works in a trunk-based-development philosophy (<https://trunkbaseddevelopment.com/>).
|
|
|
|
When working on Data modeling stuff (models, sources, seeds, docs, etc.) use a `models` branch (i.e. `models/churned-users`). It's fine and encouraged to build incrementally towards a `reporting` level table with multiple PRs as long as you keep the model buildable along the way.
|
|
|
|
For other matters, use a `chores` branch (i.e. `chores/add-dbt-package`).
|
|
|
|
## Project organization
|
|
|
|
We organize models in four folders:
|
|
|
|
- `staging`
|
|
- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/2-staging>
|
|
- One `.yml` per `sync` schema, with naming `_<sourcename>_sources.yml`. For example, for Core, `_core_sources.yml`.
|
|
- All models go prefixed with `stg_`.
|
|
- Avoid `SELECT *`. We don't know what dirty stuff can come from the `sync` schemas.
|
|
- `intermediate`
|
|
- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate>
|
|
- It's strictly forbidden to use tables here to serve end users.
|
|
- Make an effort to practice DRY.
|
|
- `reporting`
|
|
- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/4-marts>
|
|
- For now, we follow a monolithic approach and just have one `reporting` schema. When this becomes insufficient, we will judge splitting into several schemas.
|
|
- Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control.
|
|
|
|
## Conventions
|
|
|
|
- Always use CTEs in your models to `source` and `ref` other models.
|
|
- We follow [snake case](https://en.wikipedia.org/wiki/Snake_case).
|
|
- Identifier columns should begin with `id_`, not finish with `_id`.
|
|
- Use binary question-like column names for binary, bool, and flag columns (i.e. not `active` but `is_active`, not `verified` but `has_been_verified`, not `imported` but `was_imported`)
|
|
- Datetime columns should either finish in `_utc` or `_local`. If they finish in local, the table should contain a `local_timezone` column that contains the [timezone identifier](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).
|
|
- We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named `currency`. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.
|
|
|
|
## How to schedule
|
|
|
|
We currently use a minimal setup where we run the project from a VM within our infra with a simple cron job. These instructions are fit for Azure VMs running Ubuntu 22.04, you might need to change details if you are running somewhere else.
|
|
|
|
To deploy:
|
|
|
|
- Prepare a VM with Ubuntu 22.04
|
|
- You need to have Python `>=3.10` installed.
|
|
- You must be able to reach the DWH server through the network.
|
|
- On the VM, set up git creds for the project (for example, with an ssh key) and clone the git project in the `azureuser` home dir. And checkout main.
|
|
- Create a virtual environment for the project with `python3 -m venv venv`.
|
|
- Activate the virtual environment and run `pip install -r requirements.txt`
|
|
- Also run `dbt deps` to install the dbt packages required by the project.
|
|
- Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`. Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
|
|
- There's a script in the root of this project called `run_dbt.sh`. Place it in `~/run_dbt.sh`. Adjust the paths of the script if you want/need to.
|
|
- Create a cron entry with `crontab -e` that runs the script. For example: `0 2 * * * /bin/bash /home/azureuser/run_dbt.sh` to run the dbt models every day at 2AM.
|
|
|
|
To monitor:
|
|
|
|
- The script writes output to a `dbt_run.log` file. You can check the contents to see what happened in the past runs. The exact location of the log file depends on how you set up the `run_dbt.sh` script. If you are unsure of where your logs are being written, check the script to find out.
|
|
|
|
To maintain:
|
|
|
|
- Remember to update dbt package dependencies when including new packages.
|
|
|
|
## Stuff that we haven't done but we would like to
|
|
|
|
- Automate formatting with git pre-commit.
|
|
- Define conventions on testing (and enforce them).
|
|
- Define conventions on documentation (and enforce them).
|
|
- Prepare a quick way to replicate parts of the `prd` dwh in our local machines.
|