# Description Model of guest KPIs to reporting # Checklist - [x] The edited models and dependants run properly with production data. - [x] The edited models are sufficiently documented. - [x] The edited models contain PK tests, and I've ran and passed them. - [x] I have checked for DRY opportunities with other models and docs. - [ ] I've picked the right materialization for the affected models. # Other - [ ] Check if a full-refresh is required after this PR is merged. Guest KPIs reporting Related work items: #23998 |
||
|---|---|---|
| .azuredevops | ||
| .vscode | ||
| analyses | ||
| dev-env | ||
| macros | ||
| models | ||
| seeds | ||
| snapshots | ||
| tests | ||
| .gitignore | ||
| dbt_project.yml | ||
| package-lock.yml | ||
| packages.yml | ||
| profiles.yml.example | ||
| README.md | ||
| requirements.txt | ||
| run_dbt.sh | ||
| run_tests.sh | ||
DWH dbt
Welcome to Superhog's DWH dbt project. Here we model the entire DWH.
How to set up your environment
Basics
- Pre-requisites
- You need a Linux environment. That can be Linux, macOS or WSL.
- You need to have Python
>=3.10installed. - All docs will assume you are using VSCode.
- Also install the following VSCode Python extension: ms-python.python
- Prepare networking
- You must be able to reach the DWH server through the network. There are several ways to do this.
- The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.
- Set up
- Create a virtual environment for the project with
python3 -m venv venv. - It's recommended that you set up the new
venvas your default interpreter for VSCode. To do this, click Ctrl+Shift+P, and look for thePython: Select interpreteroption. Choose the newvenv. - Activate the virtual environment and run
pip install -r requirements.txt - Create an entry for this project
profiles.ymlfile at~/.dbt/profiles.yml. You have a suggested template atprofiles.yml.example - Make sure that the
profiles.ymlhost and port settings are consistent with whatever networking approach you've taken. - Run
chmod 600 ~/.dbt/profiles.ymlto secure your profiles file. - Run
dbt depsto install dbt dependencies
- Create a virtual environment for the project with
- Check
- Ensure you are running in the project venv.
- Run
dbt debug. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
- Complements
- If you are in VSCode, you most probably want to have this extension installed: dbt Power User
- It is advised to use this autoformatter and to automatically run it on save. Important: if you have already installed dbt Power User, follow the instructions of this link directly.
Local DWH
Running a local version of the DWH allows you to test things as you develop: a must if you want to push changes to master without breaking everything.
You can read on how to set this up in dev-env/local_dwh.md.
Branching strategy
This repo works in a trunk-based-development philosophy (https://trunkbaseddevelopment.com/).
If your branch is related to a work item from DevOps, we encourage adding the ticket number in the branch name. For example: models/123-some-fancy-name. If you don't have a ticket number, you can simply do a NOTICKET one: models/NOTICKET-some-fancy-name.
When working on Data modeling stuff (models, sources, seeds, docs, etc.) use a models branch (i.e. models/782-churned-users). It's fine and encouraged to build incrementally towards a reporting level table with multiple PRs as long as you keep the model buildable along the way.
For other matters, use a chores branch (i.e. chores/656-add-dbt-package).
Project organization
We organize models in three folders:
staging- Pretty much this: https://docs.getdbt.com/best-practices/how-we-structure/2-staging
- One
.ymlpersyncschema, with naming_<sourcename>_sources.yml. For example, for Core,_core_sources.yml. - All models go prefixed with
stg_. - Avoid
SELECT *. We don't know what dirty stuff can come from thesyncschemas.
intermediate- Pretty much this: https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate
- It's strictly forbidden to use tables here to serve end users.
- Make an effort to practice DRY.
reporting- Pretty much this: https://docs.getdbt.com/best-practices/how-we-structure/4-marts
- For now, we follow a monolithic approach and just have one
reportingschema. When this becomes insufficient, we will judge splitting into several schemas. - Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control.
Conventions
- dbt practices:
- Always use CTEs in your models to
sourceandrefother models.
- Always use CTEs in your models to
- Columns and naming
- We follow snake case for column names and table names.
- Identifier columns should begin with
id_, not finish with_id. - Use binary question-like column names for binary, bool, and flag columns (i.e. not
activebutis_active, notverifiedbuthas_been_verified, notimportedbutwas_imported) - Datetime columns should either finish in
_utcor_local. If they finish in local, the table should contain alocal_timezonecolumn that contains the timezone identifier. - We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named
currency. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.
- Folder structures and naming
- All models live in models, and either in staging, intermediate or reporting.
- Staging models should be prepended with
stg_and intermediate withint_. - Split schema and domain with double underscode (ie
stg_core__booking). - Always use sources to read into staging models.
- SQL formatting should be done with
sqlfmt. - YAML files:
- Should use the
.ymlextension, not.yaml. - Should be autoformatted on save. If you install this vscode extension, autoformatting should happen out of the box thanks to the settings included in the
.vscode/settings.jsonfile.
- Should use the
- Other conventions
- In staging, enforce a
lower()to user UUID fields to avoid nasty propagations in the DWH.
- In staging, enforce a
When in doubt, do what dbt guys would do: https://docs.getdbt.com/best-practices Or Gitlab: https://handbook.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/
Testing Standards
- All tables need Primary Key and Null tests.
- Tables in staging and reporting should have more thorough testing. What to look for is up to you, but it should provide strong confidence in the quality of data.
- Tests will be ran after every
dbt run.
How to schedule
We currently use a minimal setup where we run the project from a VM within our infra with a simple cron job. These instructions are fit for Azure VMs running Ubuntu 22.04, you might need to change details if you are running somewhere else.
To deploy:
- Prepare a VM with Ubuntu 22.04
- You need to have Python
>=3.10installed. - You must be able to reach the DWH server through the network.
- On the VM, set up git creds for the project (for example, with an ssh key) and clone the git project in the
azureuserhome dir. And checkout main. - Create a virtual environment for the project with
python3 -m venv venv. - Activate the virtual environment and run
pip install -r requirements.txt - Also run
dbt depsto install the dbt packages required by the project. - Create an entry for this project
profiles.ymlfile at~/.dbt/profiles.yml. You have a suggested template atprofiles.yml.example. Make sure that theprofiles.ymlhost and port settings are consistent with whatever networking approach you've taken. - There are two scripts in the root of this project called
run_dbt.shandrun_tests.sh. Place them in the running user's home folder. Adjust the paths of the script if you want/need to. - The scripts are designed to send both success and failure messages to slack channels upon completion. To properly set this up, you will need to place a file called
slack_webhook_urls.txton the same path you put the script files. The slack webhooks file should have two lines:SLACK_ALERT_WEBHOOK_URL=<url-of-webhook-for-failures>andSLACK_RECEIPT_WEBHOOK_URL=<url-of-webhook-for-successful-runs>. Setting up the slack channels and webhooks is outside of the scope of this readme. - Create a cron entry with
crontab -ethat runs the scripts. For example:0 2 * * * /bin/bash /home/azureuser/run_dbt.shto run the dbt models every day at 2AM, and15 2 * * * /bin/bash /home/azureuser/run_tests.shto run the tests fifteen minutes later.
To monitor:
- The model building script writes output to a
dbt_run.logfile. You can check the contents to see what happened in the past runs. The exact location of the log file depends on how you set up therun_dbt.shscript. If you are unsure of where your logs are being written, check the script to find out. - Same applies to the test script, except it will write into a separate
dbt_test.log.
To maintain:
- Remember to update dbt package dependencies when including new packages.
Stuff that we haven't done but we would like to
- Automate formatting with git pre-commit.
- Prepare a quick way to replicate parts of the
prddwh in our local machines.