No description

Find a file

Joaquin Ossa bcff898e45 Merged PR 3842: update exposures # Description Update exposures # Checklist - [ ] The edited models and dependants run properly with production data. - [ ] The edited models are sufficiently documented. - [ ] The edited models contain PK tests, and I've ran and passed them. - [ ] I have checked for DRY opportunities with other models and docs. - [ ] I've picked the right materialization for the affected models. # Other - [ ] Check if a full-refresh is required after this PR is merged. update exposures Related work items: #25427		2024-12-12 16:37:47 +00:00
.azuredevops	pr template	2024-07-30 16:24:36 +02:00
.vscode	add autoformat on save	2024-09-12 15:25:45 +02:00
analyses	start project	2024-01-18 11:24:35 +01:00
dev-env	atypos	2024-10-04 15:34:47 +02:00
macros	Addressed comments	2024-12-02 11:06:09 +01:00
models	new link	2024-12-12 17:06:31 +01:00
seeds	remove tests	2024-10-10 14:34:15 +02:00
snapshots	start project	2024-01-18 11:24:35 +01:00
tests	Merged PR 3731: Add 1e-6 threshold in kpis_additive_metrics_per_dimension_are_consistent	2024-12-02 17:00:52 +00:00
.gitignore	stuff	2024-10-01 10:09:30 +02:00
dbt_project.yml	activate vacuum analyze again	2024-10-02 15:32:22 +02:00
package-lock.yml	add dbt expectations	2024-02-22 15:47:30 +01:00
packages.yml	add dbt utils to deps	2024-06-14 16:39:07 +02:00
profiles.yml.example	a few tiny changes in readme and profiles template	2024-05-28 14:29:31 +02:00
README.md	instructions	2024-11-21 17:36:29 +01:00
requirements.txt	use compatible operator	2024-02-02 16:28:17 +01:00
run_dbt.sh	add exec, remove pipes	2024-08-26 12:55:51 +02:00
run_docs.sh	script to build and place docs	2024-11-21 15:32:20 +01:00
run_tests.sh	fix log file path	2024-08-28 16:21:07 +02:00

README.md

DWH dbt

Welcome to Superhog's DWH dbt project. Here we model the entire DWH.

How to set up your environment

Basics

Pre-requisites
- You need a Linux environment. That can be Linux, macOS or WSL.
- You need to have Python >=3.10 installed.
- All docs will assume you are using VSCode.
- Also install the following VSCode Python extension: ms-python.python
Prepare networking
- You must be able to reach the DWH server through the network. There are several ways to do this.
- The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.
Set up
- Create a virtual environment for the project with python3 -m venv venv.
- It's recommended that you set up the new venv as your default interpreter for VSCode. To do this, click Ctrl+Shift+P, and look for the Python: Select interpreter option. Choose the new venv.
- Activate the virtual environment and run pip install -r requirements.txt
- Create an entry for this project profiles.yml file at ~/.dbt/profiles.yml. You have a suggested template at profiles.yml.example
- Make sure that the profiles.yml host and port settings are consistent with whatever networking approach you've taken.
- Run chmod 600 ~/.dbt/profiles.yml to secure your profiles file.
- Run dbt deps to install dbt dependencies
Check
- Ensure you are running in the project venv.
- Run dbt debug. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
Complements
- If you are in VSCode, you most probably want to have this extension installed: dbt Power User
- It is advised to use this autoformatter and to automatically run it on save. Important: if you have already installed dbt Power User, follow the instructions of this link directly.

Local DWH

Running a local version of the DWH allows you to test things as you develop: a must if you want to push changes to master without breaking everything.

You can read on how to set this up in dev-env/local_dwh.md.

Branching strategy

This repo works in a trunk-based-development philosophy (https://trunkbaseddevelopment.com/).

If your branch is related to a work item from DevOps, we encourage adding the ticket number in the branch name. For example: models/123-some-fancy-name. If you don't have a ticket number, you can simply do a NOTICKET one: models/NOTICKET-some-fancy-name.

When working on Data modeling stuff (models, sources, seeds, docs, etc.) use a models branch (i.e. models/782-churned-users). It's fine and encouraged to build incrementally towards a reporting level table with multiple PRs as long as you keep the model buildable along the way.

For other matters, use a chores branch (i.e. chores/656-add-dbt-package).

Project organization

We organize models in three folders:

staging
- Pretty much this: https://docs.getdbt.com/best-practices/how-we-structure/2-staging
- One .yml per sync schema, with naming _<sourcename>_sources.yml. For example, for Core, _core_sources.yml.
- All models go prefixed with stg_.
- Avoid SELECT *. We don't know what dirty stuff can come from the sync schemas.
intermediate
- Pretty much this: https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate
- It's strictly forbidden to use tables here to serve end users.
- Make an effort to practice DRY.
reporting
- Pretty much this: https://docs.getdbt.com/best-practices/how-we-structure/4-marts
- For now, we follow a monolithic approach and just have one reporting schema. When this becomes insufficient, we will judge splitting into several schemas.
- Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control.

Conventions

dbt practices:
- Always use CTEs in your models to source and ref other models.
Columns and naming
- We follow snake case for column names and table names.
- Identifier columns should begin with id_, not finish with _id.
- Use binary question-like column names for binary, bool, and flag columns (i.e. not active but is_active, not verified but has_been_verified, not imported but was_imported)
- Datetime columns should either finish in _utc or _local. If they finish in local, the table should contain a local_timezone column that contains the timezone identifier.
- We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named currency. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.
Folder structures and naming
- All models live in models, and either in staging, intermediate or reporting.
- Staging models should be prepended with stg_ and intermediate with int_.
- Split schema and domain with double underscode (ie stg_core__booking).
- Always use sources to read into staging models.
SQL formatting should be done with sqlfmt.
YAML files:
- Should use the .yml extension, not .yaml.
- Should be autoformatted on save. If you install this vscode extension, autoformatting should happen out of the box thanks to the settings included in the .vscode/settings.json file.
Other conventions
- In staging, enforce a lower() to user UUID fields to avoid nasty propagations in the DWH.

When in doubt, do what dbt guys would do: https://docs.getdbt.com/best-practices Or Gitlab: https://handbook.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/

Testing Standards

All tables need Primary Key and Null tests.
Tables in staging and reporting should have more thorough testing. What to look for is up to you, but it should provide strong confidence in the quality of data.
Tests will be ran after every dbt run.

How to schedule

We currently use a minimal setup where we run the project from a VM within our infra with a simple cron job. These instructions are fit for Azure VMs running Ubuntu 22.04, you might need to change details if you are running somewhere else.

To deploy:

Prepare a VM with Ubuntu 22.04
You need to have Python >=3.10 installed.
You must be able to reach the DWH server through the network.
On the VM, set up git creds for the project (for example, with an ssh key) and clone the git project in the azureuser home dir. And checkout main.
Create a virtual environment for the project with python3 -m venv venv.
Activate the virtual environment and run pip install -r requirements.txt
Also run dbt deps to install the dbt packages required by the project.
Create an entry for this project profiles.yml file at ~/.dbt/profiles.yml. You have a suggested template at profiles.yml.example. Make sure that the profiles.yml host and port settings are consistent with whatever networking approach you've taken.
There are three scripts in the root of this project called run_dbt.sh, run_tests.sh and run_docs.sh. Place them in the running user's home folder. Adjust the paths of the script if you want/need to.
run_dbt.sh and run_tests.sh don't take any CLI arguments. run_docs.sh takes one: the folder where you would like the docs to be placed. So, if you want docs at /some/path/for/docs/, you would call the script like this: /bin/bash run_docs.sh /some/path/for/docs/.
The scripts are designed to send both success and failure messages to slack channels upon completion. To properly set this up, you will need to place a file called slack_webhook_urls.txt on the same path you put the script files. The slack webhooks file should have two lines: SLACK_ALERT_WEBHOOK_URL=<url-of-webhook-for-failures> and SLACK_RECEIPT_WEBHOOK_URL=<url-of-webhook-for-successful-runs>. Setting up the slack channels and webhooks is outside of the scope of this readme.
Create a cron entry with crontab -e that runs the scripts. For example, you can use the following line to sequentially build the documentation, run the models and then test the DWH, making each step only happen if the previous one succeds:
```
# This goes in your crontab file
15 6 * * * /bin/bash /home/azureuser/run_docs.sh /home/azureuser/dbtdocs && /bin/bash /home/azureuser/run_dbt.sh && /bin/bash /home/azureuser/run_tests.sh 
```

To monitor:

The model building script writes output to a dbt_run.log file. You can check the contents to see what happened in the past runs. The exact location of the log file depends on how you set up the run_dbt.sh script. If you are unsure of where your logs are being written, check the script to find out.
Same applies to the test script, except it will write into a separate dbt_test.log.
And the docs script will write in dbt_docs.log.

To maintain:

Remember to update dbt package dependencies when including new packages.

Serving the docs with a web server

Once you build the docs with run_docs.sh, you will have a bunch of files. To open them up, you will need to serve them with a webserver like Caddy or Nginx.

This goes beyond the scope of this project: to understand how you can serve these, refer to our infra script repo. Specifically, the bits around the web gateway set up.

Stuff that we haven't done but we would like to

Automate formatting with git pre-commit.