No description
Find a file
2024-11-04 15:28:31 +01:00
.azuredevops pr template 2024-07-30 16:24:36 +02:00
.vscode add autoformat on save 2024-09-12 15:25:45 +02:00
analyses start project 2024-01-18 11:24:35 +01:00
dev-env atypos 2024-10-04 15:34:47 +02:00
macros Fixed names 2024-11-04 15:28:31 +01:00
models Fixed names 2024-11-04 15:28:31 +01:00
seeds remove tests 2024-10-10 14:34:15 +02:00
snapshots start project 2024-01-18 11:24:35 +01:00
tests Merged PR 3391: Adds Check-Out, Cancelled and Billable Bookings 2024-10-31 14:31:19 +00:00
.gitignore stuff 2024-10-01 10:09:30 +02:00
dbt_project.yml activate vacuum analyze again 2024-10-02 15:32:22 +02:00
package-lock.yml add dbt expectations 2024-02-22 15:47:30 +01:00
packages.yml add dbt utils to deps 2024-06-14 16:39:07 +02:00
profiles.yml.example a few tiny changes in readme and profiles template 2024-05-28 14:29:31 +02:00
README.md Merged PR 3034: FDW local setup 2024-10-04 13:35:02 +00:00
requirements.txt use compatible operator 2024-02-02 16:28:17 +01:00
run_dbt.sh add exec, remove pipes 2024-08-26 12:55:51 +02:00
run_tests.sh fix log file path 2024-08-28 16:21:07 +02:00

DWH dbt

Welcome to Superhog's DWH dbt project. Here we model the entire DWH.

How to set up your environment

Basics

  • Pre-requisites
    • You need a Linux environment. That can be Linux, macOS or WSL.
    • You need to have Python >=3.10 installed.
    • All docs will assume you are using VSCode.
    • Also install the following VSCode Python extension: ms-python.python
  • Prepare networking
    • You must be able to reach the DWH server through the network. There are several ways to do this.
    • The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.
  • Set up
    • Create a virtual environment for the project with python3 -m venv venv.
    • It's recommended that you set up the new venv as your default interpreter for VSCode. To do this, click Ctrl+Shift+P, and look for the Python: Select interpreter option. Choose the new venv.
    • Activate the virtual environment and run pip install -r requirements.txt
    • Create an entry for this project profiles.yml file at ~/.dbt/profiles.yml. You have a suggested template at profiles.yml.example
    • Make sure that the profiles.yml host and port settings are consistent with whatever networking approach you've taken.
    • Run chmod 600 ~/.dbt/profiles.yml to secure your profiles file.
    • Run dbt deps to install dbt dependencies
  • Check
    • Ensure you are running in the project venv.
    • Run dbt debug. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
  • Complements

Local DWH

Running a local version of the DWH allows you to test things as you develop: a must if you want to push changes to master without breaking everything.

You can read on how to set this up in dev-env/local_dwh.md.

Branching strategy

This repo works in a trunk-based-development philosophy (https://trunkbaseddevelopment.com/).

If your branch is related to a work item from DevOps, we encourage adding the ticket number in the branch name. For example: models/123-some-fancy-name. If you don't have a ticket number, you can simply do a NOTICKET one: models/NOTICKET-some-fancy-name.

When working on Data modeling stuff (models, sources, seeds, docs, etc.) use a models branch (i.e. models/782-churned-users). It's fine and encouraged to build incrementally towards a reporting level table with multiple PRs as long as you keep the model buildable along the way.

For other matters, use a chores branch (i.e. chores/656-add-dbt-package).

Project organization

We organize models in three folders:

Conventions

  • dbt practices:
    • Always use CTEs in your models to source and ref other models.
  • Columns and naming
    • We follow snake case for column names and table names.
    • Identifier columns should begin with id_, not finish with _id.
    • Use binary question-like column names for binary, bool, and flag columns (i.e. not active but is_active, not verified but has_been_verified, not imported but was_imported)
    • Datetime columns should either finish in _utc or _local. If they finish in local, the table should contain a local_timezone column that contains the timezone identifier.
    • We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named currency. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.
  • Folder structures and naming
    • All models live in models, and either in staging, intermediate or reporting.
    • Staging models should be prepended with stg_ and intermediate with int_.
    • Split schema and domain with double underscode (ie stg_core__booking).
    • Always use sources to read into staging models.
  • SQL formatting should be done with sqlfmt.
  • YAML files:
    • Should use the .yml extension, not .yaml.
    • Should be autoformatted on save. If you install this vscode extension, autoformatting should happen out of the box thanks to the settings included in the .vscode/settings.json file.
  • Other conventions
    • In staging, enforce a lower() to user UUID fields to avoid nasty propagations in the DWH.

When in doubt, do what dbt guys would do: https://docs.getdbt.com/best-practices Or Gitlab: https://handbook.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/

Testing Standards

  • All tables need Primary Key and Null tests.
  • Tables in staging and reporting should have more thorough testing. What to look for is up to you, but it should provide strong confidence in the quality of data.
  • Tests will be ran after every dbt run.

How to schedule

We currently use a minimal setup where we run the project from a VM within our infra with a simple cron job. These instructions are fit for Azure VMs running Ubuntu 22.04, you might need to change details if you are running somewhere else.

To deploy:

  • Prepare a VM with Ubuntu 22.04
  • You need to have Python >=3.10 installed.
  • You must be able to reach the DWH server through the network.
  • On the VM, set up git creds for the project (for example, with an ssh key) and clone the git project in the azureuser home dir. And checkout main.
  • Create a virtual environment for the project with python3 -m venv venv.
  • Activate the virtual environment and run pip install -r requirements.txt
  • Also run dbt deps to install the dbt packages required by the project.
  • Create an entry for this project profiles.yml file at ~/.dbt/profiles.yml. You have a suggested template at profiles.yml.example. Make sure that the profiles.yml host and port settings are consistent with whatever networking approach you've taken.
  • There are two scripts in the root of this project called run_dbt.sh and run_tests.sh. Place them in the running user's home folder. Adjust the paths of the script if you want/need to.
  • The scripts are designed to send both success and failure messages to slack channels upon completion. To properly set this up, you will need to place a file called slack_webhook_urls.txt on the same path you put the script files. The slack webhooks file should have two lines: SLACK_ALERT_WEBHOOK_URL=<url-of-webhook-for-failures> and SLACK_RECEIPT_WEBHOOK_URL=<url-of-webhook-for-successful-runs>. Setting up the slack channels and webhooks is outside of the scope of this readme.
  • Create a cron entry with crontab -e that runs the scripts. For example: 0 2 * * * /bin/bash /home/azureuser/run_dbt.sh to run the dbt models every day at 2AM, and 15 2 * * * /bin/bash /home/azureuser/run_tests.sh to run the tests fifteen minutes later.

To monitor:

  • The model building script writes output to a dbt_run.log file. You can check the contents to see what happened in the past runs. The exact location of the log file depends on how you set up the run_dbt.sh script. If you are unsure of where your logs are being written, check the script to find out.
  • Same applies to the test script, except it will write into a separate dbt_test.log.

To maintain:

  • Remember to update dbt package dependencies when including new packages.

Stuff that we haven't done but we would like to

  • Automate formatting with git pre-commit.
  • Prepare a quick way to replicate parts of the prd dwh in our local machines.