No description
Find a file
2024-05-30 10:03:05 +02:00
.azuredevops/pull_request_template/branches some changes in branching strategy 2024-03-06 11:21:11 +01:00
.vscode add schema validation from dbt-labs 2024-02-26 17:53:23 +01:00
analyses start project 2024-01-18 11:24:35 +01:00
macros switch to a single model with a case, docs for it 2024-05-23 15:22:42 +02:00
models Changed schema.yaml with the correct field 2024-05-30 10:03:05 +02:00
seeds add hardcoded rates 2024-03-12 11:22:20 +01:00
snapshots start project 2024-01-18 11:24:35 +01:00
tests start project 2024-01-18 11:24:35 +01:00
.gitignore add dump files to gitignore 2024-05-29 17:22:48 +02:00
dbt_project.yml a few quick and dirty improvements 2024-05-10 00:31:27 +02:00
dev-dwh.docker-compose.yml tons of work, dear god what a spaghetti 2024-04-04 10:54:56 +02:00
package-lock.yml add dbt expectations 2024-02-22 15:47:30 +01:00
packages.yml format 2024-02-22 15:47:52 +01:00
profiles.yml.example a few tiny changes in readme and profiles template 2024-05-28 14:29:31 +02:00
README.md couple of small readme details 2024-05-29 11:27:21 +02:00
requirements.txt use compatible operator 2024-02-02 16:28:17 +01:00
run_dbt.sh a few quick and dirty improvements 2024-05-10 00:31:27 +02:00

DWH dbt

Welcome to Superhog's DWH dbt project. Here we model the entire DWH.

How to set up your environment

Basics

  • Pre-requisites
    • You need a Linux environment. That can be Linux, macOS or WSL.
    • You need to have Python >=3.10 installed.
    • All docs will assume you are using VSCode.
    • Also install the following VSCode Python extension: ms-python.python
  • Prepare networking
    • You must be able to reach the DWH server through the network. There are several ways to do this.
    • The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.
  • Set up
    • Create a virtual environment for the project with python3 -m venv venv.
    • It's recommended that you set up the new venv as your default interpreter for VSCode. To do this, click Ctrl+Shift+P, and look for the Python: Select interpreter option. Choose the new venv.
    • Activate the virtual environment and run pip install -r requirements.txt
    • Create an entry for this project profiles.yml file at ~/.dbt/profiles.yml. You have a suggested template at profiles.yml.example
    • Make sure that the profiles.yml host and port settings are consistent with whatever networking approach you've taken.
    • Run chmod 600 ~/.dbt/profiles.yml to secure your profiles file.
    • Run dbt deps to install dbt dependencies
  • Check
    • Ensure you are running in the project venv.
    • Run dbt debug. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
  • Complements

Local DWH

Having a database where you can run your WIP models is very useful to ease development. But obviously, we can't do that in production. We could do it in a shared dev instance, but then we would step into each others toes when developing.

To overcome these issues, we rely on local clones of the DWH. The idea is to have a PostgreSQL instance running on your laptop. You perform your dbt run statements for testing and you validate the outcome of your work there. When you are confident and have tested properly, you can PR to master.

You will find a docker compose file named dev-dwh.docker-compose.yml. It will simply start a PostgreSQL 16 database in your device. You can raise it, adjust it to your needs, and adapt the profiles.yml file to point to it when you are developing locally.

The only missing bit to make your local deployment be like the production DWH is to have the source data from the source systems. The current policy is to generate a dump from the production database with what you need and restore it in your local postgres. That way, you are using accurate and representative data to do your work.

For example, if you are working on models that use data from Core, you can dump and restore from your terminal with something roughly like this:

pg_dump -h superhog-dwh-prd.postgres.database.azure.com -U airbyte_user -W -F t dwh -n sync_xero_superhog_limited > xero.dump

pg_restore -h localhost -U postgres -W -d dwh xero.dump

Branching strategy

This repo works in a trunk-based-development philosophy (https://trunkbaseddevelopment.com/).

When working on Data modeling stuff (models, sources, seeds, docs, etc.) use a models branch (i.e. models/churned-users). It's fine and encouraged to build incrementally towards a reporting level table with multiple PRs as long as you keep the model buildable along the way.

For other matters, use a chores branch (i.e. chores/add-dbt-package).

Project organization

We organize models in four folders:

Conventions

  • Always use CTEs in your models to source and ref other models.
  • We follow snake case.
  • Identifier columns should begin with id_, not finish with _id.
  • Use binary question-like column names for binary, bool, and flag columns (i.e. not active but is_active, not verified but has_been_verified, not imported but was_imported)
  • Datetime columns should either finish in _utc or _local. If they finish in local, the table should contain a local_timezone column that contains the timezone identifier.
  • We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named currency. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.

How to schedule

We currently use a minimal setup where we run the project from a VM within our infra with a simple cron job. These instructions are fit for Azure VMs running Ubuntu 22.04, you might need to change details if you are running somewhere else.

To deploy:

  • Prepare a VM with Ubuntu 22.04
  • You need to have Python >=3.10 installed.
  • You must be able to reach the DWH server through the network.
  • On the VM, set up git creds for the project (for example, with an ssh key) and clone the git project in the azureuser home dir. And checkout main.
  • Create a virtual environment for the project with python3 -m venv venv.
  • Activate the virtual environment and run pip install -r requirements.txt
  • Also run dbt deps to install the dbt packages required by the project.
  • Create an entry for this project profiles.yml file at ~/.dbt/profiles.yml. You have a suggested template at profiles.yml.example. Make sure that the profiles.yml host and port settings are consistent with whatever networking approach you've taken.
  • There's a script in the root of this project called run_dbt.sh. Place it in ~/run_dbt.sh. Adjust the paths of the script if you want/need to.
  • Create a cron entry with crontab -e that runs the script. For example: 0 2 * * * /bin/bash /home/azureuser/run_dbt.sh to run the dbt models every day at 2AM.

To monitor:

  • The script writes output to a dbt_run.log file. You can check the contents to see what happened in the past runs. The exact location of the log file depends on how you set up the run_dbt.sh script. If you are unsure of where your logs are being written, check the script to find out.

To maintain:

  • Remember to update dbt package dependencies when including new packages.

Stuff that we haven't done but we would like to

  • Automate formatting with git pre-commit.
  • Define conventions on testing (and enforce them).
  • Define conventions on documentation (and enforce them).
  • Prepare a quick way to replicate parts of the prd dwh in our local machines.