No description

Find a file

Pablo Martin 914965724a fix typo in filename		2024-02-02 16:25:44 +01:00
.vscode	renames	2024-01-18 17:34:39 +01:00
analyses	start project	2024-01-18 11:24:35 +01:00
macros	first table reading from sync_core	2024-01-18 12:20:14 +01:00
models	more docs	2024-02-01 16:51:41 +01:00
seeds	start project	2024-01-18 11:24:35 +01:00
snapshots	start project	2024-01-18 11:24:35 +01:00
tests	start project	2024-01-18 11:24:35 +01:00
.gitignore	start project	2024-01-18 11:24:35 +01:00
dbt_project.yml	moving stuff	2024-02-01 16:46:41 +01:00
profiles.yml.example	many things	2024-01-18 17:25:41 +01:00
README.md	switch to requirements	2024-02-02 16:21:58 +01:00
requirements.txt	fix typo in filename	2024-02-02 16:25:44 +01:00

README.md

DWH dbt

Welcome to Superhog's DWH dbt project. Here we model the entire DWH.

How to set up your environment

Pre-requisites
- You need a Linux environment. That can be Linux, macOS or WSL.
- You need to have Python >=3.10 installed.
- All docs will assume you are using VSCode.
Prepare networking
- You must be able to reach the DWH server through the network. There are several ways to do this.
- The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.
Set up
- Create a virtual environment for the project with python3 -m venv venv.
- Activate the virtual environment and run pip install requirements.txt
- Create an entry for this project profiles.yml file at ~/.dbt/profiles.yml. You have a suggested template at profiles.yml.example
- Make sure that the profiles.yml host and port settings are consistent with whatever networking approach you've taken.
Check
- Ensure you are running in the project venv, either by setting VSCode Python interpreter to the one created by poetry, or by running poetry shell in the console when in the root dir.
- Turn on your tunnel to dev and run dbt debug. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
Complements
- If you are in VSCode, you most probably want to have this extension installed: dbt Power User
- It is advised to use this autoformatter and to automatically run it on save.

Branching strategy

This repo works in a trunk-based-development philosophy (https://trunkbaseddevelopment.com/).

Open a feature branch (feature/your-branch-name) for any changes and make it short-lived. It's fine and encouraged to build incrementally towards a mart level table with multiple PRs as long as you keep the model buildable along the way.

Project organization

We organize models in four folders:

sync
- Dedicated to sources.
- One .yml per sync schema.
- No SQL models go here.
staging
- Pretty much this: https://docs.getdbt.com/best-practices/how-we-structure/2-staging
- All models go prefixed with stg_.
- Avoid SELECT *. We don't know what dirty stuff can come from the sync schemas.
intermediate
- Pretty much this: https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate
- It's strictly forbidden to use tables here to end users.
- Make an effort to practice DRY.
reporting
- Pretty much this: https://docs.getdbt.com/best-practices/how-we-structure/4-marts
- For now, we follow a monolithic approach and just have one reporting schema. When this becomes insufficient, we will judge splitting into several schemas.
- Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control.

Conventions

Always use CTEs in your models to source and ref other models.
We follow snake case.
Identifier columns should begin with id_, not finish with _id.
Use binary question-like column names for binary, bool, and flag columns (i.e. not active but is_active, not verified but has_been_verified, not imported but was_imported)
Datetime columns should either finish in _utc or _local. If they finish in local, the table should contain a local_timezone column that contains the timezone identifier.
We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named currency. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.

Stuff that we haven't done but we would like to

Automate formatting with git pre-commit.
Define conventions on testing (and enforce them).
Define conventions on documentation (and enforce them).
Prepare a quick way to replicate parts of the prd dwh in our local machines.