# DWH dbt Welcome to Superhog's DWH dbt project. Here we model the entire DWH. ## How to set up your environment - Pre-requisites - You need a Linux environment. That can be Linux, macOS or WSL. - You need to have Python `>=3.10` installed. - All docs will assume you are using VSCode. - Prepare networking - You must be able to reach the DWH server through the network. There are several ways to do this. - The current recommended route is to use the data VPN. You can ask Pablo to help you set it up. - Set up - Create a virtual environment for the project with `python3 -m venv venv`. - Activate the virtual environment and run `pip install -r requirements.txt` - Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example` - Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken. - Check - Ensure you are running in the project venv, either by setting VSCode Python interpreter to the one created by `poetry`, or by running `poetry shell` in the console when in the root dir. - Turn on your tunnel to `dev` and run `dbt debug`. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread. - Complements - If you are in VSCode, you most probably want to have this extension installed: [dbt Power User](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user) - It is advised to use [this autoformatter](https://sqlfmt.com/) and to automatically [run it on save](https://docs.sqlfmt.com/integrations/vs-code). ## Branching strategy This repo works in a trunk-based-development philosophy (). Open a feature branch (`feature/your-branch-name`) for any changes and make it short-lived. It's fine and encouraged to build incrementally towards a `mart` level table with multiple PRs as long as you keep the model buildable along the way. ## Project organization We organize models in four folders: - `sync` - Dedicated to sources. - One `.yml` per `sync` schema. - No SQL models go here. - `staging` - Pretty much this: - All models go prefixed with `stg_`. - Avoid `SELECT *`. We don't know what dirty stuff can come from the `sync` schemas. - `intermediate` - Pretty much this: - It's strictly forbidden to use tables here to end users. - Make an effort to practice DRY. - `reporting` - Pretty much this: - For now, we follow a monolithic approach and just have one `reporting` schema. When this becomes insufficient, we will judge splitting into several schemas. - Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control. ## Conventions - Always use CTEs in your models to `source` and `ref` other models. - We follow [snake case](https://en.wikipedia.org/wiki/Snake_case). - Identifier columns should begin with `id_`, not finish with `_id`. - Use binary question-like column names for binary, bool, and flag columns (i.e. not `active` but `is_active`, not `verified` but `has_been_verified`, not `imported` but `was_imported`) - Datetime columns should either finish in `_utc` or `_local`. If they finish in local, the table should contain a `local_timezone` column that contains the [timezone identifier](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones). - We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named `currency`. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something. ## Stuff that we haven't done but we would like to - Automate formatting with git pre-commit. - Define conventions on testing (and enforce them). - Define conventions on documentation (and enforce them). - Prepare a quick way to replicate parts of the `prd` dwh in our local machines.