data-dwh-dbt-project/README.md

# DWH dbt

Welcome to Superhog's DWH dbt project. Here we model the entire DWH.

## How to set up your environment

- Pre-requisites
  - You need a Linux environment. That can be Linux, macOS or WSL.
  - You need to have Python `>=3.10` installed.
  - All docs will assume you are using VSCode.
- Prepare networking
  - You must be able to reach the DWH server through the network. There are several ways to do this.
  - The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.
- Set up
  - Create a virtual environment for the project with `python3 -m venv venv`.
  - Activate the virtual environment and run `pip install -r requirements.txt`
  - Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`
  - Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
- Check
  - Ensure you are running in the project venv, either by setting VSCode Python interpreter to the one created by `poetry`, or by running `poetry shell` in the console when in the root dir.
  - Turn on your tunnel to `dev` and run `dbt debug`. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
- Complements
  - If you are in VSCode, you most probably want to have this extension installed: [dbt Power User](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user)
  - It is advised to use [this autoformatter](https://sqlfmt.com/) and to automatically [run it on save](https://docs.sqlfmt.com/integrations/vs-code).

## Branching strategy

This repo works in a trunk-based-development philosophy (<https://trunkbaseddevelopment.com/>).

Open a feature branch (`feature/your-branch-name`) for any changes and make it short-lived. It's fine and encouraged to build incrementally towards a `mart` level table with multiple PRs as long as you keep the model buildable along the way.

## Project organization

We organize models in four folders:

- `sync`
  - Dedicated to sources.
  - One `.yml` per `sync` schema.
  - No SQL models go here.
- `staging`
  - Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/2-staging>
  - All models go prefixed with `stg_`.
  - Avoid `SELECT *`. We don't know what dirty stuff can come from the `sync` schemas.
- `intermediate`
  - Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate>
  - It's strictly forbidden to use tables here to end users.
  - Make an effort to practice DRY.
- `reporting`
  - Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/4-marts>
  - For now, we follow a monolithic approach and just have one `reporting` schema. When this becomes insufficient, we will judge splitting into several schemas.
  - Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control.

## Conventions

- Always use CTEs in your models to `source` and `ref` other models.
- We follow [snake case](https://en.wikipedia.org/wiki/Snake_case).
- Identifier columns should begin with `id_`, not finish with `_id`.
- Use binary question-like column names for binary, bool, and flag columns (i.e. not `active` but `is_active`, not `verified` but `has_been_verified`, not `imported` but `was_imported`)
- Datetime columns should either finish in `_utc` or `_local`. If they finish in local, the table should contain a `local_timezone` column that contains the [timezone identifier](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).
- We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named `currency`. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.

## Stuff that we haven't done but we would like to

- Automate formatting with git pre-commit.
- Define conventions on testing (and enforce them).
- Define conventions on documentation (and enforce them).
- Prepare a quick way to replicate parts of the `prd` dwh in our local machines.
many things 2024-01-18 17:25:41 +01:00			`# DWH dbt`
Readme 2024-01-18 10:59:12 +01:00
many things 2024-01-18 17:25:41 +01:00			`Welcome to Superhog's DWH dbt project. Here we model the entire DWH.`
Readme 2024-01-18 10:59:12 +01:00
many things 2024-01-18 17:25:41 +01:00			`## How to set up your environment`
superhog user raw 2024-01-18 12:24:29 +01:00
many things 2024-01-18 17:25:41 +01:00			`- Pre-requisites`
			`- You need a Linux environment. That can be Linux, macOS or WSL.`
switch to requirements 2024-02-02 16:21:58 +01:00			- You need to have Python `>=3.10` installed.
many things 2024-01-18 17:25:41 +01:00			`- All docs will assume you are using VSCode.`
some changes in networking instructions 2024-02-01 15:32:50 +01:00			`- Prepare networking`
			`- You must be able to reach the DWH server through the network. There are several ways to do this.`
			`- The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.`
many things 2024-01-18 17:25:41 +01:00			`- Set up`
switch to requirements 2024-02-02 16:21:58 +01:00			- Create a virtual environment for the project with `python3 -m venv venv`.
small docs details 2024-02-02 16:31:45 +01:00			- Activate the virtual environment and run `pip install -r requirements.txt`
many things 2024-01-18 17:25:41 +01:00			- Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`
some changes in networking instructions 2024-02-01 15:32:50 +01:00			- Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
many things 2024-01-18 17:25:41 +01:00			`- Check`
			- Ensure you are running in the project venv, either by setting VSCode Python interpreter to the one created by `poetry`, or by running `poetry shell` in the console when in the root dir.
			- Turn on your tunnel to `dev` and run `dbt debug`. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
			`- Complements`
			`- If you are in VSCode, you most probably want to have this extension installed: [dbt Power User](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user)`
			`- It is advised to use [this autoformatter](https://sqlfmt.com/) and to automatically [run it on save](https://docs.sqlfmt.com/integrations/vs-code).`
Readme 2024-01-18 10:59:12 +01:00
many things 2024-01-18 17:25:41 +01:00			`## Branching strategy`
Readme 2024-01-18 10:59:12 +01:00
many things 2024-01-18 17:25:41 +01:00			`This repo works in a trunk-based-development philosophy (<https://trunkbaseddevelopment.com/>).`

			Open a feature branch (`feature/your-branch-name`) for any changes and make it short-lived. It's fine and encouraged to build incrementally towards a `mart` level table with multiple PRs as long as you keep the model buildable along the way.

			`## Project organization`

			`We organize models in four folders:`

			- `sync`
			`- Dedicated to sources.`
			- One `.yml` per `sync` schema.
			`- No SQL models go here.`
			- `staging`
			`- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/2-staging>`
			- All models go prefixed with `stg_`.
			- Avoid `SELECT *`. We don't know what dirty stuff can come from the `sync` schemas.
			- `intermediate`
			`- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate>`
			`- It's strictly forbidden to use tables here to end users.`
			`- Make an effort to practice DRY.`
			- `reporting`
			`- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/4-marts>`
			- For now, we follow a monolithic approach and just have one `reporting` schema. When this becomes insufficient, we will judge splitting into several schemas.
			`- Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control.`

			`## Conventions`

			- Always use CTEs in your models to `source` and `ref` other models.
			`- We follow [snake case](https://en.wikipedia.org/wiki/Snake_case).`
			- Identifier columns should begin with `id_`, not finish with `_id`.
			- Use binary question-like column names for binary, bool, and flag columns (i.e. not `active` but `is_active`, not `verified` but `has_been_verified`, not `imported` but `was_imported`)
			- Datetime columns should either finish in `_utc` or `_local`. If they finish in local, the table should contain a `local_timezone` column that contains the [timezone identifier](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).
			- We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named `currency`. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.

			`## Stuff that we haven't done but we would like to`

			`- Automate formatting with git pre-commit.`
			`- Define conventions on testing (and enforce them).`
			`- Define conventions on documentation (and enforce them).`
			- Prepare a quick way to replicate parts of the `prd` dwh in our local machines.