data-dwh-dbt-project/README.md

# DWH dbt

Welcome to Superhog's DWH dbt project. Here we model the entire DWH.

## How to set up your environment

### Basics

- Pre-requisites
  - You need a Linux environment. That can be Linux, macOS or WSL.
  - You need to have Python `>=3.10` installed.
  - All docs will assume you are using VSCode.
  - Also install the following VSCode Python extension: ms-python.python
- Prepare networking
  - You must be able to reach the DWH server through the network. There are several ways to do this.
  - The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.
- Set up
  - Create a virtual environment for the project with `python3 -m venv venv`.
  - It's recommended that you set up the new `venv` as your default interpreter for VSCode. To do this, click Ctrl+Shift+P, and look for the `Python: Select interpreter` option. Choose the new `venv`.
  - Activate the virtual environment and run `pip install -r requirements.txt`
  - Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`
  - Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
  - Run `chmod 600 ~/.dbt/profiles.yml` to secure your profiles file.
  - Run `dbt deps` to install dbt dependencies
- Check
  - Ensure you are running in the project venv.
  - Run `dbt debug`. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
- Complements
  - If you are in VSCode, you most probably want to have this extension installed: [dbt Power User](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user)
  - It is advised to use [this autoformatter](https://sqlfmt.com/) and to automatically [run it on save](https://docs.sqlfmt.com/integrations/vs-code). **Important**: if you have already installed dbt Power User, [follow the instructions of this link directly](https://docs.sqlfmt.com/integrations/dbt-power-user).

### Local DWH

Running a local version of the DWH allows you to test things as you develop: a must if you want to push changes to master without breaking everything.

You can read on how to set this up in `dev-env/local_dwh.md`.

## Branching strategy

This repo works in a trunk-based-development philosophy (<https://trunkbaseddevelopment.com/>).

If your branch is related to a work item from DevOps, we encourage adding the ticket number in the branch name. For example: `models/123-some-fancy-name`. If you don't have a ticket number, you can simply do a `NOTICKET` one: `models/NOTICKET-some-fancy-name`.

When working on Data modeling stuff (models, sources, seeds, docs, etc.) use a `models` branch (i.e. `models/782-churned-users`). It's fine and encouraged to build incrementally towards a `reporting` level table with multiple PRs as long as you keep the model buildable along the way.

For other matters, use a `chores` branch (i.e. `chores/656-add-dbt-package`).

## Project organization

We organize models in three folders:

- `staging`
  - Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/2-staging>
  - One `.yml` per `sync` schema, with naming `_<sourcename>_sources.yml`. For example, for Core, `_core_sources.yml`.
  - All models go prefixed with `stg_`.
  - Avoid `SELECT *`. We don't know what dirty stuff can come from the `sync` schemas.
- `intermediate`
  - Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate>
  - It's strictly forbidden to use tables here to serve end users.
  - Make an effort to practice DRY.
- `reporting`
  - Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/4-marts>
  - For now, we follow a monolithic approach and just have one `reporting` schema. When this becomes insufficient, we will judge splitting into several schemas.
  - Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control.

## Conventions

- dbt practices:
  - Always use CTEs in your models to `source` and `ref` other models.
- Columns and naming
  - We follow [snake case](https://en.wikipedia.org/wiki/Snake_case) for column names and table names.
  - Identifier columns should begin with `id_`, not finish with `_id`.
  - Use binary question-like column names for binary, bool, and flag columns (i.e. not `active` but `is_active`, not `verified` but `has_been_verified`, not `imported` but `was_imported`)
  - Datetime columns should either finish in `_utc` or `_local`. If they finish in local, the table should contain a `local_timezone` column that contains the [timezone identifier](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).
  - We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named `currency`. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.
- Folder structures and naming
  - All models live in models, and either in staging, intermediate or reporting.
  - Staging models should be prepended with `stg_` and intermediate with `int_`.
  - Split schema and domain with double underscode (ie `stg_core__booking`).
  - Always use sources to read into staging models.
- SQL formatting should be done with `sqlfmt`.
- YAML files:
  - Should use the `.yml` extension, not `.yaml`.
  - Should be autoformatted on save. If you install [this vscode extension](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml), autoformatting should happen out of the box thanks to the settings included in the `.vscode/settings.json` file.
- Other conventions
  - In staging, enforce a `lower()` to user UUID fields to avoid nasty propagations in the DWH.

When in doubt, do what dbt guys would do: <https://docs.getdbt.com/best-practices>
Or Gitlab: <https://handbook.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/>

## Testing Standards

- All tables need Primary Key and Null tests.
- Tables in staging and reporting should have more thorough testing. What to look for is up to you, but it should provide strong confidence in the quality of data.
- Tests will be ran after every `dbt run`.

## How to schedule

We currently use a minimal setup where we run the project from a VM within our infra with a simple cron job. These instructions are fit for Azure VMs running Ubuntu 22.04, you might need to change details if you are running somewhere else.

To deploy:

- Prepare a VM with Ubuntu 22.04
- You need to have Python `>=3.10` installed.
- You must be able to reach the DWH server through the network.
- On the VM, set up git creds for the project (for example, with an ssh key) and clone the git project in the `azureuser` home dir. And checkout main.
- Create a virtual environment for the project with `python3 -m venv venv`.
- Activate the virtual environment and run `pip install -r requirements.txt`
- Also run `dbt deps` to install the dbt packages required by the project.
- Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`. Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
- There are three scripts in the root of this project called `run_dbt.sh`,  `run_tests.sh` and `run_docs.sh`. Place them in the running user's home folder. Adjust the paths of the script if you want/need to.
- `run_dbt.sh` and `run_tests.sh` don't take any CLI arguments. `run_docs.sh` takes one: the folder where you would like the docs to be placed. So, if you want docs at `/some/path/for/docs/`, you would call the script like this: `/bin/bash run_docs.sh /some/path/for/docs/`.
- The scripts are designed to send both success and failure messages to slack channels upon completion. To properly set this up, you will need to place a file called `slack_webhook_urls.txt` on the same path you put the script files. The slack webhooks file should have two lines: `SLACK_ALERT_WEBHOOK_URL=<url-of-webhook-for-failures>` and `SLACK_RECEIPT_WEBHOOK_URL=<url-of-webhook-for-successful-runs>`. Setting up the slack channels and webhooks is outside of the scope of this readme.
- Create a cron entry with `crontab -e` that runs the scripts. For example, you can use the following line to sequentially build the documentation, run the models and then test the DWH, making each step only happen if the previous one succeds:
  
  ```bash
  # This goes in your crontab file
  15 6 * * * /bin/bash /home/azureuser/run_docs.sh /home/azureuser/dbtdocs && /bin/bash /home/azureuser/run_dbt.sh && /bin/bash /home/azureuser/run_tests.sh 
  ```

To monitor:

- The model building script writes output to a `dbt_run.log` file. You can check the contents to see what happened in the past runs. The exact location of the log file depends on how you set up the `run_dbt.sh` script. If you are unsure of where your logs are being written, check the script to find out.
- Same applies to the test script, except it will write into a separate `dbt_test.log`.
- And the docs script will write in `dbt_docs.log`.

To maintain:

- Remember to update dbt package dependencies when including new packages.

## Serving the docs with a web server

Once you build the docs with `run_docs.sh`, you will have a bunch of files. To open them up, you will need to serve them with a webserver like Caddy or Nginx.

This goes beyond the scope of this project: to understand how you can serve these, refer to our [infra script repo](https://guardhog.visualstudio.com/Data/_git/data-infra-script). Specifically, the bits around the web gateway set up.

## Detecting (and dropping) orphan models in the DWH

If you remove a model from the dbt project, but that model had already been materialized as a table or view in the DWH, the DWH object won't go on its own. You'll have to explictly drop it.

In order to make your life easier, we have a utility script in this repo for this purpose: `find_orphan_models_in_db.sh`.

You can use this script to detect and identify any orphan models. The script can be used one off or be scheduled with slack messaging, so you get automated alerts any time an orphan model appears.

The script is designed to be called from the same machine where you are executing the regular `dbt run` calls. You can try to use it in your local machine, but there are multiple gotchas which might lead to confusion.

To use it:
- *Note that this assumes you've set up the project in the VM as described in previous sections. If you deviate in naming, paths, etc, you'll probably have to adjust some references here.*
- In the VM, copy it from the project repo into the home folder: `cp find_orphan_models_in_db.sh ~/find_orphan_models_in_db.sh` and make it executable: `chmod 700 ~/find_orphan_models_in_db.sh`.
- The script takes two positional arguments: a comma separated list of schemas to review, and a path to dbt's `manifest.json`.
- Typically, if you call from the VM, you would do: `./find_orphan_models_in_db.sh staging,intermediate,reporting data-dwh-dbt-project/target/manifest.json`.
- There is an optional `--slack` flag that will send success/failure messages to slack channels. The necessary configuration is the same described in the "How to schedule" section, so if you've already set up the dbt run, test and docs commands, you don't need to take any other steps to start sending slack messages.
  - Example usage: `./find_orphan_models_in_db.sh --slack staging,intermediate,reporting data-dwh-dbt-project/target/manifest.json `.


How to schedule:
- Simply add a cronjob in the VM with the command:

  ```bash
  COMMAND="0 9 * * * /bin/bash /home/azureuser/find_orphan_models_in_db.sh --slack staging,intermediate,reporting /home/azureuser/data-dwh-dbt-project/target/manifest.json"
  (crontab -u $USER -l; echo "$COMMAND" ) | crontab -u $USER -
  ```

Note some caveats:
- `sync` models are not checked.
- If for any reason, you add tables or views that are unrelated to the dbt project in the monitored schemas, these will be identified as orphan by this script. Be careful, you might drop them accidentally if you don't pay attention. The simple solution to this is... don't use dbt schemas for non-dbt purposes.

## CI

CI can be setup to review PRs and make the developer experience more solid and less error prone.

You can find more details on the topic in the `ci` folder.

Note that this is an optional part of the project: you can happily work without CI if needed.

## Stuff that we haven't done but we would like to

- Automate formatting with git pre-commit.
many things 2024-01-18 17:25:41 +01:00			`# DWH dbt`
Readme 2024-01-18 10:59:12 +01:00
many things 2024-01-18 17:25:41 +01:00			`Welcome to Superhog's DWH dbt project. Here we model the entire DWH.`
Readme 2024-01-18 10:59:12 +01:00
many things 2024-01-18 17:25:41 +01:00			`## How to set up your environment`
superhog user raw 2024-01-18 12:24:29 +01:00
instructions for local dwh 2024-02-15 15:36:04 +01:00			`### Basics`

many things 2024-01-18 17:25:41 +01:00			`- Pre-requisites`
			`- You need a Linux environment. That can be Linux, macOS or WSL.`
switch to requirements 2024-02-02 16:21:58 +01:00			- You need to have Python `>=3.10` installed.
many things 2024-01-18 17:25:41 +01:00			`- All docs will assume you are using VSCode.`
couple of small readme details 2024-05-29 11:27:21 +02:00			`- Also install the following VSCode Python extension: ms-python.python`
some changes in networking instructions 2024-02-01 15:32:50 +01:00			`- Prepare networking`
			`- You must be able to reach the DWH server through the network. There are several ways to do this.`
			`- The current recommended route is to use the data VPN. You can ask Pablo to help you set it up.`
many things 2024-01-18 17:25:41 +01:00			`- Set up`
switch to requirements 2024-02-02 16:21:58 +01:00			- Create a virtual environment for the project with `python3 -m venv venv`.
couple of small readme details 2024-05-29 11:27:21 +02:00			- It's recommended that you set up the new `venv` as your default interpreter for VSCode. To do this, click Ctrl+Shift+P, and look for the `Python: Select interpreter` option. Choose the new `venv`.
small docs details 2024-02-02 16:31:45 +01:00			- Activate the virtual environment and run `pip install -r requirements.txt`
many things 2024-01-18 17:25:41 +01:00			- Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`
some changes in networking instructions 2024-02-01 15:32:50 +01:00			- Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
a few tiny changes in readme and profiles template 2024-05-28 14:29:31 +02:00			- Run `chmod 600 ~/.dbt/profiles.yml` to secure your profiles file.
			- Run `dbt deps` to install dbt dependencies
many things 2024-01-18 17:25:41 +01:00			`- Check`
couple of small readme details 2024-05-29 11:27:21 +02:00			`- Ensure you are running in the project venv.`
a few tiny changes in readme and profiles template 2024-05-28 14:29:31 +02:00			- Run `dbt debug`. If it runs well, you are all set. If it fails, there's something wrong with your set up. Grab the terminal output and pull the thread.
many things 2024-01-18 17:25:41 +01:00			`- Complements`
			`- If you are in VSCode, you most probably want to have this extension installed: [dbt Power User](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user)`
Specifying sqlfmt instructions to follow for dbt Power User 2024-09-06 15:46:19 +02:00			`- It is advised to use [this autoformatter](https://sqlfmt.com/) and to automatically [run it on save](https://docs.sqlfmt.com/integrations/vs-code). Important: if you have already installed dbt Power User, [follow the instructions of this link directly](https://docs.sqlfmt.com/integrations/dbt-power-user).`
Readme 2024-01-18 10:59:12 +01:00
instructions for local dwh 2024-02-15 15:36:04 +01:00			`### Local DWH`

stuff 2024-09-30 09:52:43 +02:00			`Running a local version of the DWH allows you to test things as you develop: a must if you want to push changes to master without breaking everything.`
instructions for local dwh 2024-02-15 15:36:04 +01:00
stuff 2024-09-30 09:52:43 +02:00			You can read on how to set this up in `dev-env/local_dwh.md`.
instructions for local dwh 2024-02-15 15:36:04 +01:00
many things 2024-01-18 17:25:41 +01:00			`## Branching strategy`
Readme 2024-01-18 10:59:12 +01:00
many things 2024-01-18 17:25:41 +01:00			`This repo works in a trunk-based-development philosophy (<https://trunkbaseddevelopment.com/>).`

update readme 2024-09-12 15:35:13 +02:00			If your branch is related to a work item from DevOps, we encourage adding the ticket number in the branch name. For example: `models/123-some-fancy-name`. If you don't have a ticket number, you can simply do a `NOTICKET` one: `models/NOTICKET-some-fancy-name`.
some changes in branching strategy 2024-03-06 11:21:11 +01:00
update readme 2024-09-12 15:35:13 +02:00			When working on Data modeling stuff (models, sources, seeds, docs, etc.) use a `models` branch (i.e. `models/782-churned-users`). It's fine and encouraged to build incrementally towards a `reporting` level table with multiple PRs as long as you keep the model buildable along the way.

			For other matters, use a `chores` branch (i.e. `chores/656-add-dbt-package`).
many things 2024-01-18 17:25:41 +01:00
			`## Project organization`

Merged PR 2725: Force id user field to lower in staging # Description Forces lower case to all id_users in staging. Removes hardcoded lower case in intermediate. Adapts readme to contemplate the lowering of id users. I propose to merge, run in prod and run tests in prod as a proper evaluation method. BTW, I only find one id_user_host that was in capital letters, so that's why probably we didn't care that much about this. Still, I prefer have things clean from the start! ``` select * from staging.stg_core__booking scb left join intermediate.int_core__unified_user icuu on lower(scb.id_user_host) = lower(icuu.id_user) where scb.id_user_host <> icuu.id_user ``` # Checklist - [ ] The edited models and dependants run properly with production data. All models run in stg, did not check all the dependants - [ ] The edited models are sufficiently documented. Have not checked - [ ] The edited models contain PK tests, and I've ran and passed them. - [X] I have checked for DRY opportunities with other models and docs. - [ ] I've picked the right materialization for the affected models. # Other - [ ] Check if a full-refresh is required after this PR is merged. Related work items: #20776 2024-09-03 14:36:21 +00:00			`We organize models in three folders:`
many things 2024-01-18 17:25:41 +01:00
			- `staging`
			`- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/2-staging>`
some details in readme 2024-02-06 11:39:52 +01:00			- One `.yml` per `sync` schema, with naming `_<sourcename>_sources.yml`. For example, for Core, `_core_sources.yml`.
many things 2024-01-18 17:25:41 +01:00			- All models go prefixed with `stg_`.
			- Avoid `SELECT *`. We don't know what dirty stuff can come from the `sync` schemas.
			- `intermediate`
			`- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/3-intermediate>`
some details in readme 2024-02-06 11:39:52 +01:00			`- It's strictly forbidden to use tables here to serve end users.`
many things 2024-01-18 17:25:41 +01:00			`- Make an effort to practice DRY.`
			- `reporting`
			`- Pretty much this: <https://docs.getdbt.com/best-practices/how-we-structure/4-marts>`
			- For now, we follow a monolithic approach and just have one `reporting` schema. When this becomes insufficient, we will judge splitting into several schemas.
			`- Make an effort to keep this layer stable like you would do with a library's API so that downstream dependencies don't break without control.`

			`## Conventions`

add a few more conventions to readme 2024-05-31 10:35:57 +02:00			`- dbt practices:`
			- Always use CTEs in your models to `source` and `ref` other models.
			`- Columns and naming`
			`- We follow [snake case](https://en.wikipedia.org/wiki/Snake_case) for column names and table names.`
			- Identifier columns should begin with `id_`, not finish with `_id`.
			- Use binary question-like column names for binary, bool, and flag columns (i.e. not `active` but `is_active`, not `verified` but `has_been_verified`, not `imported` but `was_imported`)
			- Datetime columns should either finish in `_utc` or `_local`. If they finish in local, the table should contain a `local_timezone` column that contains the [timezone identifier](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).
			- We work with many currencies and lack a single main once. Hence, any money fields will be ambiguous on their own. To address this, any table that has money related columns should also have a column named `currency`. We currently have no policy for tables where a single record has columns in different currencies. If you face this, assemble the data team and decide on something.
			`- Folder structures and naming`
			`- All models live in models, and either in staging, intermediate or reporting.`
			- Staging models should be prepended with `stg_` and intermediate with `int_`.
			- Split schema and domain with double underscode (ie `stg_core__booking`).
			`- Always use sources to read into staging models.`
			- SQL formatting should be done with `sqlfmt`.
update readme 2024-09-12 15:35:13 +02:00			`- YAML files:`
			- Should use the `.yml` extension, not `.yaml`.
			- Should be autoformatted on save. If you install [this vscode extension](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml), autoformatting should happen out of the box thanks to the settings included in the `.vscode/settings.json` file.
Merged PR 2725: Force id user field to lower in staging # Description Forces lower case to all id_users in staging. Removes hardcoded lower case in intermediate. Adapts readme to contemplate the lowering of id users. I propose to merge, run in prod and run tests in prod as a proper evaluation method. BTW, I only find one id_user_host that was in capital letters, so that's why probably we didn't care that much about this. Still, I prefer have things clean from the start! ``` select * from staging.stg_core__booking scb left join intermediate.int_core__unified_user icuu on lower(scb.id_user_host) = lower(icuu.id_user) where scb.id_user_host <> icuu.id_user ``` # Checklist - [ ] The edited models and dependants run properly with production data. All models run in stg, did not check all the dependants - [ ] The edited models are sufficiently documented. Have not checked - [ ] The edited models contain PK tests, and I've ran and passed them. - [X] I have checked for DRY opportunities with other models and docs. - [ ] I've picked the right materialization for the affected models. # Other - [ ] Check if a full-refresh is required after this PR is merged. Related work items: #20776 2024-09-03 14:36:21 +00:00			`- Other conventions`
			- In staging, enforce a `lower()` to user UUID fields to avoid nasty propagations in the DWH.
add a few more conventions to readme 2024-05-31 10:35:57 +02:00
			`When in doubt, do what dbt guys would do: <https://docs.getdbt.com/best-practices>`
			`Or Gitlab: <https://handbook.gitlab.com/handbook/business-technology/data-team/platform/dbt-guide/>`

			`## Testing Standards`

tiny comment change in project readme 2024-10-02 15:28:29 +02:00			`- All tables need Primary Key and Null tests.`
			`- Tables in staging and reporting should have more thorough testing. What to look for is up to you, but it should provide strong confidence in the quality of data.`
add a few more conventions to readme 2024-05-31 10:35:57 +02:00			- Tests will be ran after every `dbt run`.
many things 2024-01-18 17:25:41 +01:00
deployment instructions 2024-02-06 12:09:47 +01:00			`## How to schedule`

			`We currently use a minimal setup where we run the project from a VM within our infra with a simple cron job. These instructions are fit for Azure VMs running Ubuntu 22.04, you might need to change details if you are running somewhere else.`

			`To deploy:`

			`- Prepare a VM with Ubuntu 22.04`
			- You need to have Python `>=3.10` installed.
			`- You must be able to reach the DWH server through the network.`
			- On the VM, set up git creds for the project (for example, with an ssh key) and clone the git project in the `azureuser` home dir. And checkout main.
			- Create a virtual environment for the project with `python3 -m venv venv`.
			- Activate the virtual environment and run `pip install -r requirements.txt`
update instructions 2024-03-06 11:30:44 +01:00			- Also run `dbt deps` to install the dbt packages required by the project.
deployment instructions 2024-02-06 12:09:47 +01:00			- Create an entry for this project `profiles.yml` file at `~/.dbt/profiles.yml`. You have a suggested template at `profiles.yml.example`. Make sure that the `profiles.yml` host and port settings are consistent with whatever networking approach you've taken.
instructions 2024-11-21 17:36:29 +01:00			- There are three scripts in the root of this project called `run_dbt.sh`, `run_tests.sh` and `run_docs.sh`. Place them in the running user's home folder. Adjust the paths of the script if you want/need to.
			- `run_dbt.sh` and `run_tests.sh` don't take any CLI arguments. `run_docs.sh` takes one: the folder where you would like the docs to be placed. So, if you want docs at `/some/path/for/docs/`, you would call the script like this: `/bin/bash run_docs.sh /some/path/for/docs/`.
update readme 2024-08-28 16:20:51 +02:00			- The scripts are designed to send both success and failure messages to slack channels upon completion. To properly set this up, you will need to place a file called `slack_webhook_urls.txt` on the same path you put the script files. The slack webhooks file should have two lines: `SLACK_ALERT_WEBHOOK_URL=<url-of-webhook-for-failures>` and `SLACK_RECEIPT_WEBHOOK_URL=<url-of-webhook-for-successful-runs>`. Setting up the slack channels and webhooks is outside of the scope of this readme.
instructions 2024-11-21 17:36:29 +01:00			- Create a cron entry with `crontab -e` that runs the scripts. For example, you can use the following line to sequentially build the documentation, run the models and then test the DWH, making each step only happen if the previous one succeds:

			```bash
			`# This goes in your crontab file`
			`15 6 * * * /bin/bash /home/azureuser/run_docs.sh /home/azureuser/dbtdocs && /bin/bash /home/azureuser/run_dbt.sh && /bin/bash /home/azureuser/run_tests.sh`
			```
pending todo so i don't forget 2024-08-26 16:30:18 +02:00
update instructions 2024-03-06 11:30:44 +01:00			`To monitor:`
deployment instructions 2024-02-06 12:09:47 +01:00
update readme 2024-08-28 16:20:51 +02:00			- The model building script writes output to a `dbt_run.log` file. You can check the contents to see what happened in the past runs. The exact location of the log file depends on how you set up the `run_dbt.sh` script. If you are unsure of where your logs are being written, check the script to find out.
			- Same applies to the test script, except it will write into a separate `dbt_test.log`.
instructions 2024-11-21 17:36:29 +01:00			- And the docs script will write in `dbt_docs.log`.
deployment instructions 2024-02-06 12:09:47 +01:00
update instructions 2024-03-06 11:30:44 +01:00			`To maintain:`

			`- Remember to update dbt package dependencies when including new packages.`

instructions 2024-11-21 17:36:29 +01:00			`## Serving the docs with a web server`

			Once you build the docs with `run_docs.sh`, you will have a bunch of files. To open them up, you will need to serve them with a webserver like Caddy or Nginx.

			`This goes beyond the scope of this project: to understand how you can serve these, refer to our [infra script repo](https://guardhog.visualstudio.com/Data/_git/data-infra-script). Specifically, the bits around the web gateway set up.`

script and docs 2025-07-04 12:25:21 +02:00			`## Detecting (and dropping) orphan models in the DWH`

			`If you remove a model from the dbt project, but that model had already been materialized as a table or view in the DWH, the DWH object won't go on its own. You'll have to explictly drop it.`

			In order to make your life easier, we have a utility script in this repo for this purpose: `find_orphan_models_in_db.sh`.

			`You can use this script to detect and identify any orphan models. The script can be used one off or be scheduled with slack messaging, so you get automated alerts any time an orphan model appears.`

			The script is designed to be called from the same machine where you are executing the regular `dbt run` calls. You can try to use it in your local machine, but there are multiple gotchas which might lead to confusion.

			`To use it:`
			`- Note that this assumes you've set up the project in the VM as described in previous sections. If you deviate in naming, paths, etc, you'll probably have to adjust some references here.`
			- In the VM, copy it from the project repo into the home folder: `cp find_orphan_models_in_db.sh ~/find_orphan_models_in_db.sh` and make it executable: `chmod 700 ~/find_orphan_models_in_db.sh`.
			- The script takes two positional arguments: a comma separated list of schemas to review, and a path to dbt's `manifest.json`.
			- Typically, if you call from the VM, you would do: `./find_orphan_models_in_db.sh staging,intermediate,reporting data-dwh-dbt-project/target/manifest.json`.
			- There is an optional `--slack` flag that will send success/failure messages to slack channels. The necessary configuration is the same described in the "How to schedule" section, so if you've already set up the dbt run, test and docs commands, you don't need to take any other steps to start sending slack messages.
			- Example usage: `./find_orphan_models_in_db.sh --slack staging,intermediate,reporting data-dwh-dbt-project/target/manifest.json `.


			`How to schedule:`
			`- Simply add a cronjob in the VM with the command:`

			```bash
			`COMMAND="0 9 * * * /bin/bash /home/azureuser/find_orphan_models_in_db.sh --slack staging,intermediate,reporting /home/azureuser/data-dwh-dbt-project/target/manifest.json"`
			`(crontab -u $USER -l; echo "$COMMAND" ) \| crontab -u $USER -`
			```

			`Note some caveats:`
			- `sync` models are not checked.
			`- If for any reason, you add tables or views that are unrelated to the dbt project in the monitored schemas, these will be identified as orphan by this script. Be careful, you might drop them accidentally if you don't pay attention. The simple solution to this is... don't use dbt schemas for non-dbt purposes.`

silly change just to trigger PR 2025-03-19 09:14:26 +01:00			`## CI`

wip 2025-04-02 15:00:22 +02:00			`CI can be setup to review PRs and make the developer experience more solid and less error prone.`

			You can find more details on the topic in the `ci` folder.

			`Note that this is an optional part of the project: you can happily work without CI if needed.`
silly change just to trigger PR 2025-03-19 09:14:26 +01:00
many things 2024-01-18 17:25:41 +01:00			`## Stuff that we haven't done but we would like to`

			`- Automate formatting with git pre-commit.`