Merged PR 4890: Add dbt CI steps

Ads new steps explaining how to set up infrastructure for the dbt CI VM.
tiny arg
2025-04-08 10:47:55 +00:00 · 2025-04-08 12:08:53 +02:00 · 2025-04-04 15:36:19 +02:00 · 2025-04-04 11:54:22 +02:00 · 2025-04-02 15:58:55 +02:00 · 2024-12-18 13:38:11 +00:00
1 changed files with 167 additions and 3 deletions
--- a/human-script.md
+++ b/human-script.md
@ -7,11 +7,10 @@ Follow this to deploy the entire data infra.
 - You need an Azure subscription and a user with administrator rights in it.
 - Whenever you see `<your-env>`, you should replace that with `dev`,`uat`, `prd` or whatever fits your environment.
 - We traditionally deploy resources on the `UK South` region. Unless stated otherwise, you should deploy resources there.
 - You have an SSH key pair ready to use for access to the different machines. You can always add more pairs later.
 ## 010. Resource group and SSH Keypair
-### 1.1 Create Resource Group
+### 010.1 Create Resource Group
 - Create a resource group. This resource group will hold all the resources. For the rest of this guide, assume this is the resource group where you must create resources.
 - Name it: `superhog-data-rg-<your-env>`
@ -19,7 +18,7 @@ Follow this to deploy the entire data infra.
  - `team: data`
  - `environment: <your-env>`
-### 1.2 SSH Keypair
+### 010.2 SSH Keypair
 - We will create an SSH Keypair for this deployment. It will be used to access VMs, Git repos and other services.
 - Create the SSH Key pair
@ -437,6 +436,8 @@ Follow this to deploy the entire data infra.
  - A user `dbt_user`, with `dwh_builder` role.
  - A user `powerbi_user`, with `consumer` role.
  - A user `airbyte user`, with permission to create new schemas.
  - A user `billingdb_reader`, with permission to read some tables from the reporting schema.
  - A user `ci_reader`, with `modeler` role.
  - *Note: replace the password fields with serious passwords and note them down.*
  - *Note: replace the name of the admin user*
@ -465,7 +466,13 @@ Follow this to deploy the entire data infra.
  CREATE ROLE powerbi_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
  GRANT consumer to powerbi_user;
  CREATE ROLE billingdb_reader LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
  CREATE ROLE modeler INHERIT; 
  CREATE ROLE ci_reader LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
  GRANT modeler to ci_reader;
  -- You might want to create a first personal user with modeler role here
  -- Login as airbyte_user
@ -514,6 +521,17 @@ Follow this to deploy the entire data infra.
    GRANT SELECT ON ALL TABLES IN SCHEMA sync_<some-new-source> TO modeler;
    ALTER DEFAULT PRIVILEGES IN SCHEMA sync_<some-new-source> GRANT SELECT ON TABLES TO modeler;
    ```
  - This script also doesn't specify exactly which tables should the `billingdb_reader` read from, since providing full access to the entire reporting schema would be excessive. You can specify which tables should be readable by the user like this:
  ```sql
  -- Login as dbt_user
  GRANT USAGE ON SCHEMA reporting TO billingdb_reader;
  GRANT SELECT ON TABLE reporting.<some_table> TO billingdb_reader;
  GRANT SELECT ON TABLE reporting.<some_other_table> TO billingdb_reader;
  ...
  ```
 ## 050. Web Gateway
@ -567,6 +585,7 @@ We will deploy a dedicated VM to act as a web server for internal services.
 - Caddy will need to be configured to act as the web server or reverse proxy of the different services within the services subnet. The details of these configurations are defined in sections below.
 - As a general note, the pattern will generally be:
  - Create the right A record in the Private DNS records so that you point users with some subdomain towards the web gateway.
  - You will need to include the right entry in the `Caddyfile` at `/etc/caddy/Caddyfile`.
  - You will need to reload caddy with `sudo systemctl reload caddy.service`.
  - If the web server needs to reach a specific port in some other VM, you will need to sort networking security out. If the VM you need to reach from the web server is within the internal services subnet, you'll have to add the necessary Inbound rules in the NSG `superhog-data-nsg-services-<your-env>`.
@ -750,6 +769,151 @@ We will deploy a dedicated VM to act as a web server for internal services.
 - Our dbt project (<https://guardhog.visualstudio.com/Data/_git/data-dwh-dbt-project>) can be deployed on any linux VM within the virtual network. The instructions on how to deploy and schedule it are in the project repository.
 - You can opt to deploy it in the same machine where airbyte is stored, since that machine is probably fairly underutilized.
 ### 080.1 dbt CI server
 Having CI pipelines in the dbt git project is a great way to automate certain quality checks around the DWH code. The way our CI strategy is designed, you need to prepare a VM within our Data private network for CI jobs to run in there. This section explains how to set up the VM. Note that we will only cover infrastructure topics here: you'll have to check the dbt repository for the full story on how to set up the CI. We recommend covering the steps describe here before jumping into the dbt specific part of things.
 #### 080.1.1 Deploying the CI VM
 - We will have a dedicated VM for the CI pipelines. The pipelines can be resource hungry at times, so I recommend having a dedicated VM that is not shared with other workloads so you can assign resources adequately and avoid resource competition with other services.
 - Create a new VM following these steps.
  - Basic settings
    - Name it: `pipeline-host-<your-env>`
    - Use Ubuntu Server 22.04
    - Size should be adjusted to the needs of the dbt project. I suggest starting on a `B2s` instance and drive upgrade decisions based on what you observe during normal usage.
    - Use username: `azureuser`
    - Use the SSH Key: `superhog-data-general-ssh-<your-env>`
    - Select the option `None` for Public inbound ports.
  - Disk settings
    - Disk requirements will vary depending on the nature of the dbt project state and the PRs. I suggest starting with the default 30gb and monitoring usage. If you see spikes that get close to 100%, increase the size to prevent a particularly heavy PR to consume all space.
  - Networking
    - Attach to the virtual network `superhog-data-vnet-<your-env>`
    - Attach to the subnet `services-subnet`
    - Assign no public IP.
    - For setting `NIC network security group` select option `None`
  - Management settings
    - Defaults are fine.
  - Monitoring
    - Defaults are fine.
  - Advanced
    - Defaults are fine.
  - Add tags:
    - `team: data`
    - `environment: <your-env>`
    - `project: dbt`
 - Once the VM is running, you should be able to ssh into the machine when your VPN is active.
 #### 080.1.2 Install docker and docker compose
 - We will use docker and docker compose to run a dockerized Postgres server in the VM.
 - You can install docker and docker compose by placing the following code in a script and running it:
 ```bash
 #!/bin/bash
 set -e  # Exit on error
 echo "🔄 Updating system packages..."
 sudo apt update && sudo apt upgrade -y
 echo "📦 Installing dependencies..."
 sudo apt install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    software-properties-common \
    lsb-release \
    gnupg2 \
    jq \
    lsb-release
 echo "🔑 Adding Docker GPG key..."
 curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
 echo "🖋️ Adding Docker repository..."
 echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
 echo "📦 Installing Docker..."
 sudo apt update
 sudo apt install -y docker-ce docker-ce-cli containerd.io
 echo "✅ Docker installed successfully!"
 echo "🔧 Enabling Docker to start on boot..."
 sudo systemctl enable docker
 echo "🔄 Installing Docker Compose..."
 DOCKER_COMPOSE_VERSION=$(curl -s https://api.github.com/repos/docker/compose/releases/latest | jq -r .tag_name)
 sudo curl -L "https://github.com/docker/compose/releases/download/${DOCKER_COMPOSE_VERSION}/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
 echo "📂 Setting permissions for Docker Compose..."
 sudo chmod +x /usr/local/bin/docker-compose
 echo "✅ Docker Compose installed successfully!"
 # Verifying installation
 echo "🔍 Verifying Docker and Docker Compose versions..."
 docker --version
 docker-compose --version
 sudo usermod -a -G docker $USER
 newgrp docker
 echo "✅ Docker and Docker Compose installation completed!"
 ```
 #### 080.1.3 Install psql
 - CI pipelines require `psql`, Postgres CLI client, to be available.
 - You can install it with the following script:
 ```bash
 sudo apt-get update
 sudo apt-get install -y gnupg2 wget nano
 sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
 curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/postgresql.gpg
 sudo apt-get update
 sudo apt-get install -y postgresql-client-16
 ```
 #### 080.1.4 Install Python
 - Python is needed to create virtual environments and run dbt and other commands.
 - You can use the following script to install python and some required packages:
 ```bash
 sudo apt-get install -y python3.10 python3.10-venv
 ```
 #### 080.1.5 Create user in DWH
 - The CI Postgres will use some Foreign Data Wrappers (FDW) pointing at the DWH. We will need a dedicated user in the DWH instance to control the permissions received by the CI server.
 - The section of this guide dedicated to setting up the DWH explains how to create this user. If you have followed it, you might have already created the user. Otherwise, head there to complete this part.
 #### 080.1.6 Install the Azure Devops agent and sync with Devops
 - The VM needs to have a Microsoft provided Azure agent to be reachable by Devops. This agent listens to requests from Devops, basically allowing Devops to execute things on the VM.
 - Some configuration needs to be done in the Azure Devops project to allow Azure Devops to reach the VM. This might include creating a pool if it doesn't exist.
 - You can find how to set this up in Ubuntu in these links:
  - Official MSFT docs: https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/linux-agent?view=azure-devops
  - Helpful walkthrough video: https://www.youtube.com/watch?v=Hy6fne9oQJM
 - Make sure to install the agent as a systemd service to have it always run on boot. The details are explained in Microsoft's documentation.
 - Once the agent is installed and correctly linked to one of our Pools in Devops, you should see the agent listed in the Devops UI for that pool, with status online. Don't move on if you haven't succeeded on this point.
 #### 080.1.7 Clone the project and further steps
 - We are going to need a local clone of the git repository to perform some setup steps, as well as for business as usual execution.
 - To do this:
  - Add some SSH key to the VM to have access to clone repos from Azure Devops. This could be the key `superhog-data-general-ssh-<your-env>` or some other key. This guide leaves this detail up to you. You can read more on how to use SSH keys with Azure Devops here: https://learn.microsoft.com/en-us/azure/devops/repos/git/use-ssh-keys-to-authenticate?view=azure-devops.
  - Also add this config to make SSH cloning work. Note that the details might have changed since this guide was written, so your mileage may vary.
  ```
  Host ssh.dev.azure.com
    Hostname ssh.dev.azure.com
    IdentityFile ~/.ssh/<whatever-key-file-you-are-using>
  ```
  - Once the CI VM is SSH capable, clone the dbt project into the `azureuser` home dir.
  - There are several steps after this, for which you should find instructions in the dbt repository itself.
 ## 090. Monitoring
 ### 090.1 Infra monitoring
Author	SHA1	Message	Date
Pablo Martín	6b5a5037d8	Merged PR 4890: Add dbt CI steps Ads new steps explaining how to set up infrastructure for the dbt CI VM.	2025-04-08 10:47:55 +00:00
Pablo Martin	6a3cfb7f75	tiny arg	2025-04-08 12:08:53 +02:00
Pablo Martin	4035d96369	final touches from real life testing	2025-04-04 15:36:19 +02:00
Pablo Martin	1996bab595	fix psql install	2025-04-04 11:54:22 +02:00
Pablo Martin	797e9506bf	add new steps	2025-04-02 15:58:55 +02:00
Pablo Martín	c550492c61	Merged PR 3870: Add BillingDB reader and permissions This PR adds a new user creation to the DWH section of the script. This includes both creating the user and describing how to give the proper grants. The explanation is simply a pattern and needs to be adjusted on run time, since it depends on what is the scope of the integration at the time of deploying. You can read more on the topic here: https://www.notion.so/knowyourguest-superhog/Currency-Rates-for-apps-integration-1600446ff9c9804faa66f982f294e6e8?pvs=4 Related work items: #25608	2024-12-18 13:38:11 +00:00
Pablo Martin	70d296594f	typo	2024-12-18 14:38:01 +01:00
Pablo Martin	6a5f6ad0ff	add user creation and permissions pattern	2024-12-18 12:56:55 +01:00
Pablo Martín	73cd9b2dcc	Merged PR 3656: Web Gateway This PR adapts our infra script to a new pattern. We now have a dedicated, small VM that acts as a gateway for web-server delivered contents. The PR: - Adds the instructions to set up the web gateway VM. - Adapts some networking and firewall configs (lock down web access from outside the VNET to only the web gateway). - Updates the Airbyte deployment instructions so that the UI gets served through the web gateway instead of directly from the Airbyte VM. Related work items: #23999	2024-11-27 08:28:30 +00:00
Pablo Martin	5cd91f8f67	add missing dns reference	2024-11-26 11:27:49 +01:00