superhog-infra-repo/human-script.md

949 lines
41 KiB
Markdown
Raw Permalink Normal View History

2024-02-06 13:53:05 +01:00
# Human Script
Follow this to deploy the entire data infra.
2024-11-26 11:13:18 +01:00
## 000. Pre-requisites and conventions
2024-02-06 13:53:05 +01:00
- You need an Azure subscription and a user with administrator rights in it.
2024-02-06 16:11:50 +01:00
- Whenever you see `<your-env>`, you should replace that with `dev`,`uat`, `prd` or whatever fits your environment.
- We traditionally deploy resources on the `UK South` region. Unless stated otherwise, you should deploy resources there.
2024-02-06 13:53:05 +01:00
2024-11-26 11:13:18 +01:00
## 010. Resource group and SSH Keypair
2024-02-06 17:11:27 +01:00
2024-11-26 11:27:49 +01:00
### 010.1 Create Resource Group
2024-02-06 13:53:05 +01:00
2024-02-06 16:11:50 +01:00
- Create a resource group. This resource group will hold all the resources. For the rest of this guide, assume this is the resource group where you must create resources.
- Name it: `superhog-data-rg-<your-env>`
- Add tags:
- `team: data`
- `environment: <your-env>`
2024-11-26 11:27:49 +01:00
### 010.2 SSH Keypair
2024-02-06 17:11:27 +01:00
- We will create an SSH Keypair for this deployment. It will be used to access VMs, Git repos and other services.
- Create the SSH Key pair
2024-02-21 09:15:07 +01:00
- Name the key: `superhog-data-general-ssh-<your-env>`
2024-02-06 17:11:27 +01:00
- Add tags:
- `team: data`
- `environment: <your-env>`
- Pay attention when storing the private key. You probably want to store it in a safe password manager, like Keeper.
- Optionally, you can also be extra paranoid, generate the SSH key locally and only upload the public key to Azure. Up to you.
2024-11-26 11:13:18 +01:00
## 020. Networking
2024-02-06 13:53:05 +01:00
2024-11-26 11:13:18 +01:00
### 020.1 VNET
2024-02-06 16:11:50 +01:00
- Create a virtual network. This virtual network is where all our infra will live. For the rest of this guide, assume this is the network where you must connect services.
- Name it: `superhog-data-vnet-<your-env>`
- You need to think what the network range should be like. For example, you could decide that the entire vnet will be contained within. For reference, we should be fine with a `/24` space (256 addresses) since we will only have a handful network interfaces connecting.
- As an example, we will use `10.69.0.0/24`. This link might be helpful: <https://www.davidc.net/sites/default/subnets/subnets.html?network=10.69.0.0&mask=24&division=11.f10>
- You need to add three subnets:
- Add no network security groups to any of the subnets still. We will create those later.
- Jumphost subnet
- This subnet is where jumphost boxes will live.
- It will be the only subnet where we allow inbound connections from WAN.
- Name it `jumphost-subnet`.
- For our example, we will make it `10.69.0.0/29` (8 addresses).
- Database subnet
- This subnet is where the DWH database will live.
- Inbound traffic will be allowed from both the jumphost subnet as well as the services subnet.
- Name it `database-subnet`
- For our example, we will make it `10.69.0.8/29` (8 addresses).
- Services subnet
- This subnet is where most VMs dedicated to data services live (Airbyte, dbt, PBI Data Gateway, etc).
- Inbound traffic will only be allowed from the jumphost subnet.
- Name it `services-subnet`
- For our example, we will make it `10.69.0.64/26` (64 addresses)
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: network`
2024-11-26 11:13:18 +01:00
### 020.2 Network security groups
2024-02-06 16:11:50 +01:00
- You will create three network security groups (NSG)
- Jumphost NSG
- Name it: `superhog-data-nsg-jumphost-<your-env>`
- Purpose: only allow connecting to the VPN server. We deny absolutely any other inbound traffic.
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: network`
- Add the following inbound rules
- VPN Rule
- Name: AllowWireguardInbound
- Source: Any
- Source port ranges: *
- Destination: the addresss range for the `jumphost-subnet`. In this example, `10.69.0.0/29`.
2024-02-07 14:50:07 +01:00
- Destination port ranges: 52420
2024-02-06 16:11:50 +01:00
- Protocol: UDP
- Action: Allow
- Priority: 100
- Deny Rule
- Name: DenyAllInbound
- Source: Any
- Source port ranges: *
- Destination: Any
- Destination port ranges: *
- Protocol: Any
- Action: Allow
- Priority: 1000
- Services NSG
- Name it: `superhog-data-nsg-services-<your-env>`
- Purpose: only allow the service VMs to be reached from our jumphost subnet. We deny absolutely any other inbound traffic.
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: network`
- Add the following inbound rules
- SSH Rule
- Name: AllowSSHFromJumphostInbound
- Source: the addresss range for the `jumphost-subnet`. In this example, `10.69.0.0/29`.
- Source port ranges: *
- Destination: the addresss range for the `services-subnet`. In this example, `10.69.0.64/26`.
- Destination port ranges: 22
- Protocol: TCP
- Action: Allow
- Priority: 100
2024-02-21 09:15:07 +01:00
- RDP Rule
2024-02-06 16:11:50 +01:00
- Name: AllowRDPFromJumphostInbound
- Source: the addresss range for the `jumphost-subnet`. In this example, `10.69.0.0/29`.
- Source port ranges: *
- Destination: the addresss range for the `services-subnet`. In this example, `10.69.0.64/26`.
- Destination port ranges: 3389
- Protocol: TCP
- Action: Allow
- Priority: 110
2024-11-14 16:19:22 +01:00
- Web server Rule
2024-02-21 09:15:07 +01:00
- Name: AllowWebFromJumphostInbound
2024-02-06 16:11:50 +01:00
- Source: the addresss range for the `jumphost-subnet`. In this example, `10.69.0.0/29`.
- Source port ranges: *
- Destination: the addresss range for the `services-subnet`. In this example, `10.69.0.64/26`.
- Destination port ranges: 80
- Protocol: TCP
- Action: Allow
- Priority: 120
- Deny Rule
- Name: DenyAllInbound
- Source: Any
- Source port ranges: *
- Destination: Any
- Destination port ranges: *
- Protocol: Any
- Action: Allow
- Priority: 1000
2024-02-06 16:36:09 +01:00
- Database NSG
- Name it: `superhog-data-nsg-database-<your-env>`
- Purpose: make the databases subnet reachable only from our services subnet and from our jumphost subnet.
2024-02-06 16:11:50 +01:00
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: network`
2024-02-06 16:36:09 +01:00
- Add the following inbound rules
- Postgres Jumphost Rule
- Name: AllowPostgresFromJumphostInbound
- Source: the addresss range for the `jumphost-subnet`. In this example, `10.69.0.0/29`.
- Source port ranges: *
- Destination: the addresss range for the `databases-subnet`. In this example, `10.69.0.8/29`.
- Destination port ranges: 5432
- Protocol: TCP
- Action: Allow
- Priority: 100
- Postgres Services Rule
2024-02-06 16:55:20 +01:00
- Name: AllowPostgresFromServicesInbound
2024-02-06 16:36:09 +01:00
- Source: the addresss range for the `services-subnet`. In this example, `10.69.0.64/26`.
- Source port ranges: *
- Destination: the addresss range for the `databases-subnet`. In this example, `10.69.0.8/29`.
- Destination port ranges: 5432
- Protocol: TCP
- Action: Allow
- Priority: 110
- Deny Rule
- Name: DenyAllInbound
- Source: Any
- Source port ranges: *
- Destination: Any
- Destination port ranges: *
- Protocol: Any
- Action: Allow
- Priority: 1000
2024-02-07 10:57:23 +01:00
- Finally, you need to attach each NSG to the related subnet
- Visit the virtual network page and look for the subnets list
- For each subnet, select its NSG and attach it
2024-02-06 16:11:50 +01:00
2024-11-26 11:13:18 +01:00
### 020.3 Private DNS Zone
2024-02-06 16:55:20 +01:00
- We will set up a private DNS Zone to avoid using hardcoded IPs to refer to services within the virtual network. This makes integrations more resilient because a service can change its IP and still be reached by other services (as long as other network configs like firewalls are still fine).
- Create the Private DNS Zone
- Name it: `<your-env>.data.superhog.com`
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: network`
- Add a new virtual network link to the zone
- Name it: `privatelink-<your-env>.data.superhog.com`
- Associate it to the virtual network.
- Enable autoregistration
2024-11-26 11:13:18 +01:00
### 020.4 Public IP
2024-02-06 17:21:47 +01:00
- We will need a public IP for the jumphost.
- Create the public IP
- Name it: `superhog-data-jumphost-ip-<your-env>`
- For setting `Routing preference` select option: `Microsoft Network`
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: network`
2024-11-26 11:13:18 +01:00
## 030. Jumphost
2024-02-06 13:53:05 +01:00
2024-11-26 11:13:18 +01:00
### 030.1 Deploy Jumphost VM
2024-02-06 17:11:27 +01:00
- The first VM we must deploy is a jumphost, since that will be our door to all other services inside the virtual network.
- Create the VM
2024-02-07 10:57:23 +01:00
- Basic settings
2024-02-09 15:14:10 +01:00
- Name it: `jumphost-<your-env>`
2024-02-07 10:57:23 +01:00
- Use Ubuntu Server 22.04
- Use Size: `Standard_B1s`
- Use username: `azureuser`
2024-02-21 09:15:07 +01:00
- Use the SSH Key: `superhog-data-general-ssh-<your-env>`
2024-02-07 10:57:23 +01:00
- Select the option `None` for Public inbound ports.
- Disk settings
- Defaults are fine. This barely needs any disk.
- Networking
- Attach to the virtual network `superhog-data-vnet-<your-env>`
- Attach to the subnet `jumphost-subnet`
- Attach the public ip `superhog-data-jumphost-ip-<your-env>`
- For setting `NIC network security group` select option `None`
- Management settings
- Defaults are fine.
- Monitoring
- Defaults are fine.
- Advanced
- Defaults are fine.
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: network`
2024-02-06 17:21:47 +01:00
2024-11-26 11:13:18 +01:00
### 030.2 Configure a VPN Server
2024-02-06 17:11:27 +01:00
2024-02-07 10:57:23 +01:00
- The jumphost we just created is not accessible via SSH from WAN due to the NSG set in the jumphost subnet.
- To make it so, you should temporarily create a new rule like this in the NSG `superhog-data-nsg-jumphost-<your-env>`.
- Name: AllowSSHInboundTemporarily
- Source: your IP.
- Source port ranges: *
- Destination: the addresss range for the `jumphost-subnet`. In this example, `10.69.0.0/29`.
- Destination port ranges: 22
- Protocol: TCP
- Action: Allow
- Priority: 110
- Connect through SSH
- We will now set up a VPN server and client with Wireguard
- Run the following script (requires `sudo`) to install wireguard and configure it
2024-02-09 15:14:10 +01:00
- Pay attention: you need to fill in the public IP manually, as well as the network mask of the virtual network
2024-02-07 14:50:07 +01:00
- *Note: the IPs chosen for the VPN can absolutely be changed. Just make sure they are consistent across the server and client configurations of the VPN.*
2024-02-21 09:15:07 +01:00
- *Note: you need to input the public IP and the VNET network mask manually at the top of the script.*
2024-02-07 10:57:23 +01:00
```bash
2024-02-21 09:15:07 +01:00
JUMPHOST_PUBLIC_IP=<write-the-public-ip-here>
NETWORK_MASK_FOR_VNET=<write-the-network-mask-here>
2024-02-07 10:57:23 +01:00
echo "Installing Wireguard."
apt update
apt install wireguard -y
echo "Wireguard installed."
echo "Creating keys."
SERVER_PRIVATE_KEY=$(wg genkey)
SERVER_PUBLIC_KEY=$(echo "$SERVER_PRIVATE_KEY" | wg pubkey)
CLIENT_PRIVATE_KEY=$(wg genkey)
CLIENT_PUBLIC_KEY=$(echo "$CLIENT_PRIVATE_KEY" | wg pubkey)
echo "Keys created."
echo "Writing server config file."
touch /etc/wireguard/wg0.conf
cat > /etc/wireguard/wg0.conf << EOL
[Interface]
PrivateKey = ${SERVER_PRIVATE_KEY}
Address = 192.168.69.1/32
ListenPort = 52420
# IP forwarding
PreUp = sysctl -w net.ipv4.ip_forward=1
# IP masquerading
PreUp = iptables -t mangle -A PREROUTING -i wg0 -j MARK --set-mark 0x30
PreUp = iptables -t nat -A POSTROUTING ! -o wg0 -m mark --mark 0x30 -j MASQUERADE
PostDown = iptables -t mangle -D PREROUTING -i wg0 -j MARK --set-mark 0x30
PostDOwn = iptables -t nat -D POSTROUTING ! -o wg0 -m mark --mark 0x30 -j MASQUERADE
[Peer]
PublicKey = ${CLIENT_PUBLIC_KEY}
AllowedIPs = 192.168.70.1/32
EOL
echo "Server config file written."
echo "Configuration for client, copy paste in your machine."
cat << EOF
2024-02-07 14:50:07 +01:00
##############################
2024-02-07 10:57:23 +01:00
[Interface]
# Jumphost VPN
PrivateKey = ${CLIENT_PRIVATE_KEY}
Address = 192.168.70.1/32
2024-02-09 15:14:10 +01:00
# Uncomment when DNS Server is ready
# DNS = 192.168.69.1
2024-02-07 10:57:23 +01:00
[Peer]
PublicKey = ${SERVER_PUBLIC_KEY}
2024-02-21 09:15:07 +01:00
AllowedIPs = 192.168.69.1/32,${NETWORK_MASK_FOR_VNET}
Endpoint = ${JUMPHOST_PUBLIC_IP}:52420
2024-02-07 14:50:07 +01:00
##############################
2024-02-07 10:57:23 +01:00
EOF
2024-02-07 14:50:07 +01:00
echo "Setting the Wireguard server as a system service."
systemctl enable wg-quick@wg0.service
echo "Starting Wireguard server."
systemctl start wg-quick@wg0.service
2024-02-07 10:57:23 +01:00
echo "Finished."
```
2024-02-07 14:50:07 +01:00
2024-02-07 18:43:10 +01:00
- You should copy the client config that the script will produce and set up the Wireguard config on your local machine.
- Once you've done so, start Wireguard on the client and try to ping the server from the client with the Wireguard VPN IP. If it reaches, the VPN is working fine.
- Now, validate your setup by SSHing from your local device into the jumphost by referencing the VPN IP of the jumphost instead of the public IP.
- Once you verify everything works, you should go to the NSG of the jumphost and remove rule AllowSSHInboundTemporarily. From this point on, the only entrypoint from WAN to the virtual network is the VPN port in the jumphost machine.
- Next, we must allow IP forwarding on Azure.
- Look for the jumphost VM Network Interface.
- In the `IP configurations` session, activate the flag `Enable IP forwarding`.
2024-02-07 10:57:23 +01:00
2024-11-26 11:13:18 +01:00
### 030.3 Configure a DNS Server
2024-02-06 17:11:27 +01:00
2024-02-09 11:36:33 +01:00
- The jumphost is now ready. When the VPN is active on our local device, we can access the services within the virtual network.
- There is one issue, though: we would like to access services through names, not IPs.
- Our Private DNS Zone takes care of providing names to services within the virtual network. But these resolution only happens within the virtual network itself, so our external device can't rely on it.
- To solve this, we need to force DNS resolution of our laptops to happen from within the virtual network itself.
- To do so, we will set up a DNS server in the jumphost, and set up our VPN configuration to use it when the VPN connection in our device is active.
- Connect to the jumphost through SSH
- Run the following script as `sudo` from the home folder of `azureuser`
```bash
echo "Installing dependencies."
apt install dpkg-dev debhelper jq -y
echo "Cloning coredns."
git clone https://github.com/coredns/deployment.git coredns/deployment
cd coredns/deployment
echo "Building package."
dpkg-buildpackage -us -uc -b
cd ..
echo "Installing package."
dpkg -i coredns*.deb
echo "Disabling Stub resolver."
sed -i -e 's/#DNSStubListener=yes/DNSStubListener=no/g' /etc/systemd/resolved.conf
systemctl restart systemd-resolved
echo "Writing config file".
rm /etc/coredns/Corefile
cat > /etc/coredns/Corefile << EOL
. {
hosts {
log
# If you want to make custom mappings, place them here
# Format is
# xxx.xxx.xxx.xxx your.domain.name
# By default, we delegate on Azure
fallthrough
}
forward . 168.63.129.16 # This IP is Azure's DNS service
errors
}
EOL
echo "Restarting coredns to pick up new config."
systemctl restart coredns.service
```
- In your client Wireguard configuration, uncomment the DNS server line we left before
- Check that the service is running fine by running `dig google.com`. You should see in the output that your laptop has relied on our new DNS to do the name resolution.
2024-11-26 11:13:18 +01:00
### 030.4 Harden the Jumphost VM
2024-02-09 15:14:10 +01:00
- In the Jumphost, run the following command to disable password based SSH authentication fully. This way, access can only be granted with SSH key pairs, which is way more secure: `sudo sed -i -e 's/#PasswordAuthentication yes/PasswordAuthentication no/g' /etc/ssh/sshd_config; sudo systemctl restart ssh`.
- Remove the AllowSSHInboundTemporarily rule that you added to the NSG `superhog-data-nsg-jumphost-<your-env>`. We don't need that anymore since we can SSH through the VPN tunnel.
2024-02-06 16:55:20 +01:00
2024-11-26 11:13:18 +01:00
## 040. DWH
2024-02-06 13:53:05 +01:00
2024-11-26 11:13:18 +01:00
### 040.1 Deploy PostgreSQL Server
2024-02-09 15:14:10 +01:00
- Next, we will deploy a Postgres server to act as the DWH.
- Create a new Azure Database for PostgreSQL flexible servers.
- Basics
- Name it: `superhog-dwh-<your-env>`.
- On field `PostgreSQL version` pick version 16.
- Adapt the sizing to your needs. Only you know how much this server is going to take.
- For field `Authentication method` pick `PostgreSQL authentication only`.
- Name the user admin: `dwh_admin_<your-env>`.
- Give it a password and make sure to note it down.
- Networking
- On field `Connectivity method` select `Private access (VNet Integration)`
- Pick the virtual network `superhog-data-vnet-<your-env>` and the subnet `databases-subnet`.
- Create a new private dns zone. Unfortunately, we can't use `<your-env>.data.superhog.com` for this service.
- Security
- Defaults are fine
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: dwh`
- Validate the deployment by trying to log into the database with the `dwh_admin_<your-env>` user from your favourite SQL client (you can use DBeaver, for example). Be aware that your VPN connection should be active so that the DWH is reachable from your device.
2024-11-26 11:13:18 +01:00
### 040.2 Create database
2024-02-09 15:47:26 +01:00
- Run the following commands to create a new database
2024-02-09 15:47:26 +01:00
```sql
CREATE DATABASE dwh;
```
- From now on, use this database for everything
2024-11-26 11:13:18 +01:00
### 040.3 Create schemas, roles and users
2024-02-09 15:47:26 +01:00
- Run the following script to create:
- A `dwh_builder` role, which:
- Owns the `staging`, `intermediate` and `reporting` schemas.
- Can read `sync_` schemas.
- Thus, can do as he pleases in them.
- Designed to support `dbt run`.
- A `modeler` role, which:
- Can read the `staging`, `intermediate` and `reporting` schemas.
- Can read `sync_` schemas.
- Designed for data team members and power users to be able to read everything within the DWH.
- A `consumer` role, which:
- Can read the `reporting` schema.
- A user `dbt_user`, with `dwh_builder` role.
- A user `powerbi_user`, with `consumer` role.
- A user `airbyte user`, with permission to create new schemas.
- A user `billingdb_reader`, with permission to read some tables from the reporting schema.
2025-04-02 15:58:55 +02:00
- A user `ci_reader`, with `modeler` role.
2024-02-09 15:47:26 +01:00
- *Note: replace the password fields with serious passwords and note them down.*
2024-02-21 09:15:07 +01:00
- *Note: replace the name of the admin user*
2024-02-09 15:14:10 +01:00
2024-02-12 11:18:18 +01:00
```sql
-- Start here, logged in as the dwh admin
2024-02-09 15:47:26 +01:00
CREATE ROLE airbyte_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
GRANT CREATE ON DATABASE dwh TO airbyte_user;
-- Login as the airbyte_user
CREATE SCHEMA sync_default;
2024-02-21 09:15:07 +01:00
-- Login as the dwh admin again
CREATE SCHEMA staging;
CREATE SCHEMA intermediate;
CREATE SCHEMA reporting;
CREATE ROLE dwh_builder INHERIT;
ALTER SCHEMA staging OWNER TO dwh_builder;
ALTER SCHEMA intermediate OWNER TO dwh_builder;
ALTER SCHEMA reporting OWNER TO dwh_builder;
2024-02-09 15:47:26 +01:00
CREATE ROLE dbt_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
GRANT dwh_builder to dbt_user;
2024-02-09 15:47:26 +01:00
CREATE ROLE consumer INHERIT;
CREATE ROLE powerbi_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
GRANT consumer to powerbi_user;
CREATE ROLE billingdb_reader LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
CREATE ROLE modeler INHERIT;
2025-04-02 15:58:55 +02:00
CREATE ROLE ci_reader LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
GRANT modeler to ci_reader;
-- You might want to create a first personal user with modeler role here
-- Login as airbyte_user
GRANT USAGE ON SCHEMA sync_default TO dwh_builder;
GRANT SELECT ON ALL TABLES IN SCHEMA sync_default TO dwh_builder;
ALTER DEFAULT PRIVILEGES IN SCHEMA sync_default GRANT SELECT ON TABLES TO dwh_builder;
GRANT USAGE ON SCHEMA sync_default TO modeler;
GRANT SELECT ON ALL TABLES IN SCHEMA sync_default TO modeler;
ALTER DEFAULT PRIVILEGES IN SCHEMA sync_default GRANT SELECT ON TABLES TO modeler;
-- Login as dbt_user
GRANT USAGE ON SCHEMA staging TO modeler;
GRANT SELECT ON ALL TABLES IN SCHEMA staging TO modeler;
ALTER DEFAULT PRIVILEGES IN SCHEMA staging GRANT SELECT ON TABLES TO modeler;
GRANT USAGE ON SCHEMA intermediate TO modeler;
GRANT SELECT ON ALL TABLES IN SCHEMA intermediate TO modeler;
ALTER DEFAULT PRIVILEGES IN SCHEMA intermediate GRANT SELECT ON TABLES TO modeler;
GRANT USAGE ON SCHEMA reporting TO modeler;
GRANT SELECT ON ALL TABLES IN SCHEMA reporting TO modeler;
ALTER DEFAULT PRIVILEGES IN SCHEMA reporting GRANT SELECT ON TABLES TO modeler;
2024-02-09 15:47:26 +01:00
GRANT USAGE ON SCHEMA reporting TO consumer;
GRANT SELECT ON ALL TABLES IN SCHEMA reporting TO consumer;
ALTER DEFAULT PRIVILEGES IN SCHEMA reporting GRANT SELECT ON TABLES TO consumer;
```
2024-02-09 15:14:10 +01:00
- On adding new users:
- Typically, you will want to create personal accounts for data team members with `modeler` role so that they can query everywhere in the dwh.
- Any other services or users that need to access the reporting layer can be given the `consumer` role.
- Furthermore, `sync_` schema permissions need to be dynamically managed from this point on. This means that:
- Generally, all `sync_` schemas should be created by the `airbyte_user`.
- Whenever a new `sync_` schema comes to life, both the `modeler` and `dwh_builder` roles should receive access. You can use the following command template:
```sql
-- Login as airbyte_user
GRANT USAGE ON SCHEMA sync_<some-new-source> TO dwh_builder;
GRANT SELECT ON ALL TABLES IN SCHEMA sync_<some-new-source> TO dwh_builder;
ALTER DEFAULT PRIVILEGES IN SCHEMA sync_<some-new-source> GRANT SELECT ON TABLES TO dwh_builder;
GRANT USAGE ON SCHEMA sync_<some-new-source> TO modeler;
GRANT SELECT ON ALL TABLES IN SCHEMA sync_<some-new-source> TO modeler;
ALTER DEFAULT PRIVILEGES IN SCHEMA sync_<some-new-source> GRANT SELECT ON TABLES TO modeler;
```
- This script also doesn't specify exactly which tables should the `billingdb_reader` read from, since providing full access to the entire reporting schema would be excessive. You can specify which tables should be readable by the user like this:
```sql
-- Login as dbt_user
GRANT USAGE ON SCHEMA reporting TO billingdb_reader;
GRANT SELECT ON TABLE reporting.<some_table> TO billingdb_reader;
2024-12-18 14:38:01 +01:00
GRANT SELECT ON TABLE reporting.<some_other_table> TO billingdb_reader;
...
```
2024-02-09 15:14:10 +01:00
2024-11-26 11:13:18 +01:00
## 050. Web Gateway
2024-11-14 16:19:22 +01:00
We will deploy a dedicated VM to act as a web server for internal services.
2024-11-26 11:13:18 +01:00
### 050.1 Deploy Web Gateway VM
2024-11-14 16:19:22 +01:00
- Create a new VM following these steps.
- Basic settings
- Name it: `web-gateway-<your-env>`
- Use Ubuntu Server 22.04
- Use size: `Standard_B1s`
- Use username: `azureuser`
- Use the SSH Key: `superhog-data-general-ssh-<your-env>`
- Select the option `None` for Public inbound ports.
- Disk settings
- Defaults are fine. This barely needs any disk.
- Networking
- Attach to the virtual network `superhog-data-vnet-<your-env>`
- Attach to the subnet `services-subnet`
- Assign no public IP.
- For setting `NIC network security group` select option `None`
- Management settings
- Defaults are fine.
- Monitoring
- Defaults are fine.
- Advanced
- Defaults are fine.
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: network`
- Once the VM is running, you should be able to ssh into the machine when your VPN is active.
2024-11-26 11:13:18 +01:00
### 050.2 Deploying Caddy
2024-11-14 16:19:22 +01:00
- We need to install caddy in the VM. You can do so with the following commands:
```bash
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update
sudo apt install caddy
```
- After the previous commands, you can verify that caddy is running properly as a systemd service with: `systemctl status caddy`
- You can also verify that Caddy is reachable (should be) by running the following command from your laptop while connected to the VPN: `curl web-gateway-<your-env>.<your-env>.data.superhog.com`. If you see a wall of HTML that looks like Caddy's demo page, it means Caddy is working as expected.
2024-11-26 11:13:18 +01:00
### 050.3 Pointing Caddy to internal services
2024-11-14 16:19:22 +01:00
- Caddy will need to be configured to act as the web server or reverse proxy of the different services within the services subnet. The details of these configurations are defined in sections below.
- As a general note, the pattern will generally be:
2024-11-26 11:27:49 +01:00
- Create the right A record in the Private DNS records so that you point users with some subdomain towards the web gateway.
2024-11-14 16:19:22 +01:00
- You will need to include the right entry in the `Caddyfile` at `/etc/caddy/Caddyfile`.
- You will need to reload caddy with `sudo systemctl reload caddy.service`.
- If the web server needs to reach a specific port in some other VM, you will need to sort networking security out. If the VM you need to reach from the web server is within the internal services subnet, you'll have to add the necessary Inbound rules in the NSG `superhog-data-nsg-services-<your-env>`.
2024-11-26 11:13:18 +01:00
## 060. Airbyte
2024-02-06 13:53:05 +01:00
2024-11-26 11:13:18 +01:00
### 060.1 Deploying Airbyte VM
2024-02-09 15:59:34 +01:00
- Airbyte lives on its own VM. To do so, create a new VM following these steps.
- Basic settings
- Name it: `airbyte-<your-env>`
- Use Ubuntu Server 22.04
2024-02-21 09:15:07 +01:00
- I suggest size for testing `Standard_DS1_v2`. For production, get something beefier.
2024-02-09 15:59:34 +01:00
- Use username: `azureuser`
2024-02-21 09:15:07 +01:00
- Use the SSH Key: `superhog-data-general-ssh-<your-env>`
2024-02-09 15:59:34 +01:00
- Select the option `None` for Public inbound ports.
- Disk settings
- Increasing the data disk to at least 64gb as a starting point is recommended. Airbyte can be a bit of a disk hog, and running low on space might lead to obscure errors happening. Start with 64gb and monitor as you increase usage.
- Networking
- Attach to the virtual network `superhog-data-vnet-<your-env>`
- Attach to the subnet `services-subnet`
- Assign no public IP.
- For setting `NIC network security group` select option `None`
- Management settings
- Defaults are fine.
- Monitoring
- Defaults are fine.
- Advanced
- Defaults are fine.
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: airbyte`
- Once the VM is running, you should be able to ssh into the machine when your VPN is active.
2024-11-26 11:13:18 +01:00
### 060.2 Deploying Airbyte
2024-02-09 15:59:34 +01:00
2024-02-12 11:18:18 +01:00
- SSH into the VM.
- Run the following script to install docker and deploy Airbyte
- *Note: replace the variables at the beginning with proper values*
```bash
AIRBYTE_ADMIN_USER=your-user-here
AIRBYTE_ADMIN_PASSWORD=your-password-here
echo "Installing docker."
apt-get update -y
apt-get install apt-transport-https ca-certificates curl gnupg lsb-release -y
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
apt-get update
apt-get install docker-ce docker-ce-cli -y
usermod -a -G docker $USER
newgrp docker
apt-get install docker-compose-plugin -y
echo "Deploying Airbyte"
mkdir airbyte
cd airbyte
wget https://raw.githubusercontent.com/airbytehq/airbyte/master/run-ab-platform.sh
chmod +x run-ab-platform.sh
./run-ab-platform.sh -b
echo "Setting credentials."
sed -i -e "s/BASIC_AUTH_USERNAME=airbyte/BASIC_AUTH_USERNAME=${AIRBYTE_ADMIN_USER}/g" .env
sed -i -e "s/BASIC_AUTH_PASSWORD=password/BASIC_AUTH_PASSWORD=${AIRBYTE_ADMIN_PASSWORD}/g" .env
echo "Restarting Airbyte."
docker compose down; docker compose up -d
2024-11-14 16:19:22 +01:00
echo "You can now access at http://localhost:8000"
2024-02-12 11:18:18 +01:00
2024-11-14 16:19:22 +01:00
echo "Finished."
```
2024-02-12 11:18:18 +01:00
2024-11-14 16:19:22 +01:00
- To check that Airbyte is running fine, run this command from a terminal within the Airbyte VM: `curl localhost:8000`. You should see some HTML for Airbyte's access denied page.
2024-11-26 11:13:18 +01:00
### 060.3 Making Airbyte Web UI reachable
2024-11-14 16:19:22 +01:00
- To provide access to the Airbyte UI, we will have to integrate it with the web gateway and our networking configurations.
- First, we need to allow the web gateway to reach Airbyte locally-served webserver.
- Use the Azure portal to navigate to the NSG `superhog-data-nsg-services-<your-env>` page.
- Add a new Inbound rule with the following details:
- Name: `Allow8000TCPWithinSubnet`
- Source: the addresss range for the `services-subnet`. In this example, `10.69.0.64/26`.
- Source port ranges: *
- Destination: the addresss range for the `services-subnet`. In this example, `10.69.0.64/26`.
- Destination port ranges: 8000
- Protocol: TCP
- Action: Allow
- Priority: Set something above existing rules, but below the `DenyAllInbound` rules.
- Next, we need to set a DNS entry to generate the URL that will be used to navigate to the Airbyte UI.
- Use the Azure portal to navigate to the Private DNS Zone `<your-env>.data.superhog.com` page.
- Create a new record with the following details:
- Name: `airbyte`
- Type: `A`
- IP Address: Look for the private IP address that was assigned to the VM `web-gateway-<your-env>` and place it here.
- Finally, we must create an entry in caddy's config file.
- SSH into the web gateway VM.
- Make a script with these commands and run it:
2024-02-12 11:18:18 +01:00
2024-11-14 16:19:22 +01:00
```bash
2024-02-09 15:59:34 +01:00
2024-11-14 16:19:22 +01:00
YOUR_ENV=<your-env>
PRIVATE_DNS_ZONE_NAME=${YOUR_ENV}.data.superhog.com
AIRBYTE_SUBDOMAIN=airbyte # If you followed this guide for the DNS bit, leave this value. If you chose a different subdomain, adjust accordingly
FULL_AIRBYTE_URL=http://${AIRBYTE_SUBDOMAIN}.${PRIVATE_DNS_ZONE_NAME}
echo "Write caddyfile"
touch /etc/caddy/Caddyfile
cat > /etc/caddy/Caddyfile << EOL
2024-02-12 11:18:18 +01:00
2024-11-14 16:19:22 +01:00
# Airbyte web UI
http://${FULL_AIRBYTE_URL} {
reverse_proxy http://airbyte-${YOUR_ENV}.${PRIVATE_DNS_ZONE_NAME}:8000
}
EOL
2024-02-12 11:18:18 +01:00
2024-11-14 16:19:22 +01:00
echo "Restart caddy"
systemctl restart caddy
2024-02-12 11:18:18 +01:00
2024-11-14 16:19:22 +01:00
echo "You can now access at http://${FULL_AIRBYTE_URL}
2024-02-12 11:18:18 +01:00
```
2024-02-09 15:59:34 +01:00
2024-11-14 16:19:22 +01:00
- If everything is working properly, you should now be able to reach airbyte at the printed URL.
2024-11-26 11:05:27 +01:00
- If something doesn't work, I would advise troubleshooting through the chain of machines (your device to the VPN box, then to the web gateway, then to the airbyte machine) to find where is the connection breaking down.
2024-02-09 15:59:34 +01:00
2024-11-26 11:13:18 +01:00
## 070. Power BI
2024-02-06 13:53:05 +01:00
2024-11-26 11:13:18 +01:00
### 070.1 Deploying Power BI VM
2024-02-12 15:08:32 +01:00
- We need to deploy a Windows VM.
- Create the VM
- Basic settings
- Name it: `pbi-gateway-<your-env>`
- Use Windows Server 2022 Datacenter: Azure Edition
- Use Size: `Standard_DS1_v2`
- Use username: `datagatewayadmin`
- Set a good password and note it down
- Select the option `None` for Public inbound ports.
- Disk settings
- Defaults are fine. This barely needs any disk.
- Networking
- Attach to the virtual network `superhog-data-vnet-<your-env>`
- Attach to the subnet `services-subnet`
- For field `Public IP` set value `None`
- For setting `NIC network security group` select option `None`
- Management settings
- Defaults are fine.
- Monitoring
- Defaults are fine.
- Advanced
- Defaults are fine.
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: pbi`
- Try to connect with RDP at `pbi-gateway-<your-env>.<your-env>.data.superhog.com`.
2024-11-26 11:13:18 +01:00
### 070.2 Installing Power BI Data Gateway
2024-02-12 15:08:32 +01:00
- Login the VM.
- Follow the instructions here to download the installer in the VM and set it up: <https://learn.microsoft.com/en-us/data-integration/gateway/service-gateway-install>
- You will need to provide an account and credentials. It would be ideal to use a service account, and not personal accounts, to make the gateway independent of any single user.
- Once you login:
- Name the gateway `data-gateway-<your-env>`
- Set up a recovery key and store it safely
- Next, download these file on the VM and install it: <https://github.com/npgsql/npgsql/releases/download/v4.0.10/Npgsql-4.0.10.msi>
- ATTENTION! During the installation process, you get to select if you want to activate the `Npgsql GAC Installation`. This option comes deactivated by default. You must turn it on. Click on it and select the `Will be installed on local hard drive` option.
- Finally, a note: if you want to use this gateway to connect to our PostgreSQL DWH (which you most probably want), you will need to disable forced TLS/SSL in the config of the PostgreSQL instance. This is because PBI is unable to use an SSL connection.
- To do this, go to the PostgreSQL isntance page on the Azure Portal.
- Click on the `Server parameters` section.
- Turn the `require_secure_transport` parameter to `Off`.
- Once you are done, you should be able to visit the PBI Service (the online UI), visit the gateways page in settings and see the gateway listed in the `On-premises data gateways` section.
2024-11-26 11:13:18 +01:00
## 080. dbt
2024-02-06 13:53:05 +01:00
2024-02-21 09:15:07 +01:00
- Our dbt project (<https://guardhog.visualstudio.com/Data/_git/data-dwh-dbt-project>) can be deployed on any linux VM within the virtual network. The instructions on how to deploy and schedule it are in the project repository.
2024-02-12 17:44:49 +01:00
- You can opt to deploy it in the same machine where airbyte is stored, since that machine is probably fairly underutilized.
2025-04-02 15:58:55 +02:00
### 080.1 dbt CI server
Having CI pipelines in the dbt git project is a great way to automate certain quality checks around the DWH code. The way our CI strategy is designed, you need to prepare a VM within our Data private network for CI jobs to run in there. This section explains how to set up the VM. Note that we will only cover infrastructure topics here: you'll have to check the dbt repository for the full story on how to set up the CI. We recommend covering the steps describe here before jumping into the dbt specific part of things.
#### 080.1.1 Deploying the CI VM
- We will have a dedicated VM for the CI pipelines. The pipelines can be resource hungry at times, so I recommend having a dedicated VM that is not shared with other workloads so you can assign resources adequately and avoid resource competition with other services.
- Create a new VM following these steps.
- Basic settings
- Name it: `pipeline-host-<your-env>`
- Use Ubuntu Server 22.04
- Size should be adjusted to the needs of the dbt project. I suggest starting on a `B2s` instance and drive upgrade decisions based on what you observe during normal usage.
- Use username: `azureuser`
- Use the SSH Key: `superhog-data-general-ssh-<your-env>`
- Select the option `None` for Public inbound ports.
- Disk settings
- Disk requirements will vary depending on the nature of the dbt project state and the PRs. I suggest starting with the default 30gb and monitoring usage. If you see spikes that get close to 100%, increase the size to prevent a particularly heavy PR to consume all space.
- Networking
- Attach to the virtual network `superhog-data-vnet-<your-env>`
- Attach to the subnet `services-subnet`
- Assign no public IP.
- For setting `NIC network security group` select option `None`
- Management settings
- Defaults are fine.
- Monitoring
- Defaults are fine.
- Advanced
- Defaults are fine.
- Add tags:
- `team: data`
- `environment: <your-env>`
- `project: dbt`
- Once the VM is running, you should be able to ssh into the machine when your VPN is active.
#### 080.1.2 Install docker and docker compose
- We will use docker and docker compose to run a dockerized Postgres server in the VM.
- You can install docker and docker compose by placing the following code in a script and running it:
```bash
#!/bin/bash
set -e # Exit on error
echo "🔄 Updating system packages..."
sudo apt update && sudo apt upgrade -y
echo "📦 Installing dependencies..."
sudo apt install -y \
apt-transport-https \
ca-certificates \
curl \
software-properties-common \
lsb-release \
gnupg2 \
jq \
lsb-release
echo "🔑 Adding Docker GPG key..."
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "🖋️ Adding Docker repository..."
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
echo "📦 Installing Docker..."
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io
echo "✅ Docker installed successfully!"
echo "🔧 Enabling Docker to start on boot..."
sudo systemctl enable docker
echo "🔄 Installing Docker Compose..."
DOCKER_COMPOSE_VERSION=$(curl -s https://api.github.com/repos/docker/compose/releases/latest | jq -r .tag_name)
sudo curl -L "https://github.com/docker/compose/releases/download/${DOCKER_COMPOSE_VERSION}/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
echo "📂 Setting permissions for Docker Compose..."
sudo chmod +x /usr/local/bin/docker-compose
echo "✅ Docker Compose installed successfully!"
# Verifying installation
echo "🔍 Verifying Docker and Docker Compose versions..."
docker --version
docker-compose --version
2025-04-04 15:36:19 +02:00
sudo usermod -a -G docker $USER
2025-04-02 15:58:55 +02:00
newgrp docker
echo "✅ Docker and Docker Compose installation completed!"
```
#### 080.1.3 Install psql
- CI pipelines require `psql`, Postgres CLI client, to be available.
2025-04-04 11:54:22 +02:00
- You can install it with the following script:
```bash
sudo apt-get update
sudo apt-get install -y gnupg2 wget nano
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/postgresql.gpg
sudo apt-get update
sudo apt-get install -y postgresql-client-16
```
2025-04-02 15:58:55 +02:00
2025-04-04 15:36:19 +02:00
#### 080.1.4 Install Python
- Python is needed to create virtual environments and run dbt and other commands.
- You can use the following script to install python and some required packages:
```bash
2025-04-08 12:08:53 +02:00
sudo apt-get install -y python3.10 python3.10-venv
2025-04-04 15:36:19 +02:00
```
#### 080.1.5 Create user in DWH
2025-04-02 15:58:55 +02:00
- The CI Postgres will use some Foreign Data Wrappers (FDW) pointing at the DWH. We will need a dedicated user in the DWH instance to control the permissions received by the CI server.
- The section of this guide dedicated to setting up the DWH explains how to create this user. If you have followed it, you might have already created the user. Otherwise, head there to complete this part.
2025-04-04 15:36:19 +02:00
#### 080.1.6 Install the Azure Devops agent and sync with Devops
2025-04-02 15:58:55 +02:00
- The VM needs to have a Microsoft provided Azure agent to be reachable by Devops. This agent listens to requests from Devops, basically allowing Devops to execute things on the VM.
2025-04-04 15:36:19 +02:00
- Some configuration needs to be done in the Azure Devops project to allow Azure Devops to reach the VM. This might include creating a pool if it doesn't exist.
2025-04-02 15:58:55 +02:00
- You can find how to set this up in Ubuntu in these links:
- Official MSFT docs: https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/linux-agent?view=azure-devops
- Helpful walkthrough video: https://www.youtube.com/watch?v=Hy6fne9oQJM
2025-04-04 15:36:19 +02:00
- Make sure to install the agent as a systemd service to have it always run on boot. The details are explained in Microsoft's documentation.
- Once the agent is installed and correctly linked to one of our Pools in Devops, you should see the agent listed in the Devops UI for that pool, with status online. Don't move on if you haven't succeeded on this point.
2025-04-02 15:58:55 +02:00
2025-04-04 15:36:19 +02:00
#### 080.1.7 Clone the project and further steps
2025-04-02 15:58:55 +02:00
- We are going to need a local clone of the git repository to perform some setup steps, as well as for business as usual execution.
- To do this:
2025-04-04 15:36:19 +02:00
- Add some SSH key to the VM to have access to clone repos from Azure Devops. This could be the key `superhog-data-general-ssh-<your-env>` or some other key. This guide leaves this detail up to you. You can read more on how to use SSH keys with Azure Devops here: https://learn.microsoft.com/en-us/azure/devops/repos/git/use-ssh-keys-to-authenticate?view=azure-devops.
- Also add this config to make SSH cloning work. Note that the details might have changed since this guide was written, so your mileage may vary.
```
Host ssh.dev.azure.com
Hostname ssh.dev.azure.com
IdentityFile ~/.ssh/<whatever-key-file-you-are-using>
```
- Once the CI VM is SSH capable, clone the dbt project into the `azureuser` home dir.
2025-04-02 15:58:55 +02:00
- There are several steps after this, for which you should find instructions in the dbt repository itself.
2024-11-26 11:13:18 +01:00
## 090. Monitoring
2024-02-12 17:44:49 +01:00
2024-11-26 11:13:18 +01:00
### 090.1 Infra monitoring
2024-02-12 17:44:49 +01:00
WIP: we are planning on using Azure Dashboards with metrics.
2024-11-26 11:13:18 +01:00
### 090.2 Service status
2024-02-12 17:44:49 +01:00
WIP: we need support to learn how to use statuspage.io
2024-02-06 13:53:05 +01:00
2024-11-14 16:19:22 +01:00
2024-11-26 11:13:18 +01:00
## 100. Backups
2024-02-06 13:53:05 +01:00
- If you are working on a dev or staging environment, you might want to skip this section.
2024-02-12 17:44:49 +01:00
2024-11-26 11:13:18 +01:00
### 100.1 DWH
2024-02-12 17:44:49 +01:00
- Backups are managed with Azure. In the Azure Portal page for the PostgreSQL service, visit section `Backup and restore`. Production servers should have 14 days as a retention period.
2024-11-26 11:13:18 +01:00
### 100.2 Jumphost
2024-02-12 17:44:49 +01:00
- Jumphosts barely hold any data at all. Although it's quite tempting to forget about this and simply raise another VM if something goes wrong, it would be annoying to have to regenerate the keys of both the VPN server and other clients.
- To solve this, make a habit of making regular copies of the Wireguard config file in another machine. Theoretically, only making a copy everytime it gets modified should be enough.
2024-11-26 11:13:18 +01:00
### 100.3 Airbyte
2024-02-12 17:44:49 +01:00
- Our strategy for backing up Airbyte is to backup the entire VM.
- WIP
2024-11-26 11:13:18 +01:00
### 100.4 PBI Gateway
2024-02-12 17:44:49 +01:00
- The PBI Gateway is pretty much stateless. Given this, if there are any issues or disasters on the current VM, simply create another one and set up the gateway again.