diff --git a/human-script.md b/human-script.md index dee7d18..7539da5 100644 --- a/human-script.md +++ b/human-script.md @@ -2,16 +2,15 @@ Follow this to deploy the entire data infra. -## 0. Pre-requisites and conventions +## 000. Pre-requisites and conventions - You need an Azure subscription and a user with administrator rights in it. - Whenever you see ``, you should replace that with `dev`,`uat`, `prd` or whatever fits your environment. - We traditionally deploy resources on the `UK South` region. Unless stated otherwise, you should deploy resources there. -- You have an SSH key pair ready to use for access to the different machines. You can always add more pairs later. -## 1. Resource group and SSH Keypair +## 010. Resource group and SSH Keypair -### 1.1 Create Resource Group +### 010.1 Create Resource Group - Create a resource group. This resource group will hold all the resources. For the rest of this guide, assume this is the resource group where you must create resources. - Name it: `superhog-data-rg-` @@ -19,7 +18,7 @@ Follow this to deploy the entire data infra. - `team: data` - `environment: ` -### 1.2 SSH Keypair +### 010.2 SSH Keypair - We will create an SSH Keypair for this deployment. It will be used to access VMs, Git repos and other services. - Create the SSH Key pair @@ -30,9 +29,9 @@ Follow this to deploy the entire data infra. - Pay attention when storing the private key. You probably want to store it in a safe password manager, like Keeper. - Optionally, you can also be extra paranoid, generate the SSH key locally and only upload the public key to Azure. Up to you. -## 2. Networking +## 020. Networking -### 2.1 VNET +### 020.1 VNET - Create a virtual network. This virtual network is where all our infra will live. For the rest of this guide, assume this is the network where you must connect services. - Name it: `superhog-data-vnet-` @@ -60,7 +59,7 @@ Follow this to deploy the entire data infra. - `environment: ` - `project: network` -### 2.2 Network security groups +### 020.2 Network security groups - You will create three network security groups (NSG) - Jumphost NSG @@ -115,7 +114,7 @@ Follow this to deploy the entire data infra. - Protocol: TCP - Action: Allow - Priority: 110 - - Airbyte web rule + - Web server Rule - Name: AllowWebFromJumphostInbound - Source: the addresss range for the `jumphost-subnet`. In this example, `10.69.0.0/29`. - Source port ranges: * @@ -172,7 +171,7 @@ Follow this to deploy the entire data infra. - Visit the virtual network page and look for the subnets list - For each subnet, select its NSG and attach it -### 2.3 Private DNS Zone +### 020.3 Private DNS Zone - We will set up a private DNS Zone to avoid using hardcoded IPs to refer to services within the virtual network. This makes integrations more resilient because a service can change its IP and still be reached by other services (as long as other network configs like firewalls are still fine). - Create the Private DNS Zone @@ -186,7 +185,7 @@ Follow this to deploy the entire data infra. - Associate it to the virtual network. - Enable autoregistration -### 2.4 Public IP +### 020.4 Public IP - We will need a public IP for the jumphost. - Create the public IP @@ -197,9 +196,9 @@ Follow this to deploy the entire data infra. - `environment: ` - `project: network` -## 3. Jumphost +## 030. Jumphost -### 3.1 Deploy Jumphost VM +### 030.1 Deploy Jumphost VM - The first VM we must deploy is a jumphost, since that will be our door to all other services inside the virtual network. - Create the VM @@ -228,7 +227,7 @@ Follow this to deploy the entire data infra. - `environment: ` - `project: network` -### 3.2 Configure a VPN Server +### 030.2 Configure a VPN Server - The jumphost we just created is not accessible via SSH from WAN due to the NSG set in the jumphost subnet. - To make it so, you should temporarily create a new rule like this in the NSG `superhog-data-nsg-jumphost-`. @@ -322,7 +321,7 @@ Follow this to deploy the entire data infra. - Look for the jumphost VM Network Interface. - In the `IP configurations` session, activate the flag `Enable IP forwarding`. -### 3.3 Configure a DNS Server +### 030.3 Configure a DNS Server - The jumphost is now ready. When the VPN is active on our local device, we can access the services within the virtual network. - There is one issue, though: we would like to access services through names, not IPs. @@ -379,14 +378,14 @@ Follow this to deploy the entire data infra. - In your client Wireguard configuration, uncomment the DNS server line we left before - Check that the service is running fine by running `dig google.com`. You should see in the output that your laptop has relied on our new DNS to do the name resolution. -### 3.4 Harden the Jumphost VM +### 030.4 Harden the Jumphost VM - In the Jumphost, run the following command to disable password based SSH authentication fully. This way, access can only be granted with SSH key pairs, which is way more secure: `sudo sed -i -e 's/#PasswordAuthentication yes/PasswordAuthentication no/g' /etc/ssh/sshd_config; sudo systemctl restart ssh`. - Remove the AllowSSHInboundTemporarily rule that you added to the NSG `superhog-data-nsg-jumphost-`. We don't need that anymore since we can SSH through the VPN tunnel. -## 4. DWH +## 040. DWH -### 4.1 Deploy PostgreSQL Server +### 040.1 Deploy PostgreSQL Server - Next, we will deploy a Postgres server to act as the DWH. - Create a new Azure Database for PostgreSQL flexible servers. @@ -410,7 +409,7 @@ Follow this to deploy the entire data infra. - Validate the deployment by trying to log into the database with the `dwh_admin_` user from your favourite SQL client (you can use DBeaver, for example). Be aware that your VPN connection should be active so that the DWH is reachable from your device. -### 4.2 Create database +### 040.2 Create database - Run the following commands to create a new database @@ -420,7 +419,7 @@ Follow this to deploy the entire data infra. - From now on, use this database for everything -### 4.3 Create schemas, roles and users +### 040.3 Create schemas, roles and users - Run the following script to create: - A `dwh_builder` role, which: @@ -515,9 +514,66 @@ Follow this to deploy the entire data infra. ALTER DEFAULT PRIVILEGES IN SCHEMA sync_ GRANT SELECT ON TABLES TO modeler; ``` -## 5. Airbyte +## 050. Web Gateway -### 5.1 Deploying Airbyte VM +We will deploy a dedicated VM to act as a web server for internal services. + +### 050.1 Deploy Web Gateway VM + +- Create a new VM following these steps. + - Basic settings + - Name it: `web-gateway-` + - Use Ubuntu Server 22.04 + - Use size: `Standard_B1s` + - Use username: `azureuser` + - Use the SSH Key: `superhog-data-general-ssh-` + - Select the option `None` for Public inbound ports. + - Disk settings + - Defaults are fine. This barely needs any disk. + - Networking + - Attach to the virtual network `superhog-data-vnet-` + - Attach to the subnet `services-subnet` + - Assign no public IP. + - For setting `NIC network security group` select option `None` + - Management settings + - Defaults are fine. + - Monitoring + - Defaults are fine. + - Advanced + - Defaults are fine. + - Add tags: + - `team: data` + - `environment: ` + - `project: network` +- Once the VM is running, you should be able to ssh into the machine when your VPN is active. + +### 050.2 Deploying Caddy + +- We need to install caddy in the VM. You can do so with the following commands: + + ```bash + sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl + curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg + curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list + sudo apt update + sudo apt install caddy + ``` + +- After the previous commands, you can verify that caddy is running properly as a systemd service with: `systemctl status caddy` +- You can also verify that Caddy is reachable (should be) by running the following command from your laptop while connected to the VPN: `curl web-gateway-..data.superhog.com`. If you see a wall of HTML that looks like Caddy's demo page, it means Caddy is working as expected. + +### 050.3 Pointing Caddy to internal services + +- Caddy will need to be configured to act as the web server or reverse proxy of the different services within the services subnet. The details of these configurations are defined in sections below. +- As a general note, the pattern will generally be: + - Create the right A record in the Private DNS records so that you point users with some subdomain towards the web gateway. + - You will need to include the right entry in the `Caddyfile` at `/etc/caddy/Caddyfile`. + - You will need to reload caddy with `sudo systemctl reload caddy.service`. + - If the web server needs to reach a specific port in some other VM, you will need to sort networking security out. If the VM you need to reach from the web server is within the internal services subnet, you'll have to add the necessary Inbound rules in the NSG `superhog-data-nsg-services-`. + +## 060. Airbyte + +### 060.1 Deploying Airbyte VM - Airbyte lives on its own VM. To do so, create a new VM following these steps. - Basic settings @@ -546,7 +602,7 @@ Follow this to deploy the entire data infra. - `project: airbyte` - Once the VM is running, you should be able to ssh into the machine when your VPN is active. -### 5.2 Deploying Airbyte +### 060.2 Deploying Airbyte - SSH into the VM. - Run the following script to install docker and deploy Airbyte @@ -556,8 +612,6 @@ Follow this to deploy the entire data infra. AIRBYTE_ADMIN_USER=your-user-here AIRBYTE_ADMIN_PASSWORD=your-password-here - YOUR_ENV= - PRIVATE_DNS_ZONE_NAME=${YOUR_ENV}.data.superhog.com echo "Installing docker." apt-get update -y @@ -585,38 +639,67 @@ Follow this to deploy the entire data infra. echo "Restarting Airbyte." docker compose down; docker compose up -d - echo "Deploying Caddy Webserver" - apt install -y debian-keyring debian-archive-keyring apt-transport-https curl - curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg - curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list - apt update - apt install caddy - - echo "Write caddyfile" - - touch /etc/caddy/Caddyfile - cat > /etc/caddy/Caddyfile << EOL - - # Airbyte web UI - http://airbyte-${YOUR_ENV}.${PRIVATE_DNS_ZONE_NAME} { - reverse_proxy localhost:8000 - } - - EOL - - echo "Restart caddy" - systemctl restart caddy - - echo "You can now access at http://airbyte-${YOUR_ENV}.${PRIVATE_DNS_ZONE_NAME}" + echo "You can now access at http://localhost:8000" echo "Finished." ``` -- Visit ..data.superhog.com. If you are prompted for user and password, it means Airbyte is running properly and is reachable. +- To check that Airbyte is running fine, run this command from a terminal within the Airbyte VM: `curl localhost:8000`. You should see some HTML for Airbyte's access denied page. -## 6. Power BI +### 060.3 Making Airbyte Web UI reachable -### 6.1 Deploying Power BI VM +- To provide access to the Airbyte UI, we will have to integrate it with the web gateway and our networking configurations. +- First, we need to allow the web gateway to reach Airbyte locally-served webserver. + - Use the Azure portal to navigate to the NSG `superhog-data-nsg-services-` page. + - Add a new Inbound rule with the following details: + - Name: `Allow8000TCPWithinSubnet` + - Source: the addresss range for the `services-subnet`. In this example, `10.69.0.64/26`. + - Source port ranges: * + - Destination: the addresss range for the `services-subnet`. In this example, `10.69.0.64/26`. + - Destination port ranges: 8000 + - Protocol: TCP + - Action: Allow + - Priority: Set something above existing rules, but below the `DenyAllInbound` rules. +- Next, we need to set a DNS entry to generate the URL that will be used to navigate to the Airbyte UI. + - Use the Azure portal to navigate to the Private DNS Zone `.data.superhog.com` page. + - Create a new record with the following details: + - Name: `airbyte` + - Type: `A` + - IP Address: Look for the private IP address that was assigned to the VM `web-gateway-` and place it here. +- Finally, we must create an entry in caddy's config file. + - SSH into the web gateway VM. + - Make a script with these commands and run it: + + ```bash + + YOUR_ENV= + PRIVATE_DNS_ZONE_NAME=${YOUR_ENV}.data.superhog.com + AIRBYTE_SUBDOMAIN=airbyte # If you followed this guide for the DNS bit, leave this value. If you chose a different subdomain, adjust accordingly + FULL_AIRBYTE_URL=http://${AIRBYTE_SUBDOMAIN}.${PRIVATE_DNS_ZONE_NAME} + echo "Write caddyfile" + + touch /etc/caddy/Caddyfile + cat > /etc/caddy/Caddyfile << EOL + + # Airbyte web UI + http://${FULL_AIRBYTE_URL} { + reverse_proxy http://airbyte-${YOUR_ENV}.${PRIVATE_DNS_ZONE_NAME}:8000 + } + + EOL + + echo "Restart caddy" + systemctl restart caddy + + echo "You can now access at http://${FULL_AIRBYTE_URL} + ``` + +- If everything is working properly, you should now be able to reach airbyte at the printed URL. +- If something doesn't work, I would advise troubleshooting through the chain of machines (your device to the VPN box, then to the web gateway, then to the airbyte machine) to find where is the connection breaking down. + +## 070. Power BI + +### 070.1 Deploying Power BI VM - We need to deploy a Windows VM. - Create the VM @@ -646,7 +729,7 @@ Follow this to deploy the entire data infra. - `project: pbi` - Try to connect with RDP at `pbi-gateway-..data.superhog.com`. -### 6.2 Installing Power BI Data Gateway +### 070.2 Installing Power BI Data Gateway - Login the VM. - Follow the instructions here to download the installer in the VM and set it up: @@ -662,39 +745,40 @@ Follow this to deploy the entire data infra. - Turn the `require_secure_transport` parameter to `Off`. - Once you are done, you should be able to visit the PBI Service (the online UI), visit the gateways page in settings and see the gateway listed in the `On-premises data gateways` section. -## 7. dbt +## 080. dbt - Our dbt project () can be deployed on any linux VM within the virtual network. The instructions on how to deploy and schedule it are in the project repository. - You can opt to deploy it in the same machine where airbyte is stored, since that machine is probably fairly underutilized. -## 8. Monitoring +## 090. Monitoring -### 8.1 Infra monitoring +### 090.1 Infra monitoring WIP: we are planning on using Azure Dashboards with metrics. -### 8.2 Service status +### 090.2 Service status WIP: we need support to learn how to use statuspage.io -## 9. Backups + +## 100. Backups - If you are working on a dev or staging environment, you might want to skip this section. -### 9.1 DWH +### 100.1 DWH - Backups are managed with Azure. In the Azure Portal page for the PostgreSQL service, visit section `Backup and restore`. Production servers should have 14 days as a retention period. -### 9.2 Jumphost +### 100.2 Jumphost - Jumphosts barely hold any data at all. Although it's quite tempting to forget about this and simply raise another VM if something goes wrong, it would be annoying to have to regenerate the keys of both the VPN server and other clients. - To solve this, make a habit of making regular copies of the Wireguard config file in another machine. Theoretically, only making a copy everytime it gets modified should be enough. -### 9.3 Airbyte +### 100.3 Airbyte - Our strategy for backing up Airbyte is to backup the entire VM. - WIP -### 9.4 PBI Gateway +### 100.4 PBI Gateway - The PBI Gateway is pretty much stateless. Given this, if there are any issues or disasters on the current VM, simply create another one and set up the gateway again.