- Create a resource group. This resource group will hold all the resources. For the rest of this guide, assume this is the resource group where you must create resources.
- Create a virtual network. This virtual network is where all our infra will live. For the rest of this guide, assume this is the network where you must connect services.
- Name it: `superhog-data-vnet-<your-env>`
- You need to think what the network range should be like. For example, you could decide that the entire vnet will be contained within. For reference, we should be fine with a `/24` space (256 addresses) since we will only have a handful network interfaces connecting.
- As an example, we will use `10.69.0.0/24`. This link might be helpful: <https://www.davidc.net/sites/default/subnets/subnets.html?network=10.69.0.0&mask=24&division=11.f10>
- You need to add three subnets:
- Add no network security groups to any of the subnets still. We will create those later.
- Jumphost subnet
- This subnet is where jumphost boxes will live.
- It will be the only subnet where we allow inbound connections from WAN.
- Name it `jumphost-subnet`.
- For our example, we will make it `10.69.0.0/29` (8 addresses).
- Database subnet
- This subnet is where the DWH database will live.
- Inbound traffic will be allowed from both the jumphost subnet as well as the services subnet.
- Name it `database-subnet`
- For our example, we will make it `10.69.0.8/29` (8 addresses).
- Services subnet
- This subnet is where most VMs dedicated to data services live (Airbyte, dbt, PBI Data Gateway, etc).
- Inbound traffic will only be allowed from the jumphost subnet.
- Name it `services-subnet`
- For our example, we will make it `10.69.0.64/26` (64 addresses)
- We will set up a private DNS Zone to avoid using hardcoded IPs to refer to services within the virtual network. This makes integrations more resilient because a service can change its IP and still be reached by other services (as long as other network configs like firewalls are still fine).
- Create the Private DNS Zone
- Name it: `<your-env>.data.superhog.com`
- Add tags:
-`team: data`
-`environment: <your-env>`
-`project: network`
- Add a new virtual network link to the zone
- Name it: `privatelink-<your-env>.data.superhog.com`
- *Note: the IPs chosen for the VPN can absolutely be changed. Just make sure they are consistent across the server and client configurations of the VPN.*
- You should copy the client config that the script will produce and set up the Wireguard config on your local machine.
- Once you've done so, start Wireguard on the client and try to ping the server from the client with the Wireguard VPN IP. If it reaches, the VPN is working fine.
- Now, validate your setup by SSHing from your local device into the jumphost by referencing the VPN IP of the jumphost instead of the public IP.
- Once you verify everything works, you should go to the NSG of the jumphost and remove rule AllowSSHInboundTemporarily. From this point on, the only entrypoint from WAN to the virtual network is the VPN port in the jumphost machine.
- Next, we must allow IP forwarding on Azure.
- Look for the jumphost VM Network Interface.
- In the `IP configurations` session, activate the flag `Enable IP forwarding`.
- The jumphost is now ready. When the VPN is active on our local device, we can access the services within the virtual network.
- There is one issue, though: we would like to access services through names, not IPs.
- Our Private DNS Zone takes care of providing names to services within the virtual network. But these resolution only happens within the virtual network itself, so our external device can't rely on it.
- To solve this, we need to force DNS resolution of our laptops to happen from within the virtual network itself.
- To do so, we will set up a DNS server in the jumphost, and set up our VPN configuration to use it when the VPN connection in our device is active.
- Connect to the jumphost through SSH
- Run the following script as `sudo` from the home folder of `azureuser`
sed -i -e 's/#DNSStubListener=yes/DNSStubListener=no/g' /etc/systemd/resolved.conf
systemctl restart systemd-resolved
echo "Writing config file".
rm /etc/coredns/Corefile
cat > /etc/coredns/Corefile <<EOL
. {
hosts {
log
# If you want to make custom mappings, place them here
# Format is
# xxx.xxx.xxx.xxx your.domain.name
# By default, we delegate on Azure
fallthrough
}
forward . 168.63.129.16 # This IP is Azure's DNS service
errors
}
EOL
echo "Restarting coredns to pick up new config."
systemctl restart coredns.service
```
- In your client Wireguard configuration, uncomment the DNS server line we left before
- Check that the service is running fine by running `dig google.com`. You should see in the output that your laptop has relied on our new DNS to do the name resolution.
- In the Jumphost, run the following command to disable password based SSH authentication fully. This way, access can only be granted with SSH key pairs, which is way more secure: `sudo sed -i -e 's/#PasswordAuthentication yes/PasswordAuthentication no/g' /etc/ssh/sshd_config; sudo systemctl restart ssh`.
- Remove the AllowSSHInboundTemporarily rule that you added to the NSG `superhog-data-nsg-jumphost-<your-env>`. We don't need that anymore since we can SSH through the VPN tunnel.
- Next, we will deploy a Postgres server to act as the DWH.
- Create a new Azure Database for PostgreSQL flexible servers.
- Basics
- Name it: `superhog-dwh-<your-env>`.
- On field `PostgreSQL version` pick version 16.
- Adapt the sizing to your needs. Only you know how much this server is going to take.
- For field `Authentication method` pick `PostgreSQL authentication only`.
- Name the user admin: `dwh_admin_<your-env>`.
- Give it a password and make sure to note it down.
- Networking
- On field `Connectivity method` select `Private access (VNet Integration)`
- Pick the virtual network `superhog-data-vnet-<your-env>` and the subnet `databases-subnet`.
- Create a new private dns zone. Unfortunately, we can't use `<your-env>.data.superhog.com` for this service.
- Security
- Defaults are fine
- Add tags:
-`team: data`
-`environment: <your-env>`
-`project: dwh`
- Validate the deployment by trying to log into the database with the `dwh_admin_<your-env>` user from your favourite SQL client (you can use DBeaver, for example). Be aware that your VPN connection should be active so that the DWH is reachable from your device.
- Typically, you will want to create personal accounts for data team members with `modeler` role so that they can query everywhere in the dwh.
- Any other services or users that need to access the reporting layer can be given the `consumer` role.
- Furthermore, `sync_` schema permissions need to be dynamically managed from this point on. This means that:
- Generally, all `sync_` schemas should be created by the `airbyte_user`.
- Whenever a new `sync_` schema comes to life, both the `modeler` and `dwh_builder` roles should receive access. You can use the following command template:
```sql
-- Login as airbyte_user
GRANT USAGE ON SCHEMA sync_<some-new-source> TO dwh_builder;
GRANT SELECT ON ALL TABLES IN SCHEMA sync_<some-new-source> TO dwh_builder;
ALTER DEFAULT PRIVILEGES IN SCHEMA sync_<some-new-source> GRANT SELECT ON TABLES TO dwh_builder;
GRANT USAGE ON SCHEMA sync_<some-new-source> TO modeler;
GRANT SELECT ON ALL TABLES IN SCHEMA sync_<some-new-source> TO modeler;
ALTER DEFAULT PRIVILEGES IN SCHEMA sync_<some-new-source> GRANT SELECT ON TABLES TO modeler;
- This script also doesn't specify exactly which tables should the `billingdb_reader` read from, since providing full access to the entire reporting schema would be excessive. You can specify which tables should be readable by the user like this:
```sql
-- Login as dbt_user
GRANT USAGE ON SCHEMA reporting TO billingdb_reader;
GRANT SELECT ON TABLE reporting.<some_table> TO billingdb_reader;
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update
sudo apt install caddy
```
- After the previous commands, you can verify that caddy is running properly as a systemd service with: `systemctl status caddy`
- You can also verify that Caddy is reachable (should be) by running the following command from your laptop while connected to the VPN: `curl web-gateway-<your-env>.<your-env>.data.superhog.com`. If you see a wall of HTML that looks like Caddy's demo page, it means Caddy is working as expected.
- Caddy will need to be configured to act as the web server or reverse proxy of the different services within the services subnet. The details of these configurations are defined in sections below.
- As a general note, the pattern will generally be:
- You will need to include the right entry in the `Caddyfile` at `/etc/caddy/Caddyfile`.
- You will need to reload caddy with `sudo systemctl reload caddy.service`.
- If the web server needs to reach a specific port in some other VM, you will need to sort networking security out. If the VM you need to reach from the web server is within the internal services subnet, you'll have to add the necessary Inbound rules in the NSG `superhog-data-nsg-services-<your-env>`.
- Select the option `None` for Public inbound ports.
- Disk settings
- Increasing the data disk to at least 64gb as a starting point is recommended. Airbyte can be a bit of a disk hog, and running low on space might lead to obscure errors happening. Start with 64gb and monitor as you increase usage.
- Networking
- Attach to the virtual network `superhog-data-vnet-<your-env>`
- Attach to the subnet `services-subnet`
- Assign no public IP.
- For setting `NIC network security group` select option `None`
- Management settings
- Defaults are fine.
- Monitoring
- Defaults are fine.
- Advanced
- Defaults are fine.
- Add tags:
-`team: data`
-`environment: <your-env>`
-`project: airbyte`
- Once the VM is running, you should be able to ssh into the machine when your VPN is active.
- To check that Airbyte is running fine, run this command from a terminal within the Airbyte VM: `curl localhost:8000`. You should see some HTML for Airbyte's access denied page.
- If something doesn't work, I would advise troubleshooting through the chain of machines (your device to the VPN box, then to the web gateway, then to the airbyte machine) to find where is the connection breaking down.
- Follow the instructions here to download the installer in the VM and set it up: <https://learn.microsoft.com/en-us/data-integration/gateway/service-gateway-install>
- You will need to provide an account and credentials. It would be ideal to use a service account, and not personal accounts, to make the gateway independent of any single user.
- Once you login:
- Name the gateway `data-gateway-<your-env>`
- Set up a recovery key and store it safely
- Next, download these file on the VM and install it: <https://github.com/npgsql/npgsql/releases/download/v4.0.10/Npgsql-4.0.10.msi>
- ATTENTION! During the installation process, you get to select if you want to activate the `Npgsql GAC Installation`. This option comes deactivated by default. You must turn it on. Click on it and select the `Will be installed on local hard drive` option.
- Finally, a note: if you want to use this gateway to connect to our PostgreSQL DWH (which you most probably want), you will need to disable forced TLS/SSL in the config of the PostgreSQL instance. This is because PBI is unable to use an SSL connection.
- To do this, go to the PostgreSQL isntance page on the Azure Portal.
- Click on the `Server parameters` section.
- Turn the `require_secure_transport` parameter to `Off`.
- Once you are done, you should be able to visit the PBI Service (the online UI), visit the gateways page in settings and see the gateway listed in the `On-premises data gateways` section.
- Our dbt project (<https://guardhog.visualstudio.com/Data/_git/data-dwh-dbt-project>) can be deployed on any linux VM within the virtual network. The instructions on how to deploy and schedule it are in the project repository.
- Backups are managed with Azure. In the Azure Portal page for the PostgreSQL service, visit section `Backup and restore`. Production servers should have 14 days as a retention period.
- Jumphosts barely hold any data at all. Although it's quite tempting to forget about this and simply raise another VM if something goes wrong, it would be annoying to have to regenerate the keys of both the VPN server and other clients.
- To solve this, make a habit of making regular copies of the Wireguard config file in another machine. Theoretically, only making a copy everytime it gets modified should be enough.
- The PBI Gateway is pretty much stateless. Given this, if there are any issues or disasters on the current VM, simply create another one and set up the gateway again.