diff --git a/human-script.md b/human-script.md index 11268c6..0935305 100644 --- a/human-script.md +++ b/human-script.md @@ -23,7 +23,7 @@ Follow this to deploy the entire data infra. - We will create an SSH Keypair for this deployment. It will be used to access VMs, Git repos and other services. - Create the SSH Key pair - - Name the key: `superhog-data--general-ssh` + - Name the key: `superhog-data-general-ssh-` - Add tags: - `team: data` - `environment: ` @@ -106,7 +106,7 @@ Follow this to deploy the entire data infra. - Protocol: TCP - Action: Allow - Priority: 100 - - SSH Rule + - RDP Rule - Name: AllowRDPFromJumphostInbound - Source: the addresss range for the `jumphost-subnet`. In this example, `10.69.0.0/29`. - Source port ranges: * @@ -116,7 +116,7 @@ Follow this to deploy the entire data infra. - Action: Allow - Priority: 110 - Airbyte web rule - - Name: AllowAirbyteWebFromJumphostInbound + - Name: AllowWebFromJumphostInbound - Source: the addresss range for the `jumphost-subnet`. In this example, `10.69.0.0/29`. - Source port ranges: * - Destination: the addresss range for the `services-subnet`. In this example, `10.69.0.64/26`. @@ -208,7 +208,7 @@ Follow this to deploy the entire data infra. - Use Ubuntu Server 22.04 - Use Size: `Standard_B1s` - Use username: `azureuser` - - Use the SSH Key: `superhog-data--general-ssh` + - Use the SSH Key: `superhog-data-general-ssh-` - Select the option `None` for Public inbound ports. - Disk settings - Defaults are fine. This barely needs any disk. @@ -245,8 +245,12 @@ Follow this to deploy the entire data infra. - Run the following script (requires `sudo`) to install wireguard and configure it - Pay attention: you need to fill in the public IP manually, as well as the network mask of the virtual network - *Note: the IPs chosen for the VPN can absolutely be changed. Just make sure they are consistent across the server and client configurations of the VPN.* + - *Note: you need to input the public IP and the VNET network mask manually at the top of the script.* ```bash + JUMPHOST_PUBLIC_IP= + NETWORK_MASK_FOR_VNET= + echo "Installing Wireguard." apt update apt install wireguard -y @@ -295,8 +299,8 @@ Follow this to deploy the entire data infra. [Peer] PublicKey = ${SERVER_PUBLIC_KEY} - AllowedIPs = 192.168.69.1/32, - Endpoint = :52420 + AllowedIPs = 192.168.69.1/32,${NETWORK_MASK_FOR_VNET} + Endpoint = ${JUMPHOST_PUBLIC_IP}:52420 ############################## EOF @@ -406,58 +410,110 @@ Follow this to deploy the entire data infra. - Validate the deployment by trying to log into the database with the `dwh_admin_` user from your favourite SQL client (you can use DBeaver, for example). Be aware that your VPN connection should be active so that the DWH is reachable from your device. -### 4.2 Create database and schemas +### 4.2 Create database -- Run the following commands to create a new database and the needed schemas +- Run the following commands to create a new database ```sql CREATE DATABASE dwh; - -- Change active DB to dwh - CREATE SCHEMA staging; - CREATE SCHEMA intermediate; - CREATE SCHEMA reporting; ``` -### 4.3 Create users and roles +- From now on, use this database for everything + +### 4.3 Create schemas, roles and users - Run the following script to create: - - A `modeler` role, owner of the `staging`, `intermediate` and `reporting` schemas. - - A `consumer` role, capable of reading the `reporting` schema. - - A dbt user, with `modeler` role. - - An airbyte user, with permission to create new schemas. - - A Power BI user, with `consumer` role. + - A `dwh_builder` role, which: + - Owns the `staging`, `intermediate` and `reporting` schemas. + - Can read `sync_` schemas. + - Thus, can do as he pleases in them. + - Designed to support `dbt run`. + - A `modeler` role, which: + - Can read the `staging`, `intermediate` and `reporting` schemas. + - Can read `sync_` schemas. + - Designed for data team members and power users to be able to read everything within the DWH. + - A `consumer` role, which: + - Can read the `reporting` schema. + - A user `dbt_user`, with `dwh_builder` role. + - A user `powerbi_user`, with `consumer` role. + - A user `airbyte user`, with permission to create new schemas. - *Note: replace the password fields with serious passwords and note them down.* + - *Note: replace the name of the admin user* ```sql - GRANT pg_read_all_data TO dwh_admin_infratest; - + -- Start here, logged in as the dwh admin CREATE ROLE airbyte_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity'; GRANT CREATE ON DATABASE dwh TO airbyte_user; - CREATE ROLE modeler INHERIT; - GRANT USAGE ON SCHEMA staging TO modeler; - GRANT USAGE ON SCHEMA intermediate TO modeler; - GRANT USAGE ON SCHEMA reporting TO modeler; - GRANT ALL ON ALL TABLES IN SCHEMA staging TO modeler; - GRANT ALL ON ALL TABLES IN SCHEMA intermediate TO modeler; - GRANT ALL ON ALL TABLES IN SCHEMA reporting TO modeler; - ALTER SCHEMA staging OWNER TO modeler; - ALTER SCHEMA intermediate OWNER TO modeler; - ALTER SCHEMA reporting OWNER TO modeler; + -- Login as the airbyte_user + CREATE SCHEMA sync_default; + + -- Login as the dwh admin again + CREATE SCHEMA staging; + CREATE SCHEMA intermediate; + CREATE SCHEMA reporting; + + CREATE ROLE dwh_builder INHERIT; + ALTER SCHEMA staging OWNER TO dwh_builder; + ALTER SCHEMA intermediate OWNER TO dwh_builder; + ALTER SCHEMA reporting OWNER TO dwh_builder; CREATE ROLE dbt_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity'; - GRANT modeler to dbt_user; + GRANT dwh_builder to dbt_user; CREATE ROLE consumer INHERIT; + CREATE ROLE powerbi_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity'; + GRANT consumer to powerbi_user; + + CREATE ROLE modeler INHERIT; + -- You might want to create a first personal user with modeler role here + + -- Login as airbyte_user + + GRANT USAGE ON SCHEMA sync_default TO dwh_builder; + GRANT SELECT ON ALL TABLES IN SCHEMA sync_default TO dwh_builder; + ALTER DEFAULT PRIVILEGES IN SCHEMA sync_default GRANT SELECT ON TABLES TO dwh_builder; + + GRANT USAGE ON SCHEMA sync_default TO modeler; + GRANT SELECT ON ALL TABLES IN SCHEMA sync_default TO modeler; + ALTER DEFAULT PRIVILEGES IN SCHEMA sync_default GRANT SELECT ON TABLES TO modeler; + + -- Login as dbt_user + + GRANT USAGE ON SCHEMA staging TO modeler; + GRANT SELECT ON ALL TABLES IN SCHEMA staging TO modeler; + ALTER DEFAULT PRIVILEGES IN SCHEMA staging GRANT SELECT ON TABLES TO modeler; + GRANT USAGE ON SCHEMA intermediate TO modeler; + GRANT SELECT ON ALL TABLES IN SCHEMA intermediate TO modeler; + ALTER DEFAULT PRIVILEGES IN SCHEMA intermediate GRANT SELECT ON TABLES TO modeler; + GRANT USAGE ON SCHEMA reporting TO modeler; + GRANT SELECT ON ALL TABLES IN SCHEMA reporting TO modeler; + ALTER DEFAULT PRIVILEGES IN SCHEMA reporting GRANT SELECT ON TABLES TO modeler; + + GRANT USAGE ON SCHEMA reporting TO consumer; GRANT SELECT ON ALL TABLES IN SCHEMA reporting TO consumer; ALTER DEFAULT PRIVILEGES IN SCHEMA reporting GRANT SELECT ON TABLES TO consumer; - - CREATE ROLE powerbi_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity'; - GRANT consumer to powerbi_user; ``` -- If you want, you might also want to create more users depending on your needs. Typically, date team members should also have the `modeler` role. +- On adding new users: + - Typically, you will want to create personal accounts for data team members with `modeler` role so that they can query everywhere in the dwh. + - Any other services or users that need to access the reporting layer can be given the `consumer` role. +- Furthermore, `sync_` schema permissions need to be dynamically managed from this point on. This means that: + - Generally, all `sync_` schemas should be created by the `airbyte_user`. + - Whenever a new `sync_` schema comes to life, both the `modeler` and `dwh_builder` roles should receive access. You can use the following command template: + + ```sql + -- Login as airbyte_user + + GRANT USAGE ON SCHEMA sync_ TO dwh_builder; + GRANT SELECT ON ALL TABLES IN SCHEMA sync_ TO dwh_builder; + ALTER DEFAULT PRIVILEGES IN SCHEMA sync_ GRANT SELECT ON TABLES TO dwh_builder; + + GRANT USAGE ON SCHEMA sync_ TO modeler; + GRANT SELECT ON ALL TABLES IN SCHEMA sync_ TO modeler; + ALTER DEFAULT PRIVILEGES IN SCHEMA sync_ GRANT SELECT ON TABLES TO modeler; + ``` ## 5. Airbyte @@ -467,9 +523,9 @@ Follow this to deploy the entire data infra. - Basic settings - Name it: `airbyte-` - Use Ubuntu Server 22.04 - - Use Size: `Standard_DS1_v2` + - I suggest size for testing `Standard_DS1_v2`. For production, get something beefier. - Use username: `azureuser` - - Use the SSH Key: `superhog-data--general-ssh` + - Use the SSH Key: `superhog-data-general-ssh-` - Select the option `None` for Public inbound ports. - Disk settings - Increasing the data disk to at least 64gb as a starting point is recommended. Airbyte can be a bit of a disk hog, and running low on space might lead to obscure errors happening. Start with 64gb and monitor as you increase usage. @@ -608,7 +664,7 @@ Follow this to deploy the entire data infra. ## 7. dbt -- Our dbt project (https://guardhog.visualstudio.com/Data/_git/data-dwh-dbt-project) can be deployed on any linux VM within the virtual network. The instructions on how to deploy and schedule it are in the project repository. +- Our dbt project () can be deployed on any linux VM within the virtual network. The instructions on how to deploy and schedule it are in the project repository. - You can opt to deploy it in the same machine where airbyte is stored, since that machine is probably fairly underutilized. ## 8. Monitoring diff --git a/architecture-overview.md b/platform-overview.md similarity index 79% rename from architecture-overview.md rename to platform-overview.md index 044ec78..aeac16f 100644 --- a/architecture-overview.md +++ b/platform-overview.md @@ -1,8 +1,8 @@ -# Architecture Overview +# Platform Overview Our infrastructure is designed to run on Azure. -The data infra architecture provides the following services: +The data platform provides the following services: - A PostgreSQL server which acts as a DWH. - A self-hosted Airbyte service that acts as a data integration tool (E and L out of ELT). @@ -11,19 +11,18 @@ The data infra architecture provides the following services: - A simple scheduled dbt run for a dbt project that runs on top of the DWH. - A VPN Server + DNS Resolution to allow developers and power users to access the different services. -The infra serves Superhog in the following way: +The platform serves Superhog in the following way: -- Data gets ingested from several sources into our DWH. +- Data gets ingested from several sources into our DWH. Typically with Airbyte, but other options might be needed for specific cases. - We perform data cleaning and modeling inside the DWH with dbt. This results in tables in a reporting schema that support our data needs. - Data team members and power users build PBI reports and other data products on top of the reporting schema. - Data team members and other analysts can also rely on direct access to the DWH to perform ad-hoc analysis and basically cover any data needs that go beyond PBI reports. -The data infra relies on the following main components: +The data infra relies on the following main Azure components: - A subscription to hold everything. - A resource group to hold all resources. -- A private network. -- Three subnets. +- A private network and three subnets. - A private DNS zone. - A managed PostgreSQL server. - A handful of VMs to host services. @@ -34,5 +33,5 @@ More detailed components also get created for some of those (network security gr The following elements are external to the data infrastructure but important: - Superhog's application SQL Server database + Networking settings for it to be reachable from Airbyte. -- Superhog's service status. +- Superhog's service status page. - VPN configurations in our laptops to access the data private network.