Merged PR 1935: SQL Users Update
With the arrival of new members to the data team, I've redesigned a bit the game of roles and users in the DWH. This PR modifies the infra script to include the commands that should be run to end up in today's state of the DWH should we ever have to redeploy from scratch. It's not fully perfect since it requires some statefulness (personal users of data team members, existing `sync_`schemas made by airbyte, etc). But it does create all necessary roles and the most important users. And it explains the general philosophy. Related work items: #16911
This commit is contained in:
commit
1d981fc728
2 changed files with 101 additions and 46 deletions
132
human-script.md
132
human-script.md
|
|
@ -23,7 +23,7 @@ Follow this to deploy the entire data infra.
|
|||
|
||||
- We will create an SSH Keypair for this deployment. It will be used to access VMs, Git repos and other services.
|
||||
- Create the SSH Key pair
|
||||
- Name the key: `superhog-data-<your-env>-general-ssh`
|
||||
- Name the key: `superhog-data-general-ssh-<your-env>`
|
||||
- Add tags:
|
||||
- `team: data`
|
||||
- `environment: <your-env>`
|
||||
|
|
@ -106,7 +106,7 @@ Follow this to deploy the entire data infra.
|
|||
- Protocol: TCP
|
||||
- Action: Allow
|
||||
- Priority: 100
|
||||
- SSH Rule
|
||||
- RDP Rule
|
||||
- Name: AllowRDPFromJumphostInbound
|
||||
- Source: the addresss range for the `jumphost-subnet`. In this example, `10.69.0.0/29`.
|
||||
- Source port ranges: *
|
||||
|
|
@ -116,7 +116,7 @@ Follow this to deploy the entire data infra.
|
|||
- Action: Allow
|
||||
- Priority: 110
|
||||
- Airbyte web rule
|
||||
- Name: AllowAirbyteWebFromJumphostInbound
|
||||
- Name: AllowWebFromJumphostInbound
|
||||
- Source: the addresss range for the `jumphost-subnet`. In this example, `10.69.0.0/29`.
|
||||
- Source port ranges: *
|
||||
- Destination: the addresss range for the `services-subnet`. In this example, `10.69.0.64/26`.
|
||||
|
|
@ -208,7 +208,7 @@ Follow this to deploy the entire data infra.
|
|||
- Use Ubuntu Server 22.04
|
||||
- Use Size: `Standard_B1s`
|
||||
- Use username: `azureuser`
|
||||
- Use the SSH Key: `superhog-data-<your-env>-general-ssh`
|
||||
- Use the SSH Key: `superhog-data-general-ssh-<your-env>`
|
||||
- Select the option `None` for Public inbound ports.
|
||||
- Disk settings
|
||||
- Defaults are fine. This barely needs any disk.
|
||||
|
|
@ -245,8 +245,12 @@ Follow this to deploy the entire data infra.
|
|||
- Run the following script (requires `sudo`) to install wireguard and configure it
|
||||
- Pay attention: you need to fill in the public IP manually, as well as the network mask of the virtual network
|
||||
- *Note: the IPs chosen for the VPN can absolutely be changed. Just make sure they are consistent across the server and client configurations of the VPN.*
|
||||
- *Note: you need to input the public IP and the VNET network mask manually at the top of the script.*
|
||||
|
||||
```bash
|
||||
JUMPHOST_PUBLIC_IP=<write-the-public-ip-here>
|
||||
NETWORK_MASK_FOR_VNET=<write-the-network-mask-here>
|
||||
|
||||
echo "Installing Wireguard."
|
||||
apt update
|
||||
apt install wireguard -y
|
||||
|
|
@ -295,8 +299,8 @@ Follow this to deploy the entire data infra.
|
|||
|
||||
[Peer]
|
||||
PublicKey = ${SERVER_PUBLIC_KEY}
|
||||
AllowedIPs = 192.168.69.1/32,<network-mask-for-vnet>
|
||||
Endpoint = <fill-public-ip-here>:52420
|
||||
AllowedIPs = 192.168.69.1/32,${NETWORK_MASK_FOR_VNET}
|
||||
Endpoint = ${JUMPHOST_PUBLIC_IP}:52420
|
||||
##############################
|
||||
|
||||
EOF
|
||||
|
|
@ -406,58 +410,110 @@ Follow this to deploy the entire data infra.
|
|||
|
||||
- Validate the deployment by trying to log into the database with the `dwh_admin_<your-env>` user from your favourite SQL client (you can use DBeaver, for example). Be aware that your VPN connection should be active so that the DWH is reachable from your device.
|
||||
|
||||
### 4.2 Create database and schemas
|
||||
### 4.2 Create database
|
||||
|
||||
- Run the following commands to create a new database and the needed schemas
|
||||
- Run the following commands to create a new database
|
||||
|
||||
```sql
|
||||
CREATE DATABASE dwh;
|
||||
-- Change active DB to dwh
|
||||
CREATE SCHEMA staging;
|
||||
CREATE SCHEMA intermediate;
|
||||
CREATE SCHEMA reporting;
|
||||
```
|
||||
|
||||
### 4.3 Create users and roles
|
||||
- From now on, use this database for everything
|
||||
|
||||
### 4.3 Create schemas, roles and users
|
||||
|
||||
- Run the following script to create:
|
||||
- A `modeler` role, owner of the `staging`, `intermediate` and `reporting` schemas.
|
||||
- A `consumer` role, capable of reading the `reporting` schema.
|
||||
- A dbt user, with `modeler` role.
|
||||
- An airbyte user, with permission to create new schemas.
|
||||
- A Power BI user, with `consumer` role.
|
||||
- A `dwh_builder` role, which:
|
||||
- Owns the `staging`, `intermediate` and `reporting` schemas.
|
||||
- Can read `sync_` schemas.
|
||||
- Thus, can do as he pleases in them.
|
||||
- Designed to support `dbt run`.
|
||||
- A `modeler` role, which:
|
||||
- Can read the `staging`, `intermediate` and `reporting` schemas.
|
||||
- Can read `sync_` schemas.
|
||||
- Designed for data team members and power users to be able to read everything within the DWH.
|
||||
- A `consumer` role, which:
|
||||
- Can read the `reporting` schema.
|
||||
- A user `dbt_user`, with `dwh_builder` role.
|
||||
- A user `powerbi_user`, with `consumer` role.
|
||||
- A user `airbyte user`, with permission to create new schemas.
|
||||
- *Note: replace the password fields with serious passwords and note them down.*
|
||||
- *Note: replace the name of the admin user*
|
||||
|
||||
```sql
|
||||
GRANT pg_read_all_data TO dwh_admin_infratest;
|
||||
|
||||
-- Start here, logged in as the dwh admin
|
||||
CREATE ROLE airbyte_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
|
||||
GRANT CREATE ON DATABASE dwh TO airbyte_user;
|
||||
|
||||
CREATE ROLE modeler INHERIT;
|
||||
GRANT USAGE ON SCHEMA staging TO modeler;
|
||||
GRANT USAGE ON SCHEMA intermediate TO modeler;
|
||||
GRANT USAGE ON SCHEMA reporting TO modeler;
|
||||
GRANT ALL ON ALL TABLES IN SCHEMA staging TO modeler;
|
||||
GRANT ALL ON ALL TABLES IN SCHEMA intermediate TO modeler;
|
||||
GRANT ALL ON ALL TABLES IN SCHEMA reporting TO modeler;
|
||||
ALTER SCHEMA staging OWNER TO modeler;
|
||||
ALTER SCHEMA intermediate OWNER TO modeler;
|
||||
ALTER SCHEMA reporting OWNER TO modeler;
|
||||
-- Login as the airbyte_user
|
||||
CREATE SCHEMA sync_default;
|
||||
|
||||
-- Login as the dwh admin again
|
||||
CREATE SCHEMA staging;
|
||||
CREATE SCHEMA intermediate;
|
||||
CREATE SCHEMA reporting;
|
||||
|
||||
CREATE ROLE dwh_builder INHERIT;
|
||||
ALTER SCHEMA staging OWNER TO dwh_builder;
|
||||
ALTER SCHEMA intermediate OWNER TO dwh_builder;
|
||||
ALTER SCHEMA reporting OWNER TO dwh_builder;
|
||||
|
||||
CREATE ROLE dbt_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
|
||||
GRANT modeler to dbt_user;
|
||||
GRANT dwh_builder to dbt_user;
|
||||
|
||||
CREATE ROLE consumer INHERIT;
|
||||
CREATE ROLE powerbi_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
|
||||
GRANT consumer to powerbi_user;
|
||||
|
||||
CREATE ROLE modeler INHERIT;
|
||||
-- You might want to create a first personal user with modeler role here
|
||||
|
||||
-- Login as airbyte_user
|
||||
|
||||
GRANT USAGE ON SCHEMA sync_default TO dwh_builder;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA sync_default TO dwh_builder;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA sync_default GRANT SELECT ON TABLES TO dwh_builder;
|
||||
|
||||
GRANT USAGE ON SCHEMA sync_default TO modeler;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA sync_default TO modeler;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA sync_default GRANT SELECT ON TABLES TO modeler;
|
||||
|
||||
-- Login as dbt_user
|
||||
|
||||
GRANT USAGE ON SCHEMA staging TO modeler;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA staging TO modeler;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA staging GRANT SELECT ON TABLES TO modeler;
|
||||
GRANT USAGE ON SCHEMA intermediate TO modeler;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA intermediate TO modeler;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA intermediate GRANT SELECT ON TABLES TO modeler;
|
||||
GRANT USAGE ON SCHEMA reporting TO modeler;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA reporting TO modeler;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA reporting GRANT SELECT ON TABLES TO modeler;
|
||||
|
||||
|
||||
GRANT USAGE ON SCHEMA reporting TO consumer;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA reporting TO consumer;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA reporting GRANT SELECT ON TABLES TO consumer;
|
||||
|
||||
CREATE ROLE powerbi_user LOGIN PASSWORD 'password' VALID UNTIL 'infinity';
|
||||
GRANT consumer to powerbi_user;
|
||||
```
|
||||
|
||||
- If you want, you might also want to create more users depending on your needs. Typically, date team members should also have the `modeler` role.
|
||||
- On adding new users:
|
||||
- Typically, you will want to create personal accounts for data team members with `modeler` role so that they can query everywhere in the dwh.
|
||||
- Any other services or users that need to access the reporting layer can be given the `consumer` role.
|
||||
- Furthermore, `sync_` schema permissions need to be dynamically managed from this point on. This means that:
|
||||
- Generally, all `sync_` schemas should be created by the `airbyte_user`.
|
||||
- Whenever a new `sync_` schema comes to life, both the `modeler` and `dwh_builder` roles should receive access. You can use the following command template:
|
||||
|
||||
```sql
|
||||
-- Login as airbyte_user
|
||||
|
||||
GRANT USAGE ON SCHEMA sync_<some-new-source> TO dwh_builder;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA sync_<some-new-source> TO dwh_builder;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA sync_<some-new-source> GRANT SELECT ON TABLES TO dwh_builder;
|
||||
|
||||
GRANT USAGE ON SCHEMA sync_<some-new-source> TO modeler;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA sync_<some-new-source> TO modeler;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA sync_<some-new-source> GRANT SELECT ON TABLES TO modeler;
|
||||
```
|
||||
|
||||
## 5. Airbyte
|
||||
|
||||
|
|
@ -467,9 +523,9 @@ Follow this to deploy the entire data infra.
|
|||
- Basic settings
|
||||
- Name it: `airbyte-<your-env>`
|
||||
- Use Ubuntu Server 22.04
|
||||
- Use Size: `Standard_DS1_v2`
|
||||
- I suggest size for testing `Standard_DS1_v2`. For production, get something beefier.
|
||||
- Use username: `azureuser`
|
||||
- Use the SSH Key: `superhog-data-<your-env>-general-ssh`
|
||||
- Use the SSH Key: `superhog-data-general-ssh-<your-env>`
|
||||
- Select the option `None` for Public inbound ports.
|
||||
- Disk settings
|
||||
- Increasing the data disk to at least 64gb as a starting point is recommended. Airbyte can be a bit of a disk hog, and running low on space might lead to obscure errors happening. Start with 64gb and monitor as you increase usage.
|
||||
|
|
@ -608,7 +664,7 @@ Follow this to deploy the entire data infra.
|
|||
|
||||
## 7. dbt
|
||||
|
||||
- Our dbt project (https://guardhog.visualstudio.com/Data/_git/data-dwh-dbt-project) can be deployed on any linux VM within the virtual network. The instructions on how to deploy and schedule it are in the project repository.
|
||||
- Our dbt project (<https://guardhog.visualstudio.com/Data/_git/data-dwh-dbt-project>) can be deployed on any linux VM within the virtual network. The instructions on how to deploy and schedule it are in the project repository.
|
||||
- You can opt to deploy it in the same machine where airbyte is stored, since that machine is probably fairly underutilized.
|
||||
|
||||
## 8. Monitoring
|
||||
|
|
|
|||
|
|
@ -1,8 +1,8 @@
|
|||
# Architecture Overview
|
||||
# Platform Overview
|
||||
|
||||
Our infrastructure is designed to run on Azure.
|
||||
|
||||
The data infra architecture provides the following services:
|
||||
The data platform provides the following services:
|
||||
|
||||
- A PostgreSQL server which acts as a DWH.
|
||||
- A self-hosted Airbyte service that acts as a data integration tool (E and L out of ELT).
|
||||
|
|
@ -11,19 +11,18 @@ The data infra architecture provides the following services:
|
|||
- A simple scheduled dbt run for a dbt project that runs on top of the DWH.
|
||||
- A VPN Server + DNS Resolution to allow developers and power users to access the different services.
|
||||
|
||||
The infra serves Superhog in the following way:
|
||||
The platform serves Superhog in the following way:
|
||||
|
||||
- Data gets ingested from several sources into our DWH.
|
||||
- Data gets ingested from several sources into our DWH. Typically with Airbyte, but other options might be needed for specific cases.
|
||||
- We perform data cleaning and modeling inside the DWH with dbt. This results in tables in a reporting schema that support our data needs.
|
||||
- Data team members and power users build PBI reports and other data products on top of the reporting schema.
|
||||
- Data team members and other analysts can also rely on direct access to the DWH to perform ad-hoc analysis and basically cover any data needs that go beyond PBI reports.
|
||||
|
||||
The data infra relies on the following main components:
|
||||
The data infra relies on the following main Azure components:
|
||||
|
||||
- A subscription to hold everything.
|
||||
- A resource group to hold all resources.
|
||||
- A private network.
|
||||
- Three subnets.
|
||||
- A private network and three subnets.
|
||||
- A private DNS zone.
|
||||
- A managed PostgreSQL server.
|
||||
- A handful of VMs to host services.
|
||||
|
|
@ -34,5 +33,5 @@ More detailed components also get created for some of those (network security gr
|
|||
The following elements are external to the data infrastructure but important:
|
||||
|
||||
- Superhog's application SQL Server database + Networking settings for it to be reachable from Airbyte.
|
||||
- Superhog's service status.
|
||||
- Superhog's service status page.
|
||||
- VPN configurations in our laptops to access the data private network.
|
||||
Loading…
Add table
Add a link
Reference in a new issue