data-dwh-dbt-project/dev-env/local_dwh.md

310 lines
17 KiB
Markdown
Raw Permalink Normal View History

2024-09-30 09:52:43 +02:00
# Local DWH
## Rationale
We currently use a Postgres server as our production DWH.
The good news is that running a pretty-close copy of it in your local laptop is easy to do. This is extremely useful because you can develop and test your dbt changes locally, which will improve your speed and quality by orders of magnitude. This way, we can push changes to production fast and with confidence.
## What we will setup
The setup will consist on preparing:
- A local Postgres server, that runs in a docker container.
- The server will have two databases that you can use depending on your needs:
- dwh: a database that only holds local data, where you are 100% in control of what data you are using. You can feed raw data from production backups or any other manual data gathering approach.
- dwh_hybrid: a database that uses Postgres' Foreign Data Wrappers to allow you to read from the `sync_` schemas in our production DWH. This means that, if you want to use production data for your development, you don't need to manually copy it: it's always reachable directly from your machine.
- The server will have a single user, named `postgres`, that owns all objects and has privileges to run any operation.
- We will prepare `dbt` to use `dwh_hybrid` by default when you are developing locally.
## Instructions
### Pre-requisites
- Before you begin, you will need to have a personal user in the production DWH with the `modeler` role granted. If you don't have one, ask Pablo to get you one.
2024-10-04 15:34:47 +02:00
- You will also need to have the WSL and Docker Desktop enabled, and to enable the integration between both. You can read more in this page: <https://www.notion.so/knowyourguest-superhog/How-to-set-up-WSL-and-Docker-Desktop-4771651ae49a455dac98d7071abcd66d?pvs=4>
2024-09-30 09:52:43 +02:00
### Creating your server
- To start your server for the first time:
- Make a copy of the `.env-example` file in this same folder and name it `.env`. Fill it in with a password for your local database user.
- Start WSL
- Start Docker Desktop
2024-10-01 10:12:40 +02:00
- Visit this directory from a WSL terminal and run the command: `docker compose -f local-dwh.docker-compose.yml up -d`
2024-09-30 09:52:43 +02:00
- You should see the local DWH container now running in Docker Desktop. If not, something went wrong.
- Connect to the server with DBeaver or whatever database manager you are using.
- Create two databases named `dwh` and `dwh_hybrid`.
```sql
CREATE DATABASE dwh;
CREATE DATABASE dwh_hybrid;
```
- Switch to database `dwh_hybrid`, use the following code to create two stored procedures: `setup_fdw_for_schemas` and `refresh_fdw_schemas`
```sql
CREATE OR REPLACE PROCEDURE setup_fdw_for_schemas(
IN server_name TEXT, -- Foreign server name
IN host TEXT, -- Foreign server host address
IN port TEXT, -- Foreign server port
IN dbname TEXT, -- Foreign database name
IN remote_user TEXT, -- Remote database user
IN remote_password TEXT, -- Remote database password
IN schema_list TEXT[] -- List of schemas to import
)
LANGUAGE plpgsql
AS $$
DECLARE
schema_to_sync TEXT; -- Variable to loop through schema list
BEGIN
-- Step 1: Create the extension if it doesn't exist
IF NOT EXISTS (select 1 from pg_extension where extname = 'postgres_fdw') THEN
CREATE EXTENSION IF NOT EXISTS postgres_fdw;
END IF;
-- Step 2: Create the foreign server if it doesn't already exist
IF NOT EXISTS (SELECT 1 FROM pg_foreign_server WHERE srvname = server_name) THEN
EXECUTE format('CREATE SERVER %I FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host %L, port %L, dbname %L)',
server_name, host, port, dbname);
END IF;
-- Step 3: Create user mapping for the specified remote user
IF NOT EXISTS (SELECT * FROM pg_user_mapping um LEFT JOIN pg_foreign_server fsev on fsev.oid = um.umserver WHERE fsev.srvname = server_name) THEN
EXECUTE format('CREATE USER MAPPING FOR CURRENT_USER SERVER %I OPTIONS (user %L, password %L)',
server_name, remote_user, remote_password);
END IF;
-- Step 4: Grant USAGE on the foreign server to the current user
EXECUTE format('GRANT USAGE ON FOREIGN SERVER %I TO %I', server_name, current_user);
-- Step 5: Loop through each schema in the schema list
FOREACH schema_to_sync IN ARRAY schema_list
LOOP
-- Step 5.1: Create the schema if it doesn't exist
IF NOT EXISTS (SELECT 1 FROM information_schema.schemata WHERE schema_name = schema_to_sync) THEN
EXECUTE format('CREATE SCHEMA IF NOT EXISTS %I', schema_to_sync);
END IF;
-- Step 5.2: Import the foreign schema into the local schema
EXECUTE format('IMPORT FOREIGN SCHEMA %I FROM SERVER %I INTO %I',
schema_to_sync, server_name, schema_to_sync);
END LOOP;
END;
$$;
CREATE OR REPLACE PROCEDURE refresh_fdw_schemas(
IN server_name TEXT, -- Foreign server name
IN schema_list TEXT[] -- List of schemas to refresh
)
LANGUAGE plpgsql
AS $$
DECLARE
2024-10-01 10:38:08 +02:00
schema_to_sync TEXT;
2024-09-30 09:52:43 +02:00
BEGIN
-- Step 1: Loop through each schema in the schema list
2024-10-01 10:38:08 +02:00
FOREACH schema_to_sync IN ARRAY schema_list
2024-09-30 09:52:43 +02:00
LOOP
-- Step 1.1: Drop the schema if it exists
2024-10-01 10:38:08 +02:00
IF EXISTS (SELECT 1 FROM information_schema.schemata WHERE schema_name = schema_to_sync) THEN
EXECUTE format('DROP SCHEMA %I CASCADE', schema_to_sync);
2024-09-30 09:52:43 +02:00
END IF;
2024-10-01 15:41:32 +02:00
EXECUTE format('CREATE SCHEMA IF NOT EXISTS %I', schema_to_sync);
2024-09-30 09:52:43 +02:00
-- Step 1.2: Re-import the foreign schema from the server
EXECUTE format('IMPORT FOREIGN SCHEMA %I FROM SERVER %I INTO %I',
2024-10-01 10:38:08 +02:00
schema_to_sync, server_name, schema_to_sync);
2024-09-30 09:52:43 +02:00
END LOOP;
END;
$$;
```
- You will now use the stored procedure named `setup_fdw_for_schemas`. You can call it as follows:
```sql
CALL setup_fdw_for_schemas(
'dwh_prd', -- Keep like this
'<dwh_prd-host>', -- Ask the team for this value
'<dwh_prd-port>', -- Ask the team for this value
'<dwh_prd-db>', -- Ask the team for this value
'<your-dwh-prd-user>', -- your username IN dwh_prd
'<your-dwh-prd-pwd>', -- your password IN dwh_prd
ARRAY['sync_core', 'another_schema'] -- Ask the team for an up to date list of all existing sync schemas
);
```
- Once you've done this, you can check that the foreign tables are successfully set up by traveling to one of the schemas you've added in the last stored procedure call, looking for its foreign tables and throwing a (small) query at one of them. If the query returns data, it means your foreign connection is properly set up.
2024-10-02 14:49:29 +02:00
- An optional (but extremely recommended) step here is to change the fetch size parameter of the FDW. This reduces the amount of times both Postgres servers exchange data through network, because it controls how many records are brought over from the remote in one go. Higher fetch size, less network back and forth, and typically more speed for our services. You can try with 100,000 records. You can set the value by running this command: `ALTER SERVER dwh_prd OPTIONS (fetch_size '100000');`. Note that this syntax only works the first time you set the value. If you ever need to change it again after using that command, you should instead use: `ALTER SERVER remote_server OPTIONS (SET fetch_size '100000');`.
2024-09-30 09:52:43 +02:00
- Next, you probably want to add these databases to your `~/.dbt/profiles.yml` file.
- You can check the example file `profiles.yml.example` at the root of this repo.
- You should make two entries: one for the `dwh` database and another one for `dwh_hybrid`.
- We recommend that you make `dwh_hybrid` your default value for `target`.
- Finally, you can try to run some `dbt` command on the databases to check that everything is working. Just make sure you've followed the right steps from the repositories main `README.md` file before doing so.
## Usage
### Refreshing foreign schemas in `dwh_hybrid`
If new `sync_` schemas get created in the production DWH, or the existing ones get new tables, they won't appear automatically in `dwh_hybrid`. To get them working in your laptop, you can call the second stored procedure we created like this:
```sql
CALL refresh_fdw_schemas(
'dwh_prd',
ARRAY['sync_some_new_schema', 'sync_some_old_schema_with_new_tables']
);
```
Bear in mind that this will run `DROP CASCADE` on the schemas to refresh! This means depending views and objects will be nuked. But since we use `dbt`, you shouldn't have much of a problem with that.
### When to use `dwh` and `dwh_hybrid`
If you have reached this point successfully, you know have a working database you can use to run this `dbt` project in your machine. Congrats!
You might be wondering why we have set up two different databases in your dockerized Postgres: `dwh` and `dwh_dbt`. You can use any of the two as you need, but using each option comes with tradeoffs. Let me explain, and then you will always be free to decide what to use:
- `dwh_hybrid`:
- How it works: it connects to our production instance to read the `sync_` schemas. Even though they look like simple local tables, whenever you query the content of the `sync_` schemas, your queries are being executed at the production instance and results get fetched through the network.
- pros:
- You always have up to date production data in your local env, without having to move a finger. Very convenient.
- As new tables appear in the `sync_` schemas, adding them to your env is trivial.
- If you want to make local copies of production tables, you can do so with a simple `INSERT INTO ... SELECT ...`
- cons:
- If production is down, this doesn't work at all.
- You are hitting the production server when developing. This is both annoying for you (because if someone is using the production DWH heavily, your queries will suffer) and for other users (because if you are using the production DWH heavily, the queries from the other users will suffer). Try to be mindful and don't spend all day running `SELECT *` from large tables just because you can.
- You can only work with perfect production data in `sync_`. You can't manually manipulate the data that appears in the `sync_` layer, to do stuff like test values that are not in production or reduce the size of data to something manageable.
- `dwh`
2024-10-02 14:49:29 +02:00
- How it works: this is a silly empty database. It expects you to add any data manually. This means that you will have to build the `sync_` layer manually (by getting dumps from production, by mocking data, by using the secondary FDW we explain in the next section of this page, etc.)
2024-09-30 09:52:43 +02:00
- pros:
- You have full flexibility. You can prepare the `sync_` layer in any way you want.
- You don't depend on production being alive to be able to run queries towards `sync_` schemas.
- Your workload is independent to the production machine workload.
- cons:
- It's a pain in the ass to prepare production-like datasets to develop on production-like data, which is something you will do a lot.
- You run the risk of developing on stale, old backup data for convenience, which can lead to trouble when you push stuff to production.
2024-10-02 14:49:29 +02:00
### Filling `dwh` with copied data
2024-10-04 15:34:47 +02:00
Given the previous discussion, sometimes it might be the case you want to work fully locally in the `dwh` database with copies of data instead of references a la `dwh_hybrid`.
2024-10-02 14:49:29 +02:00
There are many ways to populate the data. Historically, we used to do a dump from the production database with what we needed and then restored it in the local Postgres. That's an option and you can still do it.
You can also leverage the same Foreign Data Wrappers extension we saw in earlier sections to copy data into your local dwh instead of just referencing it like we do in `dwh_hybrid`. This requires a bit of setup, but once you have done that, it is extremely convenient and fast. The rest of this section will detail how to set it up.
2024-10-04 15:34:47 +02:00
To set things up, we will create similar but not exactly equal procedures as in `dwh_hybrid`. Begin by connecting to `dwh` and running the following code to create two stored procedures:
2024-10-02 14:49:29 +02:00
```sql
CREATE OR REPLACE PROCEDURE setup_fdw_for_el(
IN server_name TEXT, -- Foreign server name
IN host TEXT, -- Foreign server host address
IN port TEXT, -- Foreign server port
IN dbname TEXT, -- Foreign database name
IN remote_user TEXT, -- Remote database user
IN remote_password TEXT, -- Remote database password
IN schema_list TEXT[] -- List of schemas to import
)
LANGUAGE plpgsql
AS $$
DECLARE
el_schema TEXT; -- Variable to loop through schema list
BEGIN
-- Step 1: Create the extension if it doesn't exist
IF NOT EXISTS (select 1 from pg_extension where extname = 'postgres_fdw') THEN
CREATE EXTENSION IF NOT EXISTS postgres_fdw;
END IF;
-- Step 2: Create the foreign server if it doesn't already exist
IF NOT EXISTS (SELECT 1 FROM pg_foreign_server WHERE srvname = server_name) THEN
EXECUTE format('CREATE SERVER %I FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host %L, port %L, dbname %L)',
server_name, host, port, dbname);
END IF;
-- Step 3: Create user mapping for the specified remote user
IF NOT EXISTS (SELECT * FROM pg_user_mapping um LEFT JOIN pg_foreign_server fsev on fsev.oid = um.umserver WHERE fsev.srvname = server_name) THEN
EXECUTE format('CREATE USER MAPPING FOR CURRENT_USER SERVER %I OPTIONS (user %L, password %L)',
server_name, remote_user, remote_password);
END IF;
-- Step 4: Grant USAGE on the foreign server to the current user
EXECUTE format('GRANT USAGE ON FOREIGN SERVER %I TO %I', server_name, current_user);
-- Step 5: Loop through each schema in the schema list
FOREACH el_schema IN ARRAY schema_list
LOOP
-- Step 5.1: Create the schema if it doesn't exist
IF NOT EXISTS (SELECT 1 FROM information_schema.schemata WHERE schema_name = el_schema) THEN
EXECUTE format('CREATE SCHEMA IF NOT EXISTS el_%I', el_schema);
END IF;
-- Step 5.2: Import the foreign schema into the local schema
EXECUTE format('IMPORT FOREIGN SCHEMA %I FROM SERVER %I INTO el_%I',
el_schema, server_name, el_schema);
END LOOP;
END;
$$;
CREATE OR REPLACE PROCEDURE refresh_fdw_el_schemas(
IN server_name TEXT, -- Foreign server name
IN schema_list TEXT[] -- List of schemas to refresh
)
LANGUAGE plpgsql
AS $$
DECLARE
el_schema TEXT;
BEGIN
-- Step 1: Loop through each schema in the schema list
FOREACH el_schema IN ARRAY schema_list
LOOP
-- Step 1.1: Drop the schema if it exists
IF EXISTS (SELECT 1 FROM information_schema.schemata WHERE schema_name = format('el_%I', el_schema)) THEN
EXECUTE format('DROP SCHEMA el_%I CASCADE', el_schema);
END IF;
EXECUTE format('CREATE SCHEMA IF NOT EXISTS el_%I', el_schema);
-- Step 1.2: Re-import the foreign schema from the server
EXECUTE format('IMPORT FOREIGN SCHEMA %I FROM SERVER %I INTO el_%I',
el_schema, server_name, el_schema);
END LOOP;
END;
$$;
```
2024-09-30 09:52:43 +02:00
2024-10-02 14:49:29 +02:00
Once you've done that, you will have to execute a `CALL` to `setup_fdw_for_el`. Exact same story as with `dwh_hybrid`, except for the fact that the created foreign schemas will be prepended by `el_` (Extract, Load) so that they don't clash in name with regular sync schemas. Again, just like `dwh_hybrid`, any time you need to refresh the schemas, you can call `refresh_fdw_el_schemas`.
Once these fdw is set up, you can use it to copy data into `dwh` sync schemas. For example, imagine I want to get the `Accommodation` table from `sync_core`. I can create the `sync_core` schema manually (`CREATE SCHEMA sync_core;`) and then use our `el_sync_core` fdw to copy the data over with a query like this:
```sql
CREATE TABLE sync_core."Accommodation" AS (
SELECT *
FROM el_sync_core."Accommodation" a
);
```
This would create the table and copy all the data over. If instead, you only want to insert data and not copy, you can simply switch to an `INSERT INTO ... SELECT` statement.
```sql
INSERT INTO sync_core."Accommodation"
SELECT *
FROM el_sync_core."Accommodation" a
```
These examples copied the entire table from the FDW, but here you have more flexibility than in `dwh_hybrid`. You can bring only some subset of data, which you can define through querying.
```sql
INSERT INTO sync_core."Accommodation"
SELECT *
FROM el_sync_core."Accommodation" a
WHERE a."CreatedByUserId" = 'only-the-host-i-care-about-in-dev'
```
2024-09-30 09:52:43 +02:00
## Other stuff
- Bear in mind the docker compose file comes with Postgres server settings which were based on the laptops being used in the team on August 2024. They might be more or less relevant to you. In case of doubt, you might want to use: https://pgtune.leopard.in.ua/.