# Local DWH ## Rationale We currently use a Postgres server as our production DWH. The good news is that running a pretty-close copy of it in your local laptop is easy to do. This is extremely useful because you can develop and test your dbt changes locally, which will improve your speed and quality by orders of magnitude. This way, we can push changes to production fast and with confidence. ## What we will setup The setup will consist on preparing: - A local Postgres server, that runs in a docker container. - The server will have two databases that you can use depending on your needs: - dwh: a database that only holds local data, where you are 100% in control of what data you are using. You can feed raw data from production backups or any other manual data gathering approach. - dwh_hybrid: a database that uses Postgres' Foreign Data Wrappers to allow you to read from the `sync_` schemas in our production DWH. This means that, if you want to use production data for your development, you don't need to manually copy it: it's always reachable directly from your machine. - The server will have a single user, named `postgres`, that owns all objects and has privileges to run any operation. - We will prepare `dbt` to use `dwh_hybrid` by default when you are developing locally. ## Instructions ### Pre-requisites - Before you begin, you will need to have a personal user in the production DWH with the `modeler` role granted. If you don't have one, ask Pablo to get you one. - You will also need to have the WSL and Docker Dekstop enable, and to enable the integration between both. You can read more in this page: ### Creating your server - To start your server for the first time: - Make a copy of the `.env-example` file in this same folder and name it `.env`. Fill it in with a password for your local database user. - Start WSL - Start Docker Desktop - Visit this directory from a WSL terminal and run the command: `docker compose -f local-dwh.docker-compose.yml up -d` - You should see the local DWH container now running in Docker Desktop. If not, something went wrong. - Connect to the server with DBeaver or whatever database manager you are using. - Create two databases named `dwh` and `dwh_hybrid`. ```sql CREATE DATABASE dwh; CREATE DATABASE dwh_hybrid; ``` - Switch to database `dwh_hybrid`, use the following code to create two stored procedures: `setup_fdw_for_schemas` and `refresh_fdw_schemas` ```sql CREATE OR REPLACE PROCEDURE setup_fdw_for_schemas( IN server_name TEXT, -- Foreign server name IN host TEXT, -- Foreign server host address IN port TEXT, -- Foreign server port IN dbname TEXT, -- Foreign database name IN remote_user TEXT, -- Remote database user IN remote_password TEXT, -- Remote database password IN schema_list TEXT[] -- List of schemas to import ) LANGUAGE plpgsql AS $$ DECLARE schema_to_sync TEXT; -- Variable to loop through schema list BEGIN -- Step 1: Create the extension if it doesn't exist IF NOT EXISTS (select 1 from pg_extension where extname = 'postgres_fdw') THEN CREATE EXTENSION IF NOT EXISTS postgres_fdw; END IF; -- Step 2: Create the foreign server if it doesn't already exist IF NOT EXISTS (SELECT 1 FROM pg_foreign_server WHERE srvname = server_name) THEN EXECUTE format('CREATE SERVER %I FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host %L, port %L, dbname %L)', server_name, host, port, dbname); END IF; -- Step 3: Create user mapping for the specified remote user IF NOT EXISTS (SELECT * FROM pg_user_mapping um LEFT JOIN pg_foreign_server fsev on fsev.oid = um.umserver WHERE fsev.srvname = server_name) THEN EXECUTE format('CREATE USER MAPPING FOR CURRENT_USER SERVER %I OPTIONS (user %L, password %L)', server_name, remote_user, remote_password); END IF; -- Step 4: Grant USAGE on the foreign server to the current user EXECUTE format('GRANT USAGE ON FOREIGN SERVER %I TO %I', server_name, current_user); -- Step 5: Loop through each schema in the schema list FOREACH schema_to_sync IN ARRAY schema_list LOOP -- Step 5.1: Create the schema if it doesn't exist IF NOT EXISTS (SELECT 1 FROM information_schema.schemata WHERE schema_name = schema_to_sync) THEN EXECUTE format('CREATE SCHEMA IF NOT EXISTS %I', schema_to_sync); END IF; -- Step 5.2: Import the foreign schema into the local schema EXECUTE format('IMPORT FOREIGN SCHEMA %I FROM SERVER %I INTO %I', schema_to_sync, server_name, schema_to_sync); END LOOP; END; $$; CREATE OR REPLACE PROCEDURE refresh_fdw_schemas( IN server_name TEXT, -- Foreign server name IN schema_list TEXT[] -- List of schemas to refresh ) LANGUAGE plpgsql AS $$ DECLARE schema_to_sync TEXT; BEGIN -- Step 1: Loop through each schema in the schema list FOREACH schema_to_sync IN ARRAY schema_list LOOP -- Step 1.1: Drop the schema if it exists IF EXISTS (SELECT 1 FROM information_schema.schemata WHERE schema_name = schema_to_sync) THEN EXECUTE format('DROP SCHEMA %I CASCADE', schema_to_sync); END IF; -- Step 1.2: Re-import the foreign schema from the server EXECUTE format('IMPORT FOREIGN SCHEMA %I FROM SERVER %I INTO %I', schema_to_sync, server_name, schema_to_sync); END LOOP; END; $$; ``` - You will now use the stored procedure named `setup_fdw_for_schemas`. You can call it as follows: ```sql CALL setup_fdw_for_schemas( 'dwh_prd', -- Keep like this '', -- Ask the team for this value '', -- Ask the team for this value '', -- Ask the team for this value '', -- your username IN dwh_prd '', -- your password IN dwh_prd ARRAY['sync_core', 'another_schema'] -- Ask the team for an up to date list of all existing sync schemas ); ``` - Once you've done this, you can check that the foreign tables are successfully set up by traveling to one of the schemas you've added in the last stored procedure call, looking for its foreign tables and throwing a (small) query at one of them. If the query returns data, it means your foreign connection is properly set up. - Next, you probably want to add these databases to your `~/.dbt/profiles.yml` file. - You can check the example file `profiles.yml.example` at the root of this repo. - You should make two entries: one for the `dwh` database and another one for `dwh_hybrid`. - We recommend that you make `dwh_hybrid` your default value for `target`. - Finally, you can try to run some `dbt` command on the databases to check that everything is working. Just make sure you've followed the right steps from the repositories main `README.md` file before doing so. ## Usage ### Refreshing foreign schemas in `dwh_hybrid` If new `sync_` schemas get created in the production DWH, or the existing ones get new tables, they won't appear automatically in `dwh_hybrid`. To get them working in your laptop, you can call the second stored procedure we created like this: ```sql CALL refresh_fdw_schemas( 'dwh_prd', ARRAY['sync_some_new_schema', 'sync_some_old_schema_with_new_tables'] ); ``` Bear in mind that this will run `DROP CASCADE` on the schemas to refresh! This means depending views and objects will be nuked. But since we use `dbt`, you shouldn't have much of a problem with that. ### When to use `dwh` and `dwh_hybrid` If you have reached this point successfully, you know have a working database you can use to run this `dbt` project in your machine. Congrats! You might be wondering why we have set up two different databases in your dockerized Postgres: `dwh` and `dwh_dbt`. You can use any of the two as you need, but using each option comes with tradeoffs. Let me explain, and then you will always be free to decide what to use: - `dwh_hybrid`: - How it works: it connects to our production instance to read the `sync_` schemas. Even though they look like simple local tables, whenever you query the content of the `sync_` schemas, your queries are being executed at the production instance and results get fetched through the network. - pros: - You always have up to date production data in your local env, without having to move a finger. Very convenient. - As new tables appear in the `sync_` schemas, adding them to your env is trivial. - If you want to make local copies of production tables, you can do so with a simple `INSERT INTO ... SELECT ...` - cons: - If production is down, this doesn't work at all. - You are hitting the production server when developing. This is both annoying for you (because if someone is using the production DWH heavily, your queries will suffer) and for other users (because if you are using the production DWH heavily, the queries from the other users will suffer). Try to be mindful and don't spend all day running `SELECT *` from large tables just because you can. - You can only work with perfect production data in `sync_`. You can't manually manipulate the data that appears in the `sync_` layer, to do stuff like test values that are not in production or reduce the size of data to something manageable. - `dwh` - How it works: this is a silly empty database. It expects you to add any data manually. This means that you will have to build the `sync_` layer manually (by getting dumps from production, by mocking data, etc.) - pros: - You have full flexibility. You can prepare the `sync_` layer in any way you want. - You don't depend on production being alive to be able to run queries towards `sync_` schemas. - Your workload is independent to the production machine workload. - cons: - It's a pain in the ass to prepare production-like datasets to develop on production-like data, which is something you will do a lot. - You run the risk of developing on stale, old backup data for convenience, which can lead to trouble when you push stuff to production. ### Moving data across `dwh` and `dwh_hybrid` #TODO ## Other stuff - Bear in mind the docker compose file comes with Postgres server settings which were based on the laptops being used in the team on August 2024. They might be more or less relevant to you. In case of doubt, you might want to use: https://pgtune.leopard.in.ua/.