diff --git a/dev-env/local_dwh.md b/dev-env/local_dwh.md index 13e9237..137b5c6 100644 --- a/dev-env/local_dwh.md +++ b/dev-env/local_dwh.md @@ -138,6 +138,8 @@ The setup will consist on preparing: - Once you've done this, you can check that the foreign tables are successfully set up by traveling to one of the schemas you've added in the last stored procedure call, looking for its foreign tables and throwing a (small) query at one of them. If the query returns data, it means your foreign connection is properly set up. +- An optional (but extremely recommended) step here is to change the fetch size parameter of the FDW. This reduces the amount of times both Postgres servers exchange data through network, because it controls how many records are brought over from the remote in one go. Higher fetch size, less network back and forth, and typically more speed for our services. You can try with 100,000 records. You can set the value by running this command: `ALTER SERVER dwh_prd OPTIONS (fetch_size '100000');`. Note that this syntax only works the first time you set the value. If you ever need to change it again after using that command, you should instead use: `ALTER SERVER remote_server OPTIONS (SET fetch_size '100000');`. + - Next, you probably want to add these databases to your `~/.dbt/profiles.yml` file. - You can check the example file `profiles.yml.example` at the root of this repo. - You should make two entries: one for the `dwh` database and another one for `dwh_hybrid`. @@ -177,7 +179,7 @@ You might be wondering why we have set up two different databases in your docker - You are hitting the production server when developing. This is both annoying for you (because if someone is using the production DWH heavily, your queries will suffer) and for other users (because if you are using the production DWH heavily, the queries from the other users will suffer). Try to be mindful and don't spend all day running `SELECT *` from large tables just because you can. - You can only work with perfect production data in `sync_`. You can't manually manipulate the data that appears in the `sync_` layer, to do stuff like test values that are not in production or reduce the size of data to something manageable. - `dwh` - - How it works: this is a silly empty database. It expects you to add any data manually. This means that you will have to build the `sync_` layer manually (by getting dumps from production, by mocking data, etc.) + - How it works: this is a silly empty database. It expects you to add any data manually. This means that you will have to build the `sync_` layer manually (by getting dumps from production, by mocking data, by using the secondary FDW we explain in the next section of this page, etc.) - pros: - You have full flexibility. You can prepare the `sync_` layer in any way you want. - You don't depend on production being alive to be able to run queries towards `sync_` schemas. @@ -186,9 +188,122 @@ You might be wondering why we have set up two different databases in your docker - It's a pain in the ass to prepare production-like datasets to develop on production-like data, which is something you will do a lot. - You run the risk of developing on stale, old backup data for convenience, which can lead to trouble when you push stuff to production. -### Moving data across `dwh` and `dwh_hybrid` +### Filling `dwh` with copied data -#TODO +Given the previous discussion, sometimes it might be the case you want to work fully locally in the `dwh` database with copies of data instead of references a la `dwh_hybrid. + +There are many ways to populate the data. Historically, we used to do a dump from the production database with what we needed and then restored it in the local Postgres. That's an option and you can still do it. + +You can also leverage the same Foreign Data Wrappers extension we saw in earlier sections to copy data into your local dwh instead of just referencing it like we do in `dwh_hybrid`. This requires a bit of setup, but once you have done that, it is extremely convenient and fast. The rest of this section will detail how to set it up. + +To set things up, we will create similar but not exactly equal procedures as in `dwh_hybrid`. Begin by running connecting to `dwh` and running the following code to create two stored procedures: + +```sql +CREATE OR REPLACE PROCEDURE setup_fdw_for_el( + IN server_name TEXT, -- Foreign server name + IN host TEXT, -- Foreign server host address + IN port TEXT, -- Foreign server port + IN dbname TEXT, -- Foreign database name + IN remote_user TEXT, -- Remote database user + IN remote_password TEXT, -- Remote database password + IN schema_list TEXT[] -- List of schemas to import +) +LANGUAGE plpgsql +AS $$ +DECLARE + el_schema TEXT; -- Variable to loop through schema list +BEGIN + -- Step 1: Create the extension if it doesn't exist + IF NOT EXISTS (select 1 from pg_extension where extname = 'postgres_fdw') THEN + CREATE EXTENSION IF NOT EXISTS postgres_fdw; + END IF; + + -- Step 2: Create the foreign server if it doesn't already exist + IF NOT EXISTS (SELECT 1 FROM pg_foreign_server WHERE srvname = server_name) THEN + EXECUTE format('CREATE SERVER %I FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host %L, port %L, dbname %L)', + server_name, host, port, dbname); + END IF; + + -- Step 3: Create user mapping for the specified remote user + IF NOT EXISTS (SELECT * FROM pg_user_mapping um LEFT JOIN pg_foreign_server fsev on fsev.oid = um.umserver WHERE fsev.srvname = server_name) THEN + EXECUTE format('CREATE USER MAPPING FOR CURRENT_USER SERVER %I OPTIONS (user %L, password %L)', + server_name, remote_user, remote_password); + END IF; + + -- Step 4: Grant USAGE on the foreign server to the current user + EXECUTE format('GRANT USAGE ON FOREIGN SERVER %I TO %I', server_name, current_user); + + -- Step 5: Loop through each schema in the schema list + FOREACH el_schema IN ARRAY schema_list + LOOP + -- Step 5.1: Create the schema if it doesn't exist + IF NOT EXISTS (SELECT 1 FROM information_schema.schemata WHERE schema_name = el_schema) THEN + EXECUTE format('CREATE SCHEMA IF NOT EXISTS el_%I', el_schema); + END IF; + + -- Step 5.2: Import the foreign schema into the local schema + EXECUTE format('IMPORT FOREIGN SCHEMA %I FROM SERVER %I INTO el_%I', + el_schema, server_name, el_schema); + END LOOP; + +END; +$$; + +CREATE OR REPLACE PROCEDURE refresh_fdw_el_schemas( + IN server_name TEXT, -- Foreign server name + IN schema_list TEXT[] -- List of schemas to refresh +) +LANGUAGE plpgsql +AS $$ +DECLARE + el_schema TEXT; +BEGIN + -- Step 1: Loop through each schema in the schema list + FOREACH el_schema IN ARRAY schema_list + LOOP + -- Step 1.1: Drop the schema if it exists + IF EXISTS (SELECT 1 FROM information_schema.schemata WHERE schema_name = format('el_%I', el_schema)) THEN + EXECUTE format('DROP SCHEMA el_%I CASCADE', el_schema); + END IF; + + EXECUTE format('CREATE SCHEMA IF NOT EXISTS el_%I', el_schema); + + -- Step 1.2: Re-import the foreign schema from the server + EXECUTE format('IMPORT FOREIGN SCHEMA %I FROM SERVER %I INTO el_%I', + el_schema, server_name, el_schema); + END LOOP; + +END; +$$; +``` + +Once you've done that, you will have to execute a `CALL` to `setup_fdw_for_el`. Exact same story as with `dwh_hybrid`, except for the fact that the created foreign schemas will be prepended by `el_` (Extract, Load) so that they don't clash in name with regular sync schemas. Again, just like `dwh_hybrid`, any time you need to refresh the schemas, you can call `refresh_fdw_el_schemas`. + +Once these fdw is set up, you can use it to copy data into `dwh` sync schemas. For example, imagine I want to get the `Accommodation` table from `sync_core`. I can create the `sync_core` schema manually (`CREATE SCHEMA sync_core;`) and then use our `el_sync_core` fdw to copy the data over with a query like this: + +```sql +CREATE TABLE sync_core."Accommodation" AS ( +SELECT * +FROM el_sync_core."Accommodation" a +); +``` + +This would create the table and copy all the data over. If instead, you only want to insert data and not copy, you can simply switch to an `INSERT INTO ... SELECT` statement. + +```sql +INSERT INTO sync_core."Accommodation" +SELECT * +FROM el_sync_core."Accommodation" a +``` + +These examples copied the entire table from the FDW, but here you have more flexibility than in `dwh_hybrid`. You can bring only some subset of data, which you can define through querying. + +```sql +INSERT INTO sync_core."Accommodation" +SELECT * +FROM el_sync_core."Accommodation" a +WHERE a."CreatedByUserId" = 'only-the-host-i-care-about-in-dev' +``` ## Other stuff