finish

2025-04-24 22:58:36 +02:00 · 2025-04-24 22:58:36 +02:00 · 43ca19117c
commit 43ca19117c
parent 01a2ae4215
2 changed files with 180 additions and 1 deletions
--- a/public/writings/my-tips-and-tricks-when-using-postgres-as-a-dwh.html
+++ b/public/writings/my-tips-and-tricks-when-using-postgres-as-a-dwh.html
@ -0,0 +1,173 @@
+<!DOCTYPE HTML>
+<html>
+
+<head>
+    <title>Pablo here</title>
+    <meta charset="utf-8">
+    <meta viewport="width=device-width, initial-scale=1">
+    <link rel="stylesheet" href="../styles.css">
+</head>
+
+
+<body>
+    <main>
+        <h1>
+            Hi, Pablo here
+        </h1>
+        <p><a href="../index.html">back to home</a></p>
+        <hr>
+        <section>
+            <h2>My tips and tricks when using Postgres as a DWH</h2>
+            <p>In November 2023, I joined Superhog (now called Truvi) to start out the Data team. As part of that, I
+                also drafted and deployed the first version of its data platform.
+            </p>
+            <p>The context led me to choose Postgres for our DWH. In a time of Snowflakes, Bigqueries and Redshifts,
+                this might surprise some. But I can confidently say Postgres has done a great job for us, and I can even
+                dare to say it has provided a better experience than other, more trendy alternatives could have. I'll
+                jot down my rationale for picking Postgres one of these days.</p>
+            <p>
+                Back to the topic: Postgres is not intended to act as a DWH, so using it as such might feel a bit hacky
+                at times. There are multiple ways to make your life better with it, as well as related tools and
+                practices that you might enjoy, which I'll try to list here.
+            </p>
+            <h3>Use <code>unlogged</code> tables</h3>
+            <p>The <a href="https://www.postgresql.org/docs/current/wal-intro.html" target="_blank"
+                    rel="noopener noreferrer">Write Ahead Log</a> comes active by default for the tables you create, and
+                for good reasons. But in the context of an ELT DWH, it is probably a good idea to deactivate it by
+                making your tables <code>unlogged</code>. <a
+                    href="https://www.crunchydata.com/blog/postgresl-unlogged-tables" target="_blank">Unlogged
+                    tables</a> will provide you with much faster writes (roughly, twice as fast) which will make data
+                loading and transformation jobs inside your DWH much faster.
+            </p>
+            <p>You pay a price for this with a few trade offs, the most notable being that if your Postgres server
+                crashes, <a href="https://www.postgresql.org/docs/current/sql-createtable.html#SQL-CREATETABLE-UNLOGGED"
+                    target="_blank" rel="noopener noreferrer">the contents of the unlogged tables will be lost</a>. But,
+                again, if you have an ELT DWH, you can survive by running a backfill. In Truvi, we made the decision to
+                have the landing area for our DWH be logged, and everything else unlogged. This means if we experienced
+                a crash (which still hasn't happened, btw), we would recover by running a full-refresh dbt run.</p>
+            <p>If you are using dbt, you can easily apply this by adding this bit in your <code>dbt_project.yml</code>
+                :</p>
+            <pre><code>
+models:
+    +unlogged: true
+            </code></pre>
+
+            <h3>Tuning your server's parameters</h3>
+            <p><a href="https://www.postgresql.org/docs/current/runtime-config.html" target="_blank"
+                    rel="noopener noreferrer">Postgres has many parameters you can fiddle with</a>, with plenty of
+                chances to either improve or destroy your server's performance.</p>
+            <p>Postgres ships with some default values for it, which are almost surely not the optimal ones for
+                your needs, <em>specially</em> if you are going to use it as a DWH. Simple changes like adjusting the
+                <code>work_mem</code> will do wonders to speed up some of your heavier queries.
+            </p>
+            <p>There are many parameters to get familiar with and proper adjustment must be done taking your specific
+                context and needs into account. If you have no clue at all, <a href="https://pgtune.leopard.in.ua"
+                    target="_blank" rel="noopener noreferrer">this little web app</a> can give you some suggestions you
+                canstart from.
+            </p>
+            <h3>Running <code>VACUUM ANALYZE</code> right after building your tables</h3>
+            <p>Out of the box, Postgres will automatically run
+                <code><a href="https://www.postgresql.org/docs/current/sql-vacuum.html" target="_blank" rel="noopener noreferrer">VACUUM</a></code>
+                and
+                <code><a href="https://www.postgresql.org/docs/current/sql-analyze.html" target="_blank" rel="noopener noreferrer">ANALYZE</a></code>
+                jobs <a href="https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM" target="_blank"
+                    rel="noopener noreferrer">automatically</a>. The triggers that determine when each of those gets
+                triggered can be adjusted with a few server parameters. If you follow an ELT pattern, most surely
+                re-building your non-staging tables will cause Postgres to run them.
+            </p>
+            <p>But there's a detail that is easy to overlook. Postgres automatic triggers will start those quite fast,
+                but not right after you build each table. This poses a performance issue: if your intermediate sections
+                of the DWH have tables that build upon tables, rebuilding a table and then trying to rebuild a dependant
+                without having an <code>ANALYZE</code> on the first one before might hurt you.</p>
+            <p>Let me describe this with an example, because this one is a bit of a tongue twister: let's assume we have
+                tables <code>int_orders</code> and <code>int_order_kpis</code>. <code>int_orders</code> holds all of our
+                orders, and <code>int_order_kpis</code> derives some kpis from them. Naturally, first you will
+                materialize <code>int_orders</code> from some upstream staging tables, and once that is complete, you
+                will use its contents to build <code>int_order_kpis</code>.
+            </p>
+            <p>
+                Having <code>int_orders</code> <code>ANALYZE</code>-d before you start building
+                <code>int_order_kpis</code> is highly benefitial for your performance in building
+                <code>int_order_kpis</code>. Why? Because having perfectly updated statistics and metadata on
+                <code>int_orders</code> will help Postgres' query optimizer better plan the necessary query to
+                materialize <code>int_order_kpis</code>. This can improve performance by orders of magnitude in some
+                queries by allowing Postgres to pick the right kind of join strategy for the specific data you have, for
+                example.
+            </p>
+            <p>Now, will Postgres auto <code>VACUUM ANALYZE</code> the freshly built <code>int_orders</code> before you
+                start building <code>int_order_kpis</code>? Hard to tell. It depends on how you build your DWH, and how
+                you've tuned your server's parameters. And the most dangerous bit is you're not in full control: it can
+                be that <em>sometimes</em> it happens, and other times it doesn't. Flaky and annoying. Some day I'll
+                write a post on how this behaviour drove me mad for two months because it made a model sometimes built
+                in a few seconds, and other times in >20min.
+            </p>
+            <p>
+                My advice is to make sure you always <code>VACUUM ANALYZE</code> right after building your tables. If
+                you're using dbt, you can easily achieve this by adding this to your project's
+                <code>dbt_project.yml</code>:
+            <pre><code>
+models:
+    +post-hook:
+        sql: "VACUUM ANALYZE {{ this }}"
+        transaction: false
+        # ^ This makes dbt run a VACUUM ANALYZE on the models after building each.
+        # It's pointless for views, but it doesn't matter because Postgres fails
+        # silently withour raising an unhandled exception.
+            </code></pre>
+            </p>
+            <h3>Monitor queries with <code>pg_stats_statements</code></h3>
+            <p><a href="https://www.postgresql.org/docs/current/pgstatstatements.html" target="_blank"
+                    rel="noopener noreferrer">pg_stats_statements</a> is an extension that nowadays ships with Postgres
+                by default. If activated, it will log info on the queries executed in the server which you can check
+                afterward. This includes many details, with how frequently does the query get called and what's the min,
+                max and mean execution time being the ones you probably care about the most. Looking at those allows you
+                to find queries that take long each time they run, and queries that get run a lot.
+            </p>
+            <p>Another important piece of info that gets recorded is <em>who</em> ran the query. This is helpful
+                because, if you use users in a smart way, it can help you isolate expensive queries on different uses
+                cases or areas. For example, if you use different users to build the DWH and to give your BI tool read
+                access (you do that... right?), you can easily tell apart dashboard related queries from internal, DWH
+                transformation ones. Another example could be internal reporting vs embedded analytics in your product:
+                you might have stricter performance SLAs for product-embedded, customer-facing queries than for internal
+                dashboards. Using different users and <code>pg_stats_statements</code> makes it possible for you to
+                dissect performance issues on those separate areas independently.</p>
+            <h3>Dalibo's wonderful execution plan visualizer</h3>
+            <p>Sometimes you'll have some nasty query you just need to sit down with and optimize. In my experience, in
+                a DWH this ends up happening with queries that involve many large tables in sequential joining and
+                aggregation steps (as in, you join a few tables, group to some granularity, join some more, group again,
+                etc).
+            </p>
+            <p>You can get the query's real execution details with <code>EXPLAIN ANALYZE</code>, but the output's
+                readability is on par with morse-encoded regex patterns. I always had headaches dealing with them until
+                I came across <a href="https://dalibo.com/" target="_blank" rel="noopener noreferrer">Dalibo</a>'s <a
+                    href="https://explain.dalibo.com/" target="_blank" rel="noopener noreferrer">execution plan
+                    visualizer</a>. You can paste the output of <code>EXPLAIN ANALYZE</code> there and see the query
+                execution presented as a diagram. No amount of words will portray accurately how awesome the UX is, so
+                I encourage you to try the tool with some nasty query and see for yourself.</p>
+            <h3>Local dev env + Foreign Data Wrapper</h3>
+            <p>One of the awesome things of using Postgres is how trivial it is to spin up an instance. This makes
+                goofing around much more simpler than whenever setting up a new instance means paperwork, $$$, etc.</p>
+            <p>Data team members at Truvi have a dockerized Postgres running in their laptops that they can use when
+                they are developing on our DWH dbt project. In the early days, you could grab some production dump with
+                some subset of tables from our staging layer and run significant chunks of our dbt DAG in your laptop if
+                you were patient.</p>
+            <p>A few hundreds of models later, this evolved to increasingly difficult and finally became impossible.
+            </p>
+            <p>Luckily, we came across Postgres' <a
+                    href="https://www.postgresql.org/docs/current/postgres-fdw.html">Foreign Data Wrapper</a>. There's
+                quite a bit to it, but to keep it short here, just be aware that FDW allows you to make a Postgres
+                server give access to some table in a different Postgres server while pretending they are local. So, you
+                query table X in Postgres server A, even though table X is actually stored in Postgres server B. But
+                your query works just the same as if it was a local genuine table.</p>
+            <p>Setting these up is fairly trivial, and has allowed our dbt project contributors to be able to execute
+                hybrid dbt runs where some data and tables is local to their laptop, whereas some upstream data is being
+                read from production server's. The approach has been great so far, enabling them to actually test models
+                before commiting them to master in a convenient way.</p>
+            <hr>
+            <p><a href="../index.html">back to home</a></p>
+        </section>
+    </main>
+
+</body>
+
+</html>