pablohere/public/writings/my-tips-and-tricks-when-using-postgres-as-a-dwh.html
2025-04-24 22:58:36 +02:00

173 lines
No EOL
13 KiB
HTML

<!DOCTYPE HTML>
<html>
<head>
<title>Pablo here</title>
<meta charset="utf-8">
<meta viewport="width=device-width, initial-scale=1">
<link rel="stylesheet" href="../styles.css">
</head>
<body>
<main>
<h1>
Hi, Pablo here
</h1>
<p><a href="../index.html">back to home</a></p>
<hr>
<section>
<h2>My tips and tricks when using Postgres as a DWH</h2>
<p>In November 2023, I joined Superhog (now called Truvi) to start out the Data team. As part of that, I
also drafted and deployed the first version of its data platform.
</p>
<p>The context led me to choose Postgres for our DWH. In a time of Snowflakes, Bigqueries and Redshifts,
this might surprise some. But I can confidently say Postgres has done a great job for us, and I can even
dare to say it has provided a better experience than other, more trendy alternatives could have. I'll
jot down my rationale for picking Postgres one of these days.</p>
<p>
Back to the topic: Postgres is not intended to act as a DWH, so using it as such might feel a bit hacky
at times. There are multiple ways to make your life better with it, as well as related tools and
practices that you might enjoy, which I'll try to list here.
</p>
<h3>Use <code>unlogged</code> tables</h3>
<p>The <a href="https://www.postgresql.org/docs/current/wal-intro.html" target="_blank"
rel="noopener noreferrer">Write Ahead Log</a> comes active by default for the tables you create, and
for good reasons. But in the context of an ELT DWH, it is probably a good idea to deactivate it by
making your tables <code>unlogged</code>. <a
href="https://www.crunchydata.com/blog/postgresl-unlogged-tables" target="_blank">Unlogged
tables</a> will provide you with much faster writes (roughly, twice as fast) which will make data
loading and transformation jobs inside your DWH much faster.
</p>
<p>You pay a price for this with a few trade offs, the most notable being that if your Postgres server
crashes, <a href="https://www.postgresql.org/docs/current/sql-createtable.html#SQL-CREATETABLE-UNLOGGED"
target="_blank" rel="noopener noreferrer">the contents of the unlogged tables will be lost</a>. But,
again, if you have an ELT DWH, you can survive by running a backfill. In Truvi, we made the decision to
have the landing area for our DWH be logged, and everything else unlogged. This means if we experienced
a crash (which still hasn't happened, btw), we would recover by running a full-refresh dbt run.</p>
<p>If you are using dbt, you can easily apply this by adding this bit in your <code>dbt_project.yml</code>
:</p>
<pre><code>
models:
+unlogged: true
</code></pre>
<h3>Tuning your server's parameters</h3>
<p><a href="https://www.postgresql.org/docs/current/runtime-config.html" target="_blank"
rel="noopener noreferrer">Postgres has many parameters you can fiddle with</a>, with plenty of
chances to either improve or destroy your server's performance.</p>
<p>Postgres ships with some default values for it, which are almost surely not the optimal ones for
your needs, <em>specially</em> if you are going to use it as a DWH. Simple changes like adjusting the
<code>work_mem</code> will do wonders to speed up some of your heavier queries.
</p>
<p>There are many parameters to get familiar with and proper adjustment must be done taking your specific
context and needs into account. If you have no clue at all, <a href="https://pgtune.leopard.in.ua"
target="_blank" rel="noopener noreferrer">this little web app</a> can give you some suggestions you
canstart from.
</p>
<h3>Running <code>VACUUM ANALYZE</code> right after building your tables</h3>
<p>Out of the box, Postgres will automatically run
<code><a href="https://www.postgresql.org/docs/current/sql-vacuum.html" target="_blank" rel="noopener noreferrer">VACUUM</a></code>
and
<code><a href="https://www.postgresql.org/docs/current/sql-analyze.html" target="_blank" rel="noopener noreferrer">ANALYZE</a></code>
jobs <a href="https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM" target="_blank"
rel="noopener noreferrer">automatically</a>. The triggers that determine when each of those gets
triggered can be adjusted with a few server parameters. If you follow an ELT pattern, most surely
re-building your non-staging tables will cause Postgres to run them.
</p>
<p>But there's a detail that is easy to overlook. Postgres automatic triggers will start those quite fast,
but not right after you build each table. This poses a performance issue: if your intermediate sections
of the DWH have tables that build upon tables, rebuilding a table and then trying to rebuild a dependant
without having an <code>ANALYZE</code> on the first one before might hurt you.</p>
<p>Let me describe this with an example, because this one is a bit of a tongue twister: let's assume we have
tables <code>int_orders</code> and <code>int_order_kpis</code>. <code>int_orders</code> holds all of our
orders, and <code>int_order_kpis</code> derives some kpis from them. Naturally, first you will
materialize <code>int_orders</code> from some upstream staging tables, and once that is complete, you
will use its contents to build <code>int_order_kpis</code>.
</p>
<p>
Having <code>int_orders</code> <code>ANALYZE</code>-d before you start building
<code>int_order_kpis</code> is highly benefitial for your performance in building
<code>int_order_kpis</code>. Why? Because having perfectly updated statistics and metadata on
<code>int_orders</code> will help Postgres' query optimizer better plan the necessary query to
materialize <code>int_order_kpis</code>. This can improve performance by orders of magnitude in some
queries by allowing Postgres to pick the right kind of join strategy for the specific data you have, for
example.
</p>
<p>Now, will Postgres auto <code>VACUUM ANALYZE</code> the freshly built <code>int_orders</code> before you
start building <code>int_order_kpis</code>? Hard to tell. It depends on how you build your DWH, and how
you've tuned your server's parameters. And the most dangerous bit is you're not in full control: it can
be that <em>sometimes</em> it happens, and other times it doesn't. Flaky and annoying. Some day I'll
write a post on how this behaviour drove me mad for two months because it made a model sometimes built
in a few seconds, and other times in >20min.
</p>
<p>
My advice is to make sure you always <code>VACUUM ANALYZE</code> right after building your tables. If
you're using dbt, you can easily achieve this by adding this to your project's
<code>dbt_project.yml</code>:
<pre><code>
models:
+post-hook:
sql: "VACUUM ANALYZE {{ this }}"
transaction: false
# ^ This makes dbt run a VACUUM ANALYZE on the models after building each.
# It's pointless for views, but it doesn't matter because Postgres fails
# silently withour raising an unhandled exception.
</code></pre>
</p>
<h3>Monitor queries with <code>pg_stats_statements</code></h3>
<p><a href="https://www.postgresql.org/docs/current/pgstatstatements.html" target="_blank"
rel="noopener noreferrer">pg_stats_statements</a> is an extension that nowadays ships with Postgres
by default. If activated, it will log info on the queries executed in the server which you can check
afterward. This includes many details, with how frequently does the query get called and what's the min,
max and mean execution time being the ones you probably care about the most. Looking at those allows you
to find queries that take long each time they run, and queries that get run a lot.
</p>
<p>Another important piece of info that gets recorded is <em>who</em> ran the query. This is helpful
because, if you use users in a smart way, it can help you isolate expensive queries on different uses
cases or areas. For example, if you use different users to build the DWH and to give your BI tool read
access (you do that... right?), you can easily tell apart dashboard related queries from internal, DWH
transformation ones. Another example could be internal reporting vs embedded analytics in your product:
you might have stricter performance SLAs for product-embedded, customer-facing queries than for internal
dashboards. Using different users and <code>pg_stats_statements</code> makes it possible for you to
dissect performance issues on those separate areas independently.</p>
<h3>Dalibo's wonderful execution plan visualizer</h3>
<p>Sometimes you'll have some nasty query you just need to sit down with and optimize. In my experience, in
a DWH this ends up happening with queries that involve many large tables in sequential joining and
aggregation steps (as in, you join a few tables, group to some granularity, join some more, group again,
etc).
</p>
<p>You can get the query's real execution details with <code>EXPLAIN ANALYZE</code>, but the output's
readability is on par with morse-encoded regex patterns. I always had headaches dealing with them until
I came across <a href="https://dalibo.com/" target="_blank" rel="noopener noreferrer">Dalibo</a>'s <a
href="https://explain.dalibo.com/" target="_blank" rel="noopener noreferrer">execution plan
visualizer</a>. You can paste the output of <code>EXPLAIN ANALYZE</code> there and see the query
execution presented as a diagram. No amount of words will portray accurately how awesome the UX is, so
I encourage you to try the tool with some nasty query and see for yourself.</p>
<h3>Local dev env + Foreign Data Wrapper</h3>
<p>One of the awesome things of using Postgres is how trivial it is to spin up an instance. This makes
goofing around much more simpler than whenever setting up a new instance means paperwork, $$$, etc.</p>
<p>Data team members at Truvi have a dockerized Postgres running in their laptops that they can use when
they are developing on our DWH dbt project. In the early days, you could grab some production dump with
some subset of tables from our staging layer and run significant chunks of our dbt DAG in your laptop if
you were patient.</p>
<p>A few hundreds of models later, this evolved to increasingly difficult and finally became impossible.
</p>
<p>Luckily, we came across Postgres' <a
href="https://www.postgresql.org/docs/current/postgres-fdw.html">Foreign Data Wrapper</a>. There's
quite a bit to it, but to keep it short here, just be aware that FDW allows you to make a Postgres
server give access to some table in a different Postgres server while pretending they are local. So, you
query table X in Postgres server A, even though table X is actually stored in Postgres server B. But
your query works just the same as if it was a local genuine table.</p>
<p>Setting these up is fairly trivial, and has allowed our dbt project contributors to be able to execute
hybrid dbt runs where some data and tables is local to their laptop, whereas some upstream data is being
read from production server's. The approach has been great so far, enabling them to actually test models
before commiting them to master in a convenient way.</p>
<hr>
<p><a href="../index.html">back to home</a></p>
</section>
</main>
</body>
</html>