finish
This commit is contained in:
parent
01a2ae4215
commit
43ca19117c
2 changed files with 180 additions and 1 deletions
|
|
@ -0,0 +1,173 @@
|
|||
<!DOCTYPE HTML>
|
||||
<html>
|
||||
|
||||
<head>
|
||||
<title>Pablo here</title>
|
||||
<meta charset="utf-8">
|
||||
<meta viewport="width=device-width, initial-scale=1">
|
||||
<link rel="stylesheet" href="../styles.css">
|
||||
</head>
|
||||
|
||||
|
||||
<body>
|
||||
<main>
|
||||
<h1>
|
||||
Hi, Pablo here
|
||||
</h1>
|
||||
<p><a href="../index.html">back to home</a></p>
|
||||
<hr>
|
||||
<section>
|
||||
<h2>My tips and tricks when using Postgres as a DWH</h2>
|
||||
<p>In November 2023, I joined Superhog (now called Truvi) to start out the Data team. As part of that, I
|
||||
also drafted and deployed the first version of its data platform.
|
||||
</p>
|
||||
<p>The context led me to choose Postgres for our DWH. In a time of Snowflakes, Bigqueries and Redshifts,
|
||||
this might surprise some. But I can confidently say Postgres has done a great job for us, and I can even
|
||||
dare to say it has provided a better experience than other, more trendy alternatives could have. I'll
|
||||
jot down my rationale for picking Postgres one of these days.</p>
|
||||
<p>
|
||||
Back to the topic: Postgres is not intended to act as a DWH, so using it as such might feel a bit hacky
|
||||
at times. There are multiple ways to make your life better with it, as well as related tools and
|
||||
practices that you might enjoy, which I'll try to list here.
|
||||
</p>
|
||||
<h3>Use <code>unlogged</code> tables</h3>
|
||||
<p>The <a href="https://www.postgresql.org/docs/current/wal-intro.html" target="_blank"
|
||||
rel="noopener noreferrer">Write Ahead Log</a> comes active by default for the tables you create, and
|
||||
for good reasons. But in the context of an ELT DWH, it is probably a good idea to deactivate it by
|
||||
making your tables <code>unlogged</code>. <a
|
||||
href="https://www.crunchydata.com/blog/postgresl-unlogged-tables" target="_blank">Unlogged
|
||||
tables</a> will provide you with much faster writes (roughly, twice as fast) which will make data
|
||||
loading and transformation jobs inside your DWH much faster.
|
||||
</p>
|
||||
<p>You pay a price for this with a few trade offs, the most notable being that if your Postgres server
|
||||
crashes, <a href="https://www.postgresql.org/docs/current/sql-createtable.html#SQL-CREATETABLE-UNLOGGED"
|
||||
target="_blank" rel="noopener noreferrer">the contents of the unlogged tables will be lost</a>. But,
|
||||
again, if you have an ELT DWH, you can survive by running a backfill. In Truvi, we made the decision to
|
||||
have the landing area for our DWH be logged, and everything else unlogged. This means if we experienced
|
||||
a crash (which still hasn't happened, btw), we would recover by running a full-refresh dbt run.</p>
|
||||
<p>If you are using dbt, you can easily apply this by adding this bit in your <code>dbt_project.yml</code>
|
||||
:</p>
|
||||
<pre><code>
|
||||
models:
|
||||
+unlogged: true
|
||||
</code></pre>
|
||||
|
||||
<h3>Tuning your server's parameters</h3>
|
||||
<p><a href="https://www.postgresql.org/docs/current/runtime-config.html" target="_blank"
|
||||
rel="noopener noreferrer">Postgres has many parameters you can fiddle with</a>, with plenty of
|
||||
chances to either improve or destroy your server's performance.</p>
|
||||
<p>Postgres ships with some default values for it, which are almost surely not the optimal ones for
|
||||
your needs, <em>specially</em> if you are going to use it as a DWH. Simple changes like adjusting the
|
||||
<code>work_mem</code> will do wonders to speed up some of your heavier queries.
|
||||
</p>
|
||||
<p>There are many parameters to get familiar with and proper adjustment must be done taking your specific
|
||||
context and needs into account. If you have no clue at all, <a href="https://pgtune.leopard.in.ua"
|
||||
target="_blank" rel="noopener noreferrer">this little web app</a> can give you some suggestions you
|
||||
canstart from.
|
||||
</p>
|
||||
<h3>Running <code>VACUUM ANALYZE</code> right after building your tables</h3>
|
||||
<p>Out of the box, Postgres will automatically run
|
||||
<code><a href="https://www.postgresql.org/docs/current/sql-vacuum.html" target="_blank" rel="noopener noreferrer">VACUUM</a></code>
|
||||
and
|
||||
<code><a href="https://www.postgresql.org/docs/current/sql-analyze.html" target="_blank" rel="noopener noreferrer">ANALYZE</a></code>
|
||||
jobs <a href="https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM" target="_blank"
|
||||
rel="noopener noreferrer">automatically</a>. The triggers that determine when each of those gets
|
||||
triggered can be adjusted with a few server parameters. If you follow an ELT pattern, most surely
|
||||
re-building your non-staging tables will cause Postgres to run them.
|
||||
</p>
|
||||
<p>But there's a detail that is easy to overlook. Postgres automatic triggers will start those quite fast,
|
||||
but not right after you build each table. This poses a performance issue: if your intermediate sections
|
||||
of the DWH have tables that build upon tables, rebuilding a table and then trying to rebuild a dependant
|
||||
without having an <code>ANALYZE</code> on the first one before might hurt you.</p>
|
||||
<p>Let me describe this with an example, because this one is a bit of a tongue twister: let's assume we have
|
||||
tables <code>int_orders</code> and <code>int_order_kpis</code>. <code>int_orders</code> holds all of our
|
||||
orders, and <code>int_order_kpis</code> derives some kpis from them. Naturally, first you will
|
||||
materialize <code>int_orders</code> from some upstream staging tables, and once that is complete, you
|
||||
will use its contents to build <code>int_order_kpis</code>.
|
||||
</p>
|
||||
<p>
|
||||
Having <code>int_orders</code> <code>ANALYZE</code>-d before you start building
|
||||
<code>int_order_kpis</code> is highly benefitial for your performance in building
|
||||
<code>int_order_kpis</code>. Why? Because having perfectly updated statistics and metadata on
|
||||
<code>int_orders</code> will help Postgres' query optimizer better plan the necessary query to
|
||||
materialize <code>int_order_kpis</code>. This can improve performance by orders of magnitude in some
|
||||
queries by allowing Postgres to pick the right kind of join strategy for the specific data you have, for
|
||||
example.
|
||||
</p>
|
||||
<p>Now, will Postgres auto <code>VACUUM ANALYZE</code> the freshly built <code>int_orders</code> before you
|
||||
start building <code>int_order_kpis</code>? Hard to tell. It depends on how you build your DWH, and how
|
||||
you've tuned your server's parameters. And the most dangerous bit is you're not in full control: it can
|
||||
be that <em>sometimes</em> it happens, and other times it doesn't. Flaky and annoying. Some day I'll
|
||||
write a post on how this behaviour drove me mad for two months because it made a model sometimes built
|
||||
in a few seconds, and other times in >20min.
|
||||
</p>
|
||||
<p>
|
||||
My advice is to make sure you always <code>VACUUM ANALYZE</code> right after building your tables. If
|
||||
you're using dbt, you can easily achieve this by adding this to your project's
|
||||
<code>dbt_project.yml</code>:
|
||||
<pre><code>
|
||||
models:
|
||||
+post-hook:
|
||||
sql: "VACUUM ANALYZE {{ this }}"
|
||||
transaction: false
|
||||
# ^ This makes dbt run a VACUUM ANALYZE on the models after building each.
|
||||
# It's pointless for views, but it doesn't matter because Postgres fails
|
||||
# silently withour raising an unhandled exception.
|
||||
</code></pre>
|
||||
</p>
|
||||
<h3>Monitor queries with <code>pg_stats_statements</code></h3>
|
||||
<p><a href="https://www.postgresql.org/docs/current/pgstatstatements.html" target="_blank"
|
||||
rel="noopener noreferrer">pg_stats_statements</a> is an extension that nowadays ships with Postgres
|
||||
by default. If activated, it will log info on the queries executed in the server which you can check
|
||||
afterward. This includes many details, with how frequently does the query get called and what's the min,
|
||||
max and mean execution time being the ones you probably care about the most. Looking at those allows you
|
||||
to find queries that take long each time they run, and queries that get run a lot.
|
||||
</p>
|
||||
<p>Another important piece of info that gets recorded is <em>who</em> ran the query. This is helpful
|
||||
because, if you use users in a smart way, it can help you isolate expensive queries on different uses
|
||||
cases or areas. For example, if you use different users to build the DWH and to give your BI tool read
|
||||
access (you do that... right?), you can easily tell apart dashboard related queries from internal, DWH
|
||||
transformation ones. Another example could be internal reporting vs embedded analytics in your product:
|
||||
you might have stricter performance SLAs for product-embedded, customer-facing queries than for internal
|
||||
dashboards. Using different users and <code>pg_stats_statements</code> makes it possible for you to
|
||||
dissect performance issues on those separate areas independently.</p>
|
||||
<h3>Dalibo's wonderful execution plan visualizer</h3>
|
||||
<p>Sometimes you'll have some nasty query you just need to sit down with and optimize. In my experience, in
|
||||
a DWH this ends up happening with queries that involve many large tables in sequential joining and
|
||||
aggregation steps (as in, you join a few tables, group to some granularity, join some more, group again,
|
||||
etc).
|
||||
</p>
|
||||
<p>You can get the query's real execution details with <code>EXPLAIN ANALYZE</code>, but the output's
|
||||
readability is on par with morse-encoded regex patterns. I always had headaches dealing with them until
|
||||
I came across <a href="https://dalibo.com/" target="_blank" rel="noopener noreferrer">Dalibo</a>'s <a
|
||||
href="https://explain.dalibo.com/" target="_blank" rel="noopener noreferrer">execution plan
|
||||
visualizer</a>. You can paste the output of <code>EXPLAIN ANALYZE</code> there and see the query
|
||||
execution presented as a diagram. No amount of words will portray accurately how awesome the UX is, so
|
||||
I encourage you to try the tool with some nasty query and see for yourself.</p>
|
||||
<h3>Local dev env + Foreign Data Wrapper</h3>
|
||||
<p>One of the awesome things of using Postgres is how trivial it is to spin up an instance. This makes
|
||||
goofing around much more simpler than whenever setting up a new instance means paperwork, $$$, etc.</p>
|
||||
<p>Data team members at Truvi have a dockerized Postgres running in their laptops that they can use when
|
||||
they are developing on our DWH dbt project. In the early days, you could grab some production dump with
|
||||
some subset of tables from our staging layer and run significant chunks of our dbt DAG in your laptop if
|
||||
you were patient.</p>
|
||||
<p>A few hundreds of models later, this evolved to increasingly difficult and finally became impossible.
|
||||
</p>
|
||||
<p>Luckily, we came across Postgres' <a
|
||||
href="https://www.postgresql.org/docs/current/postgres-fdw.html">Foreign Data Wrapper</a>. There's
|
||||
quite a bit to it, but to keep it short here, just be aware that FDW allows you to make a Postgres
|
||||
server give access to some table in a different Postgres server while pretending they are local. So, you
|
||||
query table X in Postgres server A, even though table X is actually stored in Postgres server B. But
|
||||
your query works just the same as if it was a local genuine table.</p>
|
||||
<p>Setting these up is fairly trivial, and has allowed our dbt project contributors to be able to execute
|
||||
hybrid dbt runs where some data and tables is local to their laptop, whereas some upstream data is being
|
||||
read from production server's. The approach has been great so far, enabling them to actually test models
|
||||
before commiting them to master in a convenient way.</p>
|
||||
<hr>
|
||||
<p><a href="../index.html">back to home</a></p>
|
||||
</section>
|
||||
</main>
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
Loading…
Add table
Add a link
Reference in a new issue