finish
This commit is contained in:
parent
01a2ae4215
commit
43ca19117c
2 changed files with 180 additions and 1 deletions
|
|
@ -106,7 +106,13 @@
|
||||||
<p>Sometimes I like to jot down ideas and drop them here.</p>
|
<p>Sometimes I like to jot down ideas and drop them here.</p>
|
||||||
<ul>
|
<ul>
|
||||||
<li>
|
<li>
|
||||||
<a href="writings/dont-hide-it-make-it-beautiful.html" target="_blank" rel="noopener noreferrer">Don't hide it,
|
<a href="writings/my-tips-and-tricks-when-using-postgres-as-a-dwh.html" target="_blank"
|
||||||
|
rel="noopener noreferrer">My tips and tricks when
|
||||||
|
using Postgres as a DWH</a>
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
<a href="writings/dont-hide-it-make-it-beautiful.html" target="_blank"
|
||||||
|
rel="noopener noreferrer">Don't hide it,
|
||||||
make it beautiful</a>
|
make it beautiful</a>
|
||||||
</li>
|
</li>
|
||||||
<li>
|
<li>
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,173 @@
|
||||||
|
<!DOCTYPE HTML>
|
||||||
|
<html>
|
||||||
|
|
||||||
|
<head>
|
||||||
|
<title>Pablo here</title>
|
||||||
|
<meta charset="utf-8">
|
||||||
|
<meta viewport="width=device-width, initial-scale=1">
|
||||||
|
<link rel="stylesheet" href="../styles.css">
|
||||||
|
</head>
|
||||||
|
|
||||||
|
|
||||||
|
<body>
|
||||||
|
<main>
|
||||||
|
<h1>
|
||||||
|
Hi, Pablo here
|
||||||
|
</h1>
|
||||||
|
<p><a href="../index.html">back to home</a></p>
|
||||||
|
<hr>
|
||||||
|
<section>
|
||||||
|
<h2>My tips and tricks when using Postgres as a DWH</h2>
|
||||||
|
<p>In November 2023, I joined Superhog (now called Truvi) to start out the Data team. As part of that, I
|
||||||
|
also drafted and deployed the first version of its data platform.
|
||||||
|
</p>
|
||||||
|
<p>The context led me to choose Postgres for our DWH. In a time of Snowflakes, Bigqueries and Redshifts,
|
||||||
|
this might surprise some. But I can confidently say Postgres has done a great job for us, and I can even
|
||||||
|
dare to say it has provided a better experience than other, more trendy alternatives could have. I'll
|
||||||
|
jot down my rationale for picking Postgres one of these days.</p>
|
||||||
|
<p>
|
||||||
|
Back to the topic: Postgres is not intended to act as a DWH, so using it as such might feel a bit hacky
|
||||||
|
at times. There are multiple ways to make your life better with it, as well as related tools and
|
||||||
|
practices that you might enjoy, which I'll try to list here.
|
||||||
|
</p>
|
||||||
|
<h3>Use <code>unlogged</code> tables</h3>
|
||||||
|
<p>The <a href="https://www.postgresql.org/docs/current/wal-intro.html" target="_blank"
|
||||||
|
rel="noopener noreferrer">Write Ahead Log</a> comes active by default for the tables you create, and
|
||||||
|
for good reasons. But in the context of an ELT DWH, it is probably a good idea to deactivate it by
|
||||||
|
making your tables <code>unlogged</code>. <a
|
||||||
|
href="https://www.crunchydata.com/blog/postgresl-unlogged-tables" target="_blank">Unlogged
|
||||||
|
tables</a> will provide you with much faster writes (roughly, twice as fast) which will make data
|
||||||
|
loading and transformation jobs inside your DWH much faster.
|
||||||
|
</p>
|
||||||
|
<p>You pay a price for this with a few trade offs, the most notable being that if your Postgres server
|
||||||
|
crashes, <a href="https://www.postgresql.org/docs/current/sql-createtable.html#SQL-CREATETABLE-UNLOGGED"
|
||||||
|
target="_blank" rel="noopener noreferrer">the contents of the unlogged tables will be lost</a>. But,
|
||||||
|
again, if you have an ELT DWH, you can survive by running a backfill. In Truvi, we made the decision to
|
||||||
|
have the landing area for our DWH be logged, and everything else unlogged. This means if we experienced
|
||||||
|
a crash (which still hasn't happened, btw), we would recover by running a full-refresh dbt run.</p>
|
||||||
|
<p>If you are using dbt, you can easily apply this by adding this bit in your <code>dbt_project.yml</code>
|
||||||
|
:</p>
|
||||||
|
<pre><code>
|
||||||
|
models:
|
||||||
|
+unlogged: true
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<h3>Tuning your server's parameters</h3>
|
||||||
|
<p><a href="https://www.postgresql.org/docs/current/runtime-config.html" target="_blank"
|
||||||
|
rel="noopener noreferrer">Postgres has many parameters you can fiddle with</a>, with plenty of
|
||||||
|
chances to either improve or destroy your server's performance.</p>
|
||||||
|
<p>Postgres ships with some default values for it, which are almost surely not the optimal ones for
|
||||||
|
your needs, <em>specially</em> if you are going to use it as a DWH. Simple changes like adjusting the
|
||||||
|
<code>work_mem</code> will do wonders to speed up some of your heavier queries.
|
||||||
|
</p>
|
||||||
|
<p>There are many parameters to get familiar with and proper adjustment must be done taking your specific
|
||||||
|
context and needs into account. If you have no clue at all, <a href="https://pgtune.leopard.in.ua"
|
||||||
|
target="_blank" rel="noopener noreferrer">this little web app</a> can give you some suggestions you
|
||||||
|
canstart from.
|
||||||
|
</p>
|
||||||
|
<h3>Running <code>VACUUM ANALYZE</code> right after building your tables</h3>
|
||||||
|
<p>Out of the box, Postgres will automatically run
|
||||||
|
<code><a href="https://www.postgresql.org/docs/current/sql-vacuum.html" target="_blank" rel="noopener noreferrer">VACUUM</a></code>
|
||||||
|
and
|
||||||
|
<code><a href="https://www.postgresql.org/docs/current/sql-analyze.html" target="_blank" rel="noopener noreferrer">ANALYZE</a></code>
|
||||||
|
jobs <a href="https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM" target="_blank"
|
||||||
|
rel="noopener noreferrer">automatically</a>. The triggers that determine when each of those gets
|
||||||
|
triggered can be adjusted with a few server parameters. If you follow an ELT pattern, most surely
|
||||||
|
re-building your non-staging tables will cause Postgres to run them.
|
||||||
|
</p>
|
||||||
|
<p>But there's a detail that is easy to overlook. Postgres automatic triggers will start those quite fast,
|
||||||
|
but not right after you build each table. This poses a performance issue: if your intermediate sections
|
||||||
|
of the DWH have tables that build upon tables, rebuilding a table and then trying to rebuild a dependant
|
||||||
|
without having an <code>ANALYZE</code> on the first one before might hurt you.</p>
|
||||||
|
<p>Let me describe this with an example, because this one is a bit of a tongue twister: let's assume we have
|
||||||
|
tables <code>int_orders</code> and <code>int_order_kpis</code>. <code>int_orders</code> holds all of our
|
||||||
|
orders, and <code>int_order_kpis</code> derives some kpis from them. Naturally, first you will
|
||||||
|
materialize <code>int_orders</code> from some upstream staging tables, and once that is complete, you
|
||||||
|
will use its contents to build <code>int_order_kpis</code>.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
Having <code>int_orders</code> <code>ANALYZE</code>-d before you start building
|
||||||
|
<code>int_order_kpis</code> is highly benefitial for your performance in building
|
||||||
|
<code>int_order_kpis</code>. Why? Because having perfectly updated statistics and metadata on
|
||||||
|
<code>int_orders</code> will help Postgres' query optimizer better plan the necessary query to
|
||||||
|
materialize <code>int_order_kpis</code>. This can improve performance by orders of magnitude in some
|
||||||
|
queries by allowing Postgres to pick the right kind of join strategy for the specific data you have, for
|
||||||
|
example.
|
||||||
|
</p>
|
||||||
|
<p>Now, will Postgres auto <code>VACUUM ANALYZE</code> the freshly built <code>int_orders</code> before you
|
||||||
|
start building <code>int_order_kpis</code>? Hard to tell. It depends on how you build your DWH, and how
|
||||||
|
you've tuned your server's parameters. And the most dangerous bit is you're not in full control: it can
|
||||||
|
be that <em>sometimes</em> it happens, and other times it doesn't. Flaky and annoying. Some day I'll
|
||||||
|
write a post on how this behaviour drove me mad for two months because it made a model sometimes built
|
||||||
|
in a few seconds, and other times in >20min.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
My advice is to make sure you always <code>VACUUM ANALYZE</code> right after building your tables. If
|
||||||
|
you're using dbt, you can easily achieve this by adding this to your project's
|
||||||
|
<code>dbt_project.yml</code>:
|
||||||
|
<pre><code>
|
||||||
|
models:
|
||||||
|
+post-hook:
|
||||||
|
sql: "VACUUM ANALYZE {{ this }}"
|
||||||
|
transaction: false
|
||||||
|
# ^ This makes dbt run a VACUUM ANALYZE on the models after building each.
|
||||||
|
# It's pointless for views, but it doesn't matter because Postgres fails
|
||||||
|
# silently withour raising an unhandled exception.
|
||||||
|
</code></pre>
|
||||||
|
</p>
|
||||||
|
<h3>Monitor queries with <code>pg_stats_statements</code></h3>
|
||||||
|
<p><a href="https://www.postgresql.org/docs/current/pgstatstatements.html" target="_blank"
|
||||||
|
rel="noopener noreferrer">pg_stats_statements</a> is an extension that nowadays ships with Postgres
|
||||||
|
by default. If activated, it will log info on the queries executed in the server which you can check
|
||||||
|
afterward. This includes many details, with how frequently does the query get called and what's the min,
|
||||||
|
max and mean execution time being the ones you probably care about the most. Looking at those allows you
|
||||||
|
to find queries that take long each time they run, and queries that get run a lot.
|
||||||
|
</p>
|
||||||
|
<p>Another important piece of info that gets recorded is <em>who</em> ran the query. This is helpful
|
||||||
|
because, if you use users in a smart way, it can help you isolate expensive queries on different uses
|
||||||
|
cases or areas. For example, if you use different users to build the DWH and to give your BI tool read
|
||||||
|
access (you do that... right?), you can easily tell apart dashboard related queries from internal, DWH
|
||||||
|
transformation ones. Another example could be internal reporting vs embedded analytics in your product:
|
||||||
|
you might have stricter performance SLAs for product-embedded, customer-facing queries than for internal
|
||||||
|
dashboards. Using different users and <code>pg_stats_statements</code> makes it possible for you to
|
||||||
|
dissect performance issues on those separate areas independently.</p>
|
||||||
|
<h3>Dalibo's wonderful execution plan visualizer</h3>
|
||||||
|
<p>Sometimes you'll have some nasty query you just need to sit down with and optimize. In my experience, in
|
||||||
|
a DWH this ends up happening with queries that involve many large tables in sequential joining and
|
||||||
|
aggregation steps (as in, you join a few tables, group to some granularity, join some more, group again,
|
||||||
|
etc).
|
||||||
|
</p>
|
||||||
|
<p>You can get the query's real execution details with <code>EXPLAIN ANALYZE</code>, but the output's
|
||||||
|
readability is on par with morse-encoded regex patterns. I always had headaches dealing with them until
|
||||||
|
I came across <a href="https://dalibo.com/" target="_blank" rel="noopener noreferrer">Dalibo</a>'s <a
|
||||||
|
href="https://explain.dalibo.com/" target="_blank" rel="noopener noreferrer">execution plan
|
||||||
|
visualizer</a>. You can paste the output of <code>EXPLAIN ANALYZE</code> there and see the query
|
||||||
|
execution presented as a diagram. No amount of words will portray accurately how awesome the UX is, so
|
||||||
|
I encourage you to try the tool with some nasty query and see for yourself.</p>
|
||||||
|
<h3>Local dev env + Foreign Data Wrapper</h3>
|
||||||
|
<p>One of the awesome things of using Postgres is how trivial it is to spin up an instance. This makes
|
||||||
|
goofing around much more simpler than whenever setting up a new instance means paperwork, $$$, etc.</p>
|
||||||
|
<p>Data team members at Truvi have a dockerized Postgres running in their laptops that they can use when
|
||||||
|
they are developing on our DWH dbt project. In the early days, you could grab some production dump with
|
||||||
|
some subset of tables from our staging layer and run significant chunks of our dbt DAG in your laptop if
|
||||||
|
you were patient.</p>
|
||||||
|
<p>A few hundreds of models later, this evolved to increasingly difficult and finally became impossible.
|
||||||
|
</p>
|
||||||
|
<p>Luckily, we came across Postgres' <a
|
||||||
|
href="https://www.postgresql.org/docs/current/postgres-fdw.html">Foreign Data Wrapper</a>. There's
|
||||||
|
quite a bit to it, but to keep it short here, just be aware that FDW allows you to make a Postgres
|
||||||
|
server give access to some table in a different Postgres server while pretending they are local. So, you
|
||||||
|
query table X in Postgres server A, even though table X is actually stored in Postgres server B. But
|
||||||
|
your query works just the same as if it was a local genuine table.</p>
|
||||||
|
<p>Setting these up is fairly trivial, and has allowed our dbt project contributors to be able to execute
|
||||||
|
hybrid dbt runs where some data and tables is local to their laptop, whereas some upstream data is being
|
||||||
|
read from production server's. The approach has been great so far, enabling them to actually test models
|
||||||
|
before commiting them to master in a convenient way.</p>
|
||||||
|
<hr>
|
||||||
|
<p><a href="../index.html">back to home</a></p>
|
||||||
|
</section>
|
||||||
|
</main>
|
||||||
|
|
||||||
|
</body>
|
||||||
|
|
||||||
|
</html>
|
||||||
Loading…
Add table
Add a link
Reference in a new issue