359 lines
23 KiB
HTML
359 lines
23 KiB
HTML
|
|
<!DOCTYPE HTML>
|
||
|
|
<html>
|
||
|
|
|
||
|
|
<head>
|
||
|
|
<title>Pablo here</title>
|
||
|
|
<meta charset="utf-8">
|
||
|
|
<meta viewport="width=device-width, initial-scale=1">
|
||
|
|
<link rel="stylesheet" href="../styles.css">
|
||
|
|
</head>
|
||
|
|
|
||
|
|
|
||
|
|
<body>
|
||
|
|
<main>
|
||
|
|
<h1>
|
||
|
|
Hi, Pablo here
|
||
|
|
</h1>
|
||
|
|
<p><a href="../index.html">back to home</a></p>
|
||
|
|
<hr>
|
||
|
|
<section>
|
||
|
|
<h2>Busy man's guide to optimizing dbt models performance</h2>
|
||
|
|
<p>The below guide is a copy-paste of an internal doc file I created while working in Superhog. My team of
|
||
|
|
analysts were very smart guys, but had little knowledge on Postgres internals (we used Postgres for our
|
||
|
|
DWH) and low-level query optimization. That was understandable: they were analysts, busy with answering
|
||
|
|
business questions. Their value was not in fixing pipeline performance.</p>
|
||
|
|
<p>Nevertheless, giving them some degree of freedom in fixing performance was both great to avoid me
|
||
|
|
becoming a bottleneck and to also expand their knowledge, which they were eager to. This guide was
|
||
|
|
targeted to them, and the goal was to give them as many tools as possible without having to go into the
|
||
|
|
rabbit hole of <code>EXPLAIN ANALYZE</code>.</p>
|
||
|
|
<hr>
|
||
|
|
<p>You have a <code>dbt</code> model that takes ages to run in production. For some very valid reason, this
|
||
|
|
is a problem.</p>
|
||
|
|
<p>This is a small reference guide on things you can try. I suggest you try them from start to end, since
|
||
|
|
they are sorted in a descending way by value/complexity ratio.</p>
|
||
|
|
<p>Before you start working on a model, you might want to check <a href="#bonus">the bonus guide at the
|
||
|
|
bottom</a> to learn how to make sure you don't change the outputs of a model while refactoring it.
|
||
|
|
</p>
|
||
|
|
<p>If you've tried everything you could here and things still don't work, don't hesitate to call Pablo.</p>
|
||
|
|
|
||
|
|
<h3>1. Is your model <em>really</em> taking too long?</h3>
|
||
|
|
<blockquote>Before you optimize a model that is taking too long, make sure it actually takes too long.
|
||
|
|
</blockquote>
|
||
|
|
<p>The very first step is to really assess if you do have a problem.</p>
|
||
|
|
<p>We run our DWH in a Postgres server, and Postgres is a complex system. Postgres is doing many things at
|
||
|
|
all times and it's very stateful, which means you will pretty much never see <em>exactly</em> the same
|
||
|
|
performance twice for some given query.</p>
|
||
|
|
<p>Before going crazy optimizing, I would advise running the model or the entire project a few times and
|
||
|
|
observing the behaviour. It might be that <em>some day</em> it took very long for some reason, but
|
||
|
|
usually, it runs just fine.</p>
|
||
|
|
<p>You also might want to do this in a moment where there's little activity in the DWH, like very early or
|
||
|
|
late in the day, so that other users' activity in the DWH don't pollute your observations.</p>
|
||
|
|
<p>If this is a model that is already being run regularly, we can also leverage the statistics
|
||
|
|
collected by the <code>pg_stat_statements</code> Postgres extension to check what are the min, avg, and
|
||
|
|
max run times for it. Ask Pablo to get this.</p>
|
||
|
|
|
||
|
|
<h3>2. Reducing the amount of data</h3>
|
||
|
|
<blockquote>Make your query only bring in the data it needs, and not more. Reduce the amount of data as
|
||
|
|
early as possible.</blockquote>
|
||
|
|
<p>This option is a simple optimization trick that can be used in many areas and it's easy to pull off.</p>
|
||
|
|
<p>The two holy devils of slow queries are large amounts of data and monster lookups/sorts. Both can be
|
||
|
|
drastically reduced by simply reducing the amount of data that goes into the query, typically by applying
|
||
|
|
some smart <code>WHERE</code> or creative conditions on a <code>JOIN</code> clause. This can be either
|
||
|
|
done in your basic CTEs where you read from other models, or in the main <code>SELECT</code> of your
|
||
|
|
model.</p>
|
||
|
|
<p>Typically, try to make this as <em>early</em> as possible in the model. Early here refers to the steps
|
||
|
|
of your query. In your queries, you will typically:</p>
|
||
|
|
<ul>
|
||
|
|
<li>read a few tables,</li>
|
||
|
|
<li>do some <code>SELECTs</code></li>
|
||
|
|
<li>then do more crazy logic downstream with more <code>SELECTs</code></li>
|
||
|
|
<li>and the party goes on for as long and complex your case is</li>
|
||
|
|
</ul>
|
||
|
|
<p>Reducing the amount of data at the end is pointless. You will still need to read a lot of stuff early and
|
||
|
|
have monster <code>JOIN</code>s , window functions, <code>DISTINCT</code>s, etc. Ideally, you want to do
|
||
|
|
it when you first access an upstream table. If not there, then as early as possible within the logic.
|
||
|
|
</p>
|
||
|
|
<p>The specifics of how to apply this are absolutely query dependent, so I can't give you magic instructions
|
||
|
|
for the query you have at hand. But let me illustrate the concept with an example:</p>
|
||
|
|
|
||
|
|
<h4>Only hosts? Then only hosts</h4>
|
||
|
|
<p>You have a table <code>stg_my_table</code> with a lot of data, let's say 100 million records, and each
|
||
|
|
record has the id of a host. In your model, you need to join these records with the host user data to
|
||
|
|
get some columns from there. So right now your query looks something like this (tables fictional, this
|
||
|
|
is not how things look in DWH):</p>
|
||
|
|
<pre><code>with
|
||
|
|
stg_my_table as (select * from {{ ref("stg_my_table") }}),
|
||
|
|
stg_users as (select * from {{ ref("stg_users")}})
|
||
|
|
|
||
|
|
select
|
||
|
|
...
|
||
|
|
from stg_my_table t
|
||
|
|
left join
|
||
|
|
stg_users u
|
||
|
|
on t.id_host_user = id_user</code></pre>
|
||
|
|
<p>At the time I'm writing this, the real user table in our DWH has like 600,000 records. This means
|
||
|
|
that:</p>
|
||
|
|
<ul>
|
||
|
|
<li>The CTE <code>stg_users</code> will need to fetch 600,000 records, with all their data, and store
|
||
|
|
them.</li>
|
||
|
|
<li>Then the left join will have to join 100 million records from <code>my_table</code> with the 600,000
|
||
|
|
user records.</li>
|
||
|
|
</ul>
|
||
|
|
<p>Now, this is not working for you because it takes ages. We can easily improve the situation by applying
|
||
|
|
the principle of this section: reducing the amount of data.</p>
|
||
|
|
<p>Our user table in the DWH has both hosts and guests. Actually, it has a ~1,000 hosts and everything else
|
||
|
|
is just guests. This means that:</p>
|
||
|
|
<ul>
|
||
|
|
<li>We're fetching around 599,000 guest details that we don't care about at all.</li>
|
||
|
|
<li>Every time we join a record from <code>my_table</code>, we do so against 600,000 user records when
|
||
|
|
we only truly care about 1,000 of them.</li>
|
||
|
|
</ul>
|
||
|
|
<p>Stupid, isn't it?</p>
|
||
|
|
<p>Well, imagining that our fictional <code>stg_users</code> tables had a field called
|
||
|
|
<code>is_host</code>, we can rewrite the query this way to get exactly the same result in only a
|
||
|
|
fraction of the time:
|
||
|
|
</p>
|
||
|
|
<pre><code>with
|
||
|
|
stg_my_table as (select * from {{ ref("stg_my_table") }}),
|
||
|
|
stg_users as (
|
||
|
|
select *
|
||
|
|
from {{ ref("stg_users")}}
|
||
|
|
where is_host = true
|
||
|
|
)
|
||
|
|
|
||
|
|
select
|
||
|
|
...
|
||
|
|
from stg_my_table t
|
||
|
|
left join
|
||
|
|
stg_users u
|
||
|
|
on t.id_host_user = id_user</code></pre>
|
||
|
|
<p>It's simple to understand: the CTE will now only get the 1,000 records related to hosts, which means we
|
||
|
|
save performance in both fetching that data and having a much smaller join operation downstream against
|
||
|
|
<code>stg_my_table</code>.
|
||
|
|
</p>
|
||
|
|
|
||
|
|
<h3>3. Controlling CTE materialization</h3>
|
||
|
|
<blockquote>Tell Postgres when to cache intermediate results and when to optimize through them.</blockquote>
|
||
|
|
<p>This one requires a tiny bit of understanding of what happens under the hood, but the payoff is big and
|
||
|
|
the fix is easy to apply.</p>
|
||
|
|
|
||
|
|
<h4>What Postgres does with your CTEs</h4>
|
||
|
|
<p>When Postgres runs a CTE, it has two strategies:</p>
|
||
|
|
<ul>
|
||
|
|
<li><strong>Materialized</strong>: Postgres runs the CTE query, stores the full result in a temporary
|
||
|
|
buffer, and every downstream reference reads from that buffer. Think of it as Postgres creating a
|
||
|
|
temporary, index-less table with the CTE's output.</li>
|
||
|
|
<li><strong>Not materialized</strong>: Postgres treats the CTE as if it were a view. It doesn't store
|
||
|
|
anything — instead, it folds the CTE's logic into the rest of the query and optimizes everything
|
||
|
|
together. This means it can push filters down, use indexes from the original tables, and skip
|
||
|
|
reading rows it doesn't need.</li>
|
||
|
|
</ul>
|
||
|
|
<p>By default, Postgres decides for you: if a CTE is referenced once, it inlines it. If it's referenced
|
||
|
|
more than once, it materializes it.</p>
|
||
|
|
<p>The problem is that this default isn't always ideal, especially with how we write dbt models.</p>
|
||
|
|
|
||
|
|
<h4>Why this matters for our dbt models</h4>
|
||
|
|
<p>Following our conventions, we always import upstream refs as CTEs at the top of the file:</p>
|
||
|
|
<pre><code>with
|
||
|
|
stg_users as (select * from {{ ref("stg_users") }}),
|
||
|
|
stg_bookings as (select * from {{ ref("stg_bookings") }}),
|
||
|
|
|
||
|
|
some_intermediate_logic as (
|
||
|
|
select ...
|
||
|
|
from stg_users
|
||
|
|
join stg_bookings on ...
|
||
|
|
where ...
|
||
|
|
),
|
||
|
|
|
||
|
|
some_other_logic as (
|
||
|
|
select ...
|
||
|
|
from stg_users
|
||
|
|
where ...
|
||
|
|
)
|
||
|
|
|
||
|
|
select ...
|
||
|
|
from some_intermediate_logic
|
||
|
|
join some_other_logic on ...</code></pre>
|
||
|
|
<p>Notice that <code>stg_users</code> is referenced twice — once in <code>some_intermediate_logic</code>
|
||
|
|
and once in <code>some_other_logic</code>. This means Postgres will materialize it by default. What
|
||
|
|
happens then is:</p>
|
||
|
|
<ol>
|
||
|
|
<li>Postgres scans the entire <code>stg_users</code> table and copies all 600,000 rows into a temporary
|
||
|
|
buffer.</li>
|
||
|
|
<li>If the buffer exceeds available memory, it spills to disk.</li>
|
||
|
|
<li>Every downstream CTE that reads from <code>stg_users</code> does a sequential scan of that buffer.
|
||
|
|
Note this means indices can't be used, even if the original table had them.</li>
|
||
|
|
<li>Any filters that downstream CTEs apply to <code>stg_users</code> (like
|
||
|
|
<code>where is_host = true</code>) can't be pushed down to the original table scan. Postgres reads
|
||
|
|
all 600,000 rows first, stores them, and only then filters.
|
||
|
|
</li>
|
||
|
|
</ol>
|
||
|
|
<p>All of that, for a <code>select *</code> that does absolutely no computation worth caching.</p>
|
||
|
|
|
||
|
|
<h4>The fix</h4>
|
||
|
|
<p>You can explicitly control this behaviour by adding <code>MATERIALIZED</code> or
|
||
|
|
<code>NOT MATERIALIZED</code> to any CTE:
|
||
|
|
</p>
|
||
|
|
<pre><code>with
|
||
|
|
stg_users as not materialized (select * from {{ ref("stg_users") }}),
|
||
|
|
stg_bookings as not materialized (select * from {{ ref("stg_bookings") }}),
|
||
|
|
|
||
|
|
some_intermediate_logic as (
|
||
|
|
...
|
||
|
|
),
|
||
|
|
|
||
|
|
some_other_logic as (
|
||
|
|
...
|
||
|
|
)
|
||
|
|
|
||
|
|
select ...</code></pre>
|
||
|
|
<p>With <code>NOT MATERIALIZED</code>, Postgres treats those import CTEs as transparent aliases. It can see
|
||
|
|
straight through to the original table, use its indexes, and push filters down.</p>
|
||
|
|
|
||
|
|
<h4>When to use which</h4>
|
||
|
|
<p>The rule of thumb is simple:</p>
|
||
|
|
<ul>
|
||
|
|
<li><strong>Cheap CTE, referenced multiple times</strong> → <code>NOT MATERIALIZED</code>. This is the
|
||
|
|
typical case for our import CTEs at the top of the file. There's no computation to cache, so
|
||
|
|
materializing just wastes resources.</li>
|
||
|
|
<li><strong>Expensive CTE, referenced multiple times</strong> → leave it alone (or explicit
|
||
|
|
<code>MATERIALIZED</code>). If a CTE does heavy aggregations, complex joins, or window functions,
|
||
|
|
materializing means that work happens once. Without it, Postgres would repeat the expensive query
|
||
|
|
every time the CTE is referenced.
|
||
|
|
</li>
|
||
|
|
<li><strong>Any CTE referenced only once</strong> → doesn't matter. Postgres inlines it automatically.
|
||
|
|
</li>
|
||
|
|
</ul>
|
||
|
|
<p>If you're unsure whether a CTE is "expensive enough" to warrant materialization, just try both and
|
||
|
|
measure. There's no shame in that.</p>
|
||
|
|
|
||
|
|
<h3>4. Change upstream materializations</h3>
|
||
|
|
<blockquote>Materialize upstream models as tables instead of views to reduce computation on the model at
|
||
|
|
hand.</blockquote>
|
||
|
|
<p>Going back to basics, dbt offers <a href="https://docs.getdbt.com/docs/build/materializations"
|
||
|
|
target="_blank" rel="noopener noreferrer">multiple materializations strategies for our models</a>.
|
||
|
|
</p>
|
||
|
|
<p>Typically, for reasons that we won't cover here, the preferred starting point is to use views. We only go
|
||
|
|
for tables or incremental materializations if there are good reasons for this.</p>
|
||
|
|
<p>If you have a model that is having terrible performance, it's possible that the fault doesn't sit at the
|
||
|
|
model itself, but rather at an upstream model. Let me make an example.</p>
|
||
|
|
<p>Imagine we have a situation with three models:</p>
|
||
|
|
<ul>
|
||
|
|
<li><code>stg_my_simple_model</code>: a model with super simple logic and small data</li>
|
||
|
|
<li><code>stg_my_crazy_model</code>: a model with a crazy complex query and lots of data</li>
|
||
|
|
<li><code>int_my_dependant_model</code>: an int model that reads from both previous models.</li>
|
||
|
|
<li>Where the staging models are set to materialize as views and the int model is set to materialize as
|
||
|
|
a table.</li>
|
||
|
|
</ul>
|
||
|
|
<p>Because the two staging models are set to materialize as views, this means that every time you run
|
||
|
|
<code>int_my_dependant_model</code>, you will also have to execute the queries of
|
||
|
|
<code>stg_my_simple_model</code> and <code>stg_my_crazy_model</code>. If the upstream views model are
|
||
|
|
fast, this is not an issue of any kind. But if a model is a heavy query, this could be an issue.
|
||
|
|
</p>
|
||
|
|
<p>The point is, you might notice that <code>int_my_dependant_model</code> takes 600 seconds to run and
|
||
|
|
think there's something wrong with it, when actually the fault sits at <code>stg_my_crazy_model</code>,
|
||
|
|
which perhaps is taking 590 seconds out of the 600.</p>
|
||
|
|
<p>How can materializations solve this? Well, if <code>stg_my_crazy_model</code> was materialized as a table
|
||
|
|
instead of as view, whenever you ran <code>int_my_dependant_model</code> you would simply read from a
|
||
|
|
table with pre-populated results, instead of having to run the <code>stg_my_crazy_model</code> query
|
||
|
|
each time. Typically, reading the results will be much faster than running the whole query. So, in
|
||
|
|
summary, by making <code>stg_my_crazy_model</code> materialize as a table, you can fix your performance
|
||
|
|
issue in <code>int_my_dependant_model</code>.</p>
|
||
|
|
|
||
|
|
<h3>5. Switch the model to materialization to <code>incremental</code></h3>
|
||
|
|
<blockquote>Make the processing of the table happen in small batches instead of on all data to make it more
|
||
|
|
manageable.</blockquote>
|
||
|
|
<p>Imagine we want to count how many bookings were created each month.</p>
|
||
|
|
<p>As time passes, more and more months and more and more bookings appear in our history, making the size of
|
||
|
|
this problem ever increasing. But then again, once a month has finished, we shouldn't need to go back
|
||
|
|
and revisit history: what's done is done, and only the ongoing month is relevant, right?</p>
|
||
|
|
<p><a href="https://docs.getdbt.com/docs/build/incremental-models" target="_blank"
|
||
|
|
rel="noopener noreferrer">dbt offers a materialization strategy named <code>incremental</code></a>,
|
||
|
|
which allows you to only work on a subset of data. This means that every time you run
|
||
|
|
<code>dbt run</code> , your model only works on a certain part of the data, and not all of it. If the
|
||
|
|
nature of your data and your needs allows isolating each run to a small part of all upstream data, this
|
||
|
|
strategy can help wildly improve the performance.
|
||
|
|
</p>
|
||
|
|
<p>Explaining the inner details of <code>incremental</code> goes beyond the scope of this page. You can
|
||
|
|
check the official docs from <code>dbt</code> (<a
|
||
|
|
href="https://docs.getdbt.com/docs/build/incremental-models" target="_blank"
|
||
|
|
rel="noopener noreferrer">here</a>), ask the team for support or check some of the incremental
|
||
|
|
models that we already have in our project and use them as references.</p>
|
||
|
|
<p>Note that using <code>incremental</code> strategies makes life way harder than simple <code>view</code>
|
||
|
|
or <code>table</code> ones, so only pick this up if it's truly necessary. Don't make models incremental
|
||
|
|
without trying other optimizations first, or simply because you realise that you <em>could</em> use it
|
||
|
|
in a specific model.</p>
|
||
|
|
<h3>6. End of the line: general optimization</h3>
|
||
|
|
<p>The final tip is not really a tip. The above five things are the easy-peasy, low hanging fruit stuff that
|
||
|
|
you can try. This doesn't mean that there isn't more than you can do, just that I don't know of more
|
||
|
|
simple stuff that you can try without deep knowledge of how Postgres works beneath and a willingness to
|
||
|
|
get your hands <em>real</em> dirty.</p>
|
||
|
|
<p>If you've reached this point and your model is still performing poorly, you either need to put your Data
|
||
|
|
Engineer hat on and really deepen your knowledge… or call Pablo.</p>
|
||
|
|
|
||
|
|
<h3 id="bonus">Bonus: how to make sure you didn't screw up and change the output of the model</h3>
|
||
|
|
<p>The topic we are discussing in this guide is making refactors purely for the sake of performance, without
|
||
|
|
changing the output of the given model. We simply want to make the model faster, not change what data it
|
||
|
|
generates.</p>
|
||
|
|
<p>That being the case, and considering the complexity of the strategies we've presented here, being afraid
|
||
|
|
that you messed up and accidentally changed the output of the model is a very reasonable fear to have.
|
||
|
|
That's a kind of mistake that we definitely want to avoid.</p>
|
||
|
|
<p>Doing this manually can be a PITA and very time consuming, which doesn't help at all.</p>
|
||
|
|
<p>To make your life easier, I'm going to show you a new little trick.</p>
|
||
|
|
|
||
|
|
<h4>Hashing tables and comparing them</h4>
|
||
|
|
<p>I'll post a snippet of code here that you can run to compare if any pair of tables has <em>exactly</em>
|
||
|
|
the same contents. Emphasis on exactly. Changing the slightest bit of content will be detected.</p>
|
||
|
|
<pre><code>SELECT md5(array_agg(md5((t1.*)::varchar))::varchar)
|
||
|
|
FROM (
|
||
|
|
SELECT *
|
||
|
|
FROM my_first_table
|
||
|
|
ORDER BY <whatever field is unique>
|
||
|
|
) AS t1
|
||
|
|
|
||
|
|
SELECT md5(array_agg(md5((t2.*)::varchar))::varchar)
|
||
|
|
FROM (
|
||
|
|
SELECT *
|
||
|
|
FROM my_second_table
|
||
|
|
ORDER BY <whatever field is unique>
|
||
|
|
) AS t2</code></pre>
|
||
|
|
<p>How this works is: you execute the two queries, which will return a single value each. Some hexadecimal
|
||
|
|
gibberish.</p>
|
||
|
|
<p>If the output of the two queries is identical, it means their contents are identical. If they are
|
||
|
|
different, it means there's something different across both.</p>
|
||
|
|
<p>If you don't understand how this works, and you don't care, that's fine. Just use it.</p>
|
||
|
|
<p>If not knowing does bother, you should go down the rabbit holes of hash functions and deterministic
|
||
|
|
serialization.</p>
|
||
|
|
|
||
|
|
<h4>Including this in your refactoring workflow</h4>
|
||
|
|
<p>Right, now you know how to make sure that two tables are identical.</p>
|
||
|
|
<p>This is dramatically useful for your optimization workflow. You can know simply:</p>
|
||
|
|
<ul>
|
||
|
|
<li>Keep the original model</li>
|
||
|
|
<li>Create a copy of it, which is the one you will be working on (the working copy)</li>
|
||
|
|
<li>Prepare the magic query to check their contents are identical</li>
|
||
|
|
<li>From this point on, you can enter in this loop for as long as you want/need:
|
||
|
|
<ul>
|
||
|
|
<li>Run the magic query to ensure you start from same-output-state</li>
|
||
|
|
<li>Modify the working copy model to attempt whatever optimization thingie you wanna try</li>
|
||
|
|
<li>Once you are done, run the magic query again.</li>
|
||
|
|
<li>If the output is not the same anymore, you screwed up. Start again and avoid whatever
|
||
|
|
mistake you made.</li>
|
||
|
|
<li>If the output is still the same, you didn't cause a change in the model output. Either keep
|
||
|
|
on optimizing or call it day.</li>
|
||
|
|
</ul>
|
||
|
|
</li>
|
||
|
|
<li>Finally, just copy over the working copy model code into the old one and remove the working copy.
|
||
|
|
</li>
|
||
|
|
</ul>
|
||
|
|
<p>I hope that helps. I also recommend doing the loop as frequently as possible. The less things you change
|
||
|
|
between executions of the magic query, the easier is to realize what caused errors if they appear.</p>
|
||
|
|
<hr>
|
||
|
|
<p><a href="../index.html">back to home</a></p>
|
||
|
|
</section>
|
||
|
|
</main>
|
||
|
|
|
||
|
|
</body>
|
||
|
|
|
||
|
|
</html>
|