pablohere/public/writings/busy-mans-guide-to-optimizing-dbt-models-performance.html

<!DOCTYPE HTML>
<html>

<head>
    <title>Pablo here</title>
    <meta charset="utf-8">
    <meta viewport="width=device-width, initial-scale=1">
    <link rel="stylesheet" href="../styles.css">
</head>


<body>
    <main>
        <h1>
            Hi, Pablo here
        </h1>
        <p><a href="../index.html">back to home</a></p>
        <hr>
        <section>
            <h2>Busy man's guide to optimizing dbt models performance</h2>
            <p>The below guide is a copy-paste of an internal doc file I created while working in Superhog. My team of
                analysts were very smart guys, but had little knowledge on Postgres internals (we used Postgres for our
                DWH) and low-level query optimization. That was understandable: they were analysts, busy with answering
                business questions. Their value was not in fixing pipeline performance.</p>
            <p>Nevertheless, giving them some degree of freedom in fixing performance was both great to avoid me
                becoming a bottleneck and to also expand their knowledge, which they were eager to. This guide was
                targeted to them, and the goal was to give them as many tools as possible without having to go into the
                rabbit hole of <code>EXPLAIN ANALYZE</code>.</p>
            <hr>
            <p>You have a <code>dbt</code> model that takes ages to run in production. For some very valid reason, this
                is a problem.</p>
            <p>This is a small reference guide on things you can try. I suggest you try them from start to end, since
                they are sorted in a descending way by value/complexity ratio.</p>
            <p>Before you start working on a model, you might want to check <a href="#bonus">the bonus guide at the
                    bottom</a> to learn how to make sure you don't change the outputs of a model while refactoring it.
            </p>
            <p>If you've tried everything you could here and things still don't work, don't hesitate to call Pablo.</p>

            <h3>1. Is your model <em>really</em> taking too long?</h3>
            <blockquote>Before you optimize a model that is taking too long, make sure it actually takes too long.
            </blockquote>
            <p>The very first step is to really assess if you do have a problem.</p>
            <p>We run our DWH in a Postgres server, and Postgres is a complex system. Postgres is doing many things at
                all times and it's very stateful, which means you will pretty much never see <em>exactly</em> the same
                performance twice for some given query.</p>
            <p>Before going crazy optimizing, I would advise running the model or the entire project a few times and
                observing the behaviour. It might be that <em>some day</em> it took very long for some reason, but
                usually, it runs just fine.</p>
            <p>You also might want to do this in a moment where there's little activity in the DWH, like very early or
                late in the day, so that other users' activity in the DWH don't pollute your observations.</p>
            <p>If this is a model that is already being run regularly, we can also leverage the statistics
                collected by the <code>pg_stat_statements</code> Postgres extension to check what are the min, avg, and
                max run times for it. Ask Pablo to get this.</p>

            <h3>2. Reducing the amount of data</h3>
            <blockquote>Make your query only bring in the data it needs, and not more. Reduce the amount of data as
                early as possible.</blockquote>
            <p>This option is a simple optimization trick that can be used in many areas and it's easy to pull off.</p>
            <p>The two holy devils of slow queries are large amounts of data and monster lookups/sorts. Both can be
                drastically reduced by simply reducing the amount of data that goes into the query, typically by applying
                some smart <code>WHERE</code> or creative conditions on a <code>JOIN</code> clause. This can be either
                done in your basic CTEs where you read from other models, or in the main <code>SELECT</code> of your
                model.</p>
            <p>Typically, try to make this as <em>early</em> as possible in the model. Early here refers to the steps
                of your query. In your queries, you will typically:</p>
            <ul>
                <li>read a few tables,</li>
                <li>do some <code>SELECTs</code></li>
                <li>then do more crazy logic downstream with more <code>SELECTs</code></li>
                <li>and the party goes on for as long and complex your case is</li>
            </ul>
            <p>Reducing the amount of data at the end is pointless. You will still need to read a lot of stuff early and
                have monster <code>JOIN</code>s , window functions, <code>DISTINCT</code>s, etc. Ideally, you want to do
                it when you first access an upstream table. If not there, then as early as possible within the logic.
            </p>
            <p>The specifics of how to apply this are absolutely query dependent, so I can't give you magic instructions
                for the query you have at hand. But let me illustrate the concept with an example:</p>

            <h4>Only hosts? Then only hosts</h4>
            <p>You have a table <code>stg_my_table</code> with a lot of data, let's say 100 million records, and each
                record has the id of a host. In your model, you need to join these records with the host user data to
                get some columns from there. So right now your query looks something like this (tables fictional, this
                is not how things look in DWH):</p>
            <pre><code>with
stg_my_table as (select * from {{ ref("stg_my_table") }}),
stg_users as (select * from {{ ref("stg_users")}})

select
    ...
from stg_my_table t
left join
    stg_users u
    on t.id_host_user = id_user</code></pre>
            <p>At the time I'm writing this, the real user table in our DWH has like 600,000 records. This means
                that:</p>
            <ul>
                <li>The CTE <code>stg_users</code> will need to fetch 600,000 records, with all their data, and store
                    them.</li>
                <li>Then the left join will have to join 100 million records from <code>my_table</code> with the 600,000
                    user records.</li>
            </ul>
            <p>Now, this is not working for you because it takes ages. We can easily improve the situation by applying
                the principle of this section: reducing the amount of data.</p>
            <p>Our user table in the DWH has both hosts and guests. Actually, it has a ~1,000 hosts and everything else
                is just guests. This means that:</p>
            <ul>
                <li>We're fetching around 599,000 guest details that we don't care about at all.</li>
                <li>Every time we join a record from <code>my_table</code>, we do so against 600,000 user records when
                    we only truly care about 1,000 of them.</li>
            </ul>
            <p>Stupid, isn't it?</p>
            <p>Well, imagining that our fictional <code>stg_users</code> tables had a field called
                <code>is_host</code>, we can rewrite the query this way to get exactly the same result in only a
                fraction of the time:
            </p>
            <pre><code>with
stg_my_table as (select * from {{ ref("stg_my_table") }}),
stg_users as (
    select *
    from {{ ref("stg_users")}}
    where is_host = true
    )

select
    ...
from stg_my_table t
left join
    stg_users u
    on t.id_host_user = id_user</code></pre>
            <p>It's simple to understand: the CTE will now only get the 1,000 records related to hosts, which means we
                save performance in both fetching that data and having a much smaller join operation downstream against
                <code>stg_my_table</code>.
            </p>

            <h3>3. Controlling CTE materialization</h3>
            <blockquote>Tell Postgres when to cache intermediate results and when to optimize through them.</blockquote>
            <p>This one requires a tiny bit of understanding of what happens under the hood, but the payoff is big and
                the fix is easy to apply.</p>

            <h4>What Postgres does with your CTEs</h4>
            <p>When Postgres runs a CTE, it has two strategies:</p>
            <ul>
                <li><strong>Materialized</strong>: Postgres runs the CTE query, stores the full result in a temporary
                    buffer, and every downstream reference reads from that buffer. Think of it as Postgres creating a
                    temporary, index-less table with the CTE's output.</li>
                <li><strong>Not materialized</strong>: Postgres treats the CTE as if it were a view. It doesn't store
                    anything — instead, it folds the CTE's logic into the rest of the query and optimizes everything
                    together. This means it can push filters down, use indexes from the original tables, and skip
                    reading rows it doesn't need.</li>
            </ul>
            <p>By default, Postgres decides for you: if a CTE is referenced once, it inlines it. If it's referenced
                more than once, it materializes it.</p>
            <p>The problem is that this default isn't always ideal, especially with how we write dbt models.</p>

            <h4>Why this matters for our dbt models</h4>
            <p>Following our conventions, we always import upstream refs as CTEs at the top of the file:</p>
            <pre><code>with
    stg_users as (select * from {{ ref("stg_users") }}),
    stg_bookings as (select * from {{ ref("stg_bookings") }}),

    some_intermediate_logic as (
        select ...
        from stg_users
        join stg_bookings on ...
        where ...
    ),

    some_other_logic as (
        select ...
        from stg_users
        where ...
    )

select ...
from some_intermediate_logic
join some_other_logic on ...</code></pre>
            <p>Notice that <code>stg_users</code> is referenced twice — once in <code>some_intermediate_logic</code>
                and once in <code>some_other_logic</code>. This means Postgres will materialize it by default. What
                happens then is:</p>
            <ol>
                <li>Postgres scans the entire <code>stg_users</code> table and copies all 600,000 rows into a temporary
                    buffer.</li>
                <li>If the buffer exceeds available memory, it spills to disk.</li>
                <li>Every downstream CTE that reads from <code>stg_users</code> does a sequential scan of that buffer.
                    Note this means indices can't be used, even if the original table had them.</li>
                <li>Any filters that downstream CTEs apply to <code>stg_users</code> (like
                    <code>where is_host = true</code>) can't be pushed down to the original table scan. Postgres reads
                    all 600,000 rows first, stores them, and only then filters.
                </li>
            </ol>
            <p>All of that, for a <code>select *</code> that does absolutely no computation worth caching.</p>

            <h4>The fix</h4>
            <p>You can explicitly control this behaviour by adding <code>MATERIALIZED</code> or
                <code>NOT MATERIALIZED</code> to any CTE:
            </p>
            <pre><code>with
    stg_users as not materialized (select * from {{ ref("stg_users") }}),
    stg_bookings as not materialized (select * from {{ ref("stg_bookings") }}),

    some_intermediate_logic as (
        ...
    ),

    some_other_logic as (
        ...
    )

select ...</code></pre>
            <p>With <code>NOT MATERIALIZED</code>, Postgres treats those import CTEs as transparent aliases. It can see
                straight through to the original table, use its indexes, and push filters down.</p>

            <h4>When to use which</h4>
            <p>The rule of thumb is simple:</p>
            <ul>
                <li><strong>Cheap CTE, referenced multiple times</strong> → <code>NOT MATERIALIZED</code>. This is the
                    typical case for our import CTEs at the top of the file. There's no computation to cache, so
                    materializing just wastes resources.</li>
                <li><strong>Expensive CTE, referenced multiple times</strong> → leave it alone (or explicit
                    <code>MATERIALIZED</code>). If a CTE does heavy aggregations, complex joins, or window functions,
                    materializing means that work happens once. Without it, Postgres would repeat the expensive query
                    every time the CTE is referenced.
                </li>
                <li><strong>Any CTE referenced only once</strong> → doesn't matter. Postgres inlines it automatically.
                </li>
            </ul>
            <p>If you're unsure whether a CTE is "expensive enough" to warrant materialization, just try both and
                measure. There's no shame in that.</p>

            <h3>4. Change upstream materializations</h3>
            <blockquote>Materialize upstream models as tables instead of views to reduce computation on the model at
                hand.</blockquote>
            <p>Going back to basics, dbt offers <a href="https://docs.getdbt.com/docs/build/materializations"
                    target="_blank" rel="noopener noreferrer">multiple materializations strategies for our models</a>.
            </p>
            <p>Typically, for reasons that we won't cover here, the preferred starting point is to use views. We only go
                for tables or incremental materializations if there are good reasons for this.</p>
            <p>If you have a model that is having terrible performance, it's possible that the fault doesn't sit at the
                model itself, but rather at an upstream model. Let me make an example.</p>
            <p>Imagine we have a situation with three models:</p>
            <ul>
                <li><code>stg_my_simple_model</code>: a model with super simple logic and small data</li>
                <li><code>stg_my_crazy_model</code>: a model with a crazy complex query and lots of data</li>
                <li><code>int_my_dependant_model</code>: an int model that reads from both previous models.</li>
                <li>Where the staging models are set to materialize as views and the int model is set to materialize as
                    a table.</li>
            </ul>
            <p>Because the two staging models are set to materialize as views, this means that every time you run
                <code>int_my_dependant_model</code>, you will also have to execute the queries of
                <code>stg_my_simple_model</code> and <code>stg_my_crazy_model</code>. If the upstream views model are
                fast, this is not an issue of any kind. But if a model is a heavy query, this could be an issue.
            </p>
            <p>The point is, you might notice that <code>int_my_dependant_model</code> takes 600 seconds to run and
                think there's something wrong with it, when actually the fault sits at <code>stg_my_crazy_model</code>,
                which perhaps is taking 590 seconds out of the 600.</p>
            <p>How can materializations solve this? Well, if <code>stg_my_crazy_model</code> was materialized as a table
                instead of as view, whenever you ran <code>int_my_dependant_model</code> you would simply read from a
                table with pre-populated results, instead of having to run the <code>stg_my_crazy_model</code> query
                each time. Typically, reading the results will be much faster than running the whole query. So, in
                summary, by making <code>stg_my_crazy_model</code> materialize as a table, you can fix your performance
                issue in <code>int_my_dependant_model</code>.</p>

            <h3>5. Switch the model to materialization to <code>incremental</code></h3>
            <blockquote>Make the processing of the table happen in small batches instead of on all data to make it more
                manageable.</blockquote>
            <p>Imagine we want to count how many bookings were created each month.</p>
            <p>As time passes, more and more months and more and more bookings appear in our history, making the size of
                this problem ever increasing. But then again, once a month has finished, we shouldn't need to go back
                and revisit history: what's done is done, and only the ongoing month is relevant, right?</p>
            <p><a href="https://docs.getdbt.com/docs/build/incremental-models" target="_blank"
                    rel="noopener noreferrer">dbt offers a materialization strategy named <code>incremental</code></a>,
                which allows you to only work on a subset of data. This means that every time you run
                <code>dbt run</code> , your model only works on a certain part of the data, and not all of it. If the
                nature of your data and your needs allows isolating each run to a small part of all upstream data, this
                strategy can help wildly improve the performance.
            </p>
            <p>Explaining the inner details of <code>incremental</code> goes beyond the scope of this page. You can
                check the official docs from <code>dbt</code> (<a
                    href="https://docs.getdbt.com/docs/build/incremental-models" target="_blank"
                    rel="noopener noreferrer">here</a>), ask the team for support or check some of the incremental
                models that we already have in our project and use them as references.</p>
            <p>Note that using <code>incremental</code> strategies makes life way harder than simple <code>view</code>
                or <code>table</code> ones, so only pick this up if it's truly necessary. Don't make models incremental
                without trying other optimizations first, or simply because you realise that you <em>could</em> use it
                in a specific model.</p>
            <h3>6. End of the line: general optimization</h3>
            <p>The final tip is not really a tip. The above five things are the easy-peasy, low hanging fruit stuff that
                you can try. This doesn't mean that there isn't more than you can do, just that I don't know of more
                simple stuff that you can try without deep knowledge of how Postgres works beneath and a willingness to
                get your hands <em>real</em> dirty.</p>
            <p>If you've reached this point and your model is still performing poorly, you either need to put your Data
                Engineer hat on and really deepen your knowledge… or call Pablo.</p>

            <h3 id="bonus">Bonus: how to make sure you didn't screw up and change the output of the model</h3>
            <p>The topic we are discussing in this guide is making refactors purely for the sake of performance, without
                changing the output of the given model. We simply want to make the model faster, not change what data it
                generates.</p>
            <p>That being the case, and considering the complexity of the strategies we've presented here, being afraid
                that you messed up and accidentally changed the output of the model is a very reasonable fear to have.
                That's a kind of mistake that we definitely want to avoid.</p>
            <p>Doing this manually can be a PITA and very time consuming, which doesn't help at all.</p>
            <p>To make your life easier, I'm going to show you a new little trick.</p>

            <h4>Hashing tables and comparing them</h4>
            <p>I'll post a snippet of code here that you can run to compare if any pair of tables has <em>exactly</em>
                the same contents. Emphasis on exactly. Changing the slightest bit of content will be detected.</p>
            <pre><code>SELECT md5(array_agg(md5((t1.*)::varchar))::varchar)
  FROM (
        SELECT *
          FROM my_first_table
         ORDER BY &lt;whatever field is unique&gt;
       ) AS t1

SELECT md5(array_agg(md5((t2.*)::varchar))::varchar)
  FROM (
        SELECT *
          FROM my_second_table
         ORDER BY &lt;whatever field is unique&gt;
       ) AS t2</code></pre>
            <p>How this works is: you execute the two queries, which will return a single value each. Some hexadecimal
                gibberish.</p>
            <p>If the output of the two queries is identical, it means their contents are identical. If they are
                different, it means there's something different across both.</p>
            <p>If you don't understand how this works, and you don't care, that's fine. Just use it.</p>
            <p>If not knowing does bother, you should go down the rabbit holes of hash functions and deterministic
                serialization.</p>

            <h4>Including this in your refactoring workflow</h4>
            <p>Right, now you know how to make sure that two tables are identical.</p>
            <p>This is dramatically useful for your optimization workflow. You can know simply:</p>
            <ul>
                <li>Keep the original model</li>
                <li>Create a copy of it, which is the one you will be working on (the working copy)</li>
                <li>Prepare the magic query to check their contents are identical</li>
                <li>From this point on, you can enter in this loop for as long as you want/need:
                    <ul>
                        <li>Run the magic query to ensure you start from same-output-state</li>
                        <li>Modify the working copy model to attempt whatever optimization thingie you wanna try</li>
                        <li>Once you are done, run the magic query again.</li>
                        <li>If the output is not the same anymore, you screwed up. Start again and avoid whatever
                            mistake you made.</li>
                        <li>If the output is still the same, you didn't cause a change in the model output. Either keep
                            on optimizing or call it day.</li>
                    </ul>
                </li>
                <li>Finally, just copy over the working copy model code into the old one and remove the working copy.
                </li>
            </ul>
            <p>I hope that helps. I also recommend doing the loop as frequently as possible. The less things you change
                between executions of the magic query, the easier is to realize what caused errors if they appear.</p>
            <hr>
            <p><a href="../index.html">back to home</a></p>
        </section>
    </main>

</body>

</html>
new article 2026-02-10 23:43:29 +01:00			`<!DOCTYPE HTML>`
			`<html>`

			`<head>`
			`<title>Pablo here</title>`
			`<meta charset="utf-8">`
			`<meta viewport="width=device-width, initial-scale=1">`
			`<link rel="stylesheet" href="../styles.css">`
			`</head>`


			`<body>`
			`<main>`
			`<h1>`
			`Hi, Pablo here`
			`</h1>`
			`<p><a href="../index.html">back to home</a></p>`
			`<hr>`
			`<section>`
			`<h2>Busy man's guide to optimizing dbt models performance</h2>`
			`<p>The below guide is a copy-paste of an internal doc file I created while working in Superhog. My team of`
			`analysts were very smart guys, but had little knowledge on Postgres internals (we used Postgres for our`
			`DWH) and low-level query optimization. That was understandable: they were analysts, busy with answering`
			`business questions. Their value was not in fixing pipeline performance.</p>`
			`<p>Nevertheless, giving them some degree of freedom in fixing performance was both great to avoid me`
			`becoming a bottleneck and to also expand their knowledge, which they were eager to. This guide was`
			`targeted to them, and the goal was to give them as many tools as possible without having to go into the`
			`rabbit hole of <code>EXPLAIN ANALYZE</code>.</p>`
			`<hr>`
			`<p>You have a <code>dbt</code> model that takes ages to run in production. For some very valid reason, this`
			`is a problem.</p>`
			`<p>This is a small reference guide on things you can try. I suggest you try them from start to end, since`
			`they are sorted in a descending way by value/complexity ratio.</p>`
			`<p>Before you start working on a model, you might want to check <a href="#bonus">the bonus guide at the`
			`bottom</a> to learn how to make sure you don't change the outputs of a model while refactoring it.`
			`</p>`
			`<p>If you've tried everything you could here and things still don't work, don't hesitate to call Pablo.</p>`

			`<h3>1. Is your model <em>really</em> taking too long?</h3>`
			`<blockquote>Before you optimize a model that is taking too long, make sure it actually takes too long.`
			`</blockquote>`
			`<p>The very first step is to really assess if you do have a problem.</p>`
			`<p>We run our DWH in a Postgres server, and Postgres is a complex system. Postgres is doing many things at`
			`all times and it's very stateful, which means you will pretty much never see <em>exactly</em> the same`
			`performance twice for some given query.</p>`
			`<p>Before going crazy optimizing, I would advise running the model or the entire project a few times and`
			`observing the behaviour. It might be that <em>some day</em> it took very long for some reason, but`
			`usually, it runs just fine.</p>`
			`<p>You also might want to do this in a moment where there's little activity in the DWH, like very early or`
			`late in the day, so that other users' activity in the DWH don't pollute your observations.</p>`
			`<p>If this is a model that is already being run regularly, we can also leverage the statistics`
			`collected by the <code>pg_stat_statements</code> Postgres extension to check what are the min, avg, and`
			`max run times for it. Ask Pablo to get this.</p>`

			`<h3>2. Reducing the amount of data</h3>`
			`<blockquote>Make your query only bring in the data it needs, and not more. Reduce the amount of data as`
			`early as possible.</blockquote>`
			`<p>This option is a simple optimization trick that can be used in many areas and it's easy to pull off.</p>`
			`<p>The two holy devils of slow queries are large amounts of data and monster lookups/sorts. Both can be`
			`drastically reduced by simply reducing the amount of data that goes into the query, typically by applying`
			`some smart <code>WHERE</code> or creative conditions on a <code>JOIN</code> clause. This can be either`
			`done in your basic CTEs where you read from other models, or in the main <code>SELECT</code> of your`
			`model.</p>`
			`<p>Typically, try to make this as <em>early</em> as possible in the model. Early here refers to the steps`
			`of your query. In your queries, you will typically:</p>`
			`<ul>`
			`<li>read a few tables,</li>`
			`<li>do some <code>SELECTs</code></li>`
			`<li>then do more crazy logic downstream with more <code>SELECTs</code></li>`
			`<li>and the party goes on for as long and complex your case is</li>`
			`</ul>`
			`<p>Reducing the amount of data at the end is pointless. You will still need to read a lot of stuff early and`
			`have monster <code>JOIN</code>s , window functions, <code>DISTINCT</code>s, etc. Ideally, you want to do`
			`it when you first access an upstream table. If not there, then as early as possible within the logic.`
			`</p>`
			`<p>The specifics of how to apply this are absolutely query dependent, so I can't give you magic instructions`
			`for the query you have at hand. But let me illustrate the concept with an example:</p>`

			`<h4>Only hosts? Then only hosts</h4>`
			`<p>You have a table <code>stg_my_table</code> with a lot of data, let's say 100 million records, and each`
			`record has the id of a host. In your model, you need to join these records with the host user data to`
			`get some columns from there. So right now your query looks something like this (tables fictional, this`
			`is not how things look in DWH):</p>`
			`<pre><code>with`
			`stg_my_table as (select * from {{ ref("stg_my_table") }}),`
			`stg_users as (select * from {{ ref("stg_users")}})`

			`select`
			`...`
			`from stg_my_table t`
			`left join`
			`stg_users u`
			`on t.id_host_user = id_user</code></pre>`
			`<p>At the time I'm writing this, the real user table in our DWH has like 600,000 records. This means`
			`that:</p>`
			`<ul>`
			`<li>The CTE <code>stg_users</code> will need to fetch 600,000 records, with all their data, and store`
			`them.</li>`
			`<li>Then the left join will have to join 100 million records from <code>my_table</code> with the 600,000`
			`user records.</li>`
			`</ul>`
			`<p>Now, this is not working for you because it takes ages. We can easily improve the situation by applying`
			`the principle of this section: reducing the amount of data.</p>`
			`<p>Our user table in the DWH has both hosts and guests. Actually, it has a ~1,000 hosts and everything else`
			`is just guests. This means that:</p>`
			`<ul>`
			`<li>We're fetching around 599,000 guest details that we don't care about at all.</li>`
			`<li>Every time we join a record from <code>my_table</code>, we do so against 600,000 user records when`
			`we only truly care about 1,000 of them.</li>`
			`</ul>`
			`<p>Stupid, isn't it?</p>`
			`<p>Well, imagining that our fictional <code>stg_users</code> tables had a field called`
			`<code>is_host</code>, we can rewrite the query this way to get exactly the same result in only a`
			`fraction of the time:`
			`</p>`
			`<pre><code>with`
			`stg_my_table as (select * from {{ ref("stg_my_table") }}),`
			`stg_users as (`
			`select *`
			`from {{ ref("stg_users")}}`
			`where is_host = true`
			`)`

			`select`
			`...`
			`from stg_my_table t`
			`left join`
			`stg_users u`
			`on t.id_host_user = id_user</code></pre>`
			`<p>It's simple to understand: the CTE will now only get the 1,000 records related to hosts, which means we`
			`save performance in both fetching that data and having a much smaller join operation downstream against`
			`<code>stg_my_table</code>.`
			`</p>`

			`<h3>3. Controlling CTE materialization</h3>`
			`<blockquote>Tell Postgres when to cache intermediate results and when to optimize through them.</blockquote>`
			`<p>This one requires a tiny bit of understanding of what happens under the hood, but the payoff is big and`
			`the fix is easy to apply.</p>`

			`<h4>What Postgres does with your CTEs</h4>`
			`<p>When Postgres runs a CTE, it has two strategies:</p>`
			`<ul>`
			`<li><strong>Materialized</strong>: Postgres runs the CTE query, stores the full result in a temporary`
			`buffer, and every downstream reference reads from that buffer. Think of it as Postgres creating a`
			`temporary, index-less table with the CTE's output.</li>`
			`<li><strong>Not materialized</strong>: Postgres treats the CTE as if it were a view. It doesn't store`
			`anything — instead, it folds the CTE's logic into the rest of the query and optimizes everything`
			`together. This means it can push filters down, use indexes from the original tables, and skip`
			`reading rows it doesn't need.</li>`
			`</ul>`
			`<p>By default, Postgres decides for you: if a CTE is referenced once, it inlines it. If it's referenced`
			`more than once, it materializes it.</p>`
			`<p>The problem is that this default isn't always ideal, especially with how we write dbt models.</p>`

			`<h4>Why this matters for our dbt models</h4>`
			`<p>Following our conventions, we always import upstream refs as CTEs at the top of the file:</p>`
			`<pre><code>with`
			`stg_users as (select * from {{ ref("stg_users") }}),`
			`stg_bookings as (select * from {{ ref("stg_bookings") }}),`

			`some_intermediate_logic as (`
			`select ...`
			`from stg_users`
			`join stg_bookings on ...`
			`where ...`
			`),`

			`some_other_logic as (`
			`select ...`
			`from stg_users`
			`where ...`
			`)`

			`select ...`
			`from some_intermediate_logic`
			`join some_other_logic on ...</code></pre>`
			`<p>Notice that <code>stg_users</code> is referenced twice — once in <code>some_intermediate_logic</code>`
			`and once in <code>some_other_logic</code>. This means Postgres will materialize it by default. What`
			`happens then is:</p>`
			`<ol>`
			`<li>Postgres scans the entire <code>stg_users</code> table and copies all 600,000 rows into a temporary`
			`buffer.</li>`
			`<li>If the buffer exceeds available memory, it spills to disk.</li>`
			`<li>Every downstream CTE that reads from <code>stg_users</code> does a sequential scan of that buffer.`
			`Note this means indices can't be used, even if the original table had them.</li>`
			`<li>Any filters that downstream CTEs apply to <code>stg_users</code> (like`
			`<code>where is_host = true</code>) can't be pushed down to the original table scan. Postgres reads`
			`all 600,000 rows first, stores them, and only then filters.`
			`</li>`
			`</ol>`
			`<p>All of that, for a <code>select *</code> that does absolutely no computation worth caching.</p>`

			`<h4>The fix</h4>`
			`<p>You can explicitly control this behaviour by adding <code>MATERIALIZED</code> or`
			`<code>NOT MATERIALIZED</code> to any CTE:`
			`</p>`
			`<pre><code>with`
			`stg_users as not materialized (select * from {{ ref("stg_users") }}),`
			`stg_bookings as not materialized (select * from {{ ref("stg_bookings") }}),`

			`some_intermediate_logic as (`
			`...`
			`),`

			`some_other_logic as (`
			`...`
			`)`

			`select ...</code></pre>`
			`<p>With <code>NOT MATERIALIZED</code>, Postgres treats those import CTEs as transparent aliases. It can see`
			`straight through to the original table, use its indexes, and push filters down.</p>`

			`<h4>When to use which</h4>`
			`<p>The rule of thumb is simple:</p>`
			`<ul>`
			`<li><strong>Cheap CTE, referenced multiple times</strong> → <code>NOT MATERIALIZED</code>. This is the`
			`typical case for our import CTEs at the top of the file. There's no computation to cache, so`
			`materializing just wastes resources.</li>`
			`<li><strong>Expensive CTE, referenced multiple times</strong> → leave it alone (or explicit`
			`<code>MATERIALIZED</code>). If a CTE does heavy aggregations, complex joins, or window functions,`
			`materializing means that work happens once. Without it, Postgres would repeat the expensive query`
			`every time the CTE is referenced.`
			`</li>`
			`<li><strong>Any CTE referenced only once</strong> → doesn't matter. Postgres inlines it automatically.`
			`</li>`
			`</ul>`
			`<p>If you're unsure whether a CTE is "expensive enough" to warrant materialization, just try both and`
			`measure. There's no shame in that.</p>`

			`<h3>4. Change upstream materializations</h3>`
			`<blockquote>Materialize upstream models as tables instead of views to reduce computation on the model at`
			`hand.</blockquote>`
			`<p>Going back to basics, dbt offers <a href="https://docs.getdbt.com/docs/build/materializations"`
			`target="_blank" rel="noopener noreferrer">multiple materializations strategies for our models</a>.`
			`</p>`
			`<p>Typically, for reasons that we won't cover here, the preferred starting point is to use views. We only go`
			`for tables or incremental materializations if there are good reasons for this.</p>`
			`<p>If you have a model that is having terrible performance, it's possible that the fault doesn't sit at the`
			`model itself, but rather at an upstream model. Let me make an example.</p>`
			`<p>Imagine we have a situation with three models:</p>`
			`<ul>`
			`<li><code>stg_my_simple_model</code>: a model with super simple logic and small data</li>`
			`<li><code>stg_my_crazy_model</code>: a model with a crazy complex query and lots of data</li>`
			`<li><code>int_my_dependant_model</code>: an int model that reads from both previous models.</li>`
			`<li>Where the staging models are set to materialize as views and the int model is set to materialize as`
			`a table.</li>`
			`</ul>`
			`<p>Because the two staging models are set to materialize as views, this means that every time you run`
			`<code>int_my_dependant_model</code>, you will also have to execute the queries of`
			`<code>stg_my_simple_model</code> and <code>stg_my_crazy_model</code>. If the upstream views model are`
			`fast, this is not an issue of any kind. But if a model is a heavy query, this could be an issue.`
			`</p>`
			`<p>The point is, you might notice that <code>int_my_dependant_model</code> takes 600 seconds to run and`
			`think there's something wrong with it, when actually the fault sits at <code>stg_my_crazy_model</code>,`
			`which perhaps is taking 590 seconds out of the 600.</p>`
			`<p>How can materializations solve this? Well, if <code>stg_my_crazy_model</code> was materialized as a table`
			`instead of as view, whenever you ran <code>int_my_dependant_model</code> you would simply read from a`
			`table with pre-populated results, instead of having to run the <code>stg_my_crazy_model</code> query`
			`each time. Typically, reading the results will be much faster than running the whole query. So, in`
			`summary, by making <code>stg_my_crazy_model</code> materialize as a table, you can fix your performance`
			`issue in <code>int_my_dependant_model</code>.</p>`

			`<h3>5. Switch the model to materialization to <code>incremental</code></h3>`
			`<blockquote>Make the processing of the table happen in small batches instead of on all data to make it more`
			`manageable.</blockquote>`
			`<p>Imagine we want to count how many bookings were created each month.</p>`
			`<p>As time passes, more and more months and more and more bookings appear in our history, making the size of`
			`this problem ever increasing. But then again, once a month has finished, we shouldn't need to go back`
			`and revisit history: what's done is done, and only the ongoing month is relevant, right?</p>`
			`<p><a href="https://docs.getdbt.com/docs/build/incremental-models" target="_blank"`
			`rel="noopener noreferrer">dbt offers a materialization strategy named <code>incremental</code></a>,`
			`which allows you to only work on a subset of data. This means that every time you run`
			`<code>dbt run</code> , your model only works on a certain part of the data, and not all of it. If the`
			`nature of your data and your needs allows isolating each run to a small part of all upstream data, this`
			`strategy can help wildly improve the performance.`
			`</p>`
			`<p>Explaining the inner details of <code>incremental</code> goes beyond the scope of this page. You can`
			`check the official docs from <code>dbt</code> (<a`
			`href="https://docs.getdbt.com/docs/build/incremental-models" target="_blank"`
			`rel="noopener noreferrer">here</a>), ask the team for support or check some of the incremental`
			`models that we already have in our project and use them as references.</p>`
			`<p>Note that using <code>incremental</code> strategies makes life way harder than simple <code>view</code>`
			`or <code>table</code> ones, so only pick this up if it's truly necessary. Don't make models incremental`
			`without trying other optimizations first, or simply because you realise that you <em>could</em> use it`
			`in a specific model.</p>`
			`<h3>6. End of the line: general optimization</h3>`
			`<p>The final tip is not really a tip. The above five things are the easy-peasy, low hanging fruit stuff that`
			`you can try. This doesn't mean that there isn't more than you can do, just that I don't know of more`
			`simple stuff that you can try without deep knowledge of how Postgres works beneath and a willingness to`
			`get your hands <em>real</em> dirty.</p>`
			`<p>If you've reached this point and your model is still performing poorly, you either need to put your Data`
			`Engineer hat on and really deepen your knowledge… or call Pablo.</p>`

			`<h3 id="bonus">Bonus: how to make sure you didn't screw up and change the output of the model</h3>`
			`<p>The topic we are discussing in this guide is making refactors purely for the sake of performance, without`
			`changing the output of the given model. We simply want to make the model faster, not change what data it`
			`generates.</p>`
			`<p>That being the case, and considering the complexity of the strategies we've presented here, being afraid`
			`that you messed up and accidentally changed the output of the model is a very reasonable fear to have.`
			`That's a kind of mistake that we definitely want to avoid.</p>`
			`<p>Doing this manually can be a PITA and very time consuming, which doesn't help at all.</p>`
			`<p>To make your life easier, I'm going to show you a new little trick.</p>`

			`<h4>Hashing tables and comparing them</h4>`
			`<p>I'll post a snippet of code here that you can run to compare if any pair of tables has <em>exactly</em>`
			`the same contents. Emphasis on exactly. Changing the slightest bit of content will be detected.</p>`
			`<pre><code>SELECT md5(array_agg(md5((t1.*)::varchar))::varchar)`
			`FROM (`
			`SELECT *`
			`FROM my_first_table`
			`ORDER BY <whatever field is unique>`
			`) AS t1`

			`SELECT md5(array_agg(md5((t2.*)::varchar))::varchar)`
			`FROM (`
			`SELECT *`
			`FROM my_second_table`
			`ORDER BY <whatever field is unique>`
			`) AS t2</code></pre>`
			`<p>How this works is: you execute the two queries, which will return a single value each. Some hexadecimal`
			`gibberish.</p>`
			`<p>If the output of the two queries is identical, it means their contents are identical. If they are`
			`different, it means there's something different across both.</p>`
			`<p>If you don't understand how this works, and you don't care, that's fine. Just use it.</p>`
			`<p>If not knowing does bother, you should go down the rabbit holes of hash functions and deterministic`
			`serialization.</p>`

			`<h4>Including this in your refactoring workflow</h4>`
			`<p>Right, now you know how to make sure that two tables are identical.</p>`
			`<p>This is dramatically useful for your optimization workflow. You can know simply:</p>`
			`<ul>`
			`<li>Keep the original model</li>`
			`<li>Create a copy of it, which is the one you will be working on (the working copy)</li>`
			`<li>Prepare the magic query to check their contents are identical</li>`
			`<li>From this point on, you can enter in this loop for as long as you want/need:`
			`<ul>`
			`<li>Run the magic query to ensure you start from same-output-state</li>`
			`<li>Modify the working copy model to attempt whatever optimization thingie you wanna try</li>`
			`<li>Once you are done, run the magic query again.</li>`
			`<li>If the output is not the same anymore, you screwed up. Start again and avoid whatever`
			`mistake you made.</li>`
			`<li>If the output is still the same, you didn't cause a change in the model output. Either keep`
			`on optimizing or call it day.</li>`
			`</ul>`
			`</li>`
			`<li>Finally, just copy over the working copy model code into the old one and remove the working copy.`
			`</li>`
			`</ul>`
			`<p>I hope that helps. I also recommend doing the loop as frequently as possible. The less things you change`
			`between executions of the magic query, the easier is to realize what caused errors if they appear.</p>`
			`<hr>`
			`<p><a href="../index.html">back to home</a></p>`
			`</section>`
			`</main>`

			`</body>`

			`</html>`