sh-notion/notion_data_team_no_files/Busy man’s guide to optimizing dbt models performa b0540bf8fa0a4ca5a6220b9d8132800d.md

306 lines
17 KiB
Markdown
Raw Normal View History

2025-07-11 16:15:17 +02:00
# Busy mans guide to optimizing dbt models performance
You have a `dbt` model that takes ages to run in production. For some very valid reason, this is a problem.
This is a small reference guide on things you can try. I suggest you try them from start to end, since they are sorted in a descendent way by value/complexity ratio.
Before you start working on a model, you might want to check [the bonus guide at the bottom](Busy%20man%E2%80%99s%20guide%20to%20optimizing%20dbt%20models%20performa%20b0540bf8fa0a4ca5a6220b9d8132800d.md) to learn how to make sure you dont change the outputs of a model while refactoring it.
If youve tried everything you could here and things still dont work, dont hesitate to call Pablo.
## 1. Is your model *really* taking too long?
> Before you optimize a model that is taking too long, make sure it actually takes too long.
>
The very first step is to really assess if you do have a problem.
We run our DWH in a Postgres server, and Postgres is a complex system. Postgres is doing many things at all times and its very stateful, which means you will pretty much never see *exactly* the same performance twice for some given query.
Before going crazy optimizing, I would advice running the model or the entire project a few times and observing the behaviour. It might be that *some day* it took very long for some reason, but usually, it runs just fine.
You also might want to do this in a moment where theres little activity in the DWH, like very early or late in the day, so that other users activity in the DWH dont pollute your observations.
If this is a model that is already being run regularly already, we can also leverage the statistics collected by the `pg_stat_statements` Postgres extension to check what are the min, avg, and max run times for it. Ask Pablo to get this.
## 2. Reducing the amount of data
> Make your query only bring in the data it needs, and not more. Reduce the amount of data as early as possible.
>
This option is a simple optimization trick that can be used in many areas and its easy to pull off.
The two holy devils of slow queries are large amounts of data and monster lookups/sorts. Both can be drastically reduced by simply reducing the amount of data that goes into the query, typically by applying some smart `WHERE` or creative conditions on a `JOIN` clause. This can be either done in your basic CTEs where you read from other models, or in the main `SELECT` of your model.
Typically, try to make this as *early* as possible in the model. Early here refers to the steps of your query. In your queries, you will typically:
- read a few tables,
- do some `SELECTs`
- then do more crazy logic downstream with more `SELECTs`
- and the party goes on for as long and complex your case is
Reducing the amount of data at the end is pointless. You will still need to read a lot of stuff early and have monster `JOIN`s , window functions, `DISTINCTs`, etc. Ideally, you want to do it when your first access an upstream table. If not there, then as early as possible within the logic.
The specifics of how to apply this are absolutely query dependent, so I cant give you magic instructions for the query you have at hand. But let me illustrate the concept with an example:
### Only hosts? Then only hosts
You have a table `stg_my_table` with a lot of data, lets say 100 million records, and each record has the id of a host. In your model, you need to join these records with the host user data to get some columns from there. So right now your query looks something like this (tables fictional, this is not how things look in DWH):
```sql
with
stg_my_table as (select * from {{ ref("stg_my_table") }}),
stg_users as (select * from {{ ref("stg_users")}})
select
...
from stg_my_table t
left join
stg_users u
on t.id_host_user = id_user
```
At the time Im writing this, the real user table in our DWH has like 600,000 records. This means that:
- The CTE `stg_users` will need to fetch 600,000 records, with all their data, and store them.
- Then the left join will have to join 100 million records from `my_table` with the 600,000 user records.
Now, this is not working for you because it takes ages. We can easily improve the situation by applying the principle of this section: reducing the amount of data.
Our user table in the DWH has both hosts and guests. Actually, it has a ~1,000 hosts and everything else is just guests. This means that:
- Were fetching around 599,000 guest details that we dont care about at all.
- Every time we join a record from `my_table`, we do so against 600,000 user records when we only truly care about 1,000 of them.
Stupid, isnt it?
Well, imagining that our fictional `stg_users` tables had a field called `is_host`, we can rewrite the query this way to get exactly the same result in only a fraction of the time:
```sql
with
stg_my_table as (select * from {{ ref("stg_my_table") }}),
**stg_users as (
select *
from {{ ref("stg_users")}}
where is_host = true
)**
select
...
from stg_my_table t
left join
stg_users u
on t.id_host_user = id_user
```
Its simple to understand: the CTE will now only get the 1,000 records related to hosts, which means we save performance in both fetching that data and having a much smaller join operation downstream against `stg_my_table`.
## 3. Inlining CTEs
> Replace CTEs with inline references to avoid optimization fences.
>
This one is a bit more brainy. Ill split this bit in three parts: what is it, why does it work, and when its NOT a good option.
### What is it
As per our agreed good practices when building models, we always include references to upstream models we depend on as CTEs on the top of the file. So our models tend to look like this:
```sql
with
stg_some_table as (select * from {{ ref("stg_some_table") }}),
stg_some_other_table as (select * from {{ ref("stg_some_other_table") }}),
some_intermediate_thingie as (
select
...
from stg_some_table
where
...
group by
...
)
select
...
from stg_some_table st
left join stg_some_other_table sot
on st.an_id = sot.an_id
left join some_intermediate_thingie sit
on sot.another_id = sit.you_guessed_it_its_another_id
```
To inline the CTEs means to replace some or all of the CTEs in the model with direct references. For example, a first level of this could be to simply remove the first two CTEs:
```sql
with
some_intermediate_thingie as (
select
...
from **{{ ref("stg_some_table") }}**
where
...
group by
...
)
select
...
from **{{ ref("stg_some_table") }}** st
left join **{{ ref("stg_some_other_table") }}** sot
on st.an_id = sot.an_id
left join some_intermediate_thingie sit
on sot.another_id = sit.you_guessed_it_its_another_id
```
Or I could go all the way and remove all CTEs by using a subquery:
```sql
select
...
from {{ ref("stg_some_table") }} st
left join {{ ref("stg_some_other_table") }} sot
on st.an_id = sot.an_id
left join (
**select
...
from {{ ref("stg_some_table") }}
where
...
group by
...**
) sit
on sot.another_id = sit.you_guessed_it_its_another_id
```
So, inlining is as simple as that: simply go around destroying CTEs and placing `ref` , subqueries and whatever else where needed to keep the query result the same.
Inlining breaks our convention around how we write our models, and makes them harder to read. Never resort to inlining from scratch: always build models with CTEs, and only apply inlining if its critical for performance. And by critical, I truly mean critical. Dont apply inlines to a 20 seconds-long model to make it 15 seconds if nobody cares about those 5 seconds.
### Why does inlining work
CTEs can become optimization fences for Postgres. But what the hell is that?
If you write a main select with many subqueries, Postgres will try to play smart games with it to make it as fast as possible. For instance, if it finds that a `where` condition that you placed in the outermost `select` statement could be done in a subquery to make things faster, it will go ahead and do so (if this rings a bell, yes, this is the same as principle #1 of this guide. Postgres will also try to do it automatically for you at times).
The issue is that sometimes, with CTEs, Postgres refuses to play these tricks. Instead, Postgres will commit to execute your CTE exactly as it is, without having any consideration for how its output is used later. This means missed opportunities to make the query faster. This is what we call an optimization fence.
So, why this works should now be obvious: by throwing away the CTE and doing the same thing without it, you allow Postgres to leverage more optimization strategies.
### When its not a good idea
Most times, CTEs wont be the cause of your issues. So, my advice when attempting this strategy this is to simply try out and measure the performance of your query with and without applying inlining. If your results make it obvious that inlining is not helping (or is hurting), then simply revert back to having CTEs.
There is also one special situation where removing CTEs is probably a terrible idea. If you have:
- A CTE that does some very costly query.
- And that CTE is referenced in many other parts of the model multiple times.
then that CTE is probably helping, not hurting performance. This is because Postgres will compute the CTE only once and allow all downstream operations to read from the temp result, whereas if you inline it, it might end up repeating the costly query multiple times.
## 4. Change upstream materializations
> Materialize upstream models as tables instead of views to reduce computation on the model at hand.
>
Going back to basics, dbt offers [multiple materializations strategies for our models](https://docs.getdbt.com/docs/build/materializations).
Typically, for reasons that we wont cover here, the preferred starting point is to use views. We only go for tables or incremental materializations if there are good reasons for this.
If you have a model that is having terrible performance, its possible that the fault doesnt sit at the model itself, but rather at an upstream model. Let me make an example.
Imagine we have a situation with three models:
- `stg_my_simple_model`: a model with super simple logic and small data
- `stg_my_crazy_model`: a model with a crazy complex query and lots of data
- `int_my_dependant_model`: an int model that reads from both previous models.
- Where the staging models are set to materialize as views and the int model is set to materialize as a table.
Because the two staging models are set to materialize as views, this means that every time you run `int_my_dependant_model`, you will also have to execute the queries of `stg_my_simple_model` and `stg_my_crazy_model`. If the upstream views model are fast, this is not an issue of any kind. But if a model is a heavy query, this could be an issue.
The point is, you might notice that `int_my_dependant_model` takes 600 seconds to run and think theres something wrong with it, when actually the fault sits at `stg_my_crazy_model`, which perhaps is taking 590 seconds out of the 600.
How can materializations solve this? Well, if `stg_my_crazy_model` was materialized as a table instead of as view, whenever you ran `int_my_dependant_model` you would simply read from a table with pre-populated results, instead of having to run the `stg_my_crazy_model` query each time. Typically, reading the results will be much faster than running the whole query. So, in summary, by making `stg_my_crazy_model` materialize as a table, you can fix your performance issue in `int_my_dependant_model`.
## 5. Switch the model to materialization to `incremental`
> Make the processing of the table happen in small batches instead of on all data to make it more manageable.
>
Imagine we want to count how many bookings where created each month.
As time passes, more and more months and more and more bookings appear in our history, making the size of this problem ever increasing. But then again, once a month has finished, we shouldnt need to go back and revisit history: whats done is done, and only the ongoing month is relevant, right?
[dbt offers a materialization strategy named](https://docs.getdbt.com/docs/build/incremental-models) `incremental`, which allows you to only work on a subset of data. this means that every time you run `dbt run` , your model only works on a certain part of the data, and not all of it. If the nature of your data and your needs allows isolating each run to a small part of all upstream data, this strategy can help wildly improve the performance.
Explaining the inner details of `incremental` goes beyond the scope of this page. You can check the official docs from `dbt` ([here](https://docs.getdbt.com/docs/build/incremental-models)), ask the team for support or check some of the incremental models that we already have in our project and use them as references.
Note that using `incremental` strategies makes life way harder than simple `view` or `table` ones, so only pick this up if its truly necessary. Dont make models incremental without trying other optimizations first, or simply because you realise that you *could* use it. in a specific model.
![dbts official docs (wisely) warning you of the dangers of incremental.](image%2039.png)
dbts official docs (wisely) warning you of the dangers of incremental.
## 6. End of the line: general optimization
The final tip is not really a tip. The above five things are the easy-peasy, low hanging fruit stuff that you can try. This doesnt mean that there isnt more than you can do, just that I dont know of more simple stuff that you can try without deep knowledge of how Postgres works beneath and a willingness to get your hands *real* dirty.
If youve reached this point and your model still performing poorly, you either need to put your Data Engineer hat on and really deepen your knowledge… or call Pablo.
## Bonus: how to make sure you didnt screw up and change the output of the model
The topic we are discussing in this guide is making refactors purely for the sake of performance, without changing the output of the given model. We simply want to make the model faster, not change what data it generates.
That being the case, and considering the complexity of the strategies weve presented here, being afraid that you messed up and accidentally changed the output of the model is a very reasonable fear to have. Thats a kind of mistake that we definitely want to avoid.
Doing this manually can be a PITA and very time consuming, which doesnt help at all.
To make your life easier, Im going to show you a new little trick.
### Hashing tables and comparing them
Ill post a snippet of code here that you can run to compare if any pair of tables has *exactly* the same comments. Emphasis on exactly. Changing the slightest bit of content will be detected.
```sql
SELECT md5(array_agg(md5((t1.*)::varchar))::varchar)
FROM (
SELECT *
FROM my_first_table
ORDER BY <whatever field is unique>
) AS t1
SELECT md5(array_agg(md5((t2.*)::varchar))::varchar)
FROM (
SELECT *
FROM my_second_table
ORDER BY <whatever field is unique>
) AS t2
```
How this works is: you execute the two queries, which will return a single value each. Some hexadecimal gibberish.
If the output of the two queries is identical, it means their contents are identical. If they are different, it means theres something different across both.
If you dont understand how this works, and you dont care, thats fine. Just use it.
If not knowing does bother, you should go down the rabbit holes of hash functions and deterministic serialization.
### Including this in your refactoring workflow
Right, now you know how to make sure that two tables are identical.
This is dramatically useful for your optimization workflow. You can know simply:
- Keep the original model
- Create a copy of it, which is the one you will be working on (the working copy)
- Prepare the magic query to check their contents are identical
- From this point on, you can enter in this loop for as long as you want/need:
- Run the magic query to ensure you start from same-output-state
- Modify the working copy model to attempt whatever optimization thingie you wanna try
- Once you are done, run the magic query again.
- If the output is not the same anymore, you screwed up. Start again and avoid whatever mistake you made.
- If the output is still the same, you didnt cause a change in the model output. Either keep on optimizing or call it day.
- Finally, just copy over the working copy model code into the old one and remove the working copy.
I hope that helps. I also recommend doing the loop as frequently as possible. The less things you change between executions of the magic query, the easier is to realize what caused errors if they appear.
![ Donald Knuth - "[StructuredProgrammingWithGoToStatements](http://web.archive.org/web/20130731202547/http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261-knuth.pdf)”](image%2040.png)
Donald Knuth - "[StructuredProgrammingWithGoToStatements](http://web.archive.org/web/20130731202547/http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261-knuth.pdf)”