- My tips and tricks when using Postgres as a DWH
- In November 2023, I joined Superhog (now called Truvi) to start out the Data team. As part of that, I
- also drafted and deployed the first version of its data platform.
-
- The context led me to choose Postgres for our DWH. In a time of Snowflakes, Bigqueries and Redshifts,
- this might surprise some. But I can confidently say Postgres has done a great job for us, and I can even
- dare to say it has provided a better experience than other, more trendy alternatives could have. I'll
- jot down my rationale for picking Postgres one of these days.
-
- Back to the topic: Postgres is not intended to act as a DWH, so using it as such might feel a bit hacky
- at times. There are multiple ways to make your life better with it, as well as related tools and
- practices that you might enjoy, which I'll try to list here.
-
- Use unlogged tables
- The Write Ahead Log comes active by default for the tables you create, and
- for good reasons. But in the context of an ELT DWH, it is probably a good idea to deactivate it by
- making your tables unlogged. Unlogged
- tables will provide you with much faster writes (roughly, twice as fast) which will make data
- loading and transformation jobs inside your DWH much faster.
-
- You pay a price for this with a few trade offs, the most notable being that if your Postgres server
- crashes, the contents of the unlogged tables will be lost. But,
- again, if you have an ELT DWH, you can survive by running a backfill. In Truvi, we made the decision to
- have the landing area for our DWH be logged, and everything else unlogged. This means if we experienced
- a crash (which still hasn't happened, btw), we would recover by running a full-refresh dbt run.
- If you are using dbt, you can easily apply this by adding this bit in your dbt_project.yml
- :
-
-models:
- +unlogged: true
-
-
- Tuning your server's parameters
- Postgres has many parameters you can fiddle with, with plenty of
- chances to either improve or destroy your server's performance.
- Postgres ships with some default values for it, which are almost surely not the optimal ones for
- your needs, specially if you are going to use it as a DWH. Simple changes like adjusting the
- work_mem will do wonders to speed up some of your heavier queries.
-
- There are many parameters to get familiar with and proper adjustment must be done taking your specific
- context and needs into account. If you have no clue at all, this little web app can give you some suggestions you
- canstart from.
-
- Running VACUUM ANALYZE right after building your tables
- Out of the box, Postgres will automatically run
- VACUUM
- and
- ANALYZE
- jobs automatically. The triggers that determine when each of those gets
- triggered can be adjusted with a few server parameters. If you follow an ELT pattern, most surely
- re-building your non-staging tables will cause Postgres to run them.
-
- But there's a detail that is easy to overlook. Postgres automatic triggers will start those quite fast,
- but not right after you build each table. This poses a performance issue: if your intermediate sections
- of the DWH have tables that build upon tables, rebuilding a table and then trying to rebuild a dependant
- without having an ANALYZE on the first one before might hurt you.
- Let me describe this with an example, because this one is a bit of a tongue twister: let's assume we have
- tables int_orders and int_order_kpis. int_orders holds all of our
- orders, and int_order_kpis derives some kpis from them. Naturally, first you will
- materialize int_orders from some upstream staging tables, and once that is complete, you
- will use its contents to build int_order_kpis.
-
-
- Having int_orders ANALYZE-d before you start building
- int_order_kpis is highly benefitial for your performance in building
- int_order_kpis. Why? Because having perfectly updated statistics and metadata on
- int_orders will help Postgres' query optimizer better plan the necessary query to
- materialize int_order_kpis. This can improve performance by orders of magnitude in some
- queries by allowing Postgres to pick the right kind of join strategy for the specific data you have, for
- example.
-
- Now, will Postgres auto VACUUM ANALYZE the freshly built int_orders before you
- start building int_order_kpis? Hard to tell. It depends on how you build your DWH, and how
- you've tuned your server's parameters. And the most dangerous bit is you're not in full control: it can
- be that sometimes it happens, and other times it doesn't. Flaky and annoying. Some day I'll
- write a post on how this behaviour drove me mad for two months because it made a model sometimes built
- in a few seconds, and other times in >20min.
-
-
- My advice is to make sure you always VACUUM ANALYZE right after building your tables. If
- you're using dbt, you can easily achieve this by adding this to your project's
- dbt_project.yml:
-
-models:
- +post-hook:
- sql: "VACUUM ANALYZE {{ this }}"
- transaction: false
- # ^ This makes dbt run a VACUUM ANALYZE on the models after building each.
- # It's pointless for views, but it doesn't matter because Postgres fails
- # silently withour raising an unhandled exception.
-
-
- Monitor queries with pg_stats_statements
- pg_stats_statements is an extension that nowadays ships with Postgres
- by default. If activated, it will log info on the queries executed in the server which you can check
- afterward. This includes many details, with how frequently does the query get called and what's the min,
- max and mean execution time being the ones you probably care about the most. Looking at those allows you
- to find queries that take long each time they run, and queries that get run a lot.
-
- Another important piece of info that gets recorded is who ran the query. This is helpful
- because, if you use users in a smart way, it can help you isolate expensive queries on different uses
- cases or areas. For example, if you use different users to build the DWH and to give your BI tool read
- access (you do that... right?), you can easily tell apart dashboard related queries from internal, DWH
- transformation ones. Another example could be internal reporting vs embedded analytics in your product:
- you might have stricter performance SLAs for product-embedded, customer-facing queries than for internal
- dashboards. Using different users and pg_stats_statements makes it possible for you to
- dissect performance issues on those separate areas independently.
- Dalibo's wonderful execution plan visualizer
- Sometimes you'll have some nasty query you just need to sit down with and optimize. In my experience, in
- a DWH this ends up happening with queries that involve many large tables in sequential joining and
- aggregation steps (as in, you join a few tables, group to some granularity, join some more, group again,
- etc).
-
- You can get the query's real execution details with EXPLAIN ANALYZE, but the output's
- readability is on par with morse-encoded regex patterns. I always had headaches dealing with them until
- I came across Dalibo's execution plan
- visualizer. You can paste the output of EXPLAIN ANALYZE there and see the query
- execution presented as a diagram. No amount of words will portray accurately how awesome the UX is, so
- I encourage you to try the tool with some nasty query and see for yourself.
- Local dev env + Foreign Data Wrapper
- One of the awesome things of using Postgres is how trivial it is to spin up an instance. This makes
- goofing around much more simpler than whenever setting up a new instance means paperwork, $$$, etc.
- Data team members at Truvi have a dockerized Postgres running in their laptops that they can use when
- they are developing on our DWH dbt project. In the early days, you could grab some production dump with
- some subset of tables from our staging layer and run significant chunks of our dbt DAG in your laptop if
- you were patient.
- A few hundreds of models later, this evolved to increasingly difficult and finally became impossible.
-
- Luckily, we came across Postgres' Foreign Data Wrapper. There's
- quite a bit to it, but to keep it short here, just be aware that FDW allows you to make a Postgres
- server give access to some table in a different Postgres server while pretending they are local. So, you
- query table X in Postgres server A, even though table X is actually stored in Postgres server B. But
- your query works just the same as if it was a local genuine table.
- Setting these up is fairly trivial, and has allowed our dbt project contributors to be able to execute
- hybrid dbt runs where some data and tables is local to their laptop, whereas some upstream data is being
- read from production server's. The approach has been great so far, enabling them to actually test models
- before commiting them to master in a convenient way.
-
- back to home
-
-