+ Hi, Pablo here +
+ ++
My tips and tricks when using Postgres as a DWH
+In November 2023, I joined Superhog (now called Truvi) to start out the Data team. As part of that, I + also drafted and deployed the first version of its data platform. +
+The context led me to choose Postgres for our DWH. In a time of Snowflakes, Bigqueries and Redshifts, + this might surprise some. But I can confidently say Postgres has done a great job for us, and I can even + dare to say it has provided a better experience than other, more trendy alternatives could have. I'll + jot down my rationale for picking Postgres one of these days.
++ Back to the topic: Postgres is not intended to act as a DWH, so using it as such might feel a bit hacky + at times. There are multiple ways to make your life better with it, as well as related tools and + practices that you might enjoy, which I'll try to list here. +
+Use unlogged tables
+ The Write Ahead Log comes active by default for the tables you create, and
+ for good reasons. But in the context of an ELT DWH, it is probably a good idea to deactivate it by
+ making your tables unlogged. Unlogged
+ tables will provide you with much faster writes (roughly, twice as fast) which will make data
+ loading and transformation jobs inside your DWH much faster.
+
You pay a price for this with a few trade offs, the most notable being that if your Postgres server + crashes, the contents of the unlogged tables will be lost. But, + again, if you have an ELT DWH, you can survive by running a backfill. In Truvi, we made the decision to + have the landing area for our DWH be logged, and everything else unlogged. This means if we experienced + a crash (which still hasn't happened, btw), we would recover by running a full-refresh dbt run.
+If you are using dbt, you can easily apply this by adding this bit in your dbt_project.yml
+ :
+models:
+ +unlogged: true
+
+
+ Tuning your server's parameters
+Postgres has many parameters you can fiddle with, with plenty of + chances to either improve or destroy your server's performance.
+Postgres ships with some default values for it, which are almost surely not the optimal ones for
+ your needs, specially if you are going to use it as a DWH. Simple changes like adjusting the
+ work_mem will do wonders to speed up some of your heavier queries.
+
There are many parameters to get familiar with and proper adjustment must be done taking your specific + context and needs into account. If you have no clue at all, this little web app can give you some suggestions you + canstart from. +
+Running VACUUM ANALYZE right after building your tables
+ Out of the box, Postgres will automatically run
+ VACUUM
+ and
+ ANALYZE
+ jobs automatically. The triggers that determine when each of those gets
+ triggered can be adjusted with a few server parameters. If you follow an ELT pattern, most surely
+ re-building your non-staging tables will cause Postgres to run them.
+
But there's a detail that is easy to overlook. Postgres automatic triggers will start those quite fast,
+ but not right after you build each table. This poses a performance issue: if your intermediate sections
+ of the DWH have tables that build upon tables, rebuilding a table and then trying to rebuild a dependant
+ without having an ANALYZE on the first one before might hurt you.
Let me describe this with an example, because this one is a bit of a tongue twister: let's assume we have
+ tables int_orders and int_order_kpis. int_orders holds all of our
+ orders, and int_order_kpis derives some kpis from them. Naturally, first you will
+ materialize int_orders from some upstream staging tables, and once that is complete, you
+ will use its contents to build int_order_kpis.
+
+ Having int_orders ANALYZE-d before you start building
+ int_order_kpis is highly benefitial for your performance in building
+ int_order_kpis. Why? Because having perfectly updated statistics and metadata on
+ int_orders will help Postgres' query optimizer better plan the necessary query to
+ materialize int_order_kpis. This can improve performance by orders of magnitude in some
+ queries by allowing Postgres to pick the right kind of join strategy for the specific data you have, for
+ example.
+
Now, will Postgres auto VACUUM ANALYZE the freshly built int_orders before you
+ start building int_order_kpis? Hard to tell. It depends on how you build your DWH, and how
+ you've tuned your server's parameters. And the most dangerous bit is you're not in full control: it can
+ be that sometimes it happens, and other times it doesn't. Flaky and annoying. Some day I'll
+ write a post on how this behaviour drove me mad for two months because it made a model sometimes built
+ in a few seconds, and other times in >20min.
+
+ My advice is to make sure you always VACUUM ANALYZE right after building your tables. If
+ you're using dbt, you can easily achieve this by adding this to your project's
+ dbt_project.yml:
+
+models:
+ +post-hook:
+ sql: "VACUUM ANALYZE {{ this }}"
+ transaction: false
+ # ^ This makes dbt run a VACUUM ANALYZE on the models after building each.
+ # It's pointless for views, but it doesn't matter because Postgres fails
+ # silently withour raising an unhandled exception.
+
+