+ My tips and tricks when using Postgres as a DWH
+ In November 2023, I joined Superhog (now called Truvi) to start out the Data team. As part of that, I
+ also drafted and deployed the first version of its data platform.
+
+ The context led me to choose Postgres for our DWH. In a time of Snowflakes, Bigqueries and Redshifts,
+ this might surprise some. But I can confidently say Postgres has done a great job for us, and I can even
+ dare to say it has provided a better experience than other, more trendy alternatives could have. I'll
+ jot down my rationale for picking Postgres one of these days.
+
+ Back to the topic: Postgres is not intended to act as a DWH, so using it as such might feel a bit hacky
+ at times. There are multiple ways to make your life better with it, as well as related tools and
+ practices that you might enjoy, which I'll try to list here.
+
+ Use unlogged tables
+ The Write Ahead Log comes active by default for the tables you create, and
+ for good reasons. But in the context of an ELT DWH, it is probably a good idea to deactivate it by
+ making your tables unlogged. Unlogged
+ tables will provide you with much faster writes (roughly, twice as fast) which will make data
+ loading and transformation jobs inside your DWH much faster.
+
+ You pay a price for this with a few trade offs, the most notable being that if your Postgres server
+ crashes, the contents of the unlogged tables will be lost. But,
+ again, if you have an ELT DWH, you can survive by running a backfill. In Truvi, we made the decision to
+ have the landing area for our DWH be logged, and everything else unlogged. This means if we experienced
+ a crash (which still hasn't happened, btw), we would recover by running a full-refresh dbt run.
+ If you are using dbt, you can easily apply this by adding this bit in your dbt_project.yml
+ :
+
+models:
+ +unlogged: true
+
+
+ Tuning your server's parameters
+ Postgres has many parameters you can fiddle with, with plenty of
+ chances to either improve or destroy your server's performance.
+ Postgres ships with some default values for it, which are almost surely not the optimal ones for
+ your needs, specially if you are going to use it as a DWH. Simple changes like adjusting the
+ work_mem will do wonders to speed up some of your heavier queries.
+
+ There are many parameters to get familiar with and proper adjustment must be done taking your specific
+ context and needs into account. If you have no clue at all, this little web app can give you some suggestions you
+ canstart from.
+
+ Running VACUUM ANALYZE right after building your tables
+ Out of the box, Postgres will automatically run
+ VACUUM
+ and
+ ANALYZE
+ jobs automatically. The triggers that determine when each of those gets
+ triggered can be adjusted with a few server parameters. If you follow an ELT pattern, most surely
+ re-building your non-staging tables will cause Postgres to run them.
+
+ But there's a detail that is easy to overlook. Postgres automatic triggers will start those quite fast,
+ but not right after you build each table. This poses a performance issue: if your intermediate sections
+ of the DWH have tables that build upon tables, rebuilding a table and then trying to rebuild a dependant
+ without having an ANALYZE on the first one before might hurt you.
+ Let me describe this with an example, because this one is a bit of a tongue twister: let's assume we have
+ tables int_orders and int_order_kpis. int_orders holds all of our
+ orders, and int_order_kpis derives some kpis from them. Naturally, first you will
+ materialize int_orders from some upstream staging tables, and once that is complete, you
+ will use its contents to build int_order_kpis.
+
+
+ Having int_orders ANALYZE-d before you start building
+ int_order_kpis is highly benefitial for your performance in building
+ int_order_kpis. Why? Because having perfectly updated statistics and metadata on
+ int_orders will help Postgres' query optimizer better plan the necessary query to
+ materialize int_order_kpis. This can improve performance by orders of magnitude in some
+ queries by allowing Postgres to pick the right kind of join strategy for the specific data you have, for
+ example.
+
+ Now, will Postgres auto VACUUM ANALYZE the freshly built int_orders before you
+ start building int_order_kpis? Hard to tell. It depends on how you build your DWH, and how
+ you've tuned your server's parameters. And the most dangerous bit is you're not in full control: it can
+ be that sometimes it happens, and other times it doesn't. Flaky and annoying. Some day I'll
+ write a post on how this behaviour drove me mad for two months because it made a model sometimes built
+ in a few seconds, and other times in >20min.
+
+
+ My advice is to make sure you always VACUUM ANALYZE right after building your tables. If
+ you're using dbt, you can easily achieve this by adding this to your project's
+ dbt_project.yml:
+
+models:
+ +post-hook:
+ sql: "VACUUM ANALYZE {{ this }}"
+ transaction: false
+ # ^ This makes dbt run a VACUUM ANALYZE on the models after building each.
+ # It's pointless for views, but it doesn't matter because Postgres fails
+ # silently withour raising an unhandled exception.
+
+
+ Monitor queries with pg_stats_statements
+ pg_stats_statements is an extension that nowadays ships with Postgres
+ by default. If activated, it will log info on the queries executed in the server which you can check
+ afterward. This includes many details, with how frequently does the query get called and what's the min,
+ max and mean execution time being the ones you probably care about the most. Looking at those allows you
+ to find queries that take long each time they run, and queries that get run a lot.
+
+ Another important piece of info that gets recorded is who ran the query. This is helpful
+ because, if you use users in a smart way, it can help you isolate expensive queries on different uses
+ cases or areas. For example, if you use different users to build the DWH and to give your BI tool read
+ access (you do that... right?), you can easily tell apart dashboard related queries from internal, DWH
+ transformation ones. Another example could be internal reporting vs embedded analytics in your product:
+ you might have stricter performance SLAs for product-embedded, customer-facing queries than for internal
+ dashboards. Using different users and pg_stats_statements makes it possible for you to
+ dissect performance issues on those separate areas independently.
+ Dalibo's wonderful execution plan visualizer
+ Sometimes you'll have some nasty query you just need to sit down with and optimize. In my experience, in
+ a DWH this ends up happening with queries that involve many large tables in sequential joining and
+ aggregation steps (as in, you join a few tables, group to some granularity, join some more, group again,
+ etc).
+
+ You can get the query's real execution details with EXPLAIN ANALYZE, but the output's
+ readability is on par with morse-encoded regex patterns. I always had headaches dealing with them until
+ I came across Dalibo's execution plan
+ visualizer. You can paste the output of EXPLAIN ANALYZE there and see the query
+ execution presented as a diagram. No amount of words will portray accurately how awesome the UX is, so
+ I encourage you to try the tool with some nasty query and see for yourself.
+ Local dev env + Foreign Data Wrapper
+ One of the awesome things of using Postgres is how trivial it is to spin up an instance. This makes
+ goofing around much more simpler than whenever setting up a new instance means paperwork, $$$, etc.
+ Data team members at Truvi have a dockerized Postgres running in their laptops that they can use when
+ they are developing on our DWH dbt project. In the early days, you could grab some production dump with
+ some subset of tables from our staging layer and run significant chunks of our dbt DAG in your laptop if
+ you were patient.
+ A few hundreds of models later, this evolved to increasingly difficult and finally became impossible.
+
+ Luckily, we came across Postgres' Foreign Data Wrapper. There's
+ quite a bit to it, but to keep it short here, just be aware that FDW allows you to make a Postgres
+ server give access to some table in a different Postgres server while pretending they are local. So, you
+ query table X in Postgres server A, even though table X is actually stored in Postgres server B. But
+ your query works just the same as if it was a local genuine table.
+ Setting these up is fairly trivial, and has allowed our dbt project contributors to be able to execute
+ hybrid dbt runs where some data and tables is local to their laptop, whereas some upstream data is being
+ read from production server's. The approach has been great so far, enabling them to actually test models
+ before commiting them to master in a convenient way.
+
+ back to home
+
+