pablohere/public/writings/my-tips-and-tricks-when-using-postgres-as-a-dwh.html

<!DOCTYPE HTML>
<html>

<head>
    <title>Pablo here</title>
    <meta charset="utf-8">
    <meta viewport="width=device-width, initial-scale=1">
    <link rel="stylesheet" href="../styles.css">
</head>


<body>
    <main>
        <h1>
            Hi, Pablo here
        </h1>
        <p><a href="../index.html">back to home</a></p>
        <hr>
        <section>
            <h2>My tips and tricks when using Postgres as a DWH</h2>
            <p>In November 2023, I joined Superhog (now called Truvi) to start out the Data team. As part of that, I
                also drafted and deployed the first version of its data platform.
            </p>
            <p>The context led me to choose Postgres for our DWH. In a time of Snowflakes, Bigqueries and Redshifts,
                this might surprise some. But I can confidently say Postgres has done a great job for us, and I can even
                dare to say it has provided a better experience than other, more trendy alternatives could have. I'll
                jot down my rationale for picking Postgres one of these days.</p>
            <p>
                Back to the topic: Postgres is not intended to act as a DWH, so using it as such might feel a bit hacky
                at times. There are multiple ways to make your life better with it, as well as related tools and
                practices that you might enjoy, which I'll try to list here.
            </p>
            <h3>Use <code>unlogged</code> tables</h3>
            <p>The <a href="https://www.postgresql.org/docs/current/wal-intro.html" target="_blank"
                    rel="noopener noreferrer">Write Ahead Log</a> comes active by default for the tables you create, and
                for good reasons. But in the context of an ELT DWH, it is probably a good idea to deactivate it by
                making your tables <code>unlogged</code>. <a
                    href="https://www.crunchydata.com/blog/postgresl-unlogged-tables" target="_blank">Unlogged
                    tables</a> will provide you with much faster writes (roughly, twice as fast) which will make data
                loading and transformation jobs inside your DWH much faster.
            </p>
            <p>You pay a price for this with a few trade offs, the most notable being that if your Postgres server
                crashes, <a href="https://www.postgresql.org/docs/current/sql-createtable.html#SQL-CREATETABLE-UNLOGGED"
                    target="_blank" rel="noopener noreferrer">the contents of the unlogged tables will be lost</a>. But,
                again, if you have an ELT DWH, you can survive by running a backfill. In Truvi, we made the decision to
                have the landing area for our DWH be logged, and everything else unlogged. This means if we experienced
                a crash (which still hasn't happened, btw), we would recover by running a full-refresh dbt run.</p>
            <p>If you are using dbt, you can easily apply this by adding this bit in your <code>dbt_project.yml</code>
                :</p>
            <pre><code>
models:
    +unlogged: true
            </code></pre>

            <h3>Tuning your server's parameters</h3>
            <p><a href="https://www.postgresql.org/docs/current/runtime-config.html" target="_blank"
                    rel="noopener noreferrer">Postgres has many parameters you can fiddle with</a>, with plenty of
                chances to either improve or destroy your server's performance.</p>
            <p>Postgres ships with some default values for it, which are almost surely not the optimal ones for
                your needs, <em>specially</em> if you are going to use it as a DWH. Simple changes like adjusting the
                <code>work_mem</code> will do wonders to speed up some of your heavier queries.
            </p>
            <p>There are many parameters to get familiar with and proper adjustment must be done taking your specific
                context and needs into account. If you have no clue at all, <a href="https://pgtune.leopard.in.ua"
                    target="_blank" rel="noopener noreferrer">this little web app</a> can give you some suggestions you
                canstart from.
            </p>
            <h3>Running <code>VACUUM ANALYZE</code> right after building your tables</h3>
            <p>Out of the box, Postgres will automatically run
                <code><a href="https://www.postgresql.org/docs/current/sql-vacuum.html" target="_blank" rel="noopener noreferrer">VACUUM</a></code>
                and
                <code><a href="https://www.postgresql.org/docs/current/sql-analyze.html" target="_blank" rel="noopener noreferrer">ANALYZE</a></code>
                jobs <a href="https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM" target="_blank"
                    rel="noopener noreferrer">automatically</a>. The triggers that determine when each of those gets
                triggered can be adjusted with a few server parameters. If you follow an ELT pattern, most surely
                re-building your non-staging tables will cause Postgres to run them.
            </p>
            <p>But there's a detail that is easy to overlook. Postgres automatic triggers will start those quite fast,
                but not right after you build each table. This poses a performance issue: if your intermediate sections
                of the DWH have tables that build upon tables, rebuilding a table and then trying to rebuild a dependant
                without having an <code>ANALYZE</code> on the first one before might hurt you.</p>
            <p>Let me describe this with an example, because this one is a bit of a tongue twister: let's assume we have
                tables <code>int_orders</code> and <code>int_order_kpis</code>. <code>int_orders</code> holds all of our
                orders, and <code>int_order_kpis</code> derives some kpis from them. Naturally, first you will
                materialize <code>int_orders</code> from some upstream staging tables, and once that is complete, you
                will use its contents to build <code>int_order_kpis</code>.
            </p>
            <p>
                Having <code>int_orders</code> <code>ANALYZE</code>-d before you start building
                <code>int_order_kpis</code> is highly benefitial for your performance in building
                <code>int_order_kpis</code>. Why? Because having perfectly updated statistics and metadata on
                <code>int_orders</code> will help Postgres' query optimizer better plan the necessary query to
                materialize <code>int_order_kpis</code>. This can improve performance by orders of magnitude in some
                queries by allowing Postgres to pick the right kind of join strategy for the specific data you have, for
                example.
            </p>
            <p>Now, will Postgres auto <code>VACUUM ANALYZE</code> the freshly built <code>int_orders</code> before you
                start building <code>int_order_kpis</code>? Hard to tell. It depends on how you build your DWH, and how
                you've tuned your server's parameters. And the most dangerous bit is you're not in full control: it can
                be that <em>sometimes</em> it happens, and other times it doesn't. Flaky and annoying. Some day I'll
                write a post on how this behaviour drove me mad for two months because it made a model sometimes built
                in a few seconds, and other times in >20min.
            </p>
            <p>
                My advice is to make sure you always <code>VACUUM ANALYZE</code> right after building your tables. If
                you're using dbt, you can easily achieve this by adding this to your project's
                <code>dbt_project.yml</code>:
            <pre><code>
models:
    +post-hook:
        sql: "VACUUM ANALYZE {{ this }}"
        transaction: false
        # ^ This makes dbt run a VACUUM ANALYZE on the models after building each.
        # It's pointless for views, but it doesn't matter because Postgres fails
        # silently withour raising an unhandled exception.
            </code></pre>
            </p>
            <h3>Monitor queries with <code>pg_stats_statements</code></h3>
            <p><a href="https://www.postgresql.org/docs/current/pgstatstatements.html" target="_blank"
                    rel="noopener noreferrer">pg_stats_statements</a> is an extension that nowadays ships with Postgres
                by default. If activated, it will log info on the queries executed in the server which you can check
                afterward. This includes many details, with how frequently does the query get called and what's the min,
                max and mean execution time being the ones you probably care about the most. Looking at those allows you
                to find queries that take long each time they run, and queries that get run a lot.
            </p>
            <p>Another important piece of info that gets recorded is <em>who</em> ran the query. This is helpful
                because, if you use users in a smart way, it can help you isolate expensive queries on different uses
                cases or areas. For example, if you use different users to build the DWH and to give your BI tool read
                access (you do that... right?), you can easily tell apart dashboard related queries from internal, DWH
                transformation ones. Another example could be internal reporting vs embedded analytics in your product:
                you might have stricter performance SLAs for product-embedded, customer-facing queries than for internal
                dashboards. Using different users and <code>pg_stats_statements</code> makes it possible for you to
                dissect performance issues on those separate areas independently.</p>
            <h3>Dalibo's wonderful execution plan visualizer</h3>
            <p>Sometimes you'll have some nasty query you just need to sit down with and optimize. In my experience, in
                a DWH this ends up happening with queries that involve many large tables in sequential joining and
                aggregation steps (as in, you join a few tables, group to some granularity, join some more, group again,
                etc).
            </p>
            <p>You can get the query's real execution details with <code>EXPLAIN ANALYZE</code>, but the output's
                readability is on par with morse-encoded regex patterns. I always had headaches dealing with them until
                I came across <a href="https://dalibo.com/" target="_blank" rel="noopener noreferrer">Dalibo</a>'s <a
                    href="https://explain.dalibo.com/" target="_blank" rel="noopener noreferrer">execution plan
                    visualizer</a>. You can paste the output of <code>EXPLAIN ANALYZE</code> there and see the query
                execution presented as a diagram. No amount of words will portray accurately how awesome the UX is, so
                I encourage you to try the tool with some nasty query and see for yourself.</p>
            <h3>Local dev env + Foreign Data Wrapper</h3>
            <p>One of the awesome things of using Postgres is how trivial it is to spin up an instance. This makes
                goofing around much more simpler than whenever setting up a new instance means paperwork, $$$, etc.</p>
            <p>Data team members at Truvi have a dockerized Postgres running in their laptops that they can use when
                they are developing on our DWH dbt project. In the early days, you could grab some production dump with
                some subset of tables from our staging layer and run significant chunks of our dbt DAG in your laptop if
                you were patient.</p>
            <p>A few hundreds of models later, this evolved to increasingly difficult and finally became impossible.
            </p>
            <p>Luckily, we came across Postgres' <a
                    href="https://www.postgresql.org/docs/current/postgres-fdw.html">Foreign Data Wrapper</a>. There's
                quite a bit to it, but to keep it short here, just be aware that FDW allows you to make a Postgres
                server give access to some table in a different Postgres server while pretending they are local. So, you
                query table X in Postgres server A, even though table X is actually stored in Postgres server B. But
                your query works just the same as if it was a local genuine table.</p>
            <p>Setting these up is fairly trivial, and has allowed our dbt project contributors to be able to execute
                hybrid dbt runs where some data and tables is local to their laptop, whereas some upstream data is being
                read from production server's. The approach has been great so far, enabling them to actually test models
                before commiting them to master in a convenient way.</p>
            <hr>
            <p><a href="../index.html">back to home</a></p>
        </section>
    </main>

</body>

</html>