Choosing

Intro

First step is to choose which one we want to go for.

The go-to names in the industry are Airflow, Prefect and Dagster. After a lot of unstructured research during the past year, I’ve decided to simply narrow it down to Prefect and Dagster. Both options seem more feature rich, have better integrations with our stack, and have less people complaining about them than Airflow. Airflow is what everyone uses and everyone complains about, so we might just as well directly dodge the bullet.

Between Prefect and Dagster: I can’t pick one yet. I worked a lot with prefect and I know it’s good, but that was working on Prefect 1, and they are already on version 3, so things might have changed a lot.

On the other hand, I haven’t tried Dagster, but I’ve heard lovely things about it. Apparently, it’s data asset abstraction makes pipelines and governance incredibly better. Plus, it has very nice integrations with Airbyte and dbt, way better than what I’ve seen in other tools.

To be able to choose between the two, I’ve decided to run a little bit of a hello-world exercise with both of them. The plan is to do the same stuff on both, document it, discuss with Uri, and then make a decision. Once that’s done, we start planning how do we do the production deployment and how we move over executions to there.

Orchestration hello-world

These are the steps I would like to run with both.

Try to deploy it locally
Deploy a local Airbyte and a local DWH alongside
Try to setup a full Xero pipeline
- This means, setting up an Airbyte connection and running locally the dbt pipeline, for the Xero tables (dbt run -s models/staging/xero+)
- The pipeline should run everything: airbyte, and all the layers of dbt stuff.
- also, run related dbt tests
Try to setup a full xexe pipeline
- This means, triggering runs made with the CLI interface of xexe or by importing it as a library, and then running locally the dbt pipeline for the downstream currency related tables (not including the gazillion DWH tables that depend on them. Just currency stuff down to int_simple_exchange_rates).
- also, run related dbt tests

Besides that, I might also try to:

Send messages through slack for alerts
Deploy on Azure (not the final, production deployment by any means)

Some areas where I would like to take thorough notes on the features:

Retry logic
Pipeline logs
dbt logs
Warnings and alerts, perhaps even incident management
Scalability features with parametrization
Secret management
Pipeline version control
Triggering and scheduling capabilities
API for external services to interact
ownership and governance of pipelines
how the hell can we play it smart with backfills
Development and deployment flow

Dagster hello-world

2.8 KiB Raw Blame History Unescape Escape

Choosing

Intro

Orchestration hello-world

2.8 KiB

Raw Blame History