sh-notion/notion_data_team_no_files/Choosing b305d0910ef446578cc28c3b79042ea1.md
Pablo Martin a256b48b01 pages
2025-07-11 16:15:17 +02:00

2.8 KiB
Raw Blame History

Choosing

Intro

First step is to choose which one we want to go for.

The go-to names in the industry are Airflow, Prefect and Dagster. After a lot of unstructured research during the past year, Ive decided to simply narrow it down to Prefect and Dagster. Both options seem more feature rich, have better integrations with our stack, and have less people complaining about them than Airflow. Airflow is what everyone uses and everyone complains about, so we might just as well directly dodge the bullet.

Between Prefect and Dagster: I cant pick one yet. I worked a lot with prefect and I know its good, but that was working on Prefect 1, and they are already on version 3, so things might have changed a lot.

On the other hand, I havent tried Dagster, but Ive heard lovely things about it. Apparently, its data asset abstraction makes pipelines and governance incredibly better. Plus, it has very nice integrations with Airbyte and dbt, way better than what Ive seen in other tools.

To be able to choose between the two, Ive decided to run a little bit of a hello-world exercise with both of them. The plan is to do the same stuff on both, document it, discuss with Uri, and then make a decision. Once thats done, we start planning how do we do the production deployment and how we move over executions to there.

Orchestration hello-world

These are the steps I would like to run with both.

  • Try to deploy it locally
  • Deploy a local Airbyte and a local DWH alongside
  • Try to setup a full Xero pipeline
    • This means, setting up an Airbyte connection and running locally the dbt pipeline, for the Xero tables (dbt run -s models/staging/xero+)
    • The pipeline should run everything: airbyte, and all the layers of dbt stuff.
    • also, run related dbt tests
  • Try to setup a full xexe pipeline
    • This means, triggering runs made with the CLI interface of xexe or by importing it as a library, and then running locally the dbt pipeline for the downstream currency related tables (not including the gazillion DWH tables that depend on them. Just currency stuff down to int_simple_exchange_rates).
    • also, run related dbt tests

Besides that, I might also try to:

  • Send messages through slack for alerts
  • Deploy on Azure (not the final, production deployment by any means)

Some areas where I would like to take thorough notes on the features:

  • Retry logic
  • Pipeline logs
  • dbt logs
  • Warnings and alerts, perhaps even incident management
  • Scalability features with parametrization
  • Secret management
  • Pipeline version control
  • Triggering and scheduling capabilities
  • API for external services to interact
  • ownership and governance of pipelines
  • how the hell can we play it smart with backfills
  • Development and deployment flow

Dagster hello-world