galoy-personal-notes/pains.md
2025-10-24 14:33:36 +02:00

2.9 KiB

Pains

Local app/infra

  • Starting up the local app and running E2E tests + running EL takes like 20min.
  • Having Airflow UI to run things is nice. Having Airflow schedules trigger automatically is painful locally because it makes controling the state of the environment hard.
  • Simply checking the E2E flow takes a lot of steps and tools. Feeling that there's always something breaking.

Meltano

  • Meltano logs are terrible to read, make debugging painful.
  • Meltano EL takes a long time
  • Meltano configuration is painful:
    • Docs are terrible
    • Wrong configurations don't raise errors often times, just get ignored
  • Meltano handling python environments is sometimes more of an annoyance than help

Data Pipeline needs data

  • Anemic dataset makes development hard (many entities with few or no records).

Improvable practices

  • Loading all backend tables into DW, regardless of whether they are used
    • More data, worse performance, no gain
    • More cognitive load when working on DW ("What is this table? Where is it used? Can I modify it?")
    • "But then I have all backend tables handy" -> Well, let's make adding a backend table trivial
  • dbt
    • Not documenting models
      • Next person has no clue what they are, makes shared ownership hard
    • Not using exposures
      • Hard to know what models impact what reports
      • Hard to know what parts of the DW are truly used

What are our bottlenecks/issue?

  • Dealing with a convoluted output definition (understanding laws + unclear validation procedure with Vicky)
  • Translating needed report into SQL transformations of the backend data
  • Breaking changes in backend events/entities breaking downstream data dependencies

What is NOT

  • Data latency
  • Scalability of data volume
  • Having a pretty interface

My proposal

  • Fallback to a dramatically simple stack that allows team members working on reports to move fast
    • Hardcoded Bitfinex CSV
    • Hardcoded Sumsub CSV
    • Use another PG as DW, move data with a simple, stateless Python script
  • Ignore orchestration, UI delivery, monitoring for now
  • Work together with backend to find a convenient solution to have good testing data
  • Make an exhaustive list of reports and align with Luis/Vicky on a plan to systematically meet and validate. Track it.

Then, once...

  • We have ack from Luis/Vicky that all reports are domain-valid
  • We have more clarity on integrations we must run with regulators/government systems after audit
  • Backend data model is more stable

... we grab our domain rich, deployment poor setup and we discuss what is the optimal tooling and strategies to deliver with production-grade practices.

North Star Ideas

  • Use an asset based orchestrator, like dagster
  • Step away from meltano, use a EL framework such as dlt and combine it with orchestrator
  • Add a visualization tool to the stack, such as evidence, metabase, lightdash