Pains

Local app/infra

Starting up the local app and running E2E tests + running EL takes like 20min.
Having Airflow UI to run things is nice. Having Airflow schedules trigger automatically is painful locally because it makes controling the state of the environment hard.
Simply checking the E2E flow takes a lot of steps and tools. Feeling that there's always something breaking.

Meltano

Meltano logs are terrible to read, make debugging painful.
Meltano EL takes a long time
Meltano configuration is painful:
- Docs are terrible
- Wrong configurations don't raise errors often times, just get ignored
Meltano handling python environments is sometimes more of an annoyance than help

Data Pipeline needs data

Anemic dataset makes development hard (many entities with few or no records).

Improvable practices

Loading all backend tables into DW, regardless of whether they are used
- More data, worse performance, no gain
- More cognitive load when working on DW ("What is this table? Where is it used? Can I modify it?")
- "But then I have all backend tables handy" -> Well, let's make adding a backend table trivial
dbt
- Not documenting models
  - Next person has no clue what they are, makes shared ownership hard
- Not using exposures
  - Hard to know what models impact what reports
  - Hard to know what parts of the DW are truly used

What are our bottlenecks/issue?

Dealing with a convoluted output definition (understanding laws + unclear validation procedure with Vicky)
Translating needed report into SQL transformations of the backend data
Breaking changes in backend events/entities breaking downstream data dependencies

What is NOT

Data latency
Scalability of data volume
Having a pretty interface

My proposal

Fallback to a dramatically simple stack that allows team members working on reports to move fast
- Hardcoded Bitfinex CSV
- Hardcoded Sumsub CSV
- Use another PG as DW, move data with a simple, stateless Python script
Ignore orchestration, UI delivery, monitoring for now
Work together with backend to find a convenient solution to have good testing data
Make an exhaustive list of reports and align with Luis/Vicky on a plan to systematically meet and validate. Track it.

Then, once...

We have ack from Luis/Vicky that all reports are domain-valid
We have more clarity on integrations we must run with regulators/government systems after audit
Backend data model is more stable

... we grab our domain rich, deployment poor setup and we discuss what is the optimal tooling and strategies to deliver with production-grade practices.

North Star Ideas

Use an asset based orchestrator, like dagster
Step away from meltano, use a EL framework such as dlt and combine it with orchestrator
Add a visualization tool to the stack, such as evidence, metabase, lightdash

2.9 KiB Raw Permalink Blame History

Pains

Local app/infra

Meltano

Data Pipeline needs data

Improvable practices

What are our bottlenecks/issue?

My proposal

North Star Ideas

2.9 KiB

Raw Permalink Blame History