70 lines
No EOL
2.9 KiB
Markdown
70 lines
No EOL
2.9 KiB
Markdown
# Pains
|
|
|
|
## Local app/infra
|
|
- Starting up the local app and running E2E tests + running EL takes like 20min.
|
|
- Having Airflow UI to run things is nice. Having Airflow schedules trigger automatically is painful locally because it makes controling the state of the environment hard.
|
|
- Simply checking the E2E flow takes a lot of steps and tools. Feeling that there's always something breaking.
|
|
|
|
## Meltano
|
|
|
|
- Meltano logs are terrible to read, make debugging painful.
|
|
- Meltano EL takes a long time
|
|
- Meltano configuration is painful:
|
|
- Docs are terrible
|
|
- Wrong configurations don't raise errors often times, just get ignored
|
|
- Meltano handling python environments is sometimes more of an annoyance than help
|
|
|
|
## Data Pipeline needs data
|
|
|
|
- Anemic dataset makes development hard (many entities with few or no records).
|
|
|
|
|
|
## Improvable practices
|
|
|
|
- Loading all backend tables into DW, regardless of whether they are used
|
|
- More data, worse performance, no gain
|
|
- More cognitive load when working on DW ("What is this table? Where is it used? Can I modify it?")
|
|
- "But then I have all backend tables handy" -> Well, let's make adding a backend table trivial
|
|
- dbt
|
|
- Not documenting models
|
|
- Next person has no clue what they are, makes shared ownership hard
|
|
- Not using exposures
|
|
- Hard to know what models impact what reports
|
|
- Hard to know what parts of the DW are truly used
|
|
-
|
|
|
|
## What are our bottlenecks/issue?
|
|
|
|
- Dealing with a convoluted output definition (understanding laws + unclear validation procedure with Vicky)
|
|
- Translating needed report into SQL transformations of the backend data
|
|
- Breaking changes in backend events/entities breaking downstream data dependencies
|
|
|
|
What is NOT
|
|
- Data latency
|
|
- Scalability of data volume
|
|
- Having a pretty interface
|
|
|
|
|
|
## My proposal
|
|
|
|
- Fallback to a dramatically simple stack that allows team members working on reports to move fast
|
|
- Hardcoded Bitfinex CSV
|
|
- Hardcoded Sumsub CSV
|
|
- Use another PG as DW, move data with a simple, stateless Python script
|
|
- Ignore orchestration, UI delivery, monitoring for now
|
|
- Work together with backend to find a convenient solution to have good testing data
|
|
- Make an exhaustive list of reports and align with Luis/Vicky on a plan to systematically meet and validate. Track it.
|
|
|
|
Then, once...
|
|
- We have ack from Luis/Vicky that all reports are domain-valid
|
|
- We have more clarity on integrations we must run with regulators/government systems after audit
|
|
- Backend data model is more stable
|
|
|
|
... we grab our domain rich, deployment poor setup and we discuss what is the optimal tooling and strategies to deliver with production-grade practices.
|
|
|
|
|
|
## North Star Ideas
|
|
|
|
- Use an asset based orchestrator, like dagster
|
|
- Step away from meltano, use a EL framework such as dlt and combine it with orchestrator
|
|
- Add a visualization tool to the stack, such as evidence, metabase, lightdash |