# Pains ## Local app/infra - Starting up the local app and running E2E tests + running EL takes like 20min. - Having Airflow UI to run things is nice. Having Airflow schedules trigger automatically is painful locally because it makes controling the state of the environment hard. - Simply checking the E2E flow takes a lot of steps and tools. Feeling that there's always something breaking. ## Meltano - Meltano logs are terrible to read, make debugging painful. - Meltano EL takes a long time - Meltano configuration is painful: - Docs are terrible - Wrong configurations don't raise errors often times, just get ignored - Meltano handling python environments is sometimes more of an annoyance than help ## Data Pipeline needs data - Anemic dataset makes development hard (many entities with few or no records). ## Improvable practices - Loading all backend tables into DW, regardless of whether they are used - More data, worse performance, no gain - More cognitive load when working on DW ("What is this table? Where is it used? Can I modify it?") - "But then I have all backend tables handy" -> Well, let's make adding a backend table trivial - dbt - Not documenting models - Next person has no clue what they are, makes shared ownership hard - Not using exposures - Hard to know what models impact what reports - Hard to know what parts of the DW are truly used - ## What are our bottlenecks/issue? - Dealing with a convoluted output definition (understanding laws + unclear validation procedure with Vicky) - Translating needed report into SQL transformations of the backend data - Breaking changes in backend events/entities breaking downstream data dependencies What is NOT - Data latency - Scalability of data volume - Having a pretty interface ## My proposal - Fallback to a dramatically simple stack that allows team members working on reports to move fast - Hardcoded Bitfinex CSV - Hardcoded Sumsub CSV - Use another PG as DW, move data with a simple, stateless Python script - Ignore orchestration, UI delivery, monitoring for now - Work together with backend to find a convenient solution to have good testing data - Make an exhaustive list of reports and align with Luis/Vicky on a plan to systematically meet and validate. Track it. Then, once... - We have ack from Luis/Vicky that all reports are domain-valid - We have more clarity on integrations we must run with regulators/government systems after audit - Backend data model is more stable ... we grab our domain rich, deployment poor setup and we discuss what is the optimal tooling and strategies to deliver with production-grade practices. ## North Star Ideas - Use an asset based orchestrator, like dagster - Step away from meltano, use a EL framework such as dlt and combine it with orchestrator - Add a visualization tool to the stack, such as evidence, metabase, lightdash