galoy-personal-notes/pains.md

# Pains

## Local app/infra
- Starting up the local app and running E2E tests + running EL takes like 20min.
- Having Airflow UI to run things is nice. Having Airflow schedules trigger automatically is painful locally because it makes controling the state of the environment hard.
- Simply checking the E2E flow takes a lot of steps and tools. Feeling that there's always something breaking.

## Meltano

- Meltano logs are terrible to read, make debugging painful.
- Meltano EL takes a long time
- Meltano configuration is painful:
    - Docs are terrible
    - Wrong configurations don't raise errors often times, just get ignored
- Meltano handling python environments is sometimes more of an annoyance than help

## Data Pipeline needs data

- Anemic dataset makes development hard (many entities with few or no records).


## Improvable practices

- Loading all backend tables into DW, regardless of whether they are used
    - More data, worse performance, no gain
    - More cognitive load when working on DW ("What is this table? Where is it used? Can I modify it?")
    - "But then I have all backend tables handy" -> Well, let's make adding a backend table trivial 
- dbt
    - Not documenting models
        - Next person has no clue what they are, makes shared ownership hard
    - Not using exposures
        - Hard to know what models impact what reports
        - Hard to know what parts of the DW are truly used
    - 

## What are our bottlenecks/issue?

- Dealing with a convoluted output definition (understanding laws + unclear validation procedure with Vicky)
- Translating needed report into SQL transformations of the backend data
- Breaking changes in backend events/entities breaking downstream data dependencies

What is NOT
- Data latency
- Scalability of data volume
- Having a pretty interface


## My proposal

- Fallback to a dramatically simple stack that allows team members working on reports to move fast
    - Hardcoded Bitfinex CSV
    - Hardcoded Sumsub CSV
    - Use another PG as DW, move data with a simple, stateless Python script
- Ignore orchestration, UI delivery, monitoring for now
- Work together with backend to find a convenient solution to have good testing data 
- Make an exhaustive list of reports and align with Luis/Vicky on a plan to systematically meet and validate. Track it.

Then, once...
- We have ack from Luis/Vicky that all reports are domain-valid
- We have more clarity on integrations we must run with regulators/government systems after audit
- Backend data model is more stable

... we grab our domain rich, deployment poor setup and we discuss what is the optimal tooling and strategies to deliver with production-grade practices.


## North Star Ideas

- Use an asset based orchestrator, like dagster
- Step away from meltano, use a EL framework such as dlt and combine it with orchestrator
- Add a visualization tool to the stack, such as evidence, metabase, lightdash
stuf 2025-10-24 14:33:36 +02:00			`# Pains`

			`## Local app/infra`
			`- Starting up the local app and running E2E tests + running EL takes like 20min.`
			`- Having Airflow UI to run things is nice. Having Airflow schedules trigger automatically is painful locally because it makes controling the state of the environment hard.`
			`- Simply checking the E2E flow takes a lot of steps and tools. Feeling that there's always something breaking.`

			`## Meltano`

			`- Meltano logs are terrible to read, make debugging painful.`
			`- Meltano EL takes a long time`
			`- Meltano configuration is painful:`
			`- Docs are terrible`
			`- Wrong configurations don't raise errors often times, just get ignored`
			`- Meltano handling python environments is sometimes more of an annoyance than help`

			`## Data Pipeline needs data`

			`- Anemic dataset makes development hard (many entities with few or no records).`


			`## Improvable practices`

			`- Loading all backend tables into DW, regardless of whether they are used`
			`- More data, worse performance, no gain`
			`- More cognitive load when working on DW ("What is this table? Where is it used? Can I modify it?")`
			`- "But then I have all backend tables handy" -> Well, let's make adding a backend table trivial`
			`- dbt`
			`- Not documenting models`
			`- Next person has no clue what they are, makes shared ownership hard`
			`- Not using exposures`
			`- Hard to know what models impact what reports`
			`- Hard to know what parts of the DW are truly used`
			`-`

			`## What are our bottlenecks/issue?`

			`- Dealing with a convoluted output definition (understanding laws + unclear validation procedure with Vicky)`
			`- Translating needed report into SQL transformations of the backend data`
			`- Breaking changes in backend events/entities breaking downstream data dependencies`

			`What is NOT`
			`- Data latency`
			`- Scalability of data volume`
			`- Having a pretty interface`


			`## My proposal`

			`- Fallback to a dramatically simple stack that allows team members working on reports to move fast`
			`- Hardcoded Bitfinex CSV`
			`- Hardcoded Sumsub CSV`
			`- Use another PG as DW, move data with a simple, stateless Python script`
			`- Ignore orchestration, UI delivery, monitoring for now`
			`- Work together with backend to find a convenient solution to have good testing data`
			`- Make an exhaustive list of reports and align with Luis/Vicky on a plan to systematically meet and validate. Track it.`

			`Then, once...`
			`- We have ack from Luis/Vicky that all reports are domain-valid`
			`- We have more clarity on integrations we must run with regulators/government systems after audit`
			`- Backend data model is more stable`

			`... we grab our domain rich, deployment poor setup and we discuss what is the optimal tooling and strategies to deliver with production-grade practices.`


			`## North Star Ideas`

			`- Use an asset based orchestrator, like dagster`
			`- Step away from meltano, use a EL framework such as dlt and combine it with orchestrator`
			`- Add a visualization tool to the stack, such as evidence, metabase, lightdash`