While onboarding, I've inevitably looked at the data pipeline with fresh eyes, and it has made me observe some parts of the stack feel strange for the requirements we must satisfy with Lana.
This doc briefly lays out what parts of the stack I suggest we rethink.
## Brief review of current situation
### Needs we are covering
Currently, we serve one very clear requirement with the data pipeline: building and delivering the UI generated report files.
Potentially, we could also think about having a "batteries" included BI attitude with Lana, where Lana gets bundled with an ingest+transform+present stack that allows a deployment to see some useful business reports on how the bank is doing. This is not a strict requirement for Volcano but rather a nice-to-have feature addition that is completely up to use. It's worth discussing if this makes sense depending on the chances that most of our customers would already have a Datawarehouse or wider data platform already running.
### How we are doing it
We:
- Ingest from Meltano into BigQuery
- We run SQL defined transformations to data in BigQuery with dbt.
- We generate report files via Python scripts, store them in GCS buckets and make them accessible through the UI.
- We orchestrate the whole thing with a mix of scheduled running and UI triggered actions.
## Parts that feel great, parts that feel odd
Great:
- Meltano as a choice to do the Extract&Load between the app db and any DW.
- dbt as a choice for managing the tangle of SQL defined transformations
- To not try to do this transformations in any way within the rust codebase.
Odd:
- BigQuery choice
- We are coupled to BigQuery, a GCP specific tool that is not viable for Volcano, and would neither be for many other potential clients out there who are not a GCP shop.
- BigQuery feels overkill for the data volume one would expect in greenfield banks, unless they have extremely ambitious growth targets.
- Raise another postgres instance within the deployments. Use it to replace what BigQuery is being used for right now.
- Tiny side note: perhaps we could leverage the same postgres instance we have for meltano and airflow data, just adding a new database there to act as DW. There are pros and cons to it.
- We can include easily in CI jobs, tests, etc. "More control"
- Probably the most popular db out there, hence pretty much all tooling has available connectors and integrations to it.
- Opens up the possibility to do Extract and Load (EL) between app and DW with many of the available Postgres replication options and tools. Could be interesting if we ever face extremely low latency (under 1min) needs.
- Cons
- We need to move from BQ to Postgres. Change all SQL, configs, deployment, etc.
- It's not strictly built for DW needs. Will become challenging if some deployment scales a lot in data volume.
- May not look flashy to corp IT mgmt who expect all the fancy tools.
Going for Snowflake:
- Pros
- Allows us to deliver highly scalable DW (the big question here is: will we ever need to do that?).
- Comes with a lot of additional tooling and goodies around the DW.
- Makes us look serious (old fashioned IT mgmt at large corps probably feels more comfortable hearing that we use Snowflake than Postgres, even if it might be pointless or detrimental)
- Cons
- We need to move from BQ to Snowflake. Change all SQL, configs, deployment, etc.
- We need to plug ourselves into it for all CI and local env work. Might be more convoluted than simply raising a Postgres container.
- We are coupling our data stack with one vendor. It's the top dog of its niche.
- Although it's very popular in the data/BI niche, is far from being as popular as Postgres. Connectors and integrations with other tools might not always exist.
- EL jobs from the app will always need to be driven by external tools like Meltano. Snowflake states that they can do CDC replication (https://docs.snowflake.com/en/connectors/postgres6/configure-replication), but we would need to verify if that is as performant as Postgres to Postgres replication.
- Potential customers who are existing large orgs will already have a DW/Data Platform. I don't think they expect us to run that for them, but rather will have a big interest in Lana being easy to extract data from.
- Small, greenfield projects around banks that start out can probably survive, data volume wise, with Postgres for some time. Unless they expect to have millions of events per day super early, we would probably not need a highly scalable DW like BQ/Snowflake at the get go.
- I think I'm up to date with company goals, but maybe you want to give me your view?
- Manage audit vs data work
"One year ago we decide to throw all our eggs in one basket"
"Maybe in three months we have nothing to do"
Currently El Salvador only has a commercial banking law, doesn't have any investment banking law. Keeps getting pushed backed because of compliance with IMF. Law might be approved any time.
Teenage Sex
Opportunities to do something in the US with the team that started Silvergate
Before it was Fulgur and Tether, but Tether dropped out earlier this year.
- IBEX fun story
- I'm surprised by the length of the tenures
- About the 1 or 2 engineers per quarter comment you made, how come?
Vitaly, major shareholder of Bitfinex, 10K, 50K BTC
John Carlo, guy owning Tether
Get in the calls with Vicky ASAP
I like the Wild West, Godfather 2 in Cuba feeling of this project
- Working out where to change in Terraform to get BigQuery env
- Just managed to set up local env
- Had some issues with nix
- Filling in onboarding details
Onboarding stuff
- I've started and merged this PR to onboarding (on-call): https://github.com/GaloyMoney/onboarding/pull/18
- Started this one, still pending review (code-review): https://github.com/GaloyMoney/onboarding/pull/17
### Chat with Sebastien
From Sebastien
If you haven't figured this out yet, the data flows like this:
staging (stg_* files) -> intermediate (int_* files) -> output (misc. but often report_* files)
Under staging, the 'rollups' folder hosts the "rolled up" source data from the backend, raw.
Under intermediate, the 'rollups' folder hosts the expanded & type casted version of the above raw data and should be the source of all(most?) our transformations
The backend is architecture'd to stream events for most "objects"/"entity"
(as concisely explained here https://www.youtube.com/watch?v=lg6aF5PP4Tc)
and so the rollups are the snapshots of the state of those "objects"/"entity" as a one per row table (chronological reduce / event summarization) mentioned in the video around 4:20.
The reduce process are backend side, done as triggers for now on "objects"/"entity" events table in PG and visible a sql migrations under lana/app/migrations/<date>_*_events_rollup.sqlI think you should be familiar with that process given the interview take home, but that's what we adopted for now and it's all automated...
so if a field name changes in the backend for example, and the pipeline falls out of sync it breaks and we can address it, rather than silently failing.
So anyways all the concepts of the bank and data we might be interested in analyzing, reporting, etc. are implicitly documented with the above 3 set of rollup sql.
The credit facility object is probably the most interesting and easy to start with as it is a vanilla bank loan or dumbed down line of credit.
### Coffee with Kartik
- Where are you based?
- Bangalore, been there for two years
- How long you've been around?
- 5 years, started out when it was just Nicolas and him for engineering?