From 170e50c30cada036b6c66fc9ae21d91ecc52bcbf Mon Sep 17 00:00:00 2001 From: pablo Date: Fri, 25 Jul 2025 14:22:10 +0200 Subject: [PATCH] data arc ramblings --- data-arch.md | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 data-arch.md diff --git a/data-arch.md b/data-arch.md new file mode 100644 index 0000000..7a37958 --- /dev/null +++ b/data-arch.md @@ -0,0 +1,42 @@ +# Data architecture ramblings + +Some notes and thoughts after my first week. + +## Why a DW? + +- Why do we need a DWH? +- You could make a case that generating the file-based reports through the app could be solved simply by a queue + job system, which probably would fit easily within the current capabilities of `lana`. Maybe +- Are there more requirements that I'm not contemplating here? +- I'm guessing that reporting in general will always be needed, but I'm concerned about whether we should support deployment for that. Our ICP will typically already have a DW of its own. Although a batteries included approach may be appreciated in some situations, I would guess almost always the ICP will simply expect to be able to ingest the lana data into its own DW. +- Perhaps volcano is a special case because they're starting out, but should we then make sure we tell apart what is `lana`, volcano-agnostic, and specific additional stuff we build for volcano. + +## SQL Engine choice + +- Why BQ? Why snowflake? +- Does it need to be multiple engines? Is it worth it? Can we do it? What are the risks of sticking to only one? +- Why not another postgres instance in the `lana` deployment? + +## Visualization layer + +- Do we want to bring a batteries included approach to visualizing data about the bank? Should that be a responsibility of the app UI, or of an additional reporting solution? Or embbeding reporting within the app? + +## Other stuff + +- data contracts +- dbt unit testing +- data integration testing +- more solid development data? + + +## If you asked meTM + +- Add another postgres instance to the deployment service +- Use that as DW +- Do pg2pg EL and transform there +- Make deployment, testing, etc. much easier +- If we encounter multiengine in a rush, approach model building with "write in psql, transpile to XYZ". Potentially, add testing +- About what's only volcano-what's lana in general + - Either have one dbt project and use folders/tagging + - Or do multiple projects + - Or just do a monster now and we will slice it in the future as needed +- Optionally, add visualization in the stack \ No newline at end of file