stuf

2025-10-24 14:33:36 +02:00 · 2025-10-24 14:33:36 +02:00 · 18222dd2bf
commit 18222dd2bf
parent 9607c41b54
8 changed files with 10838 additions and 0 deletions
--- a/concourse-notes.md
+++ b/concourse-notes.md
@ -0,0 +1,38 @@
+# Concourse Note
+
+On 2025-10-07, I wanted to modify CI to build a new docker image in CI out of `lana-bank`. 
+
+I teamed up with Justin who gave me a nice overview of what's happening in `lana-bank`'s concourse CI. Was good both for learning about concourse and `lana-bank`'s specifics.
+
+There's a lot to unpack, so I'm writing some notes before all the knowledge flies away:
+
+- Hierarchy
+    - Pipelines (such as `lana-bank`) contain...
+        - Groups (such as `lana-bank` and `nix-cache`) are dumb namespaces to group...
+            - Jobs (the colored boxes in the UI, which get run)
+
+- Each job gets defined through one single YAML file. This YAML file gets generated dynamically from the scripts under `ci` named `repipe`. This `repipe` scripts use `ytt`, a templating tool. `repipe`:
+    - Composes the final job output
+    - Applies it to the pipeline to define what runs on concourse
+    - Note: you can actually "test in production" by running the `repipe` script while your local `fly` CLI is pointing to our concourse production instance. It will modify the actual production job definition there.
+
+- Jobs use resources, which are external states
+- State can either be an input (`get`) or an output (`put`). `get`s just get fetched, `put`s get mutated.
+    - Resources have a `type`, which specifies what it really means to `get` or `put` them. There are many included `type`s in concourse, but you can also build your own custom ones if needed.
+
+
+- Some stuff on `ytt`
+  - The special characters to reference values are `#@`.
+  - You can create a file to drop values to keep it all tidy and then reference it in the target file (we set values in `values.yml`, then reference them in `pipeline.yml`)
+  - `ytt` not only provides simple values templating but also more sophisticated python function passing.
+
+
+- On resources in a job:
+  - On `get` resources, specifying `trigger: true` defines that any updates to that resources should trigger a job run.
+  - Also on `get` resources, specifying `passed` while pointing to other jobs will signal that a certain resource should only be fetched if a certain job has built successfully. This chains jobs and prevents running downstream if upstream is failing.
+  - Even if you define resources in `get` and `put` in a job, you still need to define them again `inputs` and `outputs` within the `task` entry so they are available (technically mounted).
+  - This is because the `get`, `task` and `put` parts are all runnable steps. Calling `get` runs it, but doesn't make the state available to `task` by default. That's why you need `inputs`.
+
+
+- On secrets
+  - Many of the values that we interpolate with `ytt` are actually references to our Hashicorp `Vault`. They can be spotted because they use double parantheses `(( some_secret ))`. This get replaced at runtime in concourse. It's fine to hardcode stuff in `values.yml` in the repo, but secrets must go into the `Vault`.
--- a/dagster-demo.md
+++ b/dagster-demo.md
@ -0,0 +1,43 @@
+
+- Intro and purpose
+    - We're here to show you Meltano
+    - With the idea of replacing Meltano and Airflow
+    - I want to give an overview of features and how it work
+    - The path forward is still not fully clear but this should give us an idea of whether we want to bother continuing this line of work
+- Quick demo
+    - Assets and dependencies
+    - Materializing
+    - Asset checks
+    - Automations
+    - Extra
+        - Partitions
+        - Opentelemetry
+        - Run arbitrary pieces of code in task style, not asset style
+- Architecture
+    - 4 components
+    - Responsibilities of each
+    - How I set it up locally
+- Our repo and how this gets built
+    - Definitions
+    - Assets
+    - dlt replacing meltano
+- Deployment
+    - Locally with docker compose, it's fast
+    - 
+    - Single node or kubernetes 
+    - Environment config and secrets
+    - Networking
+- Pros and cons against current setup
+    - Pros
+        - Asset-centric is nicer, integration with dbt
+        - Sophisticated automation is posible
+        - Easy to model external dependencies
+        - More lightweight
+        - Logging and making sense of things is much friendlier
+        - No meltano for EL
+    - Cons
+        - Less popular tool
+        - No auth layer, must remain an internal system
+        - We need to do lift and shift while adjusting plumbing
+        - We need to rebuild (or extract) the file serving 
+        - Less popular tool
--- a/dagster-migration-plan.md
+++ b/dagster-migration-plan.md
@ -0,0 +1,84 @@
+
+
+## End goals
+
+- Dagster is part of deployment, gets created in staging with the full E2E data pipeline running
+- With schedules/automations 
+- And is reachable from Lana UI for reports
+- Locally we can raise containers with `make dev-up`
+
+
+## Milestones
+
+- Starting point: no dagster
+- Dagster is deployable in staging and locally
+    - Dagster gets included in the set of containers that get added to the kubernetes namespace in deployment
+    - The dagster webserver UI in staging is reachable locally thorugh tunneling
+    - We can load a dummy code location, but still no code from our own project
+    - We also include dagster in local `make dev-up`
+- We can build a lana dagster image with our project code and deploy it
+    - We build a hello-world grade docker image for dagster code automatically in CI in `lana-bank`, which gets added to our container registry
+    - CI bumps this image in our helm charts and gets deployed to staging automatically
+- Dagster takes care of EL
+    - We swap the responsibility of doing EL from lana core-pg from Meltano to Dagster
+    - This will require:
+        - Setting up EL in dagster 
+        - Adjusting the dbt project's `staging` layer to stop relying on Meltano fields
+    - While this is in the works, we will need a code freeze in staging
+- Dagster takes care of dbt execution
+    - We swap the responsibility of materializing the dbt DAG from Meltano to Dagster
+    - While this is in the works, we will need a code freeze in the dbt project
+- Dagster can generate file reports
+    - We integrate `generate-es-reports` in Dagster so that it can generate report files
+    - But we don't plug it into Lana's UI just yet
+    - At this point we begin a code freeze in `generate-es-reports`
+- Extract report files API out of Airflow
+    - We set up an indepent microservice to handle the request and delivering of report files.
+    - Same behaviour as what airflow flask plugin is doing today
+    - Internally, interactions with the bucket contents remain the same. The features regarding requesting and monitoring file creation must be repointed from Airflow to Dagster
+    - At this point we finish the code freeze in `generate-es-reports`
+- Add E2E testing
+    - At this stage the whole pipeline is running on dagster. Right time to include tests to automate checking that everything runs
+- Cleanup
+    - Remove any remaining old Meltano/Airflow references, code, env vars, etc. throughout our repositories
+
+
+## Other
+
+### How to add dagster to deployment?
+
+
+#### Understanding how we add Airflow now
+
+- I'm going to check how Airflow is currently set up.
+- From what I understand, the right terraform bits should be spread around `galoy-private-charts` and `galoy-deployments`. Let's see what I can find.
+- Okay, some notes on `galoy-private-charts` and the relationship to `galoy-deployments.:
+    - This repo is a Helm charts factory. We build a chart to deploy a lana-bank instance here.
+    - The repo defines the chart, then CI tries to check that the chart is deployable with the testflight CI job (`lana-bank-testflight`). If testflight succeeds, another CI job (`bump-lana-bank-in-deployments`) updates the chart automatically in `galoy-deployments` with a bot commit.
+    - `galoy-private-charts`
+    - Also note that some of the images used in this chart come from upstream deps. Basically, code repos like `lana-bank` build their own images, which get added into our container registry and then referenced from `galoy-private-charts`. The bumps from `lana-bank` to `galoy-private-charts` happen through CI automated commits.
+
+
+#### How to add dagster
+
+- We surely should rely on the provided helm charts to stick to their recommendations
+    - https://docs.dagster.io/deployment/oss/deployment-options/kubernetes/deploying-to-kubernetes
+- We will need to build our own code location container and upload to our container registry. Kind of what we're currently doing with the meltano image
+
+I'm going to install minikube (kubectl and helm are already provided by nix in `galoy-private-charts`) locally to try to run the helm charts locally. I don't want to have to do full tours through CI and possibly break testflight to add dagster.
+
+```
+curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube_latest_amd64.deb
+sudo dpkg -i minikube_latest_amd64.deb
+minikube start --driver=docker
+kubectl get nodes
+```
+
+```
+curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
+```
+
+I discuss with Kartik on the daily call and he points out to this old Blink guide on how to set up a local kubernetes testing env: https://github.com/blinkbitcoin/charts/blob/main/dev/README.md
+
+
+### How to add dagster to make dev-up?
--- a/data-pipeline.excalidraw
+++ b/data-pipeline.excalidraw
--- a/kubernetes-cheatsheet.md
+++ b/kubernetes-cheatsheet.md
@ -0,0 +1,12 @@
+
+
+Login:
+1. gcloud compute ssh galoy-staging-bastion
+2. gcloud auth login
+3. kauth
+
+Check namespaces:                                           k get namespace
+Check pods in a namespace:                                  k get pods -n <namespace name>
+Check containers and env vars in a pod:                     k describe pod <pod name> -n <namespace name>
+
+Delete a namespace:                                         k delete namespace <namespace name>
--- a/log.md
+++ b/log.md
@ -1,5 +1,176 @@
 # Log

+## 2025-10-03
+
+### Tesflight and staging stuck AGAIN
+
+- testflight
+  - builds 1161 and 1162 fail again after the previous ones working
+  - In build 1162, it shows this error:
+  ```
+    Plan: 1 to add, 0 to change, 0 to destroy.
+
+    helm_release.lana_bank: Creating...
+
+    ╷
+
+    │ Error: installation failed
+
+    │ 
+
+    │   with helm_release.lana_bank,
+
+    │   on main.tf line 69, in resource "helm_release" "lana_bank":
+
+    │   69: resource "helm_release" "lana_bank" {
+
+    │ 
+
+    │ cannot re-use a name that is still in use
+
+    ╵
+
+    ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
+
+    Failed To Run Terraform Apply!
+
+    2025/10/03 05:05:20 Apply Error: Failed to run Terraform command: exit status 1
+  ```
+  - In build 1161, it shows this error:
+  ```
+    ╷
+
+    │ Warning: Helm release created with warnings
+
+    │ 
+
+    │   with helm_release.lana_bank,
+
+    │   on main.tf line 69, in resource "helm_release" "lana_bank":
+
+    │   69: resource "helm_release" "lana_bank" {
+
+    │ 
+
+    │ Helm release "lana-bank" was created but has a failed status. Use the
+
+    │ `helm` command to investigate the error, correct it, then run Terraform
+
+    │ again.
+
+    ╵
+
+    ╷
+
+    │ Error: Helm release error
+
+    │ 
+
+    │   with helm_release.lana_bank,
+
+    │   on main.tf line 69, in resource "helm_release" "lana_bank":
+
+    │   69: resource "helm_release" "lana_bank" {
+
+    │ 
+
+    │ context deadline exceeded
+
+    ╵
+
+    ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
+
+    Failed To Run Terraform Apply!
+
+    2025/10/03 04:07:58 Apply Error: Failed to run Terraform command: exit status 1
+
+  ```
+
+  - I go check the logs for build 1161 because it smells bad:
+    - the lana-bank-server container crashed with this error:
+    ```
+    Error: Couldn't parse config file
+
+    Caused by:
+        app: unknown field `root_folder`, expected `bucket_name` at line 2 column 3
+    ```
+    - Jiri jumps in and suggests that PR #2807 https://github.com/GaloyMoney/lana-bank/pull/2807 introduced some changes in the parsing of the config that are causing this problem (the config parsing has become stricter, and fields that are unknown )
+    - Also hightlights how the config in these private charts files is conflictive: it must be either local or gcp, not both:
+        - https://github.com/GaloyMoney/galoy-private-charts/blob/8b28e7f213b438a2664b71c44aeb3244c303fac6/charts/lana-bank/templates/config-map.yaml#L13-L16
+        - https://github.com/GaloyMoney/galoy-private-charts/blob/8b28e7f213b438a2664b71c44aeb3244c303fac6/charts/lana-bank/values.yaml#L24C1-L27C26
+    - I propose this PR to solve this conflict: https://github.com/GaloyMoney/galoy-private-charts/pull/1168
+    - Jiri proposes instead to modify the template file with this snippet:
+    ```
+    provider: {{ .Values.lanaBank.app.storage.provider }}
+    {{ if eq .Values.lanaBank.app.storage.provider "gcp" }}
+            bucket_name: {{ .Values.lanaBank.app.storage.bucketName }}
+    {{ end }}
+    {{ if eq .Values.lanaBank.app.storage.provider "local" }}
+            root_folder: {{ .Values.lanaBank.app.storage.rootFolder }}
+    {{ end }}
+    ```
+    - I merge his changes with this PR: https://github.com/GaloyMoney/galoy-private-charts/pull/1169
+    - That triggers testflight build 1163, which completes green
+  - Staging triggers build 1160 after that automatically
+  - Which fails
+    - because: 
+    ```
+        ╷
+
+        │ Error: installation failed
+
+        │ 
+
+        │   with module.lana-bank.helm_release.lana_bank,
+
+        │   on ../../../modules/lana-bank/main.tf line 87, in resource "helm_release" "lana_bank":
+
+        │   87: resource "helm_release" "lana_bank" {
+
+        │ 
+
+        │ cannot re-use a name that is still in use
+
+        ╵
+
+        ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
+
+        Failed To Run Terraform Apply!
+
+        2025/10/03 09:37:42 Apply Error: Failed to run Terraform command: exit status 1
+    ```
+    - I'm tired of this helm release errors, I really need to understand what's borking this state in the cluster.
+    - Fuck it, I'll delete staging and trigger the CI pipeline manually again
+    - done, started build 1161 manually
+    - that does the trick, staging is up again
+
+## 2025-09-30
+
+Ideas for what to improve during oncall
+
+- We need alerts for CI important stuff failing
+    - How
+- Why don't we run bats tests in Cala Github CI?
+    - Lol we do
+
+Knowledge base new stuff
+- Link to video on how to debug on concourse
+- Add honeycomb and zenduty and add yourself to oncall rotation to onboarding checklist
+
+
+
+## 2025-09-18
+
+### Day summary
+
+- Today I reviewed José's PR (#2759) that changes some UUIDs for public ids in some parts of the regulatory reporting. I suggested that he improves some aliases names to avoid confusion.
+- I opened up PR #2765 to implement transaction details in the UIF report.
+
+Besides that, some thoughts:
+- José let me know that Meltano EL jobs from Postgres to Bigquery work as NON incremental append. So, if the source table has records A, B, C, and you run the EL twice, target tables ends up with A, B, C, A, B, C. That's the reason the staging models have such convoluted logic: to only load the "freshest" batch from Meltano. I let José know that I find to be Rube Goldberg-ian and that we definetely need to change it since it's just not acceptable, and that I have an experiment PR with Dagster on the way. He agreed we need to get rid of this.
+- Nicolas discussed about the offsite, making it clear that it's completely dependant on the outcomes of the budget discussion. El Salvador would be ideal, but if volcano doesn't move forward, Dubai or South East Asia might be discussed as alternatives since apparently for our Indian colleagues getting to El Salvador is a bureaucratic nightmare.
+
+
 ## 2025-08-13

 ### Meeting with Luis
--- a/pains.md
+++ b/pains.md
@ -0,0 +1,70 @@
+# Pains
+
+## Local app/infra
+- Starting up the local app and running E2E tests + running EL takes like 20min.
+- Having Airflow UI to run things is nice. Having Airflow schedules trigger automatically is painful locally because it makes controling the state of the environment hard.
+- Simply checking the E2E flow takes a lot of steps and tools. Feeling that there's always something breaking.
+
+## Meltano
+
+- Meltano logs are terrible to read, make debugging painful.
+- Meltano EL takes a long time
+- Meltano configuration is painful:
+    - Docs are terrible
+    - Wrong configurations don't raise errors often times, just get ignored
+- Meltano handling python environments is sometimes more of an annoyance than help
+
+## Data Pipeline needs data
+
+- Anemic dataset makes development hard (many entities with few or no records).
+
+
+## Improvable practices
+
+- Loading all backend tables into DW, regardless of whether they are used
+    - More data, worse performance, no gain
+    - More cognitive load when working on DW ("What is this table? Where is it used? Can I modify it?")
+    - "But then I have all backend tables handy" -> Well, let's make adding a backend table trivial 
+- dbt
+    - Not documenting models
+        - Next person has no clue what they are, makes shared ownership hard
+    - Not using exposures
+        - Hard to know what models impact what reports
+        - Hard to know what parts of the DW are truly used
+    - 
+
+## What are our bottlenecks/issue?
+
+- Dealing with a convoluted output definition (understanding laws + unclear validation procedure with Vicky)
+- Translating needed report into SQL transformations of the backend data
+- Breaking changes in backend events/entities breaking downstream data dependencies
+
+What is NOT
+- Data latency
+- Scalability of data volume
+- Having a pretty interface
+
+
+## My proposal
+
+- Fallback to a dramatically simple stack that allows team members working on reports to move fast
+    - Hardcoded Bitfinex CSV
+    - Hardcoded Sumsub CSV
+    - Use another PG as DW, move data with a simple, stateless Python script
+- Ignore orchestration, UI delivery, monitoring for now
+- Work together with backend to find a convenient solution to have good testing data 
+- Make an exhaustive list of reports and align with Luis/Vicky on a plan to systematically meet and validate. Track it.
+
+Then, once...
+- We have ack from Luis/Vicky that all reports are domain-valid
+- We have more clarity on integrations we must run with regulators/government systems after audit
+- Backend data model is more stable
+
+... we grab our domain rich, deployment poor setup and we discuss what is the optimal tooling and strategies to deliver with production-grade practices.
+
+
+## North Star Ideas
+
+- Use an asset based orchestrator, like dagster
+- Step away from meltano, use a EL framework such as dlt and combine it with orchestrator
+- Add a visualization tool to the stack, such as evidence, metabase, lightdash
--- a/uif_07_notes.md
+++ b/uif_07_notes.md
@ -0,0 +1,84 @@
+I'm trying to understand what are all the tables in lana's backend I need to look at to find transactions.
+
+These are my first candidates:
+- `core_deposit_accounts` ( + `_events` + `_rollup` )
+- `core_deposits` ( + `_events` + `_rollup` )
+- `core_withdrawals` ( + `_events` + `_rollup` )
+- sumsub tables for personal details of the 
+
+
+I would like to perform this flow and find the created data along in the DB:
+- Create a customer (`alice@alice.com`)
+  - DB: find his ID
+  - DB: check that it got a deposit account created
+- Make a deposit for his deposit account of 999USD
+- Make a withdrawal of 333USD
+
+
+
+## Field mapping
+
+---
+    null as numeroRegistroBancario, 
+    Don't know what's the best ID for this. Should be int but example shows differently. I would put whatever we have then fight it out with UIF.
+---
+    null as estacionServicio,
+    DTO with branch details. hould have hardcoded data for volcano bank HQ.
+    - direccionAgencia: Address where the transaction is performed (string)
+    - idDepartamento: Municipality based on catalog (string)
+    - idMunicipio: Department based on catalog (string)
+---
+    null as fechaTransaccion,
+    I would assume this is the date where we confirm the transactions. I guess that would match the posting date for the accounting book.
+---
+    null as tipoPersonaA,
+    I think this is actually stored in Lana when creating the customer.
+---
+    null as detallesPersonaA,
+    I need to dig into the info we have in Sumsub
+---
+    null as tipoPersonaB,
+    Same as persona A.
+---
+    null as detallesPersonaB,
+    Same as persona B.
+---
+    null as numeroCuentaPO,
+    public id for the deposit account
+---
+    null as claseCuentaPO,
+    Probably some hardcoded value, such as "Cuenta corriente" or the like.
+    Applies to the Persona Ordenante (personaA in the above fields).
+---
+    null as conceptoTransaccionPO,
+    In our withdrawals, we can get the reference.
+    On a deposit, it must be fetched from... somewhere?
+---
+    null as valorOtrosMediosElectronicosPO,
+    Money amount.
+---
+    null as numeroProductoPB,
+    Somehow appear in Sumsub I guess?
+---
+    null as claseCuentaPB,
+    Somehow appear in Sumsub I guess?
+---
+    null as montoTransaccionPB,
+    Somehow appear in Sumsub I guess?
+---
+    null as valorMedioElectronicoPB,
+    Somehow appear in Sumsub I guess?
+---
+    null as bancoCuentaDestinatariaPB
+    Bank name. Where do we get this from?
+---
+
+## Q&A
+
+- Where do we get an integer ID for the accounts to be reported in the transactions?
+    - `core_deposit_accounts` has a public id for each account.
+- What state should we report?
+    - Deposits 
+
+- Reversability
+    - Which transactions can be reversed? What are the time limits? How does it get managed accounting wise?