68 lines
No EOL
3.9 KiB
Markdown
68 lines
No EOL
3.9 KiB
Markdown
# 20240902-01 - Missing payment details in intermediate
|
||
|
||
# Missing payment details in intermediate
|
||
|
||
Managed by: Pablo
|
||
|
||
## Summary
|
||
|
||
- Components involved: dbt run, dbt test and airbyte
|
||
- Started at: Unknown, probably months ago
|
||
- Detected at: 2024-08-31, 08:16AM CEST
|
||
- Mitigated at: 2024-09-02, 11:57AM CEST
|
||
|
||
The simultaneous trigger of `dbt run` and Airbyte’s incremental jobs at the same time every morning had been leading to data integrity issues in the DWH for a long time. Our current release of scheduled `dbt test` just a few minutes after our scheduled `dbt run` made the problem obvious and we finally got to fix it.
|
||
|
||
## Impact
|
||
|
||
For months, a handful of verification payments were missing money amounts, both in local currency and GBP. This might have made total revenue figures in some reports insignificantly wrong (deviations being smaller than 0.1%).
|
||
|
||
## Timeline
|
||
|
||
All reported times are in CEST timezone.
|
||
|
||
| Time | Event |
|
||
| --- | --- |
|
||
| Some time around March 2024 | Pablo implements the `int_core__verification_payments` model in the `dbt` project. |
|
||
| Some time around August 2024 | As we work on implementing scheduled `dbt` tests, we recurringly observe issues with some `not null` tests on model `int_core__verification_payments` . The behaviour is flaky, so no special attention is paid at first. |
|
||
| 2024-08-31 08:16AM | A `dbt test` fails, showing some null value issues in money related columns in the model `int_core__verification_payments` |
|
||
| 2024-09-01 08:16AM | A `dbt test` fails, showing some null value issues in money related columns in the model `int_core__verification_payments` |
|
||
| 2024-09-02 08:16AM | A `dbt test` fails, showing some null value issues in money related columns in the model `int_core__verification_payments` |
|
||
| 2024-09-02 09:00AM | Data team notices the issue (previous alerts happened on the weekend) and starts investigating. |
|
||
| 2024-09-02 11:30AM | Pablo spots the possible issue and confirms by triggering another `dbt run` NOT at a o’clock and re-running `dbt test`. |
|
||
| 2024-09-02 11:35AM | Pablo changes the schedule of `dbt run` and `dbt test` to ensure that the `dbt run` doesn’t clash with Airbyte jobs and that the `dbt test` runs clearly after `dbt run`. |
|
||
| | End of the incident. |
|
||
|
||
## Root Cause(s)
|
||
|
||
Some Airbyte jobs were running simultaneously with the scheduled runs of our dbt project. This meant the `dbt run` executions were happening as the `sync` layer of some sources were being populated. Because of this, the `dbt run` wasn’t running on a consistent snapshot of the `sync` layer. This caused referential integrity issues downstream.
|
||
|
||
## Resolution and recovery
|
||
|
||
Every day the problem fixed itself for the previous day issues. So, each day we only suffered some row issues from the current day.
|
||
|
||
The same-day recovery was achieved by simply running `dbt run` when Airbyte jobs were NOT running.
|
||
|
||
The proper resolution was achieved by re-arranging the schedule of jobs so that `dbt run` does not happen at the same time as Airbyte.
|
||
|
||
## **Lessons Learned**
|
||
|
||
- What went well
|
||
- Our `dbt test` were super useful to spot this happening every day.
|
||
- What went badly
|
||
- This issue seems to have existed for 6 months. The business impact was tiny, but how long this was alive is highly concerning.
|
||
- Where did we get lucky
|
||
- We got lucky in the impact being tiny.
|
||
|
||
More generally, this is a first symptom of our home-made-bash-orchestration starting to show some wrinkles. A more sophisticated orchestration engine should enable us to link together airbyte and dbt executions, which would allow to prevent this kind of issues and also be more smart in our ELT (lower latency, less redundant jobs, reasonable management downstream when something upstream fails, etc).
|
||
|
||
No need to rush into it, but should be taken into account.
|
||
|
||
## Action Items
|
||
|
||
- [ ] Educate the team on scheduling patterns.
|
||
- [ ] Ensure the new orchestration engine deployment gets the right priority.
|
||
|
||
## Appendix
|
||
|
||
- |