sh-notion/notion_data_team_no_files/20240902-01 - Missing payment details in intermedi f2067416c0824fc686513937b3fbca78.md
Pablo Martin a256b48b01 pages
2025-07-11 16:15:17 +02:00

68 lines
No EOL
3.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 20240902-01 - Missing payment details in intermediate
# Missing payment details in intermediate
Managed by: Pablo
## Summary
- Components involved: dbt run, dbt test and airbyte
- Started at: Unknown, probably months ago
- Detected at: 2024-08-31, 08:16AM CEST
- Mitigated at: 2024-09-02, 11:57AM CEST
The simultaneous trigger of `dbt run` and Airbytes incremental jobs at the same time every morning had been leading to data integrity issues in the DWH for a long time. Our current release of scheduled `dbt test` just a few minutes after our scheduled `dbt run` made the problem obvious and we finally got to fix it.
## Impact
For months, a handful of verification payments were missing money amounts, both in local currency and GBP. This might have made total revenue figures in some reports insignificantly wrong (deviations being smaller than 0.1%).
## Timeline
All reported times are in CEST timezone.
| Time | Event |
| --- | --- |
| Some time around March 2024 | Pablo implements the `int_core__verification_payments` model in the `dbt` project. |
| Some time around August 2024 | As we work on implementing scheduled `dbt` tests, we recurringly observe issues with some `not null` tests on model `int_core__verification_payments` . The behaviour is flaky, so no special attention is paid at first. |
| 2024-08-31 08:16AM | A `dbt test` fails, showing some null value issues in money related columns in the model `int_core__verification_payments` |
| 2024-09-01 08:16AM | A `dbt test` fails, showing some null value issues in money related columns in the model `int_core__verification_payments` |
| 2024-09-02 08:16AM | A `dbt test` fails, showing some null value issues in money related columns in the model `int_core__verification_payments` |
| 2024-09-02 09:00AM | Data team notices the issue (previous alerts happened on the weekend) and starts investigating. |
| 2024-09-02 11:30AM | Pablo spots the possible issue and confirms by triggering another `dbt run` NOT at a oclock and re-running `dbt test`. |
| 2024-09-02 11:35AM | Pablo changes the schedule of `dbt run` and `dbt test` to ensure that the `dbt run` doesnt clash with Airbyte jobs and that the `dbt test` runs clearly after `dbt run`. |
| | End of the incident. |
## Root Cause(s)
Some Airbyte jobs were running simultaneously with the scheduled runs of our dbt project. This meant the `dbt run` executions were happening as the `sync` layer of some sources were being populated. Because of this, the `dbt run` wasnt running on a consistent snapshot of the `sync` layer. This caused referential integrity issues downstream.
## Resolution and recovery
Every day the problem fixed itself for the previous day issues. So, each day we only suffered some row issues from the current day.
The same-day recovery was achieved by simply running `dbt run` when Airbyte jobs were NOT running.
The proper resolution was achieved by re-arranging the schedule of jobs so that `dbt run` does not happen at the same time as Airbyte.
## **Lessons Learned**
- What went well
- Our `dbt test` were super useful to spot this happening every day.
- What went badly
- This issue seems to have existed for 6 months. The business impact was tiny, but how long this was alive is highly concerning.
- Where did we get lucky
- We got lucky in the impact being tiny.
More generally, this is a first symptom of our home-made-bash-orchestration starting to show some wrinkles. A more sophisticated orchestration engine should enable us to link together airbyte and dbt executions, which would allow to prevent this kind of issues and also be more smart in our ELT (lower latency, less redundant jobs, reasonable management downstream when something upstream fails, etc).
No need to rush into it, but should be taken into account.
## Action Items
- [ ] Educate the team on scheduling patterns.
- [ ] Ensure the new orchestration engine deployment gets the right priority.
## Appendix
-