sh-notion/notion_data_team_no_files/20240902-01 - Missing payment details in intermedi f2067416c0824fc686513937b3fbca78.md
Pablo Martin a256b48b01 pages
2025-07-11 16:15:17 +02:00

3.9 KiB
Raw Permalink Blame History

20240902-01 - Missing payment details in intermediate

Missing payment details in intermediate

Managed by: Pablo

Summary

  • Components involved: dbt run, dbt test and airbyte
  • Started at: Unknown, probably months ago
  • Detected at: 2024-08-31, 08:16AM CEST
  • Mitigated at: 2024-09-02, 11:57AM CEST

The simultaneous trigger of dbt run and Airbytes incremental jobs at the same time every morning had been leading to data integrity issues in the DWH for a long time. Our current release of scheduled dbt test just a few minutes after our scheduled dbt run made the problem obvious and we finally got to fix it.

Impact

For months, a handful of verification payments were missing money amounts, both in local currency and GBP. This might have made total revenue figures in some reports insignificantly wrong (deviations being smaller than 0.1%).

Timeline

All reported times are in CEST timezone.

Time Event
Some time around March 2024 Pablo implements the int_core__verification_payments model in the dbt project.
Some time around August 2024 As we work on implementing scheduled dbt tests, we recurringly observe issues with some not null tests on model int_core__verification_payments . The behaviour is flaky, so no special attention is paid at first.
2024-08-31 08:16AM A dbt test fails, showing some null value issues in money related columns in the model int_core__verification_payments
2024-09-01 08:16AM A dbt test fails, showing some null value issues in money related columns in the model int_core__verification_payments
2024-09-02 08:16AM A dbt test fails, showing some null value issues in money related columns in the model int_core__verification_payments
2024-09-02 09:00AM Data team notices the issue (previous alerts happened on the weekend) and starts investigating.
2024-09-02 11:30AM Pablo spots the possible issue and confirms by triggering another dbt run NOT at a oclock and re-running dbt test.
2024-09-02 11:35AM Pablo changes the schedule of dbt run and dbt test to ensure that the dbt run doesnt clash with Airbyte jobs and that the dbt test runs clearly after dbt run.
End of the incident.

Root Cause(s)

Some Airbyte jobs were running simultaneously with the scheduled runs of our dbt project. This meant the dbt run executions were happening as the sync layer of some sources were being populated. Because of this, the dbt run wasnt running on a consistent snapshot of the sync layer. This caused referential integrity issues downstream.

Resolution and recovery

Every day the problem fixed itself for the previous day issues. So, each day we only suffered some row issues from the current day.

The same-day recovery was achieved by simply running dbt run when Airbyte jobs were NOT running.

The proper resolution was achieved by re-arranging the schedule of jobs so that dbt run does not happen at the same time as Airbyte.

Lessons Learned

  • What went well
    • Our dbt test were super useful to spot this happening every day.
  • What went badly
    • This issue seems to have existed for 6 months. The business impact was tiny, but how long this was alive is highly concerning.
  • Where did we get lucky
    • We got lucky in the impact being tiny.

More generally, this is a first symptom of our home-made-bash-orchestration starting to show some wrinkles. A more sophisticated orchestration engine should enable us to link together airbyte and dbt executions, which would allow to prevent this kind of issues and also be more smart in our ELT (lower latency, less redundant jobs, reasonable management downstream when something upstream fails, etc).

No need to rush into it, but should be taken into account.

Action Items

  • Educate the team on scheduling patterns.
  • Ensure the new orchestration engine deployment gets the right priority.

Appendix