sh-notion/notion_data_team_no_files/20240919-01 - dbt test failure because wrong confi 1060446ff9c98081896ad46ad0b153e7.md
Pablo Martin a256b48b01 pages
2025-07-11 16:15:17 +02:00

4.2 KiB
Raw Blame History

20240919-01 - dbt test failure because wrong configuration in schema file

dbt test failure because wrong configuration in schema file

Managed by: Uri

Summary

  • Components involved: data-dwh-dbt-project
  • Started at: 2024-09-18 12:41 CEST
  • Detected at: 2024-09-19 08:43 CEST
  • Mitigated at: 2024-09-19 09:01 CEST

Summary

A buggy code was commited and merged into master on 18th of September that was unnoticed. In the scheduled production run in the morning of the 19th, dbt test failed because couldnt compile the test. The fix has been to remove the buggy configuration in the schema entry of core__bookings, merge, re-run dbt test in prod.

Impact

Not a massive impact because it was a test failing in reporting in core__bookings model

Timeline

  • 2024-09-18 12:41 CEST - Faulty commit 923bfa70 is created

  • 2024-09-18 16:30 CEST - Branch containing the faulty commit in pull request !2877 is merged into production

  • 2024-09-19 08:43 CEST - Data team sees the alert in #data-alerts slack channel

  • 2024-09-19 08:48 CEST - Data team accesses the production logs of dbt tests to notice the failure, specifically:

    Compilation Error in test not_nullgit_core__bookings_id_booking (models/reporting/core/schema.yml)

  • 2024-09-19 08:51 CEST - The faulty commit 923bfa70 is spotted. Uri proceeds to create a PR to remove the issue.

  • 2024-09-19 09:00 CEST - The fix is merged in production in commit feaedb2a

  • 2024-09-19 09:01 CEST - Succesful launch of a re-run of the dbt tests with the fixes.

Root Cause(s)

An involuntary human error modified a line of code in the schema entry of core__bookings in the test section for id_booking, modifying the not_null test to not_nullgit p. This change, in the commit 923bfa70, happened after the review of the data team members on the PR !2877 once it was approved, thus it went unnoticed.

image.png

Resolution and recovery

Fix has been straight forward: just change back the not_nullgit p to not_null. Afterwards, merge into prod and re run the dbt tests successfully.

Lessons Learned

What went well:

  • dbt test alerts work well and the team effectively checks the channel once an alert is raised.

What went badly

  • Unproper self-review and cross-review of code before merging. Personally, I didnt check the PR since it was already approved by Pablo. At the same time, this approval came before the faulty commit. We should be all more careful/sceptical when merging into production, specially if we leave an approval in the PR.

Where did we get lucky:

  • Minimal impact, it was just a single failing test in reporting schema that would have passed anyway. However, this situation could have been worse if this bug had been in place directly in a model code.

Action Items

  • Tend to review and re-review indistinctly of PRs being already approved.
  • Check commits made after the approval.
  • When merging into prod, run both the normal execution of dbt (run_dbt.sh) and the tests (run_tests.sh). This would have make this issue appear early
  • Automate CI checks on the dbt project (try to compile the project and perhaps also run tests on every PR, block merging if it doesnt work)