sh-notion/notion_data_team_no_files/20240621-01 - Failure of Core full-refresh Airbyte 4b308fa051694afe89c8f7147ce5ed27.md
Pablo Martin a256b48b01 pages
2025-07-11 16:15:17 +02:00

5.8 KiB
Raw Blame History

20240621-01 - Failure of Core full-refresh Airbyte jobs

Failure of Core full-refresh Airbyte jobs

Managed by: Pablo

Summary

  • Components involved: dbt, Airbyte, dwh-prd
  • Started at: 2024-06-21 3:09AM CEST
  • Detected at: 2024-06-21 3:09AM CEST
  • Mitigated at: 2024-06-21 11:35AM CEST

Some tests around dbt materialization performed in production by Pablo on 2024-06-20, plus a lack of proper clean-up after them, derived in Airbyte failing to run full-refresh loads from Core due to not being able to delete the tables properly. This left the affected tables and their dependants in DWH for around 9 natural hours/4 business hours.

Impact

The following tables were not refreshed on the nightly run of 2024-06-21, making data remain stale and outdate to the 2024-06-20 state:

Source Schema Table
Core (SQL Server - Live) Integration Integration
Core (SQL Server - Live) Integration IntegrationType
Core (SQL Server - Live) dbo Country
Core (SQL Server - Live) dbo Currency
Core (SQL Server - Live) dbo PaymentStatus
Core (SQL Server - Live) dbo PricePlanChargedByType
Core (SQL Server - Live) dbo User
Core (SQL Server - Live) dbo UserVerificationStatus
Core (SQL Server - Live) dbo VerificationPaymentType
Core (SQL Server - Live) dbo VerificationStatus

Timeline

Timezone: CEST

Time Event
2024-06-21 03:00 A scheduled job (ID: 4544) of the Airbyte sync Superhog - Live - integration → dwh-prd (Full Refresh) begins.
2024-06-21 03:09 After 5 failed attempts, job 4544 is marked as failed and a warning is sent to the Slack channel #data-alerts
2024-06-21 06:00 A scheduled job (ID: 4552) of the Airbyte sync Superhog - Live -dbo → dwh-prd (Full-refresh models) begins.
2024-06-21 06:31 After 5 failed attempts, job 4552 is marked as failed and a warning is sent to the Slack channel #data-alerts
2024-06-21 08:00 The regular, scheduled dbt run happens normally. Since its running with the usual setting of materializing staging models as table, it destroy the dirty views that were left from the previous day.
2024-06-21 09:50 Pablo picks up the alerts and research begins.
2024-06-21 11:21 Pablo triggers the failed syncs manually.
2024-06-21 11:26 The syncs have executed successfully.
2024-06-21 11:31 Pablo triggers a dbt run manually.
2024-06-21 11:35 The dbt run finishes successfully.
End of the incident.

Root Cause(s)

  • Pablo ran some dbt, staging layer models in the DWH as views instead of as tables on 2024-06-20.
  • The views were left in the DWH.
  • The following full-refresh Airbyte jobs on the sync_core schema failed upon trying to run DROP on the sync_core tables that then had dependant views, for Airbyte is running DROP, not DROP CASCADE. This is not a problem usually since we materialize staging as table, because that creates no dependency relationship between the sync_core tables and their staging table counterparties.

Resolution and recovery

  • The scheduled dbt run job that ran on 8:00 CEST deleted the dangling views from the previous day and brought the DWH back to using tables in the staging layer.
  • Without the views in place, it was only necessary to manually trigger the failed Airbyte syncs again to bring the sync_core tables that had been outdated back to being up to date.
  • After that, the dbt run was manually triggered as well to bring all the dependant models back to being up to date.
  • All jobs ran successfully and the DWH was brought back to perfect state without issues.

Lessons Learned

  • What went well
    • Alerts made sure we picked up the problem fast
  • What went badly
    • The manual testing left dirt in the DWH
  • Where did we get lucky
    • The context of the dbt project since Pablo did the testing until the next morning allowed for the upcoming scheduled dbt run to automatically remove all the undesired views from the DWH, making the job of resolving the incident as simple as re-running everything. But things could have been different and some views could have been left in the DWH, which would have made recovery more complex and error-prone.

General lesson: dont test in production.

The tests Pablo was running on 2024-06-20 were related to how to turn our staging layer in DWH from being materialized as table currently to view instead. This incident was a small sample of how this wouldnt work as simply as initially expected given the nature of Airbyte full-refresh behaviour, in combination with Postgres DROP and DROP CASCADE commands.

The following Github issue shows other people having the same discussion: https://github.com/airbytehq/airbyte/issues/35386

That discussion lead to new developments in Postgres which enabled the features that would be necessary to achieve our goal. This doc page explains the features: https://docs.airbyte.com/integrations/destinations/postgres?_gl=1vst8mh_gcl_au*MzcyNzc0OTUzLjE3MTc1MDM2MDI.#creating-dependent-objects

It will be necessary to run version updates on Airbyte to achieve this.

Action Items

  • Design a way to easily replicate the production DWH in order to minimize the need to run tests there.
  • Update Airbyte to pick up the new versions of the Postgres connector and plan, test and implement the change of the staging layer materialization strategy from table to view properly.

Appendix

Logs of the failed Airbyte sync jobs.

default_workspace_job_4552_attempt_5_txt

default_workspace_job_4544_attempt_5_txt

Slack alerts by Airbyte:

Untitled