# 20240621-01 - Failure of Core full-refresh Airbyte jobs ## Failure of Core full-refresh Airbyte jobs Managed by: Pablo ## Summary - Components involved: dbt, Airbyte, dwh-prd - Started at: *2024-06-21 3:09AM CEST* - Detected at: *2024-06-21 3:09AM CEST* - Mitigated at: *2024-06-21 11:35AM CEST* Some tests around dbt materialization performed in production by Pablo on 2024-06-20, plus a lack of proper clean-up after them, derived in Airbyte failing to run full-refresh loads from Core due to not being able to delete the tables properly. This left the affected tables and their dependants in DWH for around 9 natural hours/4 business hours. ## Impact The following tables were not refreshed on the nightly run of 2024-06-21, making data remain stale and outdate to the 2024-06-20 state: | Source | Schema | Table | | --- | --- | --- | | Core (SQL Server - Live) | `Integration` | `Integration` | | Core (SQL Server - Live) | `Integration` | `IntegrationType` | | Core (SQL Server - Live) | `dbo` | `Country` | | Core (SQL Server - Live) | `dbo` | `Currency` | | Core (SQL Server - Live) | `dbo` | `PaymentStatus` | | Core (SQL Server - Live) | `dbo` | `PricePlanChargedByType` | | Core (SQL Server - Live) | `dbo` | `User` | | Core (SQL Server - Live) | `dbo` | `UserVerificationStatus` | | Core (SQL Server - Live) | `dbo` | `VerificationPaymentType` | | Core (SQL Server - Live) | `dbo` | `VerificationStatus` | ## Timeline Timezone: CEST | Time | Event | | --- | --- | | 2024-06-21 03:00 | A scheduled job (ID: 4544) of the Airbyte sync `Superhog - Live - integration → dwh-prd (Full Refresh)` begins. | | 2024-06-21 03:09 | After 5 failed attempts, job 4544 is marked as failed and a warning is sent to the Slack channel `#data-alerts` | | 2024-06-21 06:00 | A scheduled job (ID: 4552) of the Airbyte sync `Superhog - Live -dbo → dwh-prd (Full-refresh models)` begins. | | 2024-06-21 06:31 | After 5 failed attempts, job 4552 is marked as failed and a warning is sent to the Slack channel `#data-alerts` | | 2024-06-21 08:00 | The regular, scheduled `dbt run` happens normally. Since it’s running with the usual setting of materializing `staging` models as `table`, it destroy the dirty views that were left from the previous day. | | 2024-06-21 09:50 | Pablo picks up the alerts and research begins. | | 2024-06-21 11:21 | Pablo triggers the failed syncs manually. | | 2024-06-21 11:26 | The syncs have executed successfully. | | 2024-06-21 11:31 | Pablo triggers a `dbt run` manually. | | 2024-06-21 11:35 | The `dbt run` finishes successfully. | | | End of the incident. | ## Root Cause(s) - Pablo ran some `dbt`, `staging` layer models in the DWH as views instead of as tables on 2024-06-20. - The views were left in the DWH. - The following full-refresh Airbyte jobs on the `sync_core` schema failed upon trying to run `DROP` on the `sync_core` tables that then had dependant views, for Airbyte is running `DROP`, not `DROP CASCADE`. This is not a problem usually since we materialize `staging` as `table`, because that creates no dependency relationship between the `sync_core` tables and their `staging` table counterparties. ## Resolution and recovery - The scheduled `dbt run` job that ran on 8:00 CEST deleted the dangling views from the previous day and brought the DWH back to using tables in the `staging` layer. - Without the views in place, it was only necessary to manually trigger the failed Airbyte syncs again to bring the `sync_core` tables that had been outdated back to being up to date. - After that, the `dbt run` was manually triggered as well to bring all the dependant models back to being up to date. - All jobs ran successfully and the DWH was brought back to perfect state without issues. ## **Lessons Learned** - What went well - Alerts made sure we picked up the problem fast - What went badly - The manual testing left dirt in the DWH - Where did we get lucky - The context of the `dbt` project since Pablo did the testing until the next morning allowed for the upcoming scheduled `dbt run` to automatically remove all the undesired views from the DWH, making the job of resolving the incident as simple as re-running everything. But things could have been different and some views could have been left in the DWH, which would have made recovery more complex and error-prone. General lesson: don’t test in production. The tests Pablo was running on 2024-06-20 were related to how to turn our `staging` layer in DWH from being materialized as `table` currently to `view` instead. This incident was a small sample of how this wouldn’t work as simply as initially expected given the nature of Airbyte full-refresh behaviour, in combination with Postgres `DROP` and `DROP CASCADE` commands. The following Github issue shows other people having the same discussion: https://github.com/airbytehq/airbyte/issues/35386 That discussion lead to new developments in Postgres which enabled the features that would be necessary to achieve our goal. This doc page explains the features: https://docs.airbyte.com/integrations/destinations/postgres?_gl=1*vst8mh*_gcl_au*MzcyNzc0OTUzLjE3MTc1MDM2MDI.#creating-dependent-objects It will be necessary to run version updates on Airbyte to achieve this. ## Action Items - [ ] Design a way to easily replicate the production DWH in order to minimize the need to run tests there. - [ ] Update Airbyte to pick up the new versions of the Postgres connector and plan, test and implement the change of the `staging` layer materialization strategy from `table` to `view` properly. ## Appendix Logs of the failed Airbyte sync jobs. [default_workspace_job_4552_attempt_5_txt](default_workspace_job_4552_attempt_5_txt.txt) [default_workspace_job_4544_attempt_5_txt](default_workspace_job_4544_attempt_5_txt.txt) Slack alerts by Airbyte: ![Untitled](Untitled%2045.png)