sh-notion/notion_data_team_no_files/20240621-01 - Failure of Core full-refresh Airbyte 4b308fa051694afe89c8f7147ce5ed27.md

98 lines
5.8 KiB
Markdown
Raw Permalink Normal View History

2025-07-11 16:15:17 +02:00
# 20240621-01 - Failure of Core full-refresh Airbyte jobs
## Failure of Core full-refresh Airbyte jobs
Managed by: Pablo
## Summary
- Components involved: dbt, Airbyte, dwh-prd
- Started at: *2024-06-21 3:09AM CEST*
- Detected at: *2024-06-21 3:09AM CEST*
- Mitigated at: *2024-06-21 11:35AM CEST*
Some tests around dbt materialization performed in production by Pablo on 2024-06-20, plus a lack of proper clean-up after them, derived in Airbyte failing to run full-refresh loads from Core due to not being able to delete the tables properly. This left the affected tables and their dependants in DWH for around 9 natural hours/4 business hours.
## Impact
The following tables were not refreshed on the nightly run of 2024-06-21, making data remain stale and outdate to the 2024-06-20 state:
| Source | Schema | Table |
| --- | --- | --- |
| Core (SQL Server - Live) | `Integration` | `Integration` |
| Core (SQL Server - Live) | `Integration` | `IntegrationType` |
| Core (SQL Server - Live) | `dbo` | `Country` |
| Core (SQL Server - Live) | `dbo` | `Currency` |
| Core (SQL Server - Live) | `dbo` | `PaymentStatus` |
| Core (SQL Server - Live) | `dbo` | `PricePlanChargedByType` |
| Core (SQL Server - Live) | `dbo` | `User` |
| Core (SQL Server - Live) | `dbo` | `UserVerificationStatus` |
| Core (SQL Server - Live) | `dbo` | `VerificationPaymentType` |
| Core (SQL Server - Live) | `dbo` | `VerificationStatus` |
## Timeline
Timezone: CEST
| Time | Event |
| --- | --- |
| 2024-06-21 03:00 | A scheduled job (ID: 4544) of the Airbyte sync `Superhog - Live - integration → dwh-prd (Full Refresh)` begins. |
| 2024-06-21 03:09 | After 5 failed attempts, job 4544 is marked as failed and a warning is sent to the Slack channel `#data-alerts` |
| 2024-06-21 06:00 | A scheduled job (ID: 4552) of the Airbyte sync `Superhog - Live -dbo → dwh-prd (Full-refresh models)` begins. |
| 2024-06-21 06:31 | After 5 failed attempts, job 4552 is marked as failed and a warning is sent to the Slack channel `#data-alerts` |
| 2024-06-21 08:00 | The regular, scheduled `dbt run` happens normally. Since its running with the usual setting of materializing `staging` models as `table`, it destroy the dirty views that were left from the previous day. |
| 2024-06-21 09:50 | Pablo picks up the alerts and research begins. |
| 2024-06-21 11:21 | Pablo triggers the failed syncs manually. |
| 2024-06-21 11:26 | The syncs have executed successfully. |
| 2024-06-21 11:31 | Pablo triggers a `dbt run` manually. |
| 2024-06-21 11:35 | The `dbt run` finishes successfully. |
| | End of the incident. |
## Root Cause(s)
- Pablo ran some `dbt`, `staging` layer models in the DWH as views instead of as tables on 2024-06-20.
- The views were left in the DWH.
- The following full-refresh Airbyte jobs on the `sync_core` schema failed upon trying to run `DROP` on the `sync_core` tables that then had dependant views, for Airbyte is running `DROP`, not `DROP CASCADE`. This is not a problem usually since we materialize `staging` as `table`, because that creates no dependency relationship between the `sync_core` tables and their `staging` table counterparties.
## Resolution and recovery
- The scheduled `dbt run` job that ran on 8:00 CEST deleted the dangling views from the previous day and brought the DWH back to using tables in the `staging` layer.
- Without the views in place, it was only necessary to manually trigger the failed Airbyte syncs again to bring the `sync_core` tables that had been outdated back to being up to date.
- After that, the `dbt run` was manually triggered as well to bring all the dependant models back to being up to date.
- All jobs ran successfully and the DWH was brought back to perfect state without issues.
## **Lessons Learned**
- What went well
- Alerts made sure we picked up the problem fast
- What went badly
- The manual testing left dirt in the DWH
- Where did we get lucky
- The context of the `dbt` project since Pablo did the testing until the next morning allowed for the upcoming scheduled `dbt run` to automatically remove all the undesired views from the DWH, making the job of resolving the incident as simple as re-running everything. But things could have been different and some views could have been left in the DWH, which would have made recovery more complex and error-prone.
General lesson: dont test in production.
The tests Pablo was running on 2024-06-20 were related to how to turn our `staging` layer in DWH from being materialized as `table` currently to `view` instead. This incident was a small sample of how this wouldnt work as simply as initially expected given the nature of Airbyte full-refresh behaviour, in combination with Postgres `DROP` and `DROP CASCADE` commands.
The following Github issue shows other people having the same discussion: https://github.com/airbytehq/airbyte/issues/35386
That discussion lead to new developments in Postgres which enabled the features that would be necessary to achieve our goal. This doc page explains the features: https://docs.airbyte.com/integrations/destinations/postgres?_gl=1*vst8mh*_gcl_au*MzcyNzc0OTUzLjE3MTc1MDM2MDI.#creating-dependent-objects
It will be necessary to run version updates on Airbyte to achieve this.
## Action Items
- [ ] Design a way to easily replicate the production DWH in order to minimize the need to run tests there.
- [ ] Update Airbyte to pick up the new versions of the Postgres connector and plan, test and implement the change of the `staging` layer materialization strategy from `table` to `view` properly.
## Appendix
Logs of the failed Airbyte sync jobs.
[default_workspace_job_4552_attempt_5_txt](default_workspace_job_4552_attempt_5_txt.txt)
[default_workspace_job_4544_attempt_5_txt](default_workspace_job_4544_attempt_5_txt.txt)
Slack alerts by Airbyte:
![Untitled](Untitled%2045.png)