98 lines
5.8 KiB
Markdown
98 lines
5.8 KiB
Markdown
|
|
# 20240621-01 - Failure of Core full-refresh Airbyte jobs
|
|||
|
|
|
|||
|
|
## Failure of Core full-refresh Airbyte jobs
|
|||
|
|
|
|||
|
|
Managed by: Pablo
|
|||
|
|
|
|||
|
|
## Summary
|
|||
|
|
|
|||
|
|
- Components involved: dbt, Airbyte, dwh-prd
|
|||
|
|
- Started at: *2024-06-21 3:09AM CEST*
|
|||
|
|
- Detected at: *2024-06-21 3:09AM CEST*
|
|||
|
|
- Mitigated at: *2024-06-21 11:35AM CEST*
|
|||
|
|
|
|||
|
|
Some tests around dbt materialization performed in production by Pablo on 2024-06-20, plus a lack of proper clean-up after them, derived in Airbyte failing to run full-refresh loads from Core due to not being able to delete the tables properly. This left the affected tables and their dependants in DWH for around 9 natural hours/4 business hours.
|
|||
|
|
|
|||
|
|
## Impact
|
|||
|
|
|
|||
|
|
The following tables were not refreshed on the nightly run of 2024-06-21, making data remain stale and outdate to the 2024-06-20 state:
|
|||
|
|
|
|||
|
|
| Source | Schema | Table |
|
|||
|
|
| --- | --- | --- |
|
|||
|
|
| Core (SQL Server - Live) | `Integration` | `Integration` |
|
|||
|
|
| Core (SQL Server - Live) | `Integration` | `IntegrationType` |
|
|||
|
|
| Core (SQL Server - Live) | `dbo` | `Country` |
|
|||
|
|
| Core (SQL Server - Live) | `dbo` | `Currency` |
|
|||
|
|
| Core (SQL Server - Live) | `dbo` | `PaymentStatus` |
|
|||
|
|
| Core (SQL Server - Live) | `dbo` | `PricePlanChargedByType` |
|
|||
|
|
| Core (SQL Server - Live) | `dbo` | `User` |
|
|||
|
|
| Core (SQL Server - Live) | `dbo` | `UserVerificationStatus` |
|
|||
|
|
| Core (SQL Server - Live) | `dbo` | `VerificationPaymentType` |
|
|||
|
|
| Core (SQL Server - Live) | `dbo` | `VerificationStatus` |
|
|||
|
|
|
|||
|
|
## Timeline
|
|||
|
|
|
|||
|
|
Timezone: CEST
|
|||
|
|
|
|||
|
|
| Time | Event |
|
|||
|
|
| --- | --- |
|
|||
|
|
| 2024-06-21 03:00 | A scheduled job (ID: 4544) of the Airbyte sync `Superhog - Live - integration → dwh-prd (Full Refresh)` begins. |
|
|||
|
|
| 2024-06-21 03:09 | After 5 failed attempts, job 4544 is marked as failed and a warning is sent to the Slack channel `#data-alerts` |
|
|||
|
|
| 2024-06-21 06:00 | A scheduled job (ID: 4552) of the Airbyte sync `Superhog - Live -dbo → dwh-prd (Full-refresh models)` begins. |
|
|||
|
|
| 2024-06-21 06:31 | After 5 failed attempts, job 4552 is marked as failed and a warning is sent to the Slack channel `#data-alerts` |
|
|||
|
|
| 2024-06-21 08:00 | The regular, scheduled `dbt run` happens normally. Since it’s running with the usual setting of materializing `staging` models as `table`, it destroy the dirty views that were left from the previous day. |
|
|||
|
|
| 2024-06-21 09:50 | Pablo picks up the alerts and research begins. |
|
|||
|
|
| 2024-06-21 11:21 | Pablo triggers the failed syncs manually. |
|
|||
|
|
| 2024-06-21 11:26 | The syncs have executed successfully. |
|
|||
|
|
| 2024-06-21 11:31 | Pablo triggers a `dbt run` manually. |
|
|||
|
|
| 2024-06-21 11:35 | The `dbt run` finishes successfully. |
|
|||
|
|
| | End of the incident. |
|
|||
|
|
|
|||
|
|
## Root Cause(s)
|
|||
|
|
|
|||
|
|
- Pablo ran some `dbt`, `staging` layer models in the DWH as views instead of as tables on 2024-06-20.
|
|||
|
|
- The views were left in the DWH.
|
|||
|
|
- The following full-refresh Airbyte jobs on the `sync_core` schema failed upon trying to run `DROP` on the `sync_core` tables that then had dependant views, for Airbyte is running `DROP`, not `DROP CASCADE`. This is not a problem usually since we materialize `staging` as `table`, because that creates no dependency relationship between the `sync_core` tables and their `staging` table counterparties.
|
|||
|
|
|
|||
|
|
## Resolution and recovery
|
|||
|
|
|
|||
|
|
- The scheduled `dbt run` job that ran on 8:00 CEST deleted the dangling views from the previous day and brought the DWH back to using tables in the `staging` layer.
|
|||
|
|
- Without the views in place, it was only necessary to manually trigger the failed Airbyte syncs again to bring the `sync_core` tables that had been outdated back to being up to date.
|
|||
|
|
- After that, the `dbt run` was manually triggered as well to bring all the dependant models back to being up to date.
|
|||
|
|
- All jobs ran successfully and the DWH was brought back to perfect state without issues.
|
|||
|
|
|
|||
|
|
## **Lessons Learned**
|
|||
|
|
|
|||
|
|
- What went well
|
|||
|
|
- Alerts made sure we picked up the problem fast
|
|||
|
|
- What went badly
|
|||
|
|
- The manual testing left dirt in the DWH
|
|||
|
|
- Where did we get lucky
|
|||
|
|
- The context of the `dbt` project since Pablo did the testing until the next morning allowed for the upcoming scheduled `dbt run` to automatically remove all the undesired views from the DWH, making the job of resolving the incident as simple as re-running everything. But things could have been different and some views could have been left in the DWH, which would have made recovery more complex and error-prone.
|
|||
|
|
|
|||
|
|
General lesson: don’t test in production.
|
|||
|
|
|
|||
|
|
The tests Pablo was running on 2024-06-20 were related to how to turn our `staging` layer in DWH from being materialized as `table` currently to `view` instead. This incident was a small sample of how this wouldn’t work as simply as initially expected given the nature of Airbyte full-refresh behaviour, in combination with Postgres `DROP` and `DROP CASCADE` commands.
|
|||
|
|
|
|||
|
|
The following Github issue shows other people having the same discussion: https://github.com/airbytehq/airbyte/issues/35386
|
|||
|
|
|
|||
|
|
That discussion lead to new developments in Postgres which enabled the features that would be necessary to achieve our goal. This doc page explains the features: https://docs.airbyte.com/integrations/destinations/postgres?_gl=1*vst8mh*_gcl_au*MzcyNzc0OTUzLjE3MTc1MDM2MDI.#creating-dependent-objects
|
|||
|
|
|
|||
|
|
It will be necessary to run version updates on Airbyte to achieve this.
|
|||
|
|
|
|||
|
|
## Action Items
|
|||
|
|
|
|||
|
|
- [ ] Design a way to easily replicate the production DWH in order to minimize the need to run tests there.
|
|||
|
|
- [ ] Update Airbyte to pick up the new versions of the Postgres connector and plan, test and implement the change of the `staging` layer materialization strategy from `table` to `view` properly.
|
|||
|
|
|
|||
|
|
## Appendix
|
|||
|
|
|
|||
|
|
Logs of the failed Airbyte sync jobs.
|
|||
|
|
|
|||
|
|
[default_workspace_job_4552_attempt_5_txt](default_workspace_job_4552_attempt_5_txt.txt)
|
|||
|
|
|
|||
|
|
[default_workspace_job_4544_attempt_5_txt](default_workspace_job_4544_attempt_5_txt.txt)
|
|||
|
|
|
|||
|
|
Slack alerts by Airbyte:
|
|||
|
|
|
|||
|
|

|