# 20241211-01 - DWH scheduled execution has not been launched # DWH scheduled execution has not been launched Managed by: Uri and Pablo ## Summary - Components involved: Airbyte VM, Airbyte, dbt, xexe, anaxi, DWH - Started at: 2024-12-11 05:00:00 UTC - Detected at: 2024-12-11 07:41:00 UTC - Mitigated at: 2024-12-11 09:48:00 UTC An out of the ordinary resource consumption by Airbyte has left the Airbyte VM knocked down for 5 hours due to lack of memory. Jobs that run on that machine by various of our data platform components didn’t run. We rebooted the machine and re-run all pending work. ## Impact The nightly loading and refreshing of data in DWH has been delayed by about 5 hours. This means reporting was stale for business users for around 3 hours during working time (assuming nobody looks at PBI reports at 6AM. Maybe Joan?). ## Timeline All times are UTC. | Time | Event | | --- | --- | | 2024-12-11 04:00:00 | The Airbyte VM jumps from having 1.5GB of free memory to almost none (~50 MB). CPU usage also picks up from ~0% to ~50% and stays stuck there. A sync job for the stream SQL Server incremental to DWH starts (job ID: 19235), but communication with the worker container is lost at 04:01:16. A sync job for the stream SQL Server full refresh to DWH starts (job ID: 19237), but communication with the worker container is lost at 04:01:16. A sync job for the stream Stripe UK to DWH starts (job ID: 19236), but communication with the worker container is lost at 04:01:06. | | 2024-12-11 04:01:17 | Airbyte jobs scheduled to begin from this point in time onwards do not start due to lack of resources. All cron jobs on the machine after this point in time do not start due to lack of resources. This includes dbt, anaxi and xexe jobs. | | 2024-12-11 07:41:00 | Uri notices Main KPIs are not updated with 10th December data. After checking Data Alerts, no alert has been raised. After checking Data Receipts, Uri confirms that the expected scheduled run has not been executed. | | 2024-12-11 07:43:00 | A message in the Data channel is sent to notify users of an ongoing incident. | | 2024-12-11 07:50:00 | Uri tries to connect to the SH Data Airbyte machine unsuccessfully. | | 2024-12-11 07:56:00 | Something happened around 4AM UTC since Airbyte resource consumption has fallen to a minimum and stagnated. Checking the behavior on previous days, this looks out of the ordinary. | | 2024-12-11 08:07:00 | Looks like it’s a networking issue, but cannot be 100% sure. Uri suggests restarting Airbyte machine, but might not be the best approach. Waiting for Pablo since he’s the expert. | | 2024-12-11 08:30:00 | Pablo comes in and looks at the situation. He identifies the lack of available RAM memory in the Airbyte VM and assumes that Airbyte has consumed all available resources and locked the VM in doing so. | | 2024-12-11 08:47:00 | Pablo triggers a reboot of the Airbyte VM, which completes successfully in a couple of minutes. Memory gets freed as part of it and the VM and container services become reactive once again. | | 2024-12-11 08:49:00 | Multiple Airbyte jobs start again to catch up with the missed runs. | | 2024-12-11 09:36:00 | Pablo starts triggering missed xexe, anaxi and dbt jobs. | | 2024-12-11 10:25:00 | All due jobs are completed and the DWH state is up to date. | | | End of mitigation | ## Root Cause(s) Multiple Airbyte jobs got triggered to run at 04:00:00 UTC. It seems the workload produced by the data volume on 2024-11-12 was enough to chew all RAM in the VM and bring it to a deadlocked state. This chained into all jobs running on the VM (Airbyte, dbt, anaxi and xexe) not working until mitigation was put in place. ## Resolution and recovery We brought things back to normal by rebooting the Airbyte VM so that the machine would stop being deadlocked. Some pending jobs started themselves. Others were triggered manually. ## **Lessons Learned** - What went well - Azure dashboards allowed us to identify the resource bottleneck easily. - Team is alert and notices fishy behaviours fast, even when there are no alerts. - Our logs allowed to understand nicely what ran and what didn’t. - What went badly - We almost forgot to re-run xexe and anaxi job. - We had no alerts. We aren’t testing for freshness the right away, so DWH can go stale without warning. - Where did we get lucky - We got lucky in this not happening before. ## Action Items - [x] Change current schedules in Airbyte to avoid the 04:00 AM memory usage peak. - Stripe UK and Superhog full refresh have been shifted by a few minutes (25 and 35 later than current schedule.) - [ ] Discuss the implementation of dbt source freshness tests. - [ ] Research ways to prevent Airbyte from sucking up all available memory, or at least notify when it happens. ## Appendix -