7.1 KiB
20241119-01 - CheckIn Cover multi-price problem (again)
CheckIn Cover multi-price problem (again)
Managed by: Pablo
Summary
- Components involved: SQL Server, DWH, superhog-mono-app codebase
- Started at: 2024-11-18 12:49:06 UTC
- Detected at: 2024-11-19 06:34:16 UTC
- Mitigated at: 2024-11-19 17:30:00 UTC
A new stored procedure released on 2024-11-18 mistakenly added records in live.dbo.PaymentValidationSetToCurrency with values 0 for CIH prices and covers. This caused a dimensionality issue in the DWH, which lead to duplicate records in DWH and bogus reporting for CIH in PBI, with inflated sales numbers and other affected data points. Besides that, a seeding script from the application that doesn’t respect the UpdatedDate column of live.dbo.PaymentValidationSetToCurrency caused data drift between SQL Server and DWH, which increased investigation complexity and generated the need for backfills.
This incident is a very close reoccurrence of this one from June: 20240619-01 - CheckIn Cover multi-price problem. The underlying design mistakes that act as a root cause are common across both incidents.
Impact
CIH reporting in the DWH has been displaying incorrect figures for 11 hours. This includes data such as revenue totals, sales counts, funnel and conversion rates metrics, and individual sales records displaying wrong prices.
Timeline
All times are UTC.
| Time | Event |
|---|---|
| Sometime before 2024-11-18 12:49:06 | A release was made on the Superhog backend, which added the migration 202411121235595_CreateCustomBundle.cs |
| 2024-11-18 12:49:06 | Faulty records with 0 value for CIH price and cover got added to live.dbo.PaymentValidationSetToCurrency. We suspect they were added by the stored procedure CreateCustomBundle. |
| 2024-11-18 13:00:10 | One of the hourly Airbyte jobs that syncs between SQL Server and the DWH caught the faulty records and copied them over into the DWH. |
| At some unknown time between 2024-11-18 13:00:10 and 2024-11-19 06:15:00 | The seeding script for CIH prices and covers runs in SQL Server, overriding the faulty records with 0 values. |
| 2024-11-19 06:15:00 | A dbt run was triggered, propagating the faulty records in downstream models and breaking the granularity of some models with duplicate record. From this point on, data in the DWH and the reading PBI reports was wrong. |
| 2024-11-19 06:34:16 | A data test was triggered due to duplicate records in reporting.core__vr_checkin_cover breaking the PK. Data team starts investigating. |
| 2024-11-19 14:00:00 | Pablo realises the issue looks like a duplicate of 20240619-01 - CheckIn Cover multi-price problem. This drives him to quickly spot and confirm the data drift and the faulty records. |
| 2024-11-19 15:30:00 | Pablo discusses with Lawrence and the root cause of the issue is identified. |
| 2024-11-19 17:30:00 | An Airbyte + dbt backfill to fix the data drift and remove the faulty records finishes. From this point on, data in the DWH and PBI is correct again. |
| Incident mitigated. |
Root Cause(s)
The root cause is a combination of the following:
- The true, core root cause is that business logic for CIH across the company assumes that CIH has a single, global price across all Superhog for each currency. Despite this, the database actually allows for different prices per platform user. This design is not fit for our business logic and allows for incidents like this to happen. Should this be redesigned to properly reflect our business logic, neither this incident nor 20240619-01 - CheckIn Cover multi-price problem would have happened.
- In the case of this incident, the trigger of the issue was that the uniqueness of price values per currency in
live.dbo.PaymentValidationSetToCurrencywas not respected by the stored procedure202411121235595_CreateCustomBundle.cs, which set the values for CIH prices and covers of some accounts to0. - This cascaded into breaking the uniqueness of the primary key of table
dwh.intermediate.int_core__check_in_cover_pricesin the DWH, which led to duplicate records in downstream tables related to CIH, and to wrong data being displayed in PBI reports. - Besides that, a seeding script that updates CIH price and cover values ran on top of
live.dbo.PaymentValidationSetToCurrency, overriding prices without respecting theUpdatedDatecolumn. This caused data drift across the DWH and SQL Server.
Resolution and recovery
The short mitigation consisted on:
- The wrong,
0valued records inlive.dbo.PaymentValidationSetToCurrencywhere accidentally reverted back to their proper prices. - Performing a backfill of the table
PaymentValidationSetToCurrencyon Airbyte so that thesynclayer table would stop having duplicated prices. - Execute a
dbt runon the DWH to propagate the fixed data.
Lessons Learned
- What went well
- Automated data alerts in DWH helped us notice the incident fast.
- The post-mortem from the previous incident accelerated a lot investigation and resolution. It made it easy to understand what was happening and fix it, even though the incident is rather tricky as it has many moving parts.
- What went badly
- Our inadequate design for the CIH logic in the backend keeps biting us back.
- The complexity and shared boundaries across squads are causing us to step on each others toes (a change made by the new dash squad changes behaviours on the domain of the guest squad in an uncontrolled way).
- We didn’t take action from the stuff we learned from the previous incident of this type back in June, and so the issues keep on appearing.
- Where did we get lucky
- The CIH prices seeding script fixed the wrong values inserted by the new migration added by the Dash Squad. We removed the wrong values due to sheer luck.
Action Items
- Fix the stored procedure
CreateCustomBundledefined in the migration202411121235595_CreateCustomBundle.csso that it stops creatingPaymentValidationSetToCurrencyrecords with prices different that the canonical ones.- The exact lines that cause the issue can be found here
- Modify the CIH prices seeding script so that it respects the
UpdatedDatecolumn, preventing future data drifts. - Add more specific data tests in the DWH to spot this issue faster (we can add a test that is still not there and that would give away that this issue is happening instantly)
Appendix
Link to the previous occurrence of this issue: 20240619-01 - CheckIn Cover multi-price problem