sh-notion/notion_data_team_no_files/20241119-01 - CheckIn Cover multi-price problem (a 1430446ff9c98088b547dfb0baff6024.md
Pablo Martin a256b48b01 pages
2025-07-11 16:15:17 +02:00

77 lines
No EOL
7.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 20241119-01 - CheckIn Cover multi-price problem (again)
# CheckIn Cover multi-price problem (again)
Managed by: Pablo
## Summary
- Components involved: SQL Server, DWH, superhog-mono-app codebase
- Started at: 2024-11-18 12:49:06 UTC
- Detected at: 2024-11-19 06:34:16 UTC
- Mitigated at: 2024-11-19 17:30:00 UTC
A new stored procedure released on 2024-11-18 mistakenly added records in `live.dbo.PaymentValidationSetToCurrency` with values `0` for CIH prices and covers. This caused a dimensionality issue in the DWH, which lead to duplicate records in DWH and bogus reporting for CIH in PBI, with inflated sales numbers and other affected data points. Besides that, a seeding script from the application that doesnt respect the `UpdatedDate` column of `live.dbo.PaymentValidationSetToCurrency` caused data drift between SQL Server and DWH, which increased investigation complexity and generated the need for backfills.
This incident is a very close reoccurrence of this one from June: [20240619-01 - CheckIn Cover multi-price problem](20240619-01%20-%20CheckIn%20Cover%20multi-price%20problem%20fabd174c34324292963ea52bb921203f.md). The underlying design mistakes that act as a root cause are common across both incidents.
## Impact
CIH reporting in the DWH has been displaying incorrect figures for 11 hours. This includes data such as revenue totals, sales counts, funnel and conversion rates metrics, and individual sales records displaying wrong prices.
## Timeline
All times are UTC.
| Time | Event |
| --- | --- |
| Sometime before 2024-11-18 12:49:06 | A release was made on the Superhog backend, which added the migration `202411121235595_CreateCustomBundle.cs` |
| 2024-11-18 12:49:06 | Faulty records with `0` value for CIH price and cover got added to `live.dbo.PaymentValidationSetToCurrency`. We suspect they were added by the stored procedure `CreateCustomBundle`. |
| 2024-11-18 13:00:10 | One of the hourly Airbyte jobs that syncs between SQL Server and the DWH caught the faulty records and copied them over into the DWH. |
| At some unknown time between 2024-11-18 13:00:10 and 2024-11-19 06:15:00 | The seeding script for CIH prices and covers runs in SQL Server, overriding the faulty records with `0` values. |
| 2024-11-19 06:15:00 | A `dbt run` was triggered, propagating the faulty records in downstream models and breaking the granularity of some models with duplicate record. From this point on, data in the DWH and the reading PBI reports was wrong. |
| 2024-11-19 06:34:16 | A data test was triggered due to duplicate records in `reporting.core__vr_checkin_cover` breaking the PK. Data team starts investigating. |
| 2024-11-19 14:00:00 | Pablo realises the issue looks like a duplicate of [20240619-01 - CheckIn Cover multi-price problem](20240619-01%20-%20CheckIn%20Cover%20multi-price%20problem%20fabd174c34324292963ea52bb921203f.md). This drives him to quickly spot and confirm the data drift and the faulty records. |
| 2024-11-19 15:30:00 | Pablo discusses with Lawrence and the root cause of the issue is identified. |
| 2024-11-19 17:30:00 | An Airbyte + dbt backfill to fix the data drift and remove the faulty records finishes. From this point on, data in the DWH and PBI is correct again. |
| | Incident mitigated. |
## Root Cause(s)
The root cause is a combination of the following:
- The true, core root cause is that business logic for CIH across the company assumes that CIH has a single, global price across all Superhog for each currency. Despite this, the database actually allows for different prices per platform user. This design is not fit for our business logic and allows for incidents like this to happen. Should this be redesigned to properly reflect our business logic, neither this incident nor [20240619-01 - CheckIn Cover multi-price problem](20240619-01%20-%20CheckIn%20Cover%20multi-price%20problem%20fabd174c34324292963ea52bb921203f.md) would have happened.
- In the case of this incident, the trigger of the issue was that the uniqueness of price values per currency in `live.dbo.PaymentValidationSetToCurrency` was not respected by the stored procedure `202411121235595_CreateCustomBundle.cs`, which set the values for CIH prices and covers of some accounts to `0`.
- This cascaded into breaking the uniqueness of the primary key of table `dwh.intermediate.int_core__check_in_cover_prices` in the DWH, which led to duplicate records in downstream tables related to CIH, and to wrong data being displayed in PBI reports.
- Besides that, a seeding script that updates CIH price and cover values ran on top of `live.dbo.PaymentValidationSetToCurrency`, overriding prices without respecting the `UpdatedDate` column. This caused data drift across the DWH and SQL Server.
## Resolution and recovery
The short mitigation consisted on:
- The wrong, `0` valued records in `live.dbo.PaymentValidationSetToCurrency` where accidentally reverted back to their proper prices.
- Performing a backfill of the table `PaymentValidationSetToCurrency` on Airbyte so that the `sync` layer table would stop having duplicated prices.
- Execute a `dbt run` on the DWH to propagate the fixed data.
## **Lessons Learned**
- What went well
- Automated data alerts in DWH helped us notice the incident fast.
- The post-mortem from the previous incident accelerated a lot investigation and resolution. It made it easy to understand what was happening and fix it, even though the incident is rather tricky as it has many moving parts.
- What went badly
- Our inadequate design for the CIH logic in the backend keeps biting us back.
- The complexity and shared boundaries across squads are causing us to step on each others toes (a change made by the new dash squad changes behaviours on the domain of the guest squad in an uncontrolled way).
- We didnt take action from the stuff we learned from the previous incident of this type back in June, and so the issues keep on appearing.
- Where did we get lucky
- The CIH prices seeding script fixed the wrong values inserted by the new migration added by the Dash Squad. We removed the wrong values due to sheer luck.
## Action Items
- [ ] Fix the stored procedure `CreateCustomBundle` defined in the migration `202411121235595_CreateCustomBundle.cs` so that it stops creating `PaymentValidationSetToCurrency` records with prices different that the canonical ones.
- The exact lines that cause the issue [can be found here](https://guardhog.visualstudio.com/Superhog/_git/superhog-mono-app?path=/Guardhog.Data/StoredProcedures/CreateCustomBundle/202411121235595_CreateCustomBundle.cs&version=GBdevelop&line=170&lineEnd=171&lineStartColumn=4&lineEndColumn=29&lineStyle=plain&_a=contents)
- [ ] Modify the CIH prices seeding script so that it respects the `UpdatedDate` column, preventing future data drifts.
- [x] Add more specific data tests in the DWH to spot this issue faster (we can add a test that is still not there and that would give away that this issue is happening instantly)
## Appendix
Link to the previous occurrence of this issue: [20240619-01 - CheckIn Cover multi-price problem](20240619-01%20-%20CheckIn%20Cover%20multi-price%20problem%20fabd174c34324292963ea52bb921203f.md)