sh-notion/notion_data_team_no_files/20240619-01 - CheckIn Cover multi-price problem fabd174c34324292963ea52bb921203f.md

104 lines
10 KiB
Markdown
Raw Normal View History

2025-07-11 16:15:17 +02:00
# 20240619-01 - CheckIn Cover multi-price problem
This page is to track a production bug spotted on 2024-06-19, 12:30 ES time.
The problem was solved on 2024-06-19, 17:30 ES time.
## Executive Summary
- I (Pablo) believe some pieces of data were manually modified in an unproper way in the Core Superhog database due to some tech debt and user mistakes we are living with.
- This propagated into data quality issues in the DWH, which eventually led to wrong reporting in the Checkin Hero reporting and the Business Overview reporting suites, including inflated revenue numbers.
- This specific problem will be fixed by the Data team with some engineering work, but we need to run a postmortem on how it happened and change our way of doing things to avoid **massive business problems in the future**. **Im calling all of us to serious action to avoid more of this in the future.**
## Initial problem
- Pablo spotted a duplicate record in the DWH table `reporting_core__vr_check_in_cover` while running some data tests on the DWH. This table summarizes some details around guest journeys with checkin cover. The `VerificationRequestId` for these records is `749989`.
- The duplicate records showed all fields with same values except for `checkin_cover_limit_amount_local_curr` and `checkin_cover_limit_amount_in_gbp`.
## Root cause research
- Pablos initial suspicion was a duplicate record in the `Payments` table causing the issue, but this was not the case.
- Pulling the thread, Pablo found out that the DWH table `int_core__check_in_cover_prices` had two records for `EUR` and `CAD` .
- This issue was causing the original problem of duplicate records in `reporting.core__vr_check_in_cover` .
- This is because `int_core__check_in_cover_prices` is expected to have only one record per currency.
- This is so because `int_core__check_in_cover_prices` builds a record per price by grouping the `PaymentValidationSetToCurrency` by `CurrencyIso`, `CheckInCoverCost` and `CheckInCoverLimit` .
- So, the next step was to find out why `int_core__check_in_cover_prices` was showing the duplicate records for `EUR` and `CAD`.
- The source table for this data in the DWH is `sync_core.PaymentValidationSetToCurrency`.
- Pablo ran the following query:
```sql
SELECT *
FROM sync_core."PaymentValidationSetToCurrency" pvstc
WHERE ("CheckInCoverCost" != 11
AND "CurrencyIso" = 'EUR')
OR
("CheckInCoverCost" != 13
AND "CurrencyIso" = 'CAD')
```
Which yielded the following output:
```
"Id","Fee","Amount","Waiver","Protection","Reschedule","CreatedDate","CurrencyIso","UpdatedDate","IsFeeRefundable","CheckInCoverCost","CheckInCoverLimit","PaymentValidationSetId","DisabledValidationOptions","_airbyte_raw_id","_airbyte_extracted_at","_airbyte_meta"
29583,0.000000000,690.000000000,21.000000000,48.000000000,,2024-05-16 15:04:23.080,CAD,2024-05-16 15:04:23.080,false,690.000000000,130.000000000,3710,7,f13f779f-e054-465e-b301-aa38e88808e0,2024-05-16 18:00:12.917 +0200,"{""errors"": []}"
31053,46.000000000,930.000000000,110.000000000,760.000000000,,2024-06-13 17:47:04.143,EUR,2024-06-13 17:47:04.143,false,14.000000000,130.000000000,3894,18,"43112786-986a-4cfa-aef6-135c1a1b5067",2024-06-13 21:00:13.103 +0200,"{""errors"": []}"
31085,19.000000000,940.000000000,66.000000000,940.000000000,,2024-06-13 18:52:47.003,EUR,2024-06-13 18:52:47.003,false,14.000000000,130.000000000,3898,19,fbd15fa4-8691-41ea-a8d2-1edb82e4355f,2024-06-13 22:00:12.451 +0200,"{""errors"": []}"
```
- The results show that there are three records of `PaymentValidationSetToCurrency` that dont have the *regular* values for `EUR` and `CAD`.
- This is a major issue, because there was a established contract that, even though CheckIn Cover cost and limit figures appear in different records per `PaymentValidationSet` , the price is supposed to be a global one for all of Superhog. This data breaks the contract.
- The next question that was posed was: is this data looking the same in Superhogs backend?
- I ran the following query in the Core database, `Live`
```sql
SELECT Id, CurrencyIso, Amount, Fee, PaymentValidationSetId, CreatedDate, UpdatedDate, IsFeeRefundable, DisabledValidationOptions, Waiver, Protection, Reschedule, CheckInCoverCost, CheckInCoverLimit
FROM live.dbo.PaymentValidationSetToCurrency
WHERE Id = 31053 OR Id = 31085 OR Id = 29583
```
Which yielded the following output:
```
"Id","CurrencyIso","Amount","Fee","PaymentValidationSetId","CreatedDate","UpdatedDate","IsFeeRefundable","DisabledValidationOptions","Waiver","Protection","Reschedule","CheckInCoverCost","CheckInCoverLimit"
29583,CAD,690.00000,0.00000,3710,2024-05-16 15:04:23.080,2024-05-16 15:04:23.080,0,7,21.00000,48.00000,,13.00000,130.00000
31053,EUR,930.00000,46.00000,3894,2024-06-13 17:47:04.143,2024-06-13 17:47:04.143,0,18,110.00000,760.00000,,11.00000,85.00000
31085,EUR,940.00000,19.00000,3898,2024-06-13 18:52:47.003,2024-06-13 18:52:47.003,0,22,66.00000,940.00000,,11.00000,85.00000
```
- Major problem. Data is not looking the same.
- The `CheckInCoverCost` in `dwh.sync_core.PaymentValidationSetToCurrency` are `690, 14, 14`.
- The `CheckInCoverCost` in `live.dbo.PaymentValidationSetToCurrency` are `13, 11, 11`.
- This is pointing to an issue in the Core <> DWH integration that happens through Airbyte.
Summarizing the issues, from root to effects:
- Some faulty `live.dbo.PaymentValidationSetToCurrency` values somehow came from Core to DWH, and were afterward changed in Core. This must have been done without respecting the `UpdatedDate` field of the table.
- The faulty values broke the intended granularity of `dwh.intermediate.int_core__check_in_cover_prices`, which propagated into `dwh.reporting_.ore__vr_check_in_cover`
- The issue in `dwh.reporting_.ore__vr_check_in_cover` caused (and its still causing) revenue and funnel numbers to show wrong stats. Basically inflating them artificially.
## Remediation
- Short-term: I will have to run a backfill between `live.dbo.PaymentValidationSetToCurrency` and `dwh.sync_core.PaymentValidationSetToCurrency` to ensure that the data across both is the same again.
- Beyond that: we need to understand how this situation came to life and ensure it is not repeated. My (Pablo on the keyboard) **hypothesis** on what happened is the following:
- Someone modified the CheckIn Cover prices in Wilbur for some accounts, in the fields that should NOT be editable yet are (Joan and Lawrence can provide more details on this issue). Could have been an AM experimenting or trying to catter to some host needs perhaps?
- Someone realized this happened and somehow put the necessary dev resources to fix it in the database straight. I mean to say, they literally just brought back the database field values to what they should have been. This is in contrast with simply changing the setting in Wilbur again, which wouldnt have realize solved the problem, for every time this changes are made in Wilbur, a new record gets created in `live.dbo.PaymentValidationSetToCurrency`, meaning the faulty values would still remain there. Given this behaviour, Im pretty confident whoever worked on this understands the bad implications of having multiple prices per currency in that table, and decided to do this database changed consciously to avoid it.
- This was done without updating the `UpdatedDate` fields as the SQL `UPDATE` statement happened.
- Because of this, Airbyte didnt pick up the changes and never brought the new data for those records into the DWH. This is because Airbyte syncs the data of table `live.dbo.PaymentValidationSetToCurrency` incrementally, by only brining over data that was modified since the last Airbyte run. Airbyte infers whether data was modified or not by looking at the `UpdatedDate` field. If the field is not respected when doing updates, Core and DWH end up out of sync.
- I would like to emphasize the importance of preventing this type of issue. The errors caused by this instance were small, but this could turn into massive reporting mistakes. Furthermore, these are by nature very difficult to spot and troubleshoot, meaning that they could live on a long time, leading to TMT and other Managers relying on wrong reporting for their business decision making, investor reporting, etc.
## Final reflection on the mistakes that got us here
- First, we recycled the Cancellation Cover data model for the CheckIn Cover in a rushed way, resulting in Cores data model being completely out of sync with the reality of the service (the data model allows different hosts and currencies to have different CheckIn Cover prices, when the business logic around this service is that theres a single, Superhog-wide price for each currency).
- Second, we allowed the UI of Wilbur to have fields that let users modify these values on a host level, which is again completely out of sync with our business logic because different hosts shouldnt have different prices for this service, and no user should ever change that value.
- Third, some user managed to use the UI-feature-that-shouldnt-exist the wrong way to change the values, even though this should really not be done.
- Fourth, someone modified the database to fix the third mistake, but introduced *another* mistake by failing to respect the `UpdatedDate` field in the process.
This is a long story of tech debt and bad choices bringing us to a costly mistake. We were lucky it didnt cause a big problem, but it could have. I hope we can all learn from this to avoid these issues.
**2024-06-21 update**: after a discussion together with Lawrence, we found out what we think is the cause of the values being “corrected” in the Core database without respecting the `UpdatedDate` field.
This data is being overwritten on every migration as part of the seeding process that the ream runs on deployments, replacing any existing values around the CheckIn Cover with the ones provided by the seed hardcoded values. The faulty values introduced by an user were most probably overwritten again once the team applied a new migration in the database.
Besides that, it was an important finding since we also realized that this seeding process does not update the `UpdatedAt` fields.