Merged PR 2918: Integrates screening API verifications into DWH staging

# Description This PR integrates screening API verifications into staging. There's 2 commits: * The earliest one, it just copy-pastes the strategy followed by edeposit and adapts it to fit screening API case, which is simpler. Note here that the schema entries contain a low number of tests. This is because we only have 7 records in production - and these seem fake anyway :) - so it's complex to extrapolate. Those I could extrapolate (NoFlags/Flagged) I'm taking from the data transformation within the PBI report. * The last commit, it's just a DRY. It handles the deduplication logic for cosmos db in a macro, and I applied it on both the screening API and the edeposit verifications. Works well from my POV. Since you guys are far more knowledgeable on APIs scope, I'll let you take a close look in case I missed something. # Checklist - [X] The edited models and dependants run properly with production data. - [ ] The edited models are sufficiently documented. **I guess it can be improved, waiting for your comments here** - [X] The edited models contain PK tests, and I've ran and passed them. - [X] I have checked for DRY opportunities with other models and docs. - [ ] I've picked the right materialization for the affected models. **Used default coming from stg_edeposit__verifications** # Other - [ ] Check if a full-refresh is required after this PR is merged. Related work items: #20127
2024-09-25 10:52:50 +00:00 · 2024-09-25 10:52:50 +00:00 · 070d067709
commit 070d067709
parent 260b46171c 3723cf0e57
5 changed files with 213 additions and 23 deletions
--- a/macros/cosmos_db_utils.sql
+++ b/macros/cosmos_db_utils.sql
@ -0,0 +1,31 @@
+{% macro cosmos_db_record_deduplication(source_table, primary_key) %}
+    /*
+This macro provides a deduplication logic for Cosmos DB entities.
+If two or more records have a duplicated value in a field that acts as 
+the primary key, the most recent record will be returned. If the record
+is not duplicated, it will also be returned.
+
+Inputs:
+    - source_table: table that acts as source. Should be a sync model.
+    - primary_key: unique identifier on which the deduplication will be applied.
+Output:
+    - Returns the set of records from the source_table that are unique according 
+      to the primary_key. 
+*/
+    select *
+    from
+        -- Some thoughts for the future here:
+        -- The query below is awful performance wise - but data 
+        -- size is tiny today. Let's tackle the problem as it comes.
+        (
+            select
+                *,
+                row_number() over (
+                    partition by
+                        {{ adapter.quote("documents") }} ->> '{{ primary_key }}'
+                    order by ({{ adapter.quote("documents") }} ->> '_ts')::integer desc
+                ) as rank
+            from {{ source_table }}
+        )
+    where rank = 1
+{% endmacro %}