diff --git a/images/example_of_modern_datastack.png b/images/example_of_modern_datastack.png new file mode 100644 index 0000000..0e5f64d Binary files /dev/null and b/images/example_of_modern_datastack.png differ diff --git a/notes/1.md b/notes/1.md index 23704d4..596ffef 100644 --- a/notes/1.md +++ b/notes/1.md @@ -58,3 +58,43 @@ Just use both. A data lake with some DWH layer on top. Pretty much, a swamp of f ## The modern data stack +Cheaper storage -> We don't mind duplicating data more. +Faster networking -> We can spread work across more machines and decouple things like storage and processing. We can distribute workloads with distributed storage and compute. + +![img.png](../images/example_of_modern_datastack.png) + +dbt makes sense nowadays because the modern data stack makes transformations within the datawarehouse. + +## Slowly Changing Dimensions + +- The issue comes when a dimension changes in a way that would break referential integrity. +- Sometimes, old data can be thrown away. Sometimes, not. +- There are 4 SCD types. + - SCD 0 - Retain original + - Do not update data in the DWH. Source data and DWH gets out of sync. + - You do this when you don't care about the dimension truly. + - Example: Fax numbers when fax is not used anymore. + - SCD 1 - Overwrite + - Overwrite new values in DWH. Old values go away. + - We only care about the new state. We don't need the history. + - SCD 2 - Add new row + - Add new raw with `start_date` and `end_date` fields to indicate which values should be looked at depending on time. + - Used when full historical view is important. + - Increases amount of data stored. + - SCD 3 - Add new attribute + - Keep current attribute value and previous value + - It only keeps the previous type at most + - Intermediate approach between SCD2 and SCD1 + + +## dbtw overview + +- dbt takes care of the T in ETL/ELT. +- dbt works within the datawarehouse and with SQL. +- Why not use raw SQL and that's it? Because dbt brings good software practices like modularity, version control, reusability, testing, documentation and such to SQL swamps. + +## Case + +- ELT in Airbnb. +- Data from insideairbnb.com/berlin/ +- The project will use snowflake as a DWH and preset (managed superset) as a BI tool \ No newline at end of file