A few lessons
This commit is contained in:
parent
ec0e91f454
commit
2a76bdb73f
2 changed files with 40 additions and 0 deletions
BIN
images/example_of_modern_datastack.png
Normal file
BIN
images/example_of_modern_datastack.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 130 KiB |
40
notes/1.md
40
notes/1.md
|
|
@ -58,3 +58,43 @@ Just use both. A data lake with some DWH layer on top. Pretty much, a swamp of f
|
||||||
|
|
||||||
## The modern data stack
|
## The modern data stack
|
||||||
|
|
||||||
|
Cheaper storage -> We don't mind duplicating data more.
|
||||||
|
Faster networking -> We can spread work across more machines and decouple things like storage and processing. We can distribute workloads with distributed storage and compute.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
dbt makes sense nowadays because the modern data stack makes transformations within the datawarehouse.
|
||||||
|
|
||||||
|
## Slowly Changing Dimensions
|
||||||
|
|
||||||
|
- The issue comes when a dimension changes in a way that would break referential integrity.
|
||||||
|
- Sometimes, old data can be thrown away. Sometimes, not.
|
||||||
|
- There are 4 SCD types.
|
||||||
|
- SCD 0 - Retain original
|
||||||
|
- Do not update data in the DWH. Source data and DWH gets out of sync.
|
||||||
|
- You do this when you don't care about the dimension truly.
|
||||||
|
- Example: Fax numbers when fax is not used anymore.
|
||||||
|
- SCD 1 - Overwrite
|
||||||
|
- Overwrite new values in DWH. Old values go away.
|
||||||
|
- We only care about the new state. We don't need the history.
|
||||||
|
- SCD 2 - Add new row
|
||||||
|
- Add new raw with `start_date` and `end_date` fields to indicate which values should be looked at depending on time.
|
||||||
|
- Used when full historical view is important.
|
||||||
|
- Increases amount of data stored.
|
||||||
|
- SCD 3 - Add new attribute
|
||||||
|
- Keep current attribute value and previous value
|
||||||
|
- It only keeps the previous type at most
|
||||||
|
- Intermediate approach between SCD2 and SCD1
|
||||||
|
|
||||||
|
|
||||||
|
## dbtw overview
|
||||||
|
|
||||||
|
- dbt takes care of the T in ETL/ELT.
|
||||||
|
- dbt works within the datawarehouse and with SQL.
|
||||||
|
- Why not use raw SQL and that's it? Because dbt brings good software practices like modularity, version control, reusability, testing, documentation and such to SQL swamps.
|
||||||
|
|
||||||
|
## Case
|
||||||
|
|
||||||
|
- ELT in Airbnb.
|
||||||
|
- Data from insideairbnb.com/berlin/
|
||||||
|
- The project will use snowflake as a DWH and preset (managed superset) as a BI tool
|
||||||
Loading…
Add table
Add a link
Reference in a new issue