More thingies
This commit is contained in:
parent
718366d09c
commit
ba37fc09e7
6 changed files with 5 additions and 55 deletions
108
notes/sections1-7.md
Normal file
108
notes/sections1-7.md
Normal file
|
|
@ -0,0 +1,108 @@
|
|||
|
||||
|
||||
## Data Maturity Model
|
||||
|
||||
Maslow hierarchy of needs, data version:
|
||||
|
||||

|
||||
|
||||
### Data collection
|
||||
|
||||
You get data from places.
|
||||
|
||||
You place it as is in some staging area.
|
||||
|
||||
|
||||
### Data wrangling
|
||||
|
||||
Cleaning stuff up, deduplicating, yadi yada.
|
||||
|
||||
|
||||
### Data integration
|
||||
|
||||
Place everything together nicely in the same place in a usable model.
|
||||
|
||||
## ETL
|
||||
|
||||
ETL made sense before. Storage was expensive and you only wanted to load the strictly necessary data into the target DWH.
|
||||
|
||||
But this comes with problems:
|
||||
- You have a lot of statefulness.
|
||||
- Debugging and testing pipelines is a pain in the ass.
|
||||
- You need to do transformations outside of your target database.
|
||||
- Schema changes were nightmares.
|
||||
|
||||
ELT is the new shiny toy:
|
||||
- We read raw data from source system and load it into our DWH/Data Lake.
|
||||
- We do our transformations in the target system.
|
||||
- Schema changes become much more manageable.
|
||||
|
||||
|
||||
## Datawarehouses and data lakes
|
||||
|
||||
### Datawarehouse
|
||||
|
||||
DWH -> Any database that:
|
||||
- We use to ground our BI and reporting on.
|
||||
- Optimized for reads.
|
||||
- Typically structured in facts and dimensions.
|
||||
|
||||
### Data Lake
|
||||
|
||||
Decouple storage from compute. Use S3/Blob Storage/Hadoop HDFS for storage. Everything gets stored as files. Use a separate query engine, like Athena, Trino, Spark.
|
||||
|
||||
### Data Lakehouse
|
||||
|
||||
Just use both. A data lake with some DWH layer on top. Pretty much, a swamp of files with some governance, modelling tool sitting on top of it to control access and ease queries.
|
||||
|
||||
|
||||
## The modern data stack
|
||||
|
||||
Cheaper storage -> We don't mind duplicating data more.
|
||||
Faster networking -> We can spread work across more machines and decouple things like storage and processing. We can distribute workloads with distributed storage and compute.
|
||||
|
||||

|
||||
|
||||
dbt makes sense nowadays because the modern data stack makes transformations within the datawarehouse.
|
||||
|
||||
## Slowly Changing Dimensions
|
||||
|
||||
- The issue comes when a dimension changes in a way that would break referential integrity.
|
||||
- Sometimes, old data can be thrown away. Sometimes, not.
|
||||
- There are 4 SCD types.
|
||||
- SCD 0 - Retain original
|
||||
- Do not update data in the DWH. Source data and DWH gets out of sync.
|
||||
- You do this when you don't care about the dimension truly.
|
||||
- Example: Fax numbers when fax is not used anymore.
|
||||
- SCD 1 - Overwrite
|
||||
- Overwrite new values in DWH. Old values go away.
|
||||
- We only care about the new state. We don't need the history.
|
||||
- SCD 2 - Add new row
|
||||
- Add new raw with `start_date` and `end_date` fields to indicate which values should be looked at depending on time.
|
||||
- Used when full historical view is important.
|
||||
- Increases amount of data stored.
|
||||
- SCD 3 - Add new attribute
|
||||
- Keep current attribute value and previous value
|
||||
- It only keeps the previous type at most
|
||||
- Intermediate approach between SCD2 and SCD1
|
||||
|
||||
|
||||
## dbt overview
|
||||
|
||||
- dbt takes care of the T in ETL/ELT.
|
||||
- dbt works within the datawarehouse and with SQL.
|
||||
- Why not use raw SQL and that's it? Because dbt brings good software practices like modularity, version control, reusability, testing, documentation and such to SQL swamps.
|
||||
|
||||
## Case
|
||||
|
||||
- ELT in Airbnb.
|
||||
- Data from insideairbnb.com/berlin/
|
||||
- The project will use snowflake as a DWH and preset (managed superset) as a BI tool.
|
||||
|
||||
|
||||
## dbt project structure
|
||||
|
||||
|
||||
- `dbt_project.yml`: header of the project, with stuff like versioning, the default profile for the project, the paths to different folders, etc.
|
||||
|
||||
This is a pic of the data flow we are going to build: 
|
||||
Loading…
Add table
Add a link
Reference in a new issue