60 lines
1.5 KiB
Markdown
60 lines
1.5 KiB
Markdown
|
|
|
|
## Data Maturity Model
|
|
|
|
Maslow hierarchy of needs, data version:
|
|
|
|

|
|
|
|
### Data collection
|
|
|
|
You get data from places.
|
|
|
|
You place it as is in some staging area.
|
|
|
|
|
|
### Data wrangling
|
|
|
|
Cleaning stuff up, deduplicating, yadi yada.
|
|
|
|
|
|
### Data integration
|
|
|
|
Place everything together nicely in the same place in a usable model.
|
|
|
|
## ETL
|
|
|
|
ETL made sense before. Storage was expensive and you only wanted to load the strictly necessary data into the target DWH.
|
|
|
|
But this comes with problems:
|
|
- You have a lot of statefulness.
|
|
- Debugging and testing pipelines is a pain in the ass.
|
|
- You need to do transformations outside of your target database.
|
|
- Schema changes were nightmares.
|
|
|
|
ELT is the new shiny toy:
|
|
- We read raw data from source system and load it into our DWH/Data Lake.
|
|
- We do our transformations in the target system.
|
|
- Schema changes become much more manageable.
|
|
|
|
|
|
## Datawarehouses and data lakes
|
|
|
|
### Datawarehouse
|
|
|
|
DWH -> Any database that:
|
|
- We use to ground our BI and reporting on.
|
|
- Optimized for reads.
|
|
- Typically structured in facts and dimensions.
|
|
|
|
### Data Lake
|
|
|
|
Decouple storage from compute. Use S3/Blob Storage/Hadoop HDFS for storage. Everything gets stored as files. Use a separate query engine, like Athena, Trino, Spark.
|
|
|
|
### Data Lakehouse
|
|
|
|
Just use both. A data lake with some DWH layer on top. Pretty much, a swamp of files with some governance, modelling tool sitting on top of it to control access and ease queries.
|
|
|
|
|
|
## The modern data stack
|
|
|