A few lessons.

2023-10-17 19:17:31 +02:00 · 2023-10-17 19:17:31 +02:00 · ec0e91f454
commit ec0e91f454
parent aad174c2db
2 changed files with 60 additions and 0 deletions
--- a/notes/1.md
+++ b/notes/1.md
@ -0,0 +1,60 @@
+
+
+## Data Maturity Model
+
+Maslow hierarchy of needs, data version: 
+
+![img.png](../images/pyramid_:of_data_needs.png)
+
+### Data collection
+
+You get data from places.
+
+You place it as is in some staging area.
+
+
+### Data wrangling
+
+Cleaning stuff up, deduplicating, yadi yada.
+
+
+### Data integration
+
+Place everything together nicely in the same place in a usable model.
+
+## ETL
+
+ETL made sense before. Storage was expensive and you only wanted to load the strictly necessary data into the target DWH.
+
+But this comes with problems:
+- You have a lot of statefulness.
+- Debugging and testing pipelines is a pain in the ass.
+- You need to do transformations outside of your target database.
+- Schema changes were nightmares.
+
+ELT is the new shiny toy:
+- We read raw data from source system and load it into our DWH/Data Lake.
+- We do our transformations in the target system.
+- Schema changes become much more manageable.
+
+
+## Datawarehouses and data lakes
+
+### Datawarehouse
+
+DWH -> Any database that:
+- We use to ground our BI and reporting on.
+- Optimized for reads.
+- Typically structured in facts and dimensions.
+
+### Data Lake
+
+Decouple storage from compute. Use S3/Blob Storage/Hadoop HDFS for storage. Everything gets stored as files. Use a separate query engine, like Athena, Trino, Spark.
+
+### Data Lakehouse
+
+Just use both. A data lake with some DWH layer on top. Pretty much, a swamp of files with some governance, modelling tool sitting on top of it to control access and ease queries.
+
+
+## The modern data stack
+