## Data Maturity Model Maslow hierarchy of needs, data version: ![img.png](../images/pyramid_:of_data_needs.png) ### Data collection You get data from places. You place it as is in some staging area. ### Data wrangling Cleaning stuff up, deduplicating, yadi yada. ### Data integration Place everything together nicely in the same place in a usable model. ## ETL ETL made sense before. Storage was expensive and you only wanted to load the strictly necessary data into the target DWH. But this comes with problems: - You have a lot of statefulness. - Debugging and testing pipelines is a pain in the ass. - You need to do transformations outside of your target database. - Schema changes were nightmares. ELT is the new shiny toy: - We read raw data from source system and load it into our DWH/Data Lake. - We do our transformations in the target system. - Schema changes become much more manageable. ## Datawarehouses and data lakes ### Datawarehouse DWH -> Any database that: - We use to ground our BI and reporting on. - Optimized for reads. - Typically structured in facts and dimensions. ### Data Lake Decouple storage from compute. Use S3/Blob Storage/Hadoop HDFS for storage. Everything gets stored as files. Use a separate query engine, like Athena, Trino, Spark. ### Data Lakehouse Just use both. A data lake with some DWH layer on top. Pretty much, a swamp of files with some governance, modelling tool sitting on top of it to control access and ease queries. ## The modern data stack