udemy-complete-dbt-bootcamp/notes/1.md



## Data Maturity Model

Maslow hierarchy of needs, data version:

![img.png](../images/pyramid_:of_data_needs.png)

### Data collection

You get data from places.

You place it as is in some staging area.


### Data wrangling

Cleaning stuff up, deduplicating, yadi yada.


### Data integration

Place everything together nicely in the same place in a usable model.

## ETL

ETL made sense before. Storage was expensive and you only wanted to load the strictly necessary data into the target DWH.

But this comes with problems:
- You have a lot of statefulness.
- Debugging and testing pipelines is a pain in the ass.
- You need to do transformations outside of your target database.
- Schema changes were nightmares.

ELT is the new shiny toy:
- We read raw data from source system and load it into our DWH/Data Lake.
- We do our transformations in the target system.
- Schema changes become much more manageable.


## Datawarehouses and data lakes

### Datawarehouse

DWH -> Any database that:
- We use to ground our BI and reporting on.
- Optimized for reads.
- Typically structured in facts and dimensions.

### Data Lake

Decouple storage from compute. Use S3/Blob Storage/Hadoop HDFS for storage. Everything gets stored as files. Use a separate query engine, like Athena, Trino, Spark.

### Data Lakehouse

Just use both. A data lake with some DWH layer on top. Pretty much, a swamp of files with some governance, modelling tool sitting on top of it to control access and ease queries.


## The modern data stack