udemy-complete-dbt-bootcamp/notes/1.md
2023-10-17 19:17:31 +02:00

1.5 KiB

Data Maturity Model

Maslow hierarchy of needs, data version:

img.png

Data collection

You get data from places.

You place it as is in some staging area.

Data wrangling

Cleaning stuff up, deduplicating, yadi yada.

Data integration

Place everything together nicely in the same place in a usable model.

ETL

ETL made sense before. Storage was expensive and you only wanted to load the strictly necessary data into the target DWH.

But this comes with problems:

  • You have a lot of statefulness.
  • Debugging and testing pipelines is a pain in the ass.
  • You need to do transformations outside of your target database.
  • Schema changes were nightmares.

ELT is the new shiny toy:

  • We read raw data from source system and load it into our DWH/Data Lake.
  • We do our transformations in the target system.
  • Schema changes become much more manageable.

Datawarehouses and data lakes

Datawarehouse

DWH -> Any database that:

  • We use to ground our BI and reporting on.
  • Optimized for reads.
  • Typically structured in facts and dimensions.

Data Lake

Decouple storage from compute. Use S3/Blob Storage/Hadoop HDFS for storage. Everything gets stored as files. Use a separate query engine, like Athena, Trino, Spark.

Data Lakehouse

Just use both. A data lake with some DWH layer on top. Pretty much, a swamp of files with some governance, modelling tool sitting on top of it to control access and ease queries.

The modern data stack