udemy-complete-dbt-bootcamp/notes/8.md
Pablo Martin 3fce12819b Thingies
2023-11-03 10:26:11 +01:00

4 KiB

Models

Models are the fundamental concept behind dbt.

They are stored as SQL files in the models folder.

Models can be related between themselves to map dependencies.

Materializations

  • Ways in which a model can be stored in the database. There are 4:
    • View: it's just a view
    • Table: the model gets stored as a table
    • Incremental: also a table, but can only create new records, not update
    • Ephemeral: it's actually NOT materializing. The model can be used by dependents, but it won't be materialized in the DB. It will truly only be a CTE that gets used by other models. Mostly for intermediate states in transformations.

Materializations can be defined at the model level, folder level and project level. This can be modified in the dbt_project.yml file, under the models key.

To set materialization config at the model level, one must make a jinja tag at the start of the file and call the config dbt function. See an example below:

{{
    config(
        materialized = 'incremental',
        on_schema_change = 'fail'
    )
}}

Incremental materializations need to a block that defines the logic to apply in incremental loads (as opposed to the 'normal' logic, that gets apply on first runs). See below an example:

[... rest of query ...]
WHERE
    review_text IS NOT NULL
{% if is_incremental() %}
    AND review_date > (SELECT MAX(review_date) FROM {{ this }})
{% endif %}

Bear in mind that how to define the strategy to determine what should be loaded is up to the engineer. Any SQL can be placed within the if is_incremental() block. In the example above, we have a date field that easily signals what's the most recent date the table has currently seen.

Sources and seeds

Seeds are local files that you upload to a DWH from dbt. You place them as CSVs in the seeds folder.

Sources are an abstraction layer on top of the input tables. They are not strictly necessary, but can help make the project more structured. To create sources, you create a sources.yml file and place it in the models dir. Here, you can reference models created in the models dir to mark them as sources. You can reference sources in other models like this:

{{ source('domain_name', 'source_name')}}

Sources can define freshness constraints that will provide warnings or errors when there is a significant delay.

Snapshots

Snapshots are a way to build SCD2s. There are two strategies to get this done:

  • Timestamp: all records have a unique key and an update_at field. dbt will consider a new record is necessary in the SCD2 whenever the updated_at field increases.
  • Check: dbt will monitor a set of columns and consider any changes in any of the columns as a new version of the record.

Snapshots get defined with a sql file in the snapshots folder using the snapshot macro block.

Once snapshots are defined, "snapshooting" can be triggered at any time by running dbt snapshot. dbt will create the SCD tables in the defined schema and play the valid_from, valid_to game whenever changes are detected.

Tests

There are two kinds of tests:

  • Singular tests: you make any SELECT statement you want. If the SELECT statement is run and any data is found, the test is considered failed. If the statement is run and no rows are returned, the test is considered passed.
  • Built-in test: just a bunch of typical stuff: uniqueness, nullability, enum validations and relationship (referential integrity)

You can also define your own custom generic tests.

Macros

  • Macros are jinja templates.
  • There are many built-in macros in dbt, but you can also use your own macros.
  • dbt packages exist and you can use them to have more tests and macros that you can use.

Documentation

  • Documentation is kept in the repo (yay)
  • Documentation can be defined in yaml files or in standalone markdown files. For example, the landing page can be customized with an overview.md file.
  • Documentation can be quick-served with dbt, but ideally you should compile it and serve it with a regular web server, like Nginx.