udemy-complete-dbt-bootcamp/notes/8.md

111 lines
4 KiB
Markdown
Raw Permalink Normal View History

2023-10-30 16:57:30 +01:00
## Models
Models are the fundamental concept behind dbt.
They are stored as SQL files in the `models` folder.
2023-10-30 17:24:01 +01:00
Models can be related between themselves to map dependencies.
## Materializations
- Ways in which a model can be stored in the database. There are 4:
- View: it's just a view
- Table: the model gets stored as a table
- Incremental: also a table, but can only create new records, not update
- Ephemeral: it's actually NOT materializing. The model can be used by dependents, but it won't be materialized in the DB. It will truly only be a CTE that gets used by other models. Mostly for intermediate states in transformations.
2023-10-30 18:04:19 +01:00
Materializations can be defined at the model level, folder level and project level. This can be modified in the `dbt_project.yml` file, under the `models` key.
To set materialization config at the model level, one must make a jinja tag at the start of the file and call the `config` dbt function. See an example below:
```python
{{
config(
materialized = 'incremental',
on_schema_change = 'fail'
)
}}
```
Incremental materializations need to a block that defines the logic to apply in incremental loads (as opposed to the 'normal' logic, that gets apply on first runs). See below an example:
```SQL
[... rest of query ...]
WHERE
review_text IS NOT NULL
{% if is_incremental() %}
AND review_date > (SELECT MAX(review_date) FROM {{ this }})
{% endif %}
```
Bear in mind that how to define the strategy to determine what should be loaded is up to the engineer. Any SQL can be placed within the `if is_incremental()` block. In the example above, we have a date field that easily signals what's the most recent date the table has currently seen.
2023-10-30 17:24:01 +01:00
2023-10-31 17:22:51 +01:00
## Sources and seeds
Seeds are local files that you upload to a DWH from dbt. You place them as CSVs in the `seeds` folder.
2023-11-02 17:05:44 +01:00
Sources are an abstraction layer on top of the input tables. They are not strictly necessary, but can help make the project more structured. To create sources, you create a `sources.yml` file and place it in the `models` dir. Here, you can reference models created in the `models` dir to mark them as sources. You can reference sources in other models like this:
```python
{{ source('domain_name', 'source_name')}}
```
Sources can define _freshness_ constraints that will provide warnings or errors when there is a significant delay.
## Snapshots
Snapshots are a way to build SCD2s. There are two strategies to get this done:
- Timestamp: all records have a unique key and an `update_at` field. dbt will consider a new record is necessary in the SCD2 whenever the `updated_at` field increases.
- Check: dbt will monitor a set of columns and consider any changes in any of the columns as a new version of the record.
Snapshots get defined with a sql file in the `snapshots` folder using the `snapshot` macro block.
Once snapshots are defined, "snapshooting" can be triggered at any time by running `dbt snapshot`. dbt will create the SCD tables in the defined schema and play the `valid_from`, `valid_to` game whenever changes are detected.
## Tests
There are two kinds of tests:
- Singular tests: you make any `SELECT` statement you want. If the `SELECT` statement is run and any data is found, the test is considered failed. If the statement is run and no rows are returned, the test is considered passed.
- Built-in test: just a bunch of typical stuff: uniqueness, nullability, enum validations and relationship (referential integrity)
2023-11-02 18:13:56 +01:00
You can also define your own custom generic tests.
## Macros
- Macros are jinja templates.
- There are many built-in macros in dbt, but you can also use your own macros.
2023-11-03 10:26:11 +01:00
- dbt packages exist and you can use them to have more tests and macros that you can use.
## Documentation
- Documentation is kept in the repo (yay)
- Documentation can be defined in yaml files or in standalone markdown files. For example, the landing page can be customized with an `overview.md` file.
- Documentation can be quick-served with dbt, but ideally you should compile it and serve it with a regular web server, like Nginx.
-