Compare commits
10 commits
3e8423dbd8
...
cc644be0bc
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
cc644be0bc | ||
|
|
fb4534db58 | ||
|
|
3fce12819b | ||
|
|
045ce1ec45 | ||
|
|
c2f2739e7e | ||
|
|
bccbff7205 | ||
|
|
1e53b3895c | ||
|
|
7480222cc7 | ||
|
|
2b6c385b8c | ||
|
|
f85adbde93 |
21 changed files with 490 additions and 7 deletions
|
|
@ -70,6 +70,12 @@ After, you will have to download some CSV files with the data to populate the da
|
|||
aws s3 cp s3://dbtlearn/listings.csv listings.csv
|
||||
aws s3 cp s3://dbtlearn/reviews.csv reviews.csv
|
||||
aws s3 cp s3://dbtlearn/hosts.csv hosts.csv
|
||||
|
||||
# or, to avoid using aws
|
||||
|
||||
wget http://dbtlearn.s3.amazonaws.com/listings.csv
|
||||
wget http://dbtlearn.s3.amazonaws.com/reviews.csv
|
||||
wget http://dbtlearn.s3.amazonaws.com/hosts.csv
|
||||
```
|
||||
|
||||
How to put the data into the databases is up to you. I've done it successfully using the import functionality of DBeaver.
|
||||
|
|
|
|||
|
|
@ -4,6 +4,8 @@ This is the dbt project for the course.
|
|||
|
||||
## Set up
|
||||
|
||||
Make a venv and install the requirements listed in `requirements.txt`.
|
||||
|
||||
You need to place a profile for the local postgres instance in `~/.dbt/profiles.yaml`. See below a sample config that should be a good starting point if you follow the instructions in the `database` dir of this project.
|
||||
|
||||
```yaml
|
||||
|
|
@ -25,6 +27,8 @@ dbtlearn:
|
|||
|
||||
Once you have set this up and the database as well, you can run `dbt debug` to ensure everything is set up correctly and dbt can reach the database.
|
||||
|
||||
To install the required dbt packages, run `dbt deps`.
|
||||
|
||||
You should also delete the lines under the `dbtlearn` key in the `dbt_project.yml` file.
|
||||
|
||||
Also delete the contents of the `models` folder.
|
||||
|
|
|
|||
|
|
@ -30,5 +30,7 @@ clean-targets: # directories to be removed by `dbt clean`
|
|||
models:
|
||||
dbtlearn:
|
||||
+materialized: view # Default way to materialize is view
|
||||
src:
|
||||
+materialized: ephemeral
|
||||
dim:
|
||||
+materialized: table
|
||||
|
|
|
|||
10
code_thingies/dbtlearn/macros/no_nulls_in_columns.sql
Normal file
10
code_thingies/dbtlearn/macros/no_nulls_in_columns.sql
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
{% macro no_nulls_in_columns(model) %}
|
||||
SELECT *
|
||||
FROM
|
||||
{{ model }}
|
||||
WHERE
|
||||
{% for col in adapter.get_columns_in_relation(model) -%}
|
||||
{{col.column}} IS NULL OR
|
||||
{% endfor %}
|
||||
FALSE
|
||||
{% endmacro %}
|
||||
8
code_thingies/dbtlearn/macros/positive_value.sql
Normal file
8
code_thingies/dbtlearn/macros/positive_value.sql
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
{% test positive_value(model, column_name) %}
|
||||
SELECT
|
||||
*
|
||||
FROM
|
||||
{{ model }}
|
||||
WHERE
|
||||
{{ column_name }} < 1
|
||||
{% endtest %}
|
||||
|
|
@ -1,3 +1,9 @@
|
|||
{{
|
||||
config(
|
||||
materialized = 'view'
|
||||
)
|
||||
}}
|
||||
|
||||
WITH src_hosts AS(
|
||||
SELECT *
|
||||
FROM {{ ref('src_hosts') }}
|
||||
|
|
|
|||
|
|
@ -1,3 +1,9 @@
|
|||
{{
|
||||
config(
|
||||
materialized = 'view'
|
||||
)
|
||||
}}
|
||||
|
||||
WITH src_listings AS (
|
||||
SELECT *
|
||||
FROM
|
||||
|
|
@ -10,7 +16,7 @@ SELECT
|
|||
CASE
|
||||
WHEN minimum_nights = 0 THEN 1
|
||||
ELSE minimum_nights
|
||||
END AS mininum_nights,
|
||||
END AS minimum_nights,
|
||||
host_id,
|
||||
REPLACE(price_str,'$','')::money AS price,
|
||||
created_at,
|
||||
|
|
|
|||
|
|
@ -13,7 +13,7 @@ SELECT
|
|||
listings.listing_id,
|
||||
listings.listing_name,
|
||||
listings.room_type,
|
||||
listings.mininum_nights,
|
||||
listings.minimum_nights,
|
||||
listings.price,
|
||||
listings.host_id,
|
||||
hosts.host_name,
|
||||
|
|
|
|||
|
|
@ -9,7 +9,9 @@ WITH src_reviews AS (
|
|||
FROM
|
||||
{{ ref('src_reviews') }}
|
||||
)
|
||||
SELECT *
|
||||
SELECT
|
||||
{{ dbt_utils.surrogate_key(['listing_id', 'review_date', 'reviewer_name', 'review_text']) }} as review_id,
|
||||
*
|
||||
FROM
|
||||
src_reviews
|
||||
WHERE
|
||||
|
|
|
|||
27
code_thingies/dbtlearn/models/mart/mart_fullmoon_reviews.sql
Normal file
27
code_thingies/dbtlearn/models/mart/mart_fullmoon_reviews.sql
Normal file
|
|
@ -0,0 +1,27 @@
|
|||
{{
|
||||
config(
|
||||
materialized = 'table'
|
||||
)
|
||||
}}
|
||||
|
||||
WITH fact_reviews AS (
|
||||
SELECT *
|
||||
FROM
|
||||
{{ ref('fact_reviews') }}
|
||||
),
|
||||
full_moon_dates AS (
|
||||
SELECT *
|
||||
FROM
|
||||
{{ ref('seed_full_moon_dates')}}
|
||||
)
|
||||
|
||||
SELECT
|
||||
fr.*,
|
||||
CASE
|
||||
WHEN fm.full_moon_date IS NULL THEN 'not full moon'
|
||||
ELSE 'full moon'
|
||||
END AS is_full_moon
|
||||
FROM
|
||||
fact_reviews fr
|
||||
LEFT JOIN full_moon_dates fm
|
||||
ON (fr.review_date::date) = (fm.full_moon_date + interval '1' day)
|
||||
31
code_thingies/dbtlearn/models/schema.yml
Normal file
31
code_thingies/dbtlearn/models/schema.yml
Normal file
|
|
@ -0,0 +1,31 @@
|
|||
version: 2
|
||||
|
||||
models:
|
||||
- name: dim_listings_cleansed
|
||||
columns:
|
||||
- name: listing_id
|
||||
tests:
|
||||
- unique
|
||||
- not_null
|
||||
|
||||
- name: host_id
|
||||
tests:
|
||||
- not_null
|
||||
- relationships:
|
||||
to: ref('dim_hosts_cleansed')
|
||||
field: host_id
|
||||
|
||||
- name: room_type
|
||||
tests:
|
||||
- accepted_values:
|
||||
values: [
|
||||
'Entire home/apt',
|
||||
'Private room',
|
||||
'Shared room',
|
||||
'Hotel room'
|
||||
]
|
||||
|
||||
- name: minimum_nights
|
||||
tests:
|
||||
- positive_value
|
||||
|
||||
12
code_thingies/dbtlearn/models/sources.yml
Normal file
12
code_thingies/dbtlearn/models/sources.yml
Normal file
|
|
@ -0,0 +1,12 @@
|
|||
version: 2
|
||||
|
||||
sources:
|
||||
- name: airbnb
|
||||
schema: raw
|
||||
tables:
|
||||
- name: listings
|
||||
identifier: raw_listings
|
||||
- name: hosts
|
||||
identifier: raw_hosts
|
||||
- name: reviews
|
||||
identifier: raw_reviews
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
WITH raw_hosts AS (
|
||||
SELECT *
|
||||
FROM raw.raw_hosts
|
||||
FROM {{ source ('airbnb', 'hosts')}}
|
||||
)
|
||||
SELECT
|
||||
id as host_id,
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
WITH raw_listings AS (
|
||||
SELECT *
|
||||
FROM raw.raw_listings
|
||||
FROM {{ source ('airbnb', 'listings')}}
|
||||
)
|
||||
SELECT
|
||||
id AS listing_id,
|
||||
|
|
|
|||
|
|
@ -1,10 +1,11 @@
|
|||
WITH raw_reviews AS (
|
||||
SELECT *
|
||||
FROM raw.raw_reviews
|
||||
FROM {{ source ('airbnb', 'reviews')}}
|
||||
)
|
||||
SELECT
|
||||
listing_id,
|
||||
date AS review_date,
|
||||
reviewer_name AS reviewer_name,
|
||||
comments AS review_text,
|
||||
sentiment AS review_sentiment
|
||||
FROM
|
||||
|
|
|
|||
3
code_thingies/dbtlearn/packages.yml
Normal file
3
code_thingies/dbtlearn/packages.yml
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
packages:
|
||||
- package: dbt-labs/dbt_utils
|
||||
version: 0.8.0
|
||||
273
code_thingies/dbtlearn/seeds/seed_full_moon_dates.csv
Normal file
273
code_thingies/dbtlearn/seeds/seed_full_moon_dates.csv
Normal file
|
|
@ -0,0 +1,273 @@
|
|||
full_moon_date
|
||||
2009-01-11
|
||||
2009-02-09
|
||||
2009-03-11
|
||||
2009-04-09
|
||||
2009-05-09
|
||||
2009-06-07
|
||||
2009-07-07
|
||||
2009-08-06
|
||||
2009-09-04
|
||||
2009-10-04
|
||||
2009-11-02
|
||||
2009-12-02
|
||||
2009-12-31
|
||||
2010-01-30
|
||||
2010-02-28
|
||||
2010-03-30
|
||||
2010-04-28
|
||||
2010-05-28
|
||||
2010-06-26
|
||||
2010-07-26
|
||||
2010-08-24
|
||||
2010-09-23
|
||||
2010-10-23
|
||||
2010-11-21
|
||||
2010-12-21
|
||||
2011-01-19
|
||||
2011-02-18
|
||||
2011-03-19
|
||||
2011-04-18
|
||||
2011-05-17
|
||||
2011-06-15
|
||||
2011-07-15
|
||||
2011-08-13
|
||||
2011-09-12
|
||||
2011-10-12
|
||||
2011-11-10
|
||||
2011-12-10
|
||||
2012-01-09
|
||||
2012-02-07
|
||||
2012-03-08
|
||||
2012-04-06
|
||||
2012-05-06
|
||||
2012-06-04
|
||||
2012-07-03
|
||||
2012-08-02
|
||||
2012-08-31
|
||||
2012-09-30
|
||||
2012-10-29
|
||||
2012-11-28
|
||||
2012-12-28
|
||||
2013-01-27
|
||||
2013-02-25
|
||||
2013-03-27
|
||||
2013-04-25
|
||||
2013-05-25
|
||||
2013-06-23
|
||||
2013-07-22
|
||||
2013-08-21
|
||||
2013-09-19
|
||||
2013-10-19
|
||||
2013-11-17
|
||||
2013-12-17
|
||||
2014-01-16
|
||||
2014-02-15
|
||||
2014-03-16
|
||||
2014-04-15
|
||||
2014-05-14
|
||||
2014-06-13
|
||||
2014-07-12
|
||||
2014-08-10
|
||||
2014-09-09
|
||||
2014-10-08
|
||||
2014-11-06
|
||||
2014-12-06
|
||||
2015-01-05
|
||||
2015-02-04
|
||||
2015-03-05
|
||||
2015-04-04
|
||||
2015-05-04
|
||||
2015-06-02
|
||||
2015-07-02
|
||||
2015-07-31
|
||||
2015-08-29
|
||||
2015-09-28
|
||||
2015-10-27
|
||||
2015-11-25
|
||||
2015-12-25
|
||||
2016-01-24
|
||||
2016-02-22
|
||||
2016-03-23
|
||||
2016-04-22
|
||||
2016-05-21
|
||||
2016-06-20
|
||||
2016-07-20
|
||||
2016-08-18
|
||||
2016-09-16
|
||||
2016-10-16
|
||||
2016-11-14
|
||||
2016-12-14
|
||||
2017-01-12
|
||||
2017-02-11
|
||||
2017-03-12
|
||||
2017-04-11
|
||||
2017-05-10
|
||||
2017-06-09
|
||||
2017-07-09
|
||||
2017-08-07
|
||||
2017-09-06
|
||||
2017-10-05
|
||||
2017-11-04
|
||||
2017-12-03
|
||||
2018-01-02
|
||||
2018-01-31
|
||||
2018-03-02
|
||||
2018-03-31
|
||||
2018-04-30
|
||||
2018-05-29
|
||||
2018-06-28
|
||||
2018-07-27
|
||||
2018-08-26
|
||||
2018-09-25
|
||||
2018-10-24
|
||||
2018-11-23
|
||||
2018-12-22
|
||||
2019-01-21
|
||||
2019-02-19
|
||||
2019-03-21
|
||||
2019-04-19
|
||||
2019-05-18
|
||||
2019-06-17
|
||||
2019-07-16
|
||||
2019-08-15
|
||||
2019-09-14
|
||||
2019-10-13
|
||||
2019-11-12
|
||||
2019-12-12
|
||||
2020-01-10
|
||||
2020-02-09
|
||||
2020-03-09
|
||||
2020-04-08
|
||||
2020-05-07
|
||||
2020-06-05
|
||||
2020-07-05
|
||||
2020-08-03
|
||||
2020-09-02
|
||||
2020-10-01
|
||||
2020-10-31
|
||||
2020-11-30
|
||||
2020-12-30
|
||||
2021-01-28
|
||||
2021-02-27
|
||||
2021-03-28
|
||||
2021-04-27
|
||||
2021-05-26
|
||||
2021-06-24
|
||||
2021-07-24
|
||||
2021-08-22
|
||||
2021-09-21
|
||||
2021-10-20
|
||||
2021-11-19
|
||||
2021-12-19
|
||||
2022-01-18
|
||||
2022-02-16
|
||||
2022-03-18
|
||||
2022-04-16
|
||||
2022-05-16
|
||||
2022-06-14
|
||||
2022-07-13
|
||||
2022-08-12
|
||||
2022-09-10
|
||||
2022-10-09
|
||||
2022-11-08
|
||||
2022-12-08
|
||||
2023-01-07
|
||||
2023-02-05
|
||||
2023-03-07
|
||||
2023-04-06
|
||||
2023-05-05
|
||||
2023-06-04
|
||||
2023-07-03
|
||||
2023-08-01
|
||||
2023-08-31
|
||||
2023-09-29
|
||||
2023-10-28
|
||||
2023-11-27
|
||||
2023-12-27
|
||||
2024-01-25
|
||||
2024-02-24
|
||||
2024-03-25
|
||||
2024-04-24
|
||||
2024-05-23
|
||||
2024-06-22
|
||||
2024-07-21
|
||||
2024-08-19
|
||||
2024-09-18
|
||||
2024-10-17
|
||||
2024-11-15
|
||||
2024-12-15
|
||||
2025-01-13
|
||||
2025-02-12
|
||||
2025-03-14
|
||||
2025-04-13
|
||||
2025-05-12
|
||||
2025-06-11
|
||||
2025-07-10
|
||||
2025-08-09
|
||||
2025-09-07
|
||||
2025-10-07
|
||||
2025-11-05
|
||||
2025-12-05
|
||||
2026-01-03
|
||||
2026-02-01
|
||||
2026-03-03
|
||||
2026-04-02
|
||||
2026-05-01
|
||||
2026-05-31
|
||||
2026-06-30
|
||||
2026-07-29
|
||||
2026-08-28
|
||||
2026-09-26
|
||||
2026-10-26
|
||||
2026-11-24
|
||||
2026-12-24
|
||||
2027-01-22
|
||||
2027-02-21
|
||||
2027-03-22
|
||||
2027-04-21
|
||||
2027-05-20
|
||||
2027-06-19
|
||||
2027-07-18
|
||||
2027-08-17
|
||||
2027-09-16
|
||||
2027-10-15
|
||||
2027-11-14
|
||||
2027-12-13
|
||||
2028-01-12
|
||||
2028-02-10
|
||||
2028-03-11
|
||||
2028-04-09
|
||||
2028-05-08
|
||||
2028-06-07
|
||||
2028-07-06
|
||||
2028-08-05
|
||||
2028-09-04
|
||||
2028-10-03
|
||||
2028-11-02
|
||||
2028-12-02
|
||||
2028-12-31
|
||||
2029-01-30
|
||||
2029-02-28
|
||||
2029-03-30
|
||||
2029-04-28
|
||||
2029-05-27
|
||||
2029-06-26
|
||||
2029-07-25
|
||||
2029-08-24
|
||||
2029-09-22
|
||||
2029-10-22
|
||||
2029-11-21
|
||||
2029-12-20
|
||||
2030-01-19
|
||||
2030-02-18
|
||||
2030-03-19
|
||||
2030-04-18
|
||||
2030-05-17
|
||||
2030-06-15
|
||||
2030-07-15
|
||||
2030-08-13
|
||||
2030-09-11
|
||||
2030-10-11
|
||||
2030-11-10
|
||||
2030-12-09
|
||||
|
17
code_thingies/dbtlearn/snapshots/scd_raw_listings.sql
Normal file
17
code_thingies/dbtlearn/snapshots/scd_raw_listings.sql
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
{% snapshot scd_raw_listings %}
|
||||
|
||||
{{
|
||||
config(
|
||||
target_schema = 'dev',
|
||||
unique_key = 'id',
|
||||
strategy = 'timestamp',
|
||||
updated_at = 'updated_at',
|
||||
invalidate_hard_deletes = True
|
||||
)
|
||||
}}
|
||||
|
||||
SELECT *
|
||||
FROM
|
||||
{{ source('airbnb', 'listings')}}
|
||||
|
||||
{% endsnapshot %}
|
||||
9
code_thingies/dbtlearn/tests/consistent_created_at.sql
Normal file
9
code_thingies/dbtlearn/tests/consistent_created_at.sql
Normal file
|
|
@ -0,0 +1,9 @@
|
|||
SELECT *
|
||||
FROM
|
||||
{{ ref('fact_reviews') }} fr
|
||||
LEFT JOIN
|
||||
{{ ref('dim_listings_cleansed') }} dl
|
||||
ON
|
||||
fr.listing_id = dl.listing_id
|
||||
WHERE
|
||||
fr.review_date < dl.created_at
|
||||
|
|
@ -0,0 +1 @@
|
|||
{{ no_nulls_in_columns(ref('dim_listings_cleansed')) }}
|
||||
67
notes/8.md
67
notes/8.md
|
|
@ -42,4 +42,69 @@ WHERE
|
|||
|
||||
Bear in mind that how to define the strategy to determine what should be loaded is up to the engineer. Any SQL can be placed within the `if is_incremental()` block. In the example above, we have a date field that easily signals what's the most recent date the table has currently seen.
|
||||
|
||||
##
|
||||
## Sources and seeds
|
||||
|
||||
Seeds are local files that you upload to a DWH from dbt. You place them as CSVs in the `seeds` folder.
|
||||
|
||||
|
||||
Sources are an abstraction layer on top of the input tables. They are not strictly necessary, but can help make the project more structured. To create sources, you create a `sources.yml` file and place it in the `models` dir. Here, you can reference models created in the `models` dir to mark them as sources. You can reference sources in other models like this:
|
||||
|
||||
```python
|
||||
{{ source('domain_name', 'source_name')}}
|
||||
```
|
||||
|
||||
Sources can define _freshness_ constraints that will provide warnings or errors when there is a significant delay.
|
||||
|
||||
|
||||
## Snapshots
|
||||
|
||||
Snapshots are a way to build SCD2s. There are two strategies to get this done:
|
||||
- Timestamp: all records have a unique key and an `update_at` field. dbt will consider a new record is necessary in the SCD2 whenever the `updated_at` field increases.
|
||||
- Check: dbt will monitor a set of columns and consider any changes in any of the columns as a new version of the record.
|
||||
|
||||
Snapshots get defined with a sql file in the `snapshots` folder using the `snapshot` macro block.
|
||||
|
||||
Once snapshots are defined, "snapshooting" can be triggered at any time by running `dbt snapshot`. dbt will create the SCD tables in the defined schema and play the `valid_from`, `valid_to` game whenever changes are detected.
|
||||
|
||||
## Tests
|
||||
|
||||
There are two kinds of tests:
|
||||
|
||||
- Singular tests: you make any `SELECT` statement you want. If the `SELECT` statement is run and any data is found, the test is considered failed. If the statement is run and no rows are returned, the test is considered passed.
|
||||
- Built-in test: just a bunch of typical stuff: uniqueness, nullability, enum validations and relationship (referential integrity)
|
||||
|
||||
You can also define your own custom generic tests.
|
||||
|
||||
|
||||
## Macros
|
||||
|
||||
- Macros are jinja templates.
|
||||
- There are many built-in macros in dbt, but you can also use your own macros.
|
||||
- dbt packages exist and you can use them to have more tests and macros that you can use.
|
||||
|
||||
|
||||
## Documentation
|
||||
|
||||
- Documentation is kept in the repo (yay)
|
||||
- Documentation can be defined in yaml files or in standalone markdown files. For example, the landing page can be customized with an `overview.md` file.
|
||||
- Documentation can be quick-served with dbt, but ideally you should compile it and serve it with a regular web server, like Nginx.
|
||||
-
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue