diff --git a/code_thingies/database/README.md b/code_thingies/database/README.md index f767fff..4dba9cd 100644 --- a/code_thingies/database/README.md +++ b/code_thingies/database/README.md @@ -70,12 +70,6 @@ After, you will have to download some CSV files with the data to populate the da aws s3 cp s3://dbtlearn/listings.csv listings.csv aws s3 cp s3://dbtlearn/reviews.csv reviews.csv aws s3 cp s3://dbtlearn/hosts.csv hosts.csv - -# or, to avoid using aws - -wget http://dbtlearn.s3.amazonaws.com/listings.csv -wget http://dbtlearn.s3.amazonaws.com/reviews.csv -wget http://dbtlearn.s3.amazonaws.com/hosts.csv ``` How to put the data into the databases is up to you. I've done it successfully using the import functionality of DBeaver. diff --git a/code_thingies/dbtlearn/README.md b/code_thingies/dbtlearn/README.md index 0ab7c8e..d470985 100644 --- a/code_thingies/dbtlearn/README.md +++ b/code_thingies/dbtlearn/README.md @@ -4,8 +4,6 @@ This is the dbt project for the course. ## Set up -Make a venv and install the requirements listed in `requirements.txt`. - You need to place a profile for the local postgres instance in `~/.dbt/profiles.yaml`. See below a sample config that should be a good starting point if you follow the instructions in the `database` dir of this project. ```yaml @@ -27,8 +25,6 @@ dbtlearn: Once you have set this up and the database as well, you can run `dbt debug` to ensure everything is set up correctly and dbt can reach the database. -To install the required dbt packages, run `dbt deps`. - You should also delete the lines under the `dbtlearn` key in the `dbt_project.yml` file. Also delete the contents of the `models` folder. diff --git a/code_thingies/dbtlearn/dbt_project.yml b/code_thingies/dbtlearn/dbt_project.yml index b8a4515..3c2a10f 100644 --- a/code_thingies/dbtlearn/dbt_project.yml +++ b/code_thingies/dbtlearn/dbt_project.yml @@ -30,7 +30,5 @@ clean-targets: # directories to be removed by `dbt clean` models: dbtlearn: +materialized: view # Default way to materialize is view - src: - +materialized: ephemeral dim: +materialized: table diff --git a/code_thingies/dbtlearn/macros/no_nulls_in_columns.sql b/code_thingies/dbtlearn/macros/no_nulls_in_columns.sql deleted file mode 100644 index 40af5d4..0000000 --- a/code_thingies/dbtlearn/macros/no_nulls_in_columns.sql +++ /dev/null @@ -1,10 +0,0 @@ -{% macro no_nulls_in_columns(model) %} - SELECT * - FROM - {{ model }} - WHERE - {% for col in adapter.get_columns_in_relation(model) -%} - {{col.column}} IS NULL OR - {% endfor %} - FALSE -{% endmacro %} \ No newline at end of file diff --git a/code_thingies/dbtlearn/macros/positive_value.sql b/code_thingies/dbtlearn/macros/positive_value.sql deleted file mode 100644 index 148a7bc..0000000 --- a/code_thingies/dbtlearn/macros/positive_value.sql +++ /dev/null @@ -1,8 +0,0 @@ -{% test positive_value(model, column_name) %} -SELECT - * -FROM - {{ model }} -WHERE - {{ column_name }} < 1 -{% endtest %} \ No newline at end of file diff --git a/code_thingies/dbtlearn/models/dim/dim_hosts_cleansed.sql b/code_thingies/dbtlearn/models/dim/dim_hosts_cleansed.sql index debe188..c489d1b 100644 --- a/code_thingies/dbtlearn/models/dim/dim_hosts_cleansed.sql +++ b/code_thingies/dbtlearn/models/dim/dim_hosts_cleansed.sql @@ -1,9 +1,3 @@ -{{ - config( - materialized = 'view' - ) -}} - WITH src_hosts AS( SELECT * FROM {{ ref('src_hosts') }} diff --git a/code_thingies/dbtlearn/models/dim/dim_listings_cleansed.sql b/code_thingies/dbtlearn/models/dim/dim_listings_cleansed.sql index de68ced..6e5de55 100644 --- a/code_thingies/dbtlearn/models/dim/dim_listings_cleansed.sql +++ b/code_thingies/dbtlearn/models/dim/dim_listings_cleansed.sql @@ -1,9 +1,3 @@ -{{ - config( - materialized = 'view' - ) -}} - WITH src_listings AS ( SELECT * FROM @@ -16,7 +10,7 @@ SELECT CASE WHEN minimum_nights = 0 THEN 1 ELSE minimum_nights - END AS minimum_nights, + END AS mininum_nights, host_id, REPLACE(price_str,'$','')::money AS price, created_at, diff --git a/code_thingies/dbtlearn/models/dim/dim_listings_with_hosts.sql b/code_thingies/dbtlearn/models/dim/dim_listings_with_hosts.sql index 6ccf1f0..ee05cc2 100644 --- a/code_thingies/dbtlearn/models/dim/dim_listings_with_hosts.sql +++ b/code_thingies/dbtlearn/models/dim/dim_listings_with_hosts.sql @@ -13,7 +13,7 @@ SELECT listings.listing_id, listings.listing_name, listings.room_type, - listings.minimum_nights, + listings.mininum_nights, listings.price, listings.host_id, hosts.host_name, diff --git a/code_thingies/dbtlearn/models/fact/fact_reviews.sql b/code_thingies/dbtlearn/models/fact/fact_reviews.sql index 9fd3dc8..8a707fa 100644 --- a/code_thingies/dbtlearn/models/fact/fact_reviews.sql +++ b/code_thingies/dbtlearn/models/fact/fact_reviews.sql @@ -9,9 +9,7 @@ WITH src_reviews AS ( FROM {{ ref('src_reviews') }} ) -SELECT - {{ dbt_utils.surrogate_key(['listing_id', 'review_date', 'reviewer_name', 'review_text']) }} as review_id, - * +SELECT * FROM src_reviews WHERE diff --git a/code_thingies/dbtlearn/models/mart/mart_fullmoon_reviews.sql b/code_thingies/dbtlearn/models/mart/mart_fullmoon_reviews.sql deleted file mode 100644 index 199b808..0000000 --- a/code_thingies/dbtlearn/models/mart/mart_fullmoon_reviews.sql +++ /dev/null @@ -1,27 +0,0 @@ -{{ - config( - materialized = 'table' - ) -}} - -WITH fact_reviews AS ( - SELECT * - FROM - {{ ref('fact_reviews') }} -), -full_moon_dates AS ( - SELECT * - FROM - {{ ref('seed_full_moon_dates')}} -) - -SELECT - fr.*, - CASE - WHEN fm.full_moon_date IS NULL THEN 'not full moon' - ELSE 'full moon' - END AS is_full_moon -FROM - fact_reviews fr - LEFT JOIN full_moon_dates fm - ON (fr.review_date::date) = (fm.full_moon_date + interval '1' day) \ No newline at end of file diff --git a/code_thingies/dbtlearn/models/schema.yml b/code_thingies/dbtlearn/models/schema.yml deleted file mode 100644 index cefd2d4..0000000 --- a/code_thingies/dbtlearn/models/schema.yml +++ /dev/null @@ -1,31 +0,0 @@ -version: 2 - -models: - - name: dim_listings_cleansed - columns: - - name: listing_id - tests: - - unique - - not_null - - - name: host_id - tests: - - not_null - - relationships: - to: ref('dim_hosts_cleansed') - field: host_id - - - name: room_type - tests: - - accepted_values: - values: [ - 'Entire home/apt', - 'Private room', - 'Shared room', - 'Hotel room' - ] - - - name: minimum_nights - tests: - - positive_value - diff --git a/code_thingies/dbtlearn/models/sources.yml b/code_thingies/dbtlearn/models/sources.yml deleted file mode 100644 index 0e2db1c..0000000 --- a/code_thingies/dbtlearn/models/sources.yml +++ /dev/null @@ -1,12 +0,0 @@ -version: 2 - -sources: - - name: airbnb - schema: raw - tables: - - name: listings - identifier: raw_listings - - name: hosts - identifier: raw_hosts - - name: reviews - identifier: raw_reviews \ No newline at end of file diff --git a/code_thingies/dbtlearn/models/src/src_hosts.sql b/code_thingies/dbtlearn/models/src/src_hosts.sql index 9d9d5b3..b33b25d 100644 --- a/code_thingies/dbtlearn/models/src/src_hosts.sql +++ b/code_thingies/dbtlearn/models/src/src_hosts.sql @@ -1,6 +1,6 @@ WITH raw_hosts AS ( SELECT * - FROM {{ source ('airbnb', 'hosts')}} + FROM raw.raw_hosts ) SELECT id as host_id, diff --git a/code_thingies/dbtlearn/models/src/src_listings.sql b/code_thingies/dbtlearn/models/src/src_listings.sql index 4c09b3e..c68838c 100644 --- a/code_thingies/dbtlearn/models/src/src_listings.sql +++ b/code_thingies/dbtlearn/models/src/src_listings.sql @@ -1,6 +1,6 @@ WITH raw_listings AS ( SELECT * - FROM {{ source ('airbnb', 'listings')}} + FROM raw.raw_listings ) SELECT id AS listing_id, diff --git a/code_thingies/dbtlearn/models/src/src_reviews.sql b/code_thingies/dbtlearn/models/src/src_reviews.sql index 2a99151..59d8167 100644 --- a/code_thingies/dbtlearn/models/src/src_reviews.sql +++ b/code_thingies/dbtlearn/models/src/src_reviews.sql @@ -1,11 +1,10 @@ WITH raw_reviews AS ( SELECT * - FROM {{ source ('airbnb', 'reviews')}} + FROM raw.raw_reviews ) SELECT listing_id, date AS review_date, - reviewer_name AS reviewer_name, comments AS review_text, sentiment AS review_sentiment FROM diff --git a/code_thingies/dbtlearn/packages.yml b/code_thingies/dbtlearn/packages.yml deleted file mode 100644 index 9e74ff0..0000000 --- a/code_thingies/dbtlearn/packages.yml +++ /dev/null @@ -1,3 +0,0 @@ -packages: - - package: dbt-labs/dbt_utils - version: 0.8.0 \ No newline at end of file diff --git a/code_thingies/dbtlearn/seeds/seed_full_moon_dates.csv b/code_thingies/dbtlearn/seeds/seed_full_moon_dates.csv deleted file mode 100644 index c9d373e..0000000 --- a/code_thingies/dbtlearn/seeds/seed_full_moon_dates.csv +++ /dev/null @@ -1,273 +0,0 @@ -full_moon_date -2009-01-11 -2009-02-09 -2009-03-11 -2009-04-09 -2009-05-09 -2009-06-07 -2009-07-07 -2009-08-06 -2009-09-04 -2009-10-04 -2009-11-02 -2009-12-02 -2009-12-31 -2010-01-30 -2010-02-28 -2010-03-30 -2010-04-28 -2010-05-28 -2010-06-26 -2010-07-26 -2010-08-24 -2010-09-23 -2010-10-23 -2010-11-21 -2010-12-21 -2011-01-19 -2011-02-18 -2011-03-19 -2011-04-18 -2011-05-17 -2011-06-15 -2011-07-15 -2011-08-13 -2011-09-12 -2011-10-12 -2011-11-10 -2011-12-10 -2012-01-09 -2012-02-07 -2012-03-08 -2012-04-06 -2012-05-06 -2012-06-04 -2012-07-03 -2012-08-02 -2012-08-31 -2012-09-30 -2012-10-29 -2012-11-28 -2012-12-28 -2013-01-27 -2013-02-25 -2013-03-27 -2013-04-25 -2013-05-25 -2013-06-23 -2013-07-22 -2013-08-21 -2013-09-19 -2013-10-19 -2013-11-17 -2013-12-17 -2014-01-16 -2014-02-15 -2014-03-16 -2014-04-15 -2014-05-14 -2014-06-13 -2014-07-12 -2014-08-10 -2014-09-09 -2014-10-08 -2014-11-06 -2014-12-06 -2015-01-05 -2015-02-04 -2015-03-05 -2015-04-04 -2015-05-04 -2015-06-02 -2015-07-02 -2015-07-31 -2015-08-29 -2015-09-28 -2015-10-27 -2015-11-25 -2015-12-25 -2016-01-24 -2016-02-22 -2016-03-23 -2016-04-22 -2016-05-21 -2016-06-20 -2016-07-20 -2016-08-18 -2016-09-16 -2016-10-16 -2016-11-14 -2016-12-14 -2017-01-12 -2017-02-11 -2017-03-12 -2017-04-11 -2017-05-10 -2017-06-09 -2017-07-09 -2017-08-07 -2017-09-06 -2017-10-05 -2017-11-04 -2017-12-03 -2018-01-02 -2018-01-31 -2018-03-02 -2018-03-31 -2018-04-30 -2018-05-29 -2018-06-28 -2018-07-27 -2018-08-26 -2018-09-25 -2018-10-24 -2018-11-23 -2018-12-22 -2019-01-21 -2019-02-19 -2019-03-21 -2019-04-19 -2019-05-18 -2019-06-17 -2019-07-16 -2019-08-15 -2019-09-14 -2019-10-13 -2019-11-12 -2019-12-12 -2020-01-10 -2020-02-09 -2020-03-09 -2020-04-08 -2020-05-07 -2020-06-05 -2020-07-05 -2020-08-03 -2020-09-02 -2020-10-01 -2020-10-31 -2020-11-30 -2020-12-30 -2021-01-28 -2021-02-27 -2021-03-28 -2021-04-27 -2021-05-26 -2021-06-24 -2021-07-24 -2021-08-22 -2021-09-21 -2021-10-20 -2021-11-19 -2021-12-19 -2022-01-18 -2022-02-16 -2022-03-18 -2022-04-16 -2022-05-16 -2022-06-14 -2022-07-13 -2022-08-12 -2022-09-10 -2022-10-09 -2022-11-08 -2022-12-08 -2023-01-07 -2023-02-05 -2023-03-07 -2023-04-06 -2023-05-05 -2023-06-04 -2023-07-03 -2023-08-01 -2023-08-31 -2023-09-29 -2023-10-28 -2023-11-27 -2023-12-27 -2024-01-25 -2024-02-24 -2024-03-25 -2024-04-24 -2024-05-23 -2024-06-22 -2024-07-21 -2024-08-19 -2024-09-18 -2024-10-17 -2024-11-15 -2024-12-15 -2025-01-13 -2025-02-12 -2025-03-14 -2025-04-13 -2025-05-12 -2025-06-11 -2025-07-10 -2025-08-09 -2025-09-07 -2025-10-07 -2025-11-05 -2025-12-05 -2026-01-03 -2026-02-01 -2026-03-03 -2026-04-02 -2026-05-01 -2026-05-31 -2026-06-30 -2026-07-29 -2026-08-28 -2026-09-26 -2026-10-26 -2026-11-24 -2026-12-24 -2027-01-22 -2027-02-21 -2027-03-22 -2027-04-21 -2027-05-20 -2027-06-19 -2027-07-18 -2027-08-17 -2027-09-16 -2027-10-15 -2027-11-14 -2027-12-13 -2028-01-12 -2028-02-10 -2028-03-11 -2028-04-09 -2028-05-08 -2028-06-07 -2028-07-06 -2028-08-05 -2028-09-04 -2028-10-03 -2028-11-02 -2028-12-02 -2028-12-31 -2029-01-30 -2029-02-28 -2029-03-30 -2029-04-28 -2029-05-27 -2029-06-26 -2029-07-25 -2029-08-24 -2029-09-22 -2029-10-22 -2029-11-21 -2029-12-20 -2030-01-19 -2030-02-18 -2030-03-19 -2030-04-18 -2030-05-17 -2030-06-15 -2030-07-15 -2030-08-13 -2030-09-11 -2030-10-11 -2030-11-10 -2030-12-09 diff --git a/code_thingies/dbtlearn/snapshots/scd_raw_listings.sql b/code_thingies/dbtlearn/snapshots/scd_raw_listings.sql deleted file mode 100644 index 4108a21..0000000 --- a/code_thingies/dbtlearn/snapshots/scd_raw_listings.sql +++ /dev/null @@ -1,17 +0,0 @@ -{% snapshot scd_raw_listings %} - -{{ - config( - target_schema = 'dev', - unique_key = 'id', - strategy = 'timestamp', - updated_at = 'updated_at', - invalidate_hard_deletes = True - ) -}} - -SELECT * -FROM - {{ source('airbnb', 'listings')}} - -{% endsnapshot %} diff --git a/code_thingies/dbtlearn/tests/consistent_created_at.sql b/code_thingies/dbtlearn/tests/consistent_created_at.sql deleted file mode 100644 index 2fb6893..0000000 --- a/code_thingies/dbtlearn/tests/consistent_created_at.sql +++ /dev/null @@ -1,9 +0,0 @@ -SELECT * -FROM - {{ ref('fact_reviews') }} fr -LEFT JOIN - {{ ref('dim_listings_cleansed') }} dl -ON - fr.listing_id = dl.listing_id -WHERE - fr.review_date < dl.created_at \ No newline at end of file diff --git a/code_thingies/dbtlearn/tests/no_nulls_in_dim_listings.sql b/code_thingies/dbtlearn/tests/no_nulls_in_dim_listings.sql deleted file mode 100644 index e2a6fb5..0000000 --- a/code_thingies/dbtlearn/tests/no_nulls_in_dim_listings.sql +++ /dev/null @@ -1 +0,0 @@ -{{ no_nulls_in_columns(ref('dim_listings_cleansed')) }} \ No newline at end of file diff --git a/notes/8.md b/notes/8.md index 33003e5..0d1c83f 100644 --- a/notes/8.md +++ b/notes/8.md @@ -42,69 +42,4 @@ WHERE Bear in mind that how to define the strategy to determine what should be loaded is up to the engineer. Any SQL can be placed within the `if is_incremental()` block. In the example above, we have a date field that easily signals what's the most recent date the table has currently seen. -## Sources and seeds - -Seeds are local files that you upload to a DWH from dbt. You place them as CSVs in the `seeds` folder. - - -Sources are an abstraction layer on top of the input tables. They are not strictly necessary, but can help make the project more structured. To create sources, you create a `sources.yml` file and place it in the `models` dir. Here, you can reference models created in the `models` dir to mark them as sources. You can reference sources in other models like this: - -```python -{{ source('domain_name', 'source_name')}} -``` - -Sources can define _freshness_ constraints that will provide warnings or errors when there is a significant delay. - - -## Snapshots - -Snapshots are a way to build SCD2s. There are two strategies to get this done: - - Timestamp: all records have a unique key and an `update_at` field. dbt will consider a new record is necessary in the SCD2 whenever the `updated_at` field increases. - - Check: dbt will monitor a set of columns and consider any changes in any of the columns as a new version of the record. - -Snapshots get defined with a sql file in the `snapshots` folder using the `snapshot` macro block. - -Once snapshots are defined, "snapshooting" can be triggered at any time by running `dbt snapshot`. dbt will create the SCD tables in the defined schema and play the `valid_from`, `valid_to` game whenever changes are detected. - -## Tests - -There are two kinds of tests: - -- Singular tests: you make any `SELECT` statement you want. If the `SELECT` statement is run and any data is found, the test is considered failed. If the statement is run and no rows are returned, the test is considered passed. -- Built-in test: just a bunch of typical stuff: uniqueness, nullability, enum validations and relationship (referential integrity) - -You can also define your own custom generic tests. - - -## Macros - -- Macros are jinja templates. -- There are many built-in macros in dbt, but you can also use your own macros. -- dbt packages exist and you can use them to have more tests and macros that you can use. - - -## Documentation - -- Documentation is kept in the repo (yay) -- Documentation can be defined in yaml files or in standalone markdown files. For example, the landing page can be customized with an `overview.md` file. -- Documentation can be quick-served with dbt, but ideally you should compile it and serve it with a regular web server, like Nginx. -- - - - - - - - - - - - - - - - - - - - +## \ No newline at end of file