Compare commits

...

10 commits

Author SHA1 Message Date
Pablo Martin
cc644be0bc Few small upgrades 2023-11-28 18:06:41 +01:00
Pablo Martin
fb4534db58 Alternative to AWS download 2023-11-27 17:58:10 +01:00
Pablo Martin
3fce12819b Thingies 2023-11-03 10:26:11 +01:00
Pablo Martin
045ce1ec45 Fix typo in col name, move test to generic test 2023-11-02 18:24:30 +01:00
Pablo Martin
c2f2739e7e Thingies 2023-11-02 18:13:56 +01:00
Pablo Martin
bccbff7205 Thingies 2023-11-02 17:45:20 +01:00
Pablo Martin
1e53b3895c Thingies 2023-11-02 17:05:44 +01:00
Pablo Martin
7480222cc7 Thingies 2023-10-31 17:22:51 +01:00
Pablo Martin
2b6c385b8c Add csv 2023-10-31 17:00:58 +01:00
Pablo Martin
f85adbde93 More thingies 2023-10-30 18:59:39 +01:00
21 changed files with 490 additions and 7 deletions

View file

@ -70,6 +70,12 @@ After, you will have to download some CSV files with the data to populate the da
aws s3 cp s3://dbtlearn/listings.csv listings.csv aws s3 cp s3://dbtlearn/listings.csv listings.csv
aws s3 cp s3://dbtlearn/reviews.csv reviews.csv aws s3 cp s3://dbtlearn/reviews.csv reviews.csv
aws s3 cp s3://dbtlearn/hosts.csv hosts.csv aws s3 cp s3://dbtlearn/hosts.csv hosts.csv
# or, to avoid using aws
wget http://dbtlearn.s3.amazonaws.com/listings.csv
wget http://dbtlearn.s3.amazonaws.com/reviews.csv
wget http://dbtlearn.s3.amazonaws.com/hosts.csv
``` ```
How to put the data into the databases is up to you. I've done it successfully using the import functionality of DBeaver. How to put the data into the databases is up to you. I've done it successfully using the import functionality of DBeaver.

View file

@ -4,6 +4,8 @@ This is the dbt project for the course.
## Set up ## Set up
Make a venv and install the requirements listed in `requirements.txt`.
You need to place a profile for the local postgres instance in `~/.dbt/profiles.yaml`. See below a sample config that should be a good starting point if you follow the instructions in the `database` dir of this project. You need to place a profile for the local postgres instance in `~/.dbt/profiles.yaml`. See below a sample config that should be a good starting point if you follow the instructions in the `database` dir of this project.
```yaml ```yaml
@ -25,6 +27,8 @@ dbtlearn:
Once you have set this up and the database as well, you can run `dbt debug` to ensure everything is set up correctly and dbt can reach the database. Once you have set this up and the database as well, you can run `dbt debug` to ensure everything is set up correctly and dbt can reach the database.
To install the required dbt packages, run `dbt deps`.
You should also delete the lines under the `dbtlearn` key in the `dbt_project.yml` file. You should also delete the lines under the `dbtlearn` key in the `dbt_project.yml` file.
Also delete the contents of the `models` folder. Also delete the contents of the `models` folder.

View file

@ -30,5 +30,7 @@ clean-targets: # directories to be removed by `dbt clean`
models: models:
dbtlearn: dbtlearn:
+materialized: view # Default way to materialize is view +materialized: view # Default way to materialize is view
src:
+materialized: ephemeral
dim: dim:
+materialized: table +materialized: table

View file

@ -0,0 +1,10 @@
{% macro no_nulls_in_columns(model) %}
SELECT *
FROM
{{ model }}
WHERE
{% for col in adapter.get_columns_in_relation(model) -%}
{{col.column}} IS NULL OR
{% endfor %}
FALSE
{% endmacro %}

View file

@ -0,0 +1,8 @@
{% test positive_value(model, column_name) %}
SELECT
*
FROM
{{ model }}
WHERE
{{ column_name }} < 1
{% endtest %}

View file

@ -1,3 +1,9 @@
{{
config(
materialized = 'view'
)
}}
WITH src_hosts AS( WITH src_hosts AS(
SELECT * SELECT *
FROM {{ ref('src_hosts') }} FROM {{ ref('src_hosts') }}

View file

@ -1,3 +1,9 @@
{{
config(
materialized = 'view'
)
}}
WITH src_listings AS ( WITH src_listings AS (
SELECT * SELECT *
FROM FROM
@ -10,7 +16,7 @@ SELECT
CASE CASE
WHEN minimum_nights = 0 THEN 1 WHEN minimum_nights = 0 THEN 1
ELSE minimum_nights ELSE minimum_nights
END AS mininum_nights, END AS minimum_nights,
host_id, host_id,
REPLACE(price_str,'$','')::money AS price, REPLACE(price_str,'$','')::money AS price,
created_at, created_at,

View file

@ -13,7 +13,7 @@ SELECT
listings.listing_id, listings.listing_id,
listings.listing_name, listings.listing_name,
listings.room_type, listings.room_type,
listings.mininum_nights, listings.minimum_nights,
listings.price, listings.price,
listings.host_id, listings.host_id,
hosts.host_name, hosts.host_name,

View file

@ -9,7 +9,9 @@ WITH src_reviews AS (
FROM FROM
{{ ref('src_reviews') }} {{ ref('src_reviews') }}
) )
SELECT * SELECT
{{ dbt_utils.surrogate_key(['listing_id', 'review_date', 'reviewer_name', 'review_text']) }} as review_id,
*
FROM FROM
src_reviews src_reviews
WHERE WHERE

View file

@ -0,0 +1,27 @@
{{
config(
materialized = 'table'
)
}}
WITH fact_reviews AS (
SELECT *
FROM
{{ ref('fact_reviews') }}
),
full_moon_dates AS (
SELECT *
FROM
{{ ref('seed_full_moon_dates')}}
)
SELECT
fr.*,
CASE
WHEN fm.full_moon_date IS NULL THEN 'not full moon'
ELSE 'full moon'
END AS is_full_moon
FROM
fact_reviews fr
LEFT JOIN full_moon_dates fm
ON (fr.review_date::date) = (fm.full_moon_date + interval '1' day)

View file

@ -0,0 +1,31 @@
version: 2
models:
- name: dim_listings_cleansed
columns:
- name: listing_id
tests:
- unique
- not_null
- name: host_id
tests:
- not_null
- relationships:
to: ref('dim_hosts_cleansed')
field: host_id
- name: room_type
tests:
- accepted_values:
values: [
'Entire home/apt',
'Private room',
'Shared room',
'Hotel room'
]
- name: minimum_nights
tests:
- positive_value

View file

@ -0,0 +1,12 @@
version: 2
sources:
- name: airbnb
schema: raw
tables:
- name: listings
identifier: raw_listings
- name: hosts
identifier: raw_hosts
- name: reviews
identifier: raw_reviews

View file

@ -1,6 +1,6 @@
WITH raw_hosts AS ( WITH raw_hosts AS (
SELECT * SELECT *
FROM raw.raw_hosts FROM {{ source ('airbnb', 'hosts')}}
) )
SELECT SELECT
id as host_id, id as host_id,

View file

@ -1,6 +1,6 @@
WITH raw_listings AS ( WITH raw_listings AS (
SELECT * SELECT *
FROM raw.raw_listings FROM {{ source ('airbnb', 'listings')}}
) )
SELECT SELECT
id AS listing_id, id AS listing_id,

View file

@ -1,10 +1,11 @@
WITH raw_reviews AS ( WITH raw_reviews AS (
SELECT * SELECT *
FROM raw.raw_reviews FROM {{ source ('airbnb', 'reviews')}}
) )
SELECT SELECT
listing_id, listing_id,
date AS review_date, date AS review_date,
reviewer_name AS reviewer_name,
comments AS review_text, comments AS review_text,
sentiment AS review_sentiment sentiment AS review_sentiment
FROM FROM

View file

@ -0,0 +1,3 @@
packages:
- package: dbt-labs/dbt_utils
version: 0.8.0

View file

@ -0,0 +1,273 @@
full_moon_date
2009-01-11
2009-02-09
2009-03-11
2009-04-09
2009-05-09
2009-06-07
2009-07-07
2009-08-06
2009-09-04
2009-10-04
2009-11-02
2009-12-02
2009-12-31
2010-01-30
2010-02-28
2010-03-30
2010-04-28
2010-05-28
2010-06-26
2010-07-26
2010-08-24
2010-09-23
2010-10-23
2010-11-21
2010-12-21
2011-01-19
2011-02-18
2011-03-19
2011-04-18
2011-05-17
2011-06-15
2011-07-15
2011-08-13
2011-09-12
2011-10-12
2011-11-10
2011-12-10
2012-01-09
2012-02-07
2012-03-08
2012-04-06
2012-05-06
2012-06-04
2012-07-03
2012-08-02
2012-08-31
2012-09-30
2012-10-29
2012-11-28
2012-12-28
2013-01-27
2013-02-25
2013-03-27
2013-04-25
2013-05-25
2013-06-23
2013-07-22
2013-08-21
2013-09-19
2013-10-19
2013-11-17
2013-12-17
2014-01-16
2014-02-15
2014-03-16
2014-04-15
2014-05-14
2014-06-13
2014-07-12
2014-08-10
2014-09-09
2014-10-08
2014-11-06
2014-12-06
2015-01-05
2015-02-04
2015-03-05
2015-04-04
2015-05-04
2015-06-02
2015-07-02
2015-07-31
2015-08-29
2015-09-28
2015-10-27
2015-11-25
2015-12-25
2016-01-24
2016-02-22
2016-03-23
2016-04-22
2016-05-21
2016-06-20
2016-07-20
2016-08-18
2016-09-16
2016-10-16
2016-11-14
2016-12-14
2017-01-12
2017-02-11
2017-03-12
2017-04-11
2017-05-10
2017-06-09
2017-07-09
2017-08-07
2017-09-06
2017-10-05
2017-11-04
2017-12-03
2018-01-02
2018-01-31
2018-03-02
2018-03-31
2018-04-30
2018-05-29
2018-06-28
2018-07-27
2018-08-26
2018-09-25
2018-10-24
2018-11-23
2018-12-22
2019-01-21
2019-02-19
2019-03-21
2019-04-19
2019-05-18
2019-06-17
2019-07-16
2019-08-15
2019-09-14
2019-10-13
2019-11-12
2019-12-12
2020-01-10
2020-02-09
2020-03-09
2020-04-08
2020-05-07
2020-06-05
2020-07-05
2020-08-03
2020-09-02
2020-10-01
2020-10-31
2020-11-30
2020-12-30
2021-01-28
2021-02-27
2021-03-28
2021-04-27
2021-05-26
2021-06-24
2021-07-24
2021-08-22
2021-09-21
2021-10-20
2021-11-19
2021-12-19
2022-01-18
2022-02-16
2022-03-18
2022-04-16
2022-05-16
2022-06-14
2022-07-13
2022-08-12
2022-09-10
2022-10-09
2022-11-08
2022-12-08
2023-01-07
2023-02-05
2023-03-07
2023-04-06
2023-05-05
2023-06-04
2023-07-03
2023-08-01
2023-08-31
2023-09-29
2023-10-28
2023-11-27
2023-12-27
2024-01-25
2024-02-24
2024-03-25
2024-04-24
2024-05-23
2024-06-22
2024-07-21
2024-08-19
2024-09-18
2024-10-17
2024-11-15
2024-12-15
2025-01-13
2025-02-12
2025-03-14
2025-04-13
2025-05-12
2025-06-11
2025-07-10
2025-08-09
2025-09-07
2025-10-07
2025-11-05
2025-12-05
2026-01-03
2026-02-01
2026-03-03
2026-04-02
2026-05-01
2026-05-31
2026-06-30
2026-07-29
2026-08-28
2026-09-26
2026-10-26
2026-11-24
2026-12-24
2027-01-22
2027-02-21
2027-03-22
2027-04-21
2027-05-20
2027-06-19
2027-07-18
2027-08-17
2027-09-16
2027-10-15
2027-11-14
2027-12-13
2028-01-12
2028-02-10
2028-03-11
2028-04-09
2028-05-08
2028-06-07
2028-07-06
2028-08-05
2028-09-04
2028-10-03
2028-11-02
2028-12-02
2028-12-31
2029-01-30
2029-02-28
2029-03-30
2029-04-28
2029-05-27
2029-06-26
2029-07-25
2029-08-24
2029-09-22
2029-10-22
2029-11-21
2029-12-20
2030-01-19
2030-02-18
2030-03-19
2030-04-18
2030-05-17
2030-06-15
2030-07-15
2030-08-13
2030-09-11
2030-10-11
2030-11-10
2030-12-09
1 full_moon_date
2 2009-01-11
3 2009-02-09
4 2009-03-11
5 2009-04-09
6 2009-05-09
7 2009-06-07
8 2009-07-07
9 2009-08-06
10 2009-09-04
11 2009-10-04
12 2009-11-02
13 2009-12-02
14 2009-12-31
15 2010-01-30
16 2010-02-28
17 2010-03-30
18 2010-04-28
19 2010-05-28
20 2010-06-26
21 2010-07-26
22 2010-08-24
23 2010-09-23
24 2010-10-23
25 2010-11-21
26 2010-12-21
27 2011-01-19
28 2011-02-18
29 2011-03-19
30 2011-04-18
31 2011-05-17
32 2011-06-15
33 2011-07-15
34 2011-08-13
35 2011-09-12
36 2011-10-12
37 2011-11-10
38 2011-12-10
39 2012-01-09
40 2012-02-07
41 2012-03-08
42 2012-04-06
43 2012-05-06
44 2012-06-04
45 2012-07-03
46 2012-08-02
47 2012-08-31
48 2012-09-30
49 2012-10-29
50 2012-11-28
51 2012-12-28
52 2013-01-27
53 2013-02-25
54 2013-03-27
55 2013-04-25
56 2013-05-25
57 2013-06-23
58 2013-07-22
59 2013-08-21
60 2013-09-19
61 2013-10-19
62 2013-11-17
63 2013-12-17
64 2014-01-16
65 2014-02-15
66 2014-03-16
67 2014-04-15
68 2014-05-14
69 2014-06-13
70 2014-07-12
71 2014-08-10
72 2014-09-09
73 2014-10-08
74 2014-11-06
75 2014-12-06
76 2015-01-05
77 2015-02-04
78 2015-03-05
79 2015-04-04
80 2015-05-04
81 2015-06-02
82 2015-07-02
83 2015-07-31
84 2015-08-29
85 2015-09-28
86 2015-10-27
87 2015-11-25
88 2015-12-25
89 2016-01-24
90 2016-02-22
91 2016-03-23
92 2016-04-22
93 2016-05-21
94 2016-06-20
95 2016-07-20
96 2016-08-18
97 2016-09-16
98 2016-10-16
99 2016-11-14
100 2016-12-14
101 2017-01-12
102 2017-02-11
103 2017-03-12
104 2017-04-11
105 2017-05-10
106 2017-06-09
107 2017-07-09
108 2017-08-07
109 2017-09-06
110 2017-10-05
111 2017-11-04
112 2017-12-03
113 2018-01-02
114 2018-01-31
115 2018-03-02
116 2018-03-31
117 2018-04-30
118 2018-05-29
119 2018-06-28
120 2018-07-27
121 2018-08-26
122 2018-09-25
123 2018-10-24
124 2018-11-23
125 2018-12-22
126 2019-01-21
127 2019-02-19
128 2019-03-21
129 2019-04-19
130 2019-05-18
131 2019-06-17
132 2019-07-16
133 2019-08-15
134 2019-09-14
135 2019-10-13
136 2019-11-12
137 2019-12-12
138 2020-01-10
139 2020-02-09
140 2020-03-09
141 2020-04-08
142 2020-05-07
143 2020-06-05
144 2020-07-05
145 2020-08-03
146 2020-09-02
147 2020-10-01
148 2020-10-31
149 2020-11-30
150 2020-12-30
151 2021-01-28
152 2021-02-27
153 2021-03-28
154 2021-04-27
155 2021-05-26
156 2021-06-24
157 2021-07-24
158 2021-08-22
159 2021-09-21
160 2021-10-20
161 2021-11-19
162 2021-12-19
163 2022-01-18
164 2022-02-16
165 2022-03-18
166 2022-04-16
167 2022-05-16
168 2022-06-14
169 2022-07-13
170 2022-08-12
171 2022-09-10
172 2022-10-09
173 2022-11-08
174 2022-12-08
175 2023-01-07
176 2023-02-05
177 2023-03-07
178 2023-04-06
179 2023-05-05
180 2023-06-04
181 2023-07-03
182 2023-08-01
183 2023-08-31
184 2023-09-29
185 2023-10-28
186 2023-11-27
187 2023-12-27
188 2024-01-25
189 2024-02-24
190 2024-03-25
191 2024-04-24
192 2024-05-23
193 2024-06-22
194 2024-07-21
195 2024-08-19
196 2024-09-18
197 2024-10-17
198 2024-11-15
199 2024-12-15
200 2025-01-13
201 2025-02-12
202 2025-03-14
203 2025-04-13
204 2025-05-12
205 2025-06-11
206 2025-07-10
207 2025-08-09
208 2025-09-07
209 2025-10-07
210 2025-11-05
211 2025-12-05
212 2026-01-03
213 2026-02-01
214 2026-03-03
215 2026-04-02
216 2026-05-01
217 2026-05-31
218 2026-06-30
219 2026-07-29
220 2026-08-28
221 2026-09-26
222 2026-10-26
223 2026-11-24
224 2026-12-24
225 2027-01-22
226 2027-02-21
227 2027-03-22
228 2027-04-21
229 2027-05-20
230 2027-06-19
231 2027-07-18
232 2027-08-17
233 2027-09-16
234 2027-10-15
235 2027-11-14
236 2027-12-13
237 2028-01-12
238 2028-02-10
239 2028-03-11
240 2028-04-09
241 2028-05-08
242 2028-06-07
243 2028-07-06
244 2028-08-05
245 2028-09-04
246 2028-10-03
247 2028-11-02
248 2028-12-02
249 2028-12-31
250 2029-01-30
251 2029-02-28
252 2029-03-30
253 2029-04-28
254 2029-05-27
255 2029-06-26
256 2029-07-25
257 2029-08-24
258 2029-09-22
259 2029-10-22
260 2029-11-21
261 2029-12-20
262 2030-01-19
263 2030-02-18
264 2030-03-19
265 2030-04-18
266 2030-05-17
267 2030-06-15
268 2030-07-15
269 2030-08-13
270 2030-09-11
271 2030-10-11
272 2030-11-10
273 2030-12-09

View file

@ -0,0 +1,17 @@
{% snapshot scd_raw_listings %}
{{
config(
target_schema = 'dev',
unique_key = 'id',
strategy = 'timestamp',
updated_at = 'updated_at',
invalidate_hard_deletes = True
)
}}
SELECT *
FROM
{{ source('airbnb', 'listings')}}
{% endsnapshot %}

View file

@ -0,0 +1,9 @@
SELECT *
FROM
{{ ref('fact_reviews') }} fr
LEFT JOIN
{{ ref('dim_listings_cleansed') }} dl
ON
fr.listing_id = dl.listing_id
WHERE
fr.review_date < dl.created_at

View file

@ -0,0 +1 @@
{{ no_nulls_in_columns(ref('dim_listings_cleansed')) }}

View file

@ -42,4 +42,69 @@ WHERE
Bear in mind that how to define the strategy to determine what should be loaded is up to the engineer. Any SQL can be placed within the `if is_incremental()` block. In the example above, we have a date field that easily signals what's the most recent date the table has currently seen. Bear in mind that how to define the strategy to determine what should be loaded is up to the engineer. Any SQL can be placed within the `if is_incremental()` block. In the example above, we have a date field that easily signals what's the most recent date the table has currently seen.
## ## Sources and seeds
Seeds are local files that you upload to a DWH from dbt. You place them as CSVs in the `seeds` folder.
Sources are an abstraction layer on top of the input tables. They are not strictly necessary, but can help make the project more structured. To create sources, you create a `sources.yml` file and place it in the `models` dir. Here, you can reference models created in the `models` dir to mark them as sources. You can reference sources in other models like this:
```python
{{ source('domain_name', 'source_name')}}
```
Sources can define _freshness_ constraints that will provide warnings or errors when there is a significant delay.
## Snapshots
Snapshots are a way to build SCD2s. There are two strategies to get this done:
- Timestamp: all records have a unique key and an `update_at` field. dbt will consider a new record is necessary in the SCD2 whenever the `updated_at` field increases.
- Check: dbt will monitor a set of columns and consider any changes in any of the columns as a new version of the record.
Snapshots get defined with a sql file in the `snapshots` folder using the `snapshot` macro block.
Once snapshots are defined, "snapshooting" can be triggered at any time by running `dbt snapshot`. dbt will create the SCD tables in the defined schema and play the `valid_from`, `valid_to` game whenever changes are detected.
## Tests
There are two kinds of tests:
- Singular tests: you make any `SELECT` statement you want. If the `SELECT` statement is run and any data is found, the test is considered failed. If the statement is run and no rows are returned, the test is considered passed.
- Built-in test: just a bunch of typical stuff: uniqueness, nullability, enum validations and relationship (referential integrity)
You can also define your own custom generic tests.
## Macros
- Macros are jinja templates.
- There are many built-in macros in dbt, but you can also use your own macros.
- dbt packages exist and you can use them to have more tests and macros that you can use.
## Documentation
- Documentation is kept in the repo (yay)
- Documentation can be defined in yaml files or in standalone markdown files. For example, the landing page can be customized with an `overview.md` file.
- Documentation can be quick-served with dbt, but ideally you should compile it and serve it with a regular web server, like Nginx.
-