Merged PR 5617: Orphan Models Script
# Description This PR adds a script to look for orphan models in the DWH. The `README.md` has been expanded to explain how to use and schedule this script.
This commit is contained in:
commit
717590513f
2 changed files with 181 additions and 0 deletions
31
README.md
31
README.md
|
|
@ -134,6 +134,37 @@ Once you build the docs with `run_docs.sh`, you will have a bunch of files. To o
|
|||
|
||||
This goes beyond the scope of this project: to understand how you can serve these, refer to our [infra script repo](https://guardhog.visualstudio.com/Data/_git/data-infra-script). Specifically, the bits around the web gateway set up.
|
||||
|
||||
## Detecting (and dropping) orphan models in the DWH
|
||||
|
||||
If you remove a model from the dbt project, but that model had already been materialized as a table or view in the DWH, the DWH object won't go on its own. You'll have to explictly drop it.
|
||||
|
||||
In order to make your life easier, we have a utility script in this repo for this purpose: `find_orphan_models_in_db.sh`.
|
||||
|
||||
You can use this script to detect and identify any orphan models. The script can be used one off or be scheduled with slack messaging, so you get automated alerts any time an orphan model appears.
|
||||
|
||||
The script is designed to be called from the same machine where you are executing the regular `dbt run` calls. You can try to use it in your local machine, but there are multiple gotchas which might lead to confusion.
|
||||
|
||||
To use it:
|
||||
- *Note that this assumes you've set up the project in the VM as described in previous sections. If you deviate in naming, paths, etc, you'll probably have to adjust some references here.*
|
||||
- In the VM, copy it from the project repo into the home folder: `cp find_orphan_models_in_db.sh ~/find_orphan_models_in_db.sh` and make it executable: `chmod 700 ~/find_orphan_models_in_db.sh`.
|
||||
- The script takes two positional arguments: a comma separated list of schemas to review, and a path to dbt's `manifest.json`.
|
||||
- Typically, if you call from the VM, you would do: `./find_orphan_models_in_db.sh staging,intermediate,reporting data-dwh-dbt-project/target/manifest.json`.
|
||||
- There is an optional `--slack` flag that will send success/failure messages to slack channels. The necessary configuration is the same described in the "How to schedule" section, so if you've already set up the dbt run, test and docs commands, you don't need to take any other steps to start sending slack messages.
|
||||
- Example usage: `./find_orphan_models_in_db.sh --slack staging,intermediate,reporting data-dwh-dbt-project/target/manifest.json `.
|
||||
|
||||
|
||||
How to schedule:
|
||||
- Simply add a cronjob in the VM with the command:
|
||||
|
||||
```bash
|
||||
COMMAND="0 9 * * * /bin/bash /home/azureuser/find_orphan_models_in_db.sh --slack staging,intermediate,reporting /home/azureuser/data-dwh-dbt-project/target/manifest.json"
|
||||
(crontab -u $USER -l; echo "$COMMAND" ) | crontab -u $USER -
|
||||
```
|
||||
|
||||
Note some caveats:
|
||||
- `sync` models are not checked.
|
||||
- If for any reason, you add tables or views that are unrelated to the dbt project in the monitored schemas, these will be identified as orphan by this script. Be careful, you might drop them accidentally if you don't pay attention. The simple solution to this is... don't use dbt schemas for non-dbt purposes.
|
||||
|
||||
## CI
|
||||
|
||||
CI can be setup to review PRs and make the developer experience more solid and less error prone.
|
||||
|
|
|
|||
150
find_orphan_models_in_db.sh
Normal file
150
find_orphan_models_in_db.sh
Normal file
|
|
@ -0,0 +1,150 @@
|
|||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
STARTING_DIR="/home/azureuser"
|
||||
cd $STARTING_DIR
|
||||
|
||||
# === CONFIGURATION ===
|
||||
DBT_PROJECT="dwh_dbt"
|
||||
DBT_TARGET="prd"
|
||||
PROFILE_YML="$STARTING_DIR/.dbt/profiles.yml"
|
||||
|
||||
# === Flag defaults ===
|
||||
SEND_SLACK=false
|
||||
|
||||
# === Parse flags ===
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
-s|--slack)
|
||||
SEND_SLACK=true
|
||||
shift
|
||||
;;
|
||||
-*)
|
||||
echo "❌ Unknown option: $1"
|
||||
exit 1
|
||||
;;
|
||||
*)
|
||||
break
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# === Positional arguments ===
|
||||
SCHEMAS="$1"
|
||||
MANIFEST_PATH="$2"
|
||||
shift 2
|
||||
IFS=',' read -r -a SCHEMA_ARRAY <<< "$SCHEMAS"
|
||||
|
||||
# === Tool check/install ===
|
||||
install_tool_if_missing() {
|
||||
TOOL_CALL_NAME=$1
|
||||
TOOL_APT_NAME=$2
|
||||
if ! command -v "$TOOL_CALL_NAME" &>/dev/null; then
|
||||
echo "🔧 Installing missing tool: $TOOL_APT_NAME"
|
||||
sudo apt-get update -qq
|
||||
sudo apt-get install -y "$TOOL_APT_NAME"
|
||||
else
|
||||
echo "✅ $TOOL_APT_NAME is installed"
|
||||
fi
|
||||
}
|
||||
|
||||
install_tool_if_missing jq jq
|
||||
install_tool_if_missing yq yq
|
||||
install_tool_if_missing psql postgresql-client
|
||||
|
||||
# === Slack webhook setup ===
|
||||
script_dir=$(dirname "$0")
|
||||
webhooks_file="slack_webhook_urls.txt"
|
||||
env_file="$script_dir/$webhooks_file"
|
||||
|
||||
if [ -f "$env_file" ]; then
|
||||
export $(grep -v '^#' "$env_file" | xargs)
|
||||
else
|
||||
echo "Error: $webhooks_file file not found in the script directory."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# === Load DB credentials from profiles.yml ===
|
||||
echo "🔐 Loading DB credentials from $PROFILE_YML..."
|
||||
DB_NAME=$(yq e ".${DBT_PROJECT}.outputs.${DBT_TARGET}.dbname" "$PROFILE_YML")
|
||||
DB_USER=$(yq e ".${DBT_PROJECT}.outputs.${DBT_TARGET}.user" "$PROFILE_YML")
|
||||
DB_HOST=$(yq e ".${DBT_PROJECT}.outputs.${DBT_TARGET}.host" "$PROFILE_YML")
|
||||
DB_PORT=$(yq e ".${DBT_PROJECT}.outputs.${DBT_TARGET}.port" "$PROFILE_YML")
|
||||
export PGPASSWORD=$(yq e ".${DBT_PROJECT}.outputs.${DBT_TARGET}.pass" "$PROFILE_YML")
|
||||
|
||||
# === Get list of tables/views from Postgres ===
|
||||
echo "🗃️ Reading current tables/views from PostgreSQL..."
|
||||
|
||||
POSTGRES_OBJECTS=()
|
||||
for SCHEMA in "${SCHEMA_ARRAY[@]}"; do
|
||||
echo "🔎 Scanning schema: $SCHEMA"
|
||||
TABLES=$(psql -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -d "$DB_NAME" -Atc "
|
||||
SELECT table_schema || '.' || table_name
|
||||
FROM information_schema.tables
|
||||
WHERE table_schema = '$SCHEMA'
|
||||
AND table_type IN ('BASE TABLE', 'VIEW');
|
||||
")
|
||||
while IFS= read -r tbl; do
|
||||
[[ -n "$tbl" ]] && POSTGRES_OBJECTS+=("${tbl,,}")
|
||||
done <<< "$TABLES"
|
||||
done
|
||||
|
||||
POSTGRES_OBJECTS=($(printf "%s\n" "${POSTGRES_OBJECTS[@]}" | sort -u))
|
||||
|
||||
# === Parse manifest.json for dbt model output names ===
|
||||
echo "📦 Extracting model output names from dbt manifest..."
|
||||
|
||||
DBT_OBJECTS=()
|
||||
DBT_ENTRIES=$(jq -r '
|
||||
.nodes | to_entries[] |
|
||||
select(.value.resource_type == "model" or .value.resource_type == "seed") |
|
||||
.value.schema + "." + .value.alias
|
||||
' "$MANIFEST_PATH")
|
||||
|
||||
while IFS= read -r entry; do
|
||||
[[ -n "$entry" ]] && DBT_OBJECTS+=("${entry,,}")
|
||||
done <<< "$DBT_ENTRIES"
|
||||
|
||||
DBT_OBJECTS=($(printf "%s\n" "${DBT_OBJECTS[@]}" | sort -u))
|
||||
|
||||
# === Compare ===
|
||||
echo "📊 Comparing DBT models vs Postgres state..."
|
||||
|
||||
RELEVANT_MODELS=()
|
||||
STALE_MODELS=()
|
||||
|
||||
for pg_obj in "${POSTGRES_OBJECTS[@]}"; do
|
||||
if printf "%s\n" "${DBT_OBJECTS[@]}" | grep -Fxq "$pg_obj"; then
|
||||
RELEVANT_MODELS+=("$pg_obj")
|
||||
else
|
||||
STALE_MODELS+=("$pg_obj")
|
||||
fi
|
||||
done
|
||||
|
||||
# === Output ===
|
||||
echo ""
|
||||
echo "✅ Relevant models (in both DB and DBT):"
|
||||
printf "%s\n" "${RELEVANT_MODELS[@]}" | sort
|
||||
|
||||
echo ""
|
||||
echo "⚠️ Stale models (in DB but NOT in DBT):"
|
||||
printf "%s\n" "${STALE_MODELS[@]}" | sort
|
||||
|
||||
# === Format stale models for Slack ===
|
||||
if [ "$SEND_SLACK" = true ]; then
|
||||
echo "✅ Sending slack message with results."
|
||||
if [ ${#STALE_MODELS[@]} -eq 0 ]; then
|
||||
SLACK_MSG=":white_check_mark::white_check_mark::white_check_mark: dbt models reviewed. No stale models found in the database! :white_check_mark::white_check_mark::white_check_mark:"
|
||||
curl -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"$SLACK_MSG\"}" \
|
||||
"$SLACK_RECEIPT_WEBHOOK_URL"
|
||||
else
|
||||
SLACK_MSG=":rotating_light::rotating_light::rotating_light: Stale models detected in Postgres (not in dbt manifest): :rotating_light::rotating_light::rotating_light:\n"
|
||||
for model in "${STALE_MODELS[@]}"; do
|
||||
SLACK_MSG+="- \`$model\`\n"
|
||||
done
|
||||
curl -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"$SLACK_MSG\"}" \
|
||||
"$SLACK_ALERT_WEBHOOK_URL"
|
||||
fi
|
||||
fi
|
||||
Loading…
Add table
Add a link
Reference in a new issue