sh-notion/notion_data_team_no_files/DE survival without Pablo 2881b273bcaa46e4b0ae5fac1c1ba728.md
Pablo Martin a256b48b01 pages
2025-07-11 16:15:17 +02:00

83 lines
No EOL
5.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# DE survival without Pablo
Reference page for when Pablo is not around and something goes wrong.
# General stuff
- Remember, the deployment instructions for the Data infra [live here](https://guardhog.visualstudio.com/Data/_git/data-infra-script?path=/platform-overview.md). When in doubt on how something is set up, or how something could be re-deployed, check there. Ill skip a lot of details on how things are built here because you can find them there.
- If you feel unsure about touching stuff you dont understand in Azure, ask Ben Robinson for help.
# DWH
| *when…* | *then…* |
| --- | --- |
| DWH is not responsive | - Troubleshoot to find out if its a network issue (cant reach DWH) or if the DWH is effectively turned off/locked.
- If its a VPN issue, go to VPN section.
- If its a DWH issue, try rebooting from Azure portal.
- If that doesnt do the trick, try restoring a backup. DWH has 1 backup per day, up to 7 days into the past. |
| DWH runs out of space | - Visit the Azure Portal and increase the disk size of the database. |
# VPN/Jumphost machine
| *when…* | *then…* |
| --- | --- |
| VPN becomes unresponsive | - Try to run a reboot on the jumphost machine. Wireguard is set as a `systemd` service, so it should start again on boot.
- If that doesnt work, youll have to check the logs to understand whats wrong. SSH into the machine and run `sudo journalctl -u wg-quick@wg0.service` to do so.
- You can run `sudo systemctl status wg-quick@wg0.service` to simply check if the service is running. If its running fine, you should see a green `active` |
| You need to give someone new/another device access | - Create a new key pair.
- Use the existing key configurations or the infra script documentation to understand how to add it on both the VPN server and the client device.
- Do not try to share your existing key with more devices: each keypair should only be active in one device at a time. |
| You lock yourself out of the VPN because it stopped working and you cant SSH into the jumphost | Dont despair, this is still solvable. Add an exceptional rule in the Azure Network Security Group (NSG) that the Jumphost is using to **TEMPORARILY** allow yourself to SSH on port 22 on the public IP.
**REMEMBER TO REMOVE THE EXCEPTION ONCE YOU ARE DONE!!!!!!**
You can check details on how to do this in the `data-infra-script` repository. |
# dbt
| *when…* | *then…* |
| --- | --- |
| You need to execute `dbt` on demand | - SSH into `airbyte-prd` machine.
- Execute the following command `/bin/bash /home/azureuser/run_dbt.sh`
- You can check the execution logs in `/home/azureuser/dbt_run.log` |
| The `dbt run` is failing and you dont understand why | - SSH into `airbyte-prd` machine.
- Check the execution logs in `/home/azureuser/dbt_run.log` to find the errors
- Pull the thread from there. |
| You need to run a full refresh | - I would suggest making a sneaky `dbt run` from your own laptop, pointing to `prd` and with the `--full-refresh` flag active. Be careful since this can trigger very heavy runs. |
# Airbyte
| *when…* | *then…* |
| --- | --- |
| You need to add a new table from Core into the DWH | - Pick the right existing connection depending on the source schema and incrementality (full refresh vs incremental).
- Add the table stream. |
| A sync job is failing | - Visit the Airbyte UI.
- Find the failed job and read the logs.
- Pull the thread from there. |
| Upstream changes in a data model are creating conflicts | - This is a tricky area because there are many ways to handle it.
- One option is to convince whoever owns the upstream data model to go back to the previous state. If so, the issue fixes itself without changing anything in Airbyte.
- If this isnt possible, you might decide to Reset the stream and sync all from source again. Bear in mind this cant be easily reverted and might break downstream `dbt` models. |
| Data between Core and Airbyte has run out of sync for some mistake, like for instance violated `UpdatedDate` fields in Core | - Visit the affected connection and streams and reset them. |
# `airbyte-prd` machine
This machine is where Airbyte, dbt and `xexe` run.
- Airbyte is deployed as a series of docker containers orchestrated with docker compose. The docker compose file can be found in `/home/azureuser/airbyte/`
- Both dbt and `xexe` run on a scheduled basis. Their execution is triggered by `cron` (commands live on `azureuser`'s crontab) and the running scripts are on `/home/azureuser/run_dbt.sh` and `/home/azureuser/run_xexe.sh`
This is typically uber-stable and nothing goes wrong.
If something smells fishy, a simple reboot should do the trick, as all services will start working on boot.
If any of the services stop working, heres where you can go to research:
- Airbyte → See the logs of the containers. If the UI still works, you can also read the logs there.
- dbt → Check the file `/home/azureuser/dbt_run.log`
- `xexe` → Check the file `/home/azureuser/xexe_run.log`
- `anaxi` → Check the file `/home/azureuser/anaxi_run.log`
# Power BI Gateway
The PBI Gateway software is running on a Windows VM named `pbi-gateway-prd`.
The software is just installed there. I have no clue on how anything could go wrong: the service has been working like a charm since day one.