sh-notion/notion_data_team_no_files/DE survival without Pablo 2881b273bcaa46e4b0ae5fac1c1ba728.md

83 lines
5.2 KiB
Markdown
Raw Normal View History

2025-07-11 16:15:17 +02:00
# DE survival without Pablo
Reference page for when Pablo is not around and something goes wrong.
# General stuff
- Remember, the deployment instructions for the Data infra [live here](https://guardhog.visualstudio.com/Data/_git/data-infra-script?path=/platform-overview.md). When in doubt on how something is set up, or how something could be re-deployed, check there. Ill skip a lot of details on how things are built here because you can find them there.
- If you feel unsure about touching stuff you dont understand in Azure, ask Ben Robinson for help.
# DWH
| *when…* | *then…* |
| --- | --- |
| DWH is not responsive | - Troubleshoot to find out if its a network issue (cant reach DWH) or if the DWH is effectively turned off/locked.
- If its a VPN issue, go to VPN section.
- If its a DWH issue, try rebooting from Azure portal.
- If that doesnt do the trick, try restoring a backup. DWH has 1 backup per day, up to 7 days into the past. |
| DWH runs out of space | - Visit the Azure Portal and increase the disk size of the database. |
# VPN/Jumphost machine
| *when…* | *then…* |
| --- | --- |
| VPN becomes unresponsive | - Try to run a reboot on the jumphost machine. Wireguard is set as a `systemd` service, so it should start again on boot.
- If that doesnt work, youll have to check the logs to understand whats wrong. SSH into the machine and run `sudo journalctl -u wg-quick@wg0.service` to do so.
- You can run `sudo systemctl status wg-quick@wg0.service` to simply check if the service is running. If its running fine, you should see a green `active` |
| You need to give someone new/another device access | - Create a new key pair.
- Use the existing key configurations or the infra script documentation to understand how to add it on both the VPN server and the client device.
- Do not try to share your existing key with more devices: each keypair should only be active in one device at a time. |
| You lock yourself out of the VPN because it stopped working and you cant SSH into the jumphost | Dont despair, this is still solvable. Add an exceptional rule in the Azure Network Security Group (NSG) that the Jumphost is using to **TEMPORARILY** allow yourself to SSH on port 22 on the public IP.
**REMEMBER TO REMOVE THE EXCEPTION ONCE YOU ARE DONE!!!!!!**
You can check details on how to do this in the `data-infra-script` repository. |
# dbt
| *when…* | *then…* |
| --- | --- |
| You need to execute `dbt` on demand | - SSH into `airbyte-prd` machine.
- Execute the following command `/bin/bash /home/azureuser/run_dbt.sh`
- You can check the execution logs in `/home/azureuser/dbt_run.log` |
| The `dbt run` is failing and you dont understand why | - SSH into `airbyte-prd` machine.
- Check the execution logs in `/home/azureuser/dbt_run.log` to find the errors
- Pull the thread from there. |
| You need to run a full refresh | - I would suggest making a sneaky `dbt run` from your own laptop, pointing to `prd` and with the `--full-refresh` flag active. Be careful since this can trigger very heavy runs. |
# Airbyte
| *when…* | *then…* |
| --- | --- |
| You need to add a new table from Core into the DWH | - Pick the right existing connection depending on the source schema and incrementality (full refresh vs incremental).
- Add the table stream. |
| A sync job is failing | - Visit the Airbyte UI.
- Find the failed job and read the logs.
- Pull the thread from there. |
| Upstream changes in a data model are creating conflicts | - This is a tricky area because there are many ways to handle it.
- One option is to convince whoever owns the upstream data model to go back to the previous state. If so, the issue fixes itself without changing anything in Airbyte.
- If this isnt possible, you might decide to Reset the stream and sync all from source again. Bear in mind this cant be easily reverted and might break downstream `dbt` models. |
| Data between Core and Airbyte has run out of sync for some mistake, like for instance violated `UpdatedDate` fields in Core | - Visit the affected connection and streams and reset them. |
# `airbyte-prd` machine
This machine is where Airbyte, dbt and `xexe` run.
- Airbyte is deployed as a series of docker containers orchestrated with docker compose. The docker compose file can be found in `/home/azureuser/airbyte/`
- Both dbt and `xexe` run on a scheduled basis. Their execution is triggered by `cron` (commands live on `azureuser`'s crontab) and the running scripts are on `/home/azureuser/run_dbt.sh` and `/home/azureuser/run_xexe.sh`
This is typically uber-stable and nothing goes wrong.
If something smells fishy, a simple reboot should do the trick, as all services will start working on boot.
If any of the services stop working, heres where you can go to research:
- Airbyte → See the logs of the containers. If the UI still works, you can also read the logs there.
- dbt → Check the file `/home/azureuser/dbt_run.log`
- `xexe` → Check the file `/home/azureuser/xexe_run.log`
- `anaxi` → Check the file `/home/azureuser/anaxi_run.log`
# Power BI Gateway
The PBI Gateway software is running on a Windows VM named `pbi-gateway-prd`.
The software is just installed there. I have no clue on how anything could go wrong: the service has been working like a charm since day one.