sh-notion/notion_data_team_no_files/DE survival without Pablo 2881b273bcaa46e4b0ae5fac1c1ba728.md
Pablo Martin a256b48b01 pages
2025-07-11 16:15:17 +02:00

5.2 KiB
Raw Permalink Blame History

DE survival without Pablo

Reference page for when Pablo is not around and something goes wrong.

General stuff

  • Remember, the deployment instructions for the Data infra live here. When in doubt on how something is set up, or how something could be re-deployed, check there. Ill skip a lot of details on how things are built here because you can find them there.
  • If you feel unsure about touching stuff you dont understand in Azure, ask Ben Robinson for help.

DWH

when… then…
DWH is not responsive - Troubleshoot to find out if its a network issue (cant reach DWH) or if the DWH is effectively turned off/locked.
  • If its a VPN issue, go to VPN section.
  • If its a DWH issue, try rebooting from Azure portal.
  • If that doesnt do the trick, try restoring a backup. DWH has 1 backup per day, up to 7 days into the past. | | DWH runs out of space | - Visit the Azure Portal and increase the disk size of the database. |

VPN/Jumphost machine

when… then…
VPN becomes unresponsive - Try to run a reboot on the jumphost machine. Wireguard is set as a systemd service, so it should start again on boot.
  • If that doesnt work, youll have to check the logs to understand whats wrong. SSH into the machine and run sudo journalctl -u wg-quick@wg0.service to do so.
  • You can run sudo systemctl status wg-quick@wg0.service to simply check if the service is running. If its running fine, you should see a green active | | You need to give someone new/another device access | - Create a new key pair.
  • Use the existing key configurations or the infra script documentation to understand how to add it on both the VPN server and the client device.
  • Do not try to share your existing key with more devices: each keypair should only be active in one device at a time. | | You lock yourself out of the VPN because it stopped working and you cant SSH into the jumphost | Dont despair, this is still solvable. Add an exceptional rule in the Azure Network Security Group (NSG) that the Jumphost is using to TEMPORARILY allow yourself to SSH on port 22 on the public IP.

REMEMBER TO REMOVE THE EXCEPTION ONCE YOU ARE DONE!!!!!!

You can check details on how to do this in the data-infra-script repository. |

dbt

when… then…
You need to execute dbt on demand - SSH into airbyte-prd machine.
  • Execute the following command /bin/bash /home/azureuser/run_dbt.sh
  • You can check the execution logs in /home/azureuser/dbt_run.log | | The dbt run is failing and you dont understand why | - SSH into airbyte-prd machine.
  • Check the execution logs in /home/azureuser/dbt_run.log to find the errors
  • Pull the thread from there. | | You need to run a full refresh | - I would suggest making a sneaky dbt run from your own laptop, pointing to prd and with the --full-refresh flag active. Be careful since this can trigger very heavy runs. |

Airbyte

when… then…
You need to add a new table from Core into the DWH - Pick the right existing connection depending on the source schema and incrementality (full refresh vs incremental).
  • Add the table stream. | | A sync job is failing | - Visit the Airbyte UI.
  • Find the failed job and read the logs.
  • Pull the thread from there. | | Upstream changes in a data model are creating conflicts | - This is a tricky area because there are many ways to handle it.
  • One option is to convince whoever owns the upstream data model to go back to the previous state. If so, the issue fixes itself without changing anything in Airbyte.
  • If this isnt possible, you might decide to Reset the stream and sync all from source again. Bear in mind this cant be easily reverted and might break downstream dbt models. | | Data between Core and Airbyte has run out of sync for some mistake, like for instance violated UpdatedDate fields in Core | - Visit the affected connection and streams and reset them. |

airbyte-prd machine

This machine is where Airbyte, dbt and xexe run.

  • Airbyte is deployed as a series of docker containers orchestrated with docker compose. The docker compose file can be found in /home/azureuser/airbyte/
  • Both dbt and xexe run on a scheduled basis. Their execution is triggered by cron (commands live on azureuser's crontab) and the running scripts are on /home/azureuser/run_dbt.sh and /home/azureuser/run_xexe.sh

This is typically uber-stable and nothing goes wrong. If something smells fishy, a simple reboot should do the trick, as all services will start working on boot.

If any of the services stop working, heres where you can go to research:

  • Airbyte → See the logs of the containers. If the UI still works, you can also read the logs there.
  • dbt → Check the file /home/azureuser/dbt_run.log
  • xexe → Check the file /home/azureuser/xexe_run.log
  • anaxi → Check the file /home/azureuser/anaxi_run.log

Power BI Gateway

The PBI Gateway software is running on a Windows VM named pbi-gateway-prd.

The software is just installed there. I have no clue on how anything could go wrong: the service has been working like a charm since day one.