5.2 KiB
5.2 KiB
DE survival without Pablo
Reference page for when Pablo is not around and something goes wrong.
General stuff
- Remember, the deployment instructions for the Data infra live here. When in doubt on how something is set up, or how something could be re-deployed, check there. I’ll skip a lot of details on how things are built here because you can find them there.
- If you feel unsure about touching stuff you don’t understand in Azure, ask Ben Robinson for help.
DWH
| when… | then… |
|---|---|
| DWH is not responsive | - Troubleshoot to find out if it’s a network issue (can’t reach DWH) or if the DWH is effectively turned off/locked. |
- If it’s a VPN issue, go to VPN section.
- If it’s a DWH issue, try rebooting from Azure portal.
- If that doesn’t do the trick, try restoring a backup. DWH has 1 backup per day, up to 7 days into the past. | | DWH runs out of space | - Visit the Azure Portal and increase the disk size of the database. |
VPN/Jumphost machine
| when… | then… |
|---|---|
| VPN becomes unresponsive | - Try to run a reboot on the jumphost machine. Wireguard is set as a systemd service, so it should start again on boot. |
- If that doesn’t work, you’ll have to check the logs to understand what’s wrong. SSH into the machine and run
sudo journalctl -u wg-quick@wg0.serviceto do so. - You can run
sudo systemctl status wg-quick@wg0.serviceto simply check if the service is running. If it’s running fine, you should see a greenactive| | You need to give someone new/another device access | - Create a new key pair. - Use the existing key configurations or the infra script documentation to understand how to add it on both the VPN server and the client device.
- Do not try to share your existing key with more devices: each keypair should only be active in one device at a time. | | You lock yourself out of the VPN because it stopped working and you can’t SSH into the jumphost | Don’t despair, this is still solvable. Add an exceptional rule in the Azure Network Security Group (NSG) that the Jumphost is using to TEMPORARILY allow yourself to SSH on port 22 on the public IP.
REMEMBER TO REMOVE THE EXCEPTION ONCE YOU ARE DONE!!!!!!
You can check details on how to do this in the data-infra-script repository. |
dbt
| when… | then… |
|---|---|
You need to execute dbt on demand |
- SSH into airbyte-prd machine. |
- Execute the following command
/bin/bash /home/azureuser/run_dbt.sh - You can check the execution logs in
/home/azureuser/dbt_run.log| | Thedbt runis failing and you don’t understand why | - SSH intoairbyte-prdmachine. - Check the execution logs in
/home/azureuser/dbt_run.logto find the errors - Pull the thread from there. |
| You need to run a full refresh | - I would suggest making a sneaky
dbt runfrom your own laptop, pointing toprdand with the--full-refreshflag active. Be careful since this can trigger very heavy runs. |
Airbyte
| when… | then… |
|---|---|
| You need to add a new table from Core into the DWH | - Pick the right existing connection depending on the source schema and incrementality (full refresh vs incremental). |
- Add the table stream. | | A sync job is failing | - Visit the Airbyte UI.
- Find the failed job and read the logs.
- Pull the thread from there. | | Upstream changes in a data model are creating conflicts | - This is a tricky area because there are many ways to handle it.
- One option is to convince whoever owns the upstream data model to go back to the previous state. If so, the issue fixes itself without changing anything in Airbyte.
- If this isn’t possible, you might decide to Reset the stream and sync all from source again. Bear in mind this can’t be easily reverted and might break downstream
dbtmodels. | | Data between Core and Airbyte has run out of sync for some mistake, like for instance violatedUpdatedDatefields in Core | - Visit the affected connection and streams and reset them. |
airbyte-prd machine
This machine is where Airbyte, dbt and xexe run.
- Airbyte is deployed as a series of docker containers orchestrated with docker compose. The docker compose file can be found in
/home/azureuser/airbyte/ - Both dbt and
xexerun on a scheduled basis. Their execution is triggered bycron(commands live onazureuser's crontab) and the running scripts are on/home/azureuser/run_dbt.shand/home/azureuser/run_xexe.sh
This is typically uber-stable and nothing goes wrong. If something smells fishy, a simple reboot should do the trick, as all services will start working on boot.
If any of the services stop working, here’s where you can go to research:
- Airbyte → See the logs of the containers. If the UI still works, you can also read the logs there.
- dbt → Check the file
/home/azureuser/dbt_run.log xexe→ Check the file/home/azureuser/xexe_run.loganaxi→ Check the file/home/azureuser/anaxi_run.log
Power BI Gateway
The PBI Gateway software is running on a Windows VM named pbi-gateway-prd.
The software is just installed there. I have no clue on how anything could go wrong: the service has been working like a charm since day one.