83 lines
No EOL
5.2 KiB
Markdown
83 lines
No EOL
5.2 KiB
Markdown
# DE survival without Pablo
|
||
|
||
Reference page for when Pablo is not around and something goes wrong.
|
||
|
||
# General stuff
|
||
|
||
- Remember, the deployment instructions for the Data infra [live here](https://guardhog.visualstudio.com/Data/_git/data-infra-script?path=/platform-overview.md). When in doubt on how something is set up, or how something could be re-deployed, check there. I’ll skip a lot of details on how things are built here because you can find them there.
|
||
- If you feel unsure about touching stuff you don’t understand in Azure, ask Ben Robinson for help.
|
||
|
||
# DWH
|
||
|
||
| *when…* | *then…* |
|
||
| --- | --- |
|
||
| DWH is not responsive | - Troubleshoot to find out if it’s a network issue (can’t reach DWH) or if the DWH is effectively turned off/locked.
|
||
- If it’s a VPN issue, go to VPN section.
|
||
- If it’s a DWH issue, try rebooting from Azure portal.
|
||
- If that doesn’t do the trick, try restoring a backup. DWH has 1 backup per day, up to 7 days into the past. |
|
||
| DWH runs out of space | - Visit the Azure Portal and increase the disk size of the database. |
|
||
|
||
# VPN/Jumphost machine
|
||
|
||
| *when…* | *then…* |
|
||
| --- | --- |
|
||
| VPN becomes unresponsive | - Try to run a reboot on the jumphost machine. Wireguard is set as a `systemd` service, so it should start again on boot.
|
||
- If that doesn’t work, you’ll have to check the logs to understand what’s wrong. SSH into the machine and run `sudo journalctl -u wg-quick@wg0.service` to do so.
|
||
- You can run `sudo systemctl status wg-quick@wg0.service` to simply check if the service is running. If it’s running fine, you should see a green `active` |
|
||
| You need to give someone new/another device access | - Create a new key pair.
|
||
- Use the existing key configurations or the infra script documentation to understand how to add it on both the VPN server and the client device.
|
||
- Do not try to share your existing key with more devices: each keypair should only be active in one device at a time. |
|
||
| You lock yourself out of the VPN because it stopped working and you can’t SSH into the jumphost | Don’t despair, this is still solvable. Add an exceptional rule in the Azure Network Security Group (NSG) that the Jumphost is using to **TEMPORARILY** allow yourself to SSH on port 22 on the public IP.
|
||
|
||
**REMEMBER TO REMOVE THE EXCEPTION ONCE YOU ARE DONE!!!!!!**
|
||
|
||
You can check details on how to do this in the `data-infra-script` repository. |
|
||
|
||
# dbt
|
||
|
||
| *when…* | *then…* |
|
||
| --- | --- |
|
||
| You need to execute `dbt` on demand | - SSH into `airbyte-prd` machine.
|
||
- Execute the following command `/bin/bash /home/azureuser/run_dbt.sh`
|
||
- You can check the execution logs in `/home/azureuser/dbt_run.log` |
|
||
| The `dbt run` is failing and you don’t understand why | - SSH into `airbyte-prd` machine.
|
||
- Check the execution logs in `/home/azureuser/dbt_run.log` to find the errors
|
||
- Pull the thread from there. |
|
||
| You need to run a full refresh | - I would suggest making a sneaky `dbt run` from your own laptop, pointing to `prd` and with the `--full-refresh` flag active. Be careful since this can trigger very heavy runs. |
|
||
|
||
# Airbyte
|
||
|
||
| *when…* | *then…* |
|
||
| --- | --- |
|
||
| You need to add a new table from Core into the DWH | - Pick the right existing connection depending on the source schema and incrementality (full refresh vs incremental).
|
||
- Add the table stream. |
|
||
| A sync job is failing | - Visit the Airbyte UI.
|
||
- Find the failed job and read the logs.
|
||
- Pull the thread from there. |
|
||
| Upstream changes in a data model are creating conflicts | - This is a tricky area because there are many ways to handle it.
|
||
- One option is to convince whoever owns the upstream data model to go back to the previous state. If so, the issue fixes itself without changing anything in Airbyte.
|
||
- If this isn’t possible, you might decide to Reset the stream and sync all from source again. Bear in mind this can’t be easily reverted and might break downstream `dbt` models. |
|
||
| Data between Core and Airbyte has run out of sync for some mistake, like for instance violated `UpdatedDate` fields in Core | - Visit the affected connection and streams and reset them. |
|
||
|
||
# `airbyte-prd` machine
|
||
|
||
This machine is where Airbyte, dbt and `xexe` run.
|
||
|
||
- Airbyte is deployed as a series of docker containers orchestrated with docker compose. The docker compose file can be found in `/home/azureuser/airbyte/`
|
||
- Both dbt and `xexe` run on a scheduled basis. Their execution is triggered by `cron` (commands live on `azureuser`'s crontab) and the running scripts are on `/home/azureuser/run_dbt.sh` and `/home/azureuser/run_xexe.sh`
|
||
|
||
This is typically uber-stable and nothing goes wrong.
|
||
If something smells fishy, a simple reboot should do the trick, as all services will start working on boot.
|
||
|
||
If any of the services stop working, here’s where you can go to research:
|
||
|
||
- Airbyte → See the logs of the containers. If the UI still works, you can also read the logs there.
|
||
- dbt → Check the file `/home/azureuser/dbt_run.log`
|
||
- `xexe` → Check the file `/home/azureuser/xexe_run.log`
|
||
- `anaxi` → Check the file `/home/azureuser/anaxi_run.log`
|
||
|
||
# Power BI Gateway
|
||
|
||
The PBI Gateway software is running on a Windows VM named `pbi-gateway-prd`.
|
||
|
||
The software is just installed there. I have no clue on how anything could go wrong: the service has been working like a charm since day one. |