more readme

This commit is contained in:
Pablo Martin 2024-08-16 11:10:35 +02:00
parent 14bdf7dcbe
commit 64977b8fe7

View file

@ -37,6 +37,8 @@ Different streams are independent from each other. Their runs won't affect them
To set up a stream, `anaxi` expects to find a file called `streams.yml` in the path `~/.anaxi/streams.yml`. You can check the example file in this repo named `example-streams.yml` to understand how to build this file. Each entry in the file represents one stream. The `cosmos_database_id` field and `postgres_database` field in each stream entry should be filled in with values that you have informed in the `cosmos-db.yml` and `postgres.yml` files. To set up a stream, `anaxi` expects to find a file called `streams.yml` in the path `~/.anaxi/streams.yml`. You can check the example file in this repo named `example-streams.yml` to understand how to build this file. Each entry in the file represents one stream. The `cosmos_database_id` field and `postgres_database` field in each stream entry should be filled in with values that you have informed in the `cosmos-db.yml` and `postgres.yml` files.
Also, you will need to create a folder named `checkpoints` in the path `~/.anaxi/checkpoints`. The state of the checkpoints for each stream will be kept there in different files.
Once you have configured the `streams.yml` file, you can use `anaxi` to execute the syncs. See more details in the next section. Once you have configured the `streams.yml` file, you can use `anaxi` to execute the syncs. See more details in the next section.
### Calling `anaxi` ### Calling `anaxi`
@ -59,6 +61,28 @@ To run a sync job, you can use:
anaxi sync-stream --stream-id <your-stream-name> anaxi sync-stream --stream-id <your-stream-name>
``` ```
### Deploying for Superhog infra
`anaxi` is simply a Python CLI app coupled with some configs stored in the file system. It can be run in many different ways. This section will provide some guidance in deploying it for the specifics needs we have as of today.
- Setup
- Prepare a Linux VM.
- Follow the instructions from the previous sections of this readme to get the tool ready to run.
- Scheduling
- Up next, schedule the execution in the Linux VM to fit your needs.
- Specifics are up to you and your circunstances.
- A general pattern would be to create a little bash script that calls the tool with the right parameters on it. You can find an example that I like in the root of this repo named `run_anaxi.sh`, but that's opinionated and adjusted to my needs at the time of writing this. Adapt it to your environment or start from scratch if necessary. The script is designed to be placed in`~/run_anaxi.sh`.
- The script is designed to send both success and failure messages to slack channels upon completion. To properly set this up, you will need to place a file called `slack_webhook_urls.txt` on the same path you drop `run_anaxi.sh`. The file should have two lines: `SLACK_ALERT_WEBHOOK_URL=<url-of-webhook-for-failures>` and `SLACK_RECEIPT_WEBHOOK_URL=<url-of-webhook-for-successful-runs>`. Setting up the slack channels and webhooks is outside of the scope of this readme.
- Create a cron entry with `crontab -e` that runs the script. For example: `0 2 * * * /bin/bash /home/azureuser/run_anaxi.sh` to run syncs every day at 2AM.
- If you want to run syncs at different frequencies, you can make different copies of `run_anaxi.sh` and schedule them independently.
- Backfilling and first runs
- When running the first sync for a stream, `anaxi` will by default start reading records since the start of the source Cosmos DB container. In some cases, this is probably what you want. You don't need to take any special action.
- On the other hand, if you want to only sync from a specific point on time, you can achieve this by creating a file in the checkpoints folder. The file should be named `<name-of-your-stream>.yml` and contain a single key named `highest_synced_timestamp`, with the value being the timestamp from which you want the sync to begin at (in UTC!). For, example, if I wanted to start at the time I'm writing this, this would be the content of the file:
```yml
highest_synced_timestamp: '2024-08-16T9:02:23+00:00'
```
## Relevant internals and implementation details ## Relevant internals and implementation details
### Tracking checkpoints ### Tracking checkpoints