Merged PR 2581: Fixes small typos

Fixes small typos in the Readme.

Very exhaustive and clear readme btw :)
This commit is contained in:
Oriol Roqué Paniagua 2024-08-19 12:51:46 +00:00 committed by Pablo Martín
commit cc1449074c

View file

@ -29,13 +29,13 @@ For Postgres databases: `anaxi` expects to find a file called `postgres.yml` in
### Set up streams
`anaxi` works through streams. A stream is the link between one specific Csomos DB container in some database, and a table in a Postgres database. Once you configure a stream properly, you can use `anaxi` to move the data from the container to the table. Each time you do that, you are syncing.
`anaxi` works through streams. A stream is the link between one specific Cosmos DB container in some database, and a table in a Postgres database. Once you configure a stream properly, you can use `anaxi` to move the data from the container to the table. Each time you do that, you are syncing.
Syncs are incremental in nature. `anaxi` keeps track of what's the most up to date timestamp it has seen in the last run for the stream, and will only bring over Cosmos DB documents created or edited since then. We call this stored point in time the *checkpoint*. When you run a sync, `anaxi` will increase the checkpoint as it sees more data. Even if an error happens in the middle of a job, `anaxi` will keep track of the *checkpoint* up to which it can be fully confident that the data has been delivered to the target table.
Different streams are independent from each other. Their runs won't affect them in anyway.
To set up a stream, `anaxi` expects to find a file called `streams.yml` in the path `~/.anaxi/streams.yml`. You can check the example file in this repo named `example-streams.yml` to understand how to build this file. Each entry in the file represents one stream. The `cosmos_database_id` field and `postgres_database` field in each stream entry should be filled in with values that you have informed in the `cosmos-db.yml` and `postgres.yml` files. The `cutoff_timestamp` field allows you to specify a timestamp (ISO 8601) that should be used as the first data to read data from if no checkpoint is available. You can leave it empty to read all records from the start of the container history.
To set up a stream, `anaxi` expects to find a file called `streams.yml` in the path `~/.anaxi/streams.yml`. You can check the example file in this repo named `example-streams.yml` to understand how to build this file. Each entry in the file represents one stream. The `cosmos_database_id` field and `postgres_database` field in each stream entry should be filled in with values that you have informed in the `cosmos-db.yml` and `postgres.yml` files, respectively. The `cutoff_timestamp` field allows you to specify a timestamp (ISO 8601) that should be used as the first date to read data from if no checkpoint is available. You can leave it empty to read all records from the start of the container history.
Also, you will need to create a folder named `checkpoints` in the path `~/.anaxi/checkpoints`. The state of the checkpoints for each stream will be kept there in different files.
@ -63,14 +63,14 @@ anaxi sync-stream --stream-id <your-stream-name>
### Deploying for Superhog infra
`anaxi` is simply a Python CLI app coupled with some configs stored in the file system. It can be run in many different ways. This section will provide some guidance in deploying it for the specifics needs we have as of today.
`anaxi` is simply a Python CLI app coupled with some configs stored in the file system. It can be run in many different ways. This section will provide some guidance in deploying it for the specific needs we have as of today.
- Setup
- Prepare a Linux VM.
- Follow the instructions from the previous sections of this readme to get the tool ready to run.
- Scheduling
- Up next, schedule the execution in the Linux VM to fit your needs.
- Specifics are up to you and your circunstances.
- Specifics are up to you and your circumstances.
- A general pattern would be to create a little bash script that calls the tool with the right parameters on it. You can find an example that I like in the root of this repo named `run_anaxi.sh`, but that's opinionated and adjusted to my needs at the time of writing this. Adapt it to your environment or start from scratch if necessary. The script is designed to be placed in`~/run_anaxi.sh`.
- The script is designed to send both success and failure messages to slack channels upon completion. To properly set this up, you will need to place a file called `slack_webhook_urls.txt` on the same path you drop `run_anaxi.sh`. The file should have two lines: `SLACK_ALERT_WEBHOOK_URL=<url-of-webhook-for-failures>` and `SLACK_RECEIPT_WEBHOOK_URL=<url-of-webhook-for-successful-runs>`. Setting up the slack channels and webhooks is outside of the scope of this readme.
- Create a cron entry with `crontab -e` that runs the script. For example: `0 2 * * * /bin/bash /home/azureuser/run_anaxi.sh` to run syncs every day at 2AM.
@ -96,7 +96,7 @@ This implies two important facts:
- If the checkpoint file gets deleted, altered, corrupted, or whatever dramatic event happens to it, the checkpoint will be lost.
- You can modify the file to manipulate the behaviour of the next sync for any given stream. For example, if you want to run a full-refresh, you can simply delete the file.
On the other hand, `anaxi` will never delete anything on destination in any situation. Be careful when runnning full-refreshes, since you will need to decided if you want to remove data from destination or are happy having duplicate records there.
On the other hand, `anaxi` will never delete anything on destination in any situation. Be careful when runnning full-refreshes, since you will need to decide if you want to remove data from destination or if you are happy having duplicate records there.
### Deletes
@ -115,7 +115,7 @@ If you don't really care about why this is the case, you can skip the rest of th
- A new sync gets triggered, and documents that have a timestamp equal to the checkpoint get synced *again*.
- This behaviour will actually repeat in further syncs until a new change feed event with a higher timestamp appears.
Note that, for some unknown reason, this only happens to SOME records. Meaning, if checkpoint is at point in time *t*, and there are 10 documents with timestamp being *t*, it can happen that only 7 of them get repeteadly loaded again and again, while the other 3 only got loaded on the first sync. Why? Only Bill Gates might know I guess.
Note that, for some unknown reason, this only happens to SOME records. Meaning, if checkpoint is at point in time *t*, and there are 10 documents with timestamp being *t*, it can happen that only 7 of them get repeatedly loaded again and again, while the other 3 only got loaded on the first sync. Why? Only Bill Gates might know I guess.
### Multiple containers with same name
@ -125,7 +125,7 @@ If you encounter this situation: that's an issue. You are going to have to modif
### Missing intermdiate steps due to Live mode
Cosmos DB databases have different modes for the Change Feed. In databases that run in the *Latest version* mode, not *every* single change in documents will be picked up by `anaxi`. Instead, every time you run a sync, you will receive __the latest state__ of any documents that have been created or documented since the checkpoint you are using. Intermediate states will not be read through the feed.
Cosmos DB databases have different modes for the Change Feed. In databases that run in the *Latest version* mode, not *every* single change in documents will be picked up by `anaxi`. Instead, every time you run a sync, you will receive __the latest state__ of any document that have been created or documented since the checkpoint you are using. Intermediate states will not be read through the feed.
You can read more here to understand all the nuances and implications of the different modes: <https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/change-feed-modes>