diff --git a/README.md b/README.md index 1606340..a43e9de 100644 --- a/README.md +++ b/README.md @@ -21,28 +21,61 @@ Regarding Cosmos DB databases: `anaxi` expects to find a file called `cosmos-db. For Postgres databases: `anaxi` expects to find a file called `postgres.yml` in the path `~/.anaxi/postgres.yml`. The file should specify one or more Postgres databases, along with the required secrets to interact with them. You can check the example file in this repo named `example-postgres.yml` to understand how to build this file. Once you've done that, you can check if any database is reachable with the `postgres-healthcheck` command. See more in the `General Usage` section below. -### General Usage +### Target database preparation + +`anaxi` assumes that the destination Postgres database and schema are already present before executing, and will not try to create them if they can't be found. Instead, things will simply fail. You are expected to take care of creating them. + +`anaxi` will use the source container name as the name for the destination table. So, if you are reading from a container named `oranges`, `anaxi` will move the documents into a table name `oranges`. The table will be created for you on the first run, so you don't need to take care of that. + +### Set up streams + +`anaxi` works through streams. A stream is the link between one specific Csomos DB container in some database, and a table in a Postgres database. Once you configure a stream properly, you can use `anaxi` to move the data from the container to the table. Each time you do that, you are syncing. + +Syncs are incremental in nature. `anaxi` keeps track of what's the most up to date timestamp it has seen in the last run for the stream, and will only bring over Cosmos DB documents created or edited since then. We call this stored point in time the *checkpoint*. When you run a sync, `anaxi` will increase the checkpoint as it sees more data. Even if an error happens in the middle of a job, `anaxi` will keep track of the *checkpoint* up to which it can be fully confident that the data has been delivered to the target table. + +Different streams are independent from each other. Their runs won't affect them in anyway. + +To set up a stream, `anaxi` expects to find a file called `streams.yml` in the path `~/.anaxi/streams.yml`. You can check the example file in this repo named `example-streams.yml` to understand how to build this file. Each entry in the file represents one stream. The `cosmos_database_id` field and `postgres_database` field in each stream entry should be filled in with values that you have informed in the `cosmos-db.yml` and `postgres.yml` files. + +Once you have configured the `streams.yml` file, you can use `anaxi` to execute the syncs. See more details in the next section. + +### Calling `anaxi` You can run a healthcheck against any Cosmos DB database like this: ```bash -anaxis cosmos-db-healthcheck --cosmos-db-id +anaxi cosmos-db-healthcheck --cosmos-db-id ``` You can run a healthcheck against Postgres databases like this: ```bash -anaxis postgres-healthcheck --postgres-database +anaxi postgres-healthcheck --postgres-database ``` -## Incremental scopes +To run a sync job, you can use: -- [X] Callable CLI app -- [X] Healthchecks against Cosmos DB doable -- [X] Healthchecks against DWH doable -- [ ] Reading from Cosmos DB -- [ ] Writing into DWH -- [ ] Refactors and improvements +```bash +anaxi sync-stream --stream-id +``` + +## Relevant internals and implementation details + +### Tracking checkpoints + +`anaxi` keeps track of the most recent timestamp it has committed (the checkpoint) for each stream to keep syncs incremental. Checkpoints are stored in `~/.anaxi/checkpoints/`, with one file dedicated to each stream. The files are named as the stream they track (stream `some-stream` will have its checkpoint stored at `~/.anaxi/checkpoints/some-stream.yml`). The timestamp is stored in UTC. + +This implies two important facts: + +- If the checkpoint file gets deleted, altered, corrupted, or whatever dramatic event happens to it, the checkpoint will be lost. +- You can modify the file to manipulate the behaviour of the next sync for any given stream. For example, if you want to run a full-refresh, you can simply delete the file. + +On the other hand, `anaxi` will never delete anything on destination in any situation. Be careful when runnning full-refreshes, since you will need to decided if you want to remove data from destination or are happy having duplicate records there. + +- More-than-once delivery +- deletes +- missing intermediate steps due to live mode +- same-container-name in different databases ## Development @@ -55,7 +88,6 @@ You can find instructions here: - - - ## What's with the name `Anaxi` is short for Anaximander. [Anaximander of Miletus](https://en.wikipedia.org/wiki/Anaximander) was a pre-Socratic Greek philosopher who lived in Miletus. He is often called the "Father of Cosmology" and founder of astronomy.