No description
Find a file
2024-08-14 14:20:31 +02:00
.vscode healthcheck 2024-08-08 18:34:27 +02:00
anaxi stuff 2024-08-13 17:44:51 +02:00
.gitignore gitignore 2024-08-08 18:21:39 +02:00
example-checkpoint.yml things kinda work 2024-08-13 15:02:03 +02:00
example-cosmos-db.yml a gazillion things to implement cosmos db healthcheck 2024-08-09 12:41:23 +02:00
example-postgres.yml postgres healthcheck works 2024-08-09 14:45:10 +02:00
example-streams.yml things kinda work 2024-08-13 15:02:03 +02:00
poetry.lock postgres healthcheck works 2024-08-09 14:45:10 +02:00
pyproject.toml postgres healthcheck works 2024-08-09 14:45:10 +02:00
README.md readme improvements 2024-08-14 14:20:31 +02:00

Anaxi

Anaxi is Superhog's tool to perform Extract-Load (EL) syncs between our multiple Cosmos DB databases and our DWH.

How to use the tool

Note: the app has only been used so far in a Linux environment. Windows support is dubious.

Install

  • Ensure you have Python 3.10>= and poetry installed.
  • Run poetry install to install dependencies.
  • Activate the project's virtual environment. You can use poetry shell.
  • Test that everything is working by running anaxi smoke-test. You should see a happy pig.

Set up credentials

anaxi needs a few configs and secrets to run.

Regarding Cosmos DB databases: anaxi expects to find a file called cosmos-db.yml in the path ~/.anaxi/cosmos-db.yml. The file should specify one or more Cosmos DB databases, along with the required secrets to interact with them. You can check the example file in this repo named example-cosmos-db.yml to understand how to build this file. Once you've done that, you can check if any database is reachable with the cosmos-db-healthcheck command. See more in the General Usage section below.

For Postgres databases: anaxi expects to find a file called postgres.yml in the path ~/.anaxi/postgres.yml. The file should specify one or more Postgres databases, along with the required secrets to interact with them. You can check the example file in this repo named example-postgres.yml to understand how to build this file. Once you've done that, you can check if any database is reachable with the postgres-healthcheck command. See more in the General Usage section below.

Target database preparation

anaxi assumes that the destination Postgres database and schema are already present before executing, and will not try to create them if they can't be found. Instead, things will simply fail. You are expected to take care of creating them.

anaxi will use the source container name as the name for the destination table. So, if you are reading from a container named oranges, anaxi will move the documents into a table name oranges. The table will be created for you on the first run, so you don't need to take care of that.

Set up streams

anaxi works through streams. A stream is the link between one specific Csomos DB container in some database, and a table in a Postgres database. Once you configure a stream properly, you can use anaxi to move the data from the container to the table. Each time you do that, you are syncing.

Syncs are incremental in nature. anaxi keeps track of what's the most up to date timestamp it has seen in the last run for the stream, and will only bring over Cosmos DB documents created or edited since then. We call this stored point in time the checkpoint. When you run a sync, anaxi will increase the checkpoint as it sees more data. Even if an error happens in the middle of a job, anaxi will keep track of the checkpoint up to which it can be fully confident that the data has been delivered to the target table.

Different streams are independent from each other. Their runs won't affect them in anyway.

To set up a stream, anaxi expects to find a file called streams.yml in the path ~/.anaxi/streams.yml. You can check the example file in this repo named example-streams.yml to understand how to build this file. Each entry in the file represents one stream. The cosmos_database_id field and postgres_database field in each stream entry should be filled in with values that you have informed in the cosmos-db.yml and postgres.yml files.

Once you have configured the streams.yml file, you can use anaxi to execute the syncs. See more details in the next section.

Calling anaxi

You can run a healthcheck against any Cosmos DB database like this:

anaxi cosmos-db-healthcheck --cosmos-db-id <your-db-id>

You can run a healthcheck against Postgres databases like this:

anaxi postgres-healthcheck --postgres-database <your-db-name>

To run a sync job, you can use:

anaxi sync-stream --stream-id <your-stream-name>

Relevant internals and implementation details

Tracking checkpoints

anaxi keeps track of the most recent timestamp it has committed (the checkpoint) for each stream to keep syncs incremental. Checkpoints are stored in ~/.anaxi/checkpoints/, with one file dedicated to each stream. The files are named as the stream they track (stream some-stream will have its checkpoint stored at ~/.anaxi/checkpoints/some-stream.yml). The timestamp is stored in UTC.

This implies two important facts:

  • If the checkpoint file gets deleted, altered, corrupted, or whatever dramatic event happens to it, the checkpoint will be lost.
  • You can modify the file to manipulate the behaviour of the next sync for any given stream. For example, if you want to run a full-refresh, you can simply delete the file.

On the other hand, anaxi will never delete anything on destination in any situation. Be careful when runnning full-refreshes, since you will need to decided if you want to remove data from destination or are happy having duplicate records there.

  • More-than-once delivery
  • deletes
  • missing intermediate steps due to live mode
  • same-container-name in different databases

Development

Local Cosmos DB

Microsoft provides tools to run a local emulator of Cosmos DB. The bad news is we have been unable to make it work so far, it always breaks for one reason or another.

You can find instructions here:

What's with the name

Anaxi is short for Anaximander. Anaximander of Miletus was a pre-Socratic Greek philosopher who lived in Miletus. He is often called the "Father of Cosmology" and founder of astronomy.