4.1 KiB
Careful with the DB: How to work in SQL Server without giving Pablo a stroke
This is a brief guide in how to work on the SQL Server DB we have for our backend without giving the Data Team a hard time.
TLDR
Short and sweet:
- Give all tables a Primary Key.
UpdatedDatefields are massively useful for the Data team. Do include them where possible, specially in tables that grow large over time.- Respect
UpdatedDatefields. - If you don’t respect the
UpdatedDate, let the Data team know. No hard feelings, hiding it will just be worse. - Before deleting/renaming tables and columns, please give us a call to check that the change won’t cause incidents in downstream dependencies.
SQL Server and the Data Team
What we do
The Data Team centralizes a lot of Superhog’s data sources in the DWH. There are good reasons for this that go beyond the scope of this document. Our SQL Server in the backend (which we usually refer to as Core) is one of theses sources.
To achieve this, we replicate Core’s data in the DWH, regularly syncing data on different frequencies, from daily to hourly.
How we do it
We use a tool called Airbyte. This replication is very simple: we just move the data as-is, without any kind of transformation at all. It’s not the typical ETL, but rather just EL (extract, load).
There are two ways we can replicate each individual table from Core in the DWH:
- Full refresh: on a scheduled basis, we destroy the replicated table in the DWH and read all the data from the source again.
- Incremental: we only pick up new changes. If a record gets updated in Core, we update it in the DWH as well. If a record gets created in Core, we create it in the DWH as well.
As you might be guessing, we prefer incremental loads: they are faster and lighter on both source and destination. But these are only possible for tables that have a specified PK and a well-maintained UpdatedDate field.
We highly appreciate creating tables with UpdatedDate fields (or CreatedDate, if the table is fully immutable). This is specially important for tables that are large or will eventually grow into being large (just to throw a number here for the sake of not being ambiguous, let’s say a million records makes a table large).
Mistakes, ways to make them and remediation
Not respecting UpdatedDate
Most common situation. This feels easy to avoid, but there are sneaky ways to make mistakes with this. Here’s a list of examples of things that led to this happening in the past:
- Code bugs.
- You went to run an
UPDATEmanually in production because some emergency required it, but you didn’t forget to include theUpdatedDatefield with aGETUTCDATE(). - You’ve created a nice seeding method within the migrations of the database, but the seeding statement doesn’t update the
UpdatedDate
The technical impact of this is that the records where the field was not honored will not be replicated in the DWH, causing the data in the DWH and Core to drift. The business impact can go all the way from none at all to multiply the revenue in a TMT report by 25.
If it does happen, please, get in touch and let us know. We can solve the situation by running a full-refresh between Core and the DWH. Depending on the size of the table, this may be problematic and cause some disruption. But leaving the drift be will surely be worse.
Breaking schema changes (deletes/renames) without alerting the Data Team
If you remove or rename a table or column that we are replicating in the DWH without notice, our pipelines and all their downstream dependencies will break.
The easiest way to prevent this is to have us in the loop. We make our best effort to quickly let you know if your change needs coordination with us. 95% we will just let you know on the same day that the change doesn’t affect us and give you a green light straight away.
If this goes uncommunicated, the remediation will be fully on our side: we will surely notice because smoke will start coming out of the pipelines and DWH. We will be happy to run a post-mortem together to understand what went wrong to prevent it in the future.