sh-notion/notion_data_team_no_files/Careful with the DB How to work in SQL Server with 405c497b76c74bb29dcc790bc59928fd.md
Pablo Martin a256b48b01 pages
2025-07-11 16:15:17 +02:00

56 lines
No EOL
4.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Careful with the DB: How to work in SQL Server without giving Pablo a stroke
This is a brief guide in how to work on the SQL Server DB we have for our backend without giving the Data Team a hard time.
# TLDR
Short and sweet:
- Give all tables a Primary Key.
- `UpdatedDate` fields are massively useful for the Data team. Do include them where possible, specially in tables that grow large over time.
- Respect `UpdatedDate` fields.
- If you dont respect the `UpdatedDate`, let the Data team know. No hard feelings, hiding it will just be worse.
- Before deleting/renaming tables and columns, please give us a call to check that the change wont cause incidents in downstream dependencies.
# SQL Server and the Data Team
## What we do
The Data Team centralizes a lot of Superhogs data sources in the DWH. There are good reasons for this that go beyond the scope of this document. Our SQL Server in the backend (which we usually refer to as Core) is one of theses sources.
To achieve this, we replicate Cores data in the DWH, regularly syncing data on different frequencies, from daily to hourly.
## How we do it
We use a tool called Airbyte. This replication is very simple: we just move the data as-is, without any kind of transformation at all. Its not the typical ETL, but rather just EL (extract, load).
There are two ways we can replicate each individual table from Core in the DWH:
- **Full refresh:** on a scheduled basis, we destroy the replicated table in the DWH and read all the data from the source again.
- **Incremental**: we only pick up new changes. If a record gets updated in Core, we update it in the DWH as well. If a record gets created in Core, we create it in the DWH as well.
As you might be guessing, we prefer incremental loads: they are faster and lighter on both source and destination. But these are only possible for tables that have a specified PK and a well-maintained `UpdatedDate` field.
We highly appreciate creating tables with `UpdatedDate` fields (or `CreatedDate`, if the table is fully immutable). This is specially important for tables that are large or will eventually grow into being large (just to throw a number here for the sake of not being ambiguous, lets say a million records makes a table large).
## Mistakes, ways to make them and remediation
### Not respecting `UpdatedDate`
Most common situation. This feels easy to avoid, but there are sneaky ways to make mistakes with this. Heres a list of examples of things that led to this happening in the past:
- Code bugs.
- You went to run an `UPDATE` manually in production because some emergency required it, but you didnt forget to include the `UpdatedDate` field with a `GETUTCDATE()`.
- Youve created a nice seeding method within the migrations of the database, but the seeding statement doesnt update the `UpdatedDate`
The technical impact of this is that the records where the field was not honored will *not* be replicated in the DWH, causing the data in the DWH and Core to drift. The business impact can go all the way from none at all to multiply the revenue in a TMT report by 25.
If it does happen, please, get in touch and let us know. We can solve the situation by running a full-refresh between Core and the DWH. Depending on the size of the table, this may be problematic and cause some disruption. But leaving the drift be will surely be worse.
### Breaking schema changes (deletes/renames) without alerting the Data Team
If you remove or rename a table or column that we are replicating in the DWH without notice, our pipelines and all their downstream dependencies will break.
The easiest way to prevent this is to have us in the loop. We make our best effort to quickly let you know if your change needs coordination with us. 95% we will just let you know on the same day that the change doesnt affect us and give you a green light straight away.
If this goes uncommunicated, the remediation will be fully on our side: we will surely notice because smoke will start coming out of the pipelines and DWH. We will be happy to run a post-mortem together to understand what went wrong to prevent it in the future.