lolamarket-notes/notes/blog/great-expectations-in-lolamarket.md
2023-11-17 15:13:27 +01:00

80 lines
No EOL
8.8 KiB
Markdown

---
title: "Great Expectations in Lolamarket"
slug: "great-expectations-in-lolamarket"
tags: "stack,data,great expectations,data quality"
author: "Pablo Martin"
---
# Great Expectations in Lolamarket
Hi there, and welcome to the start of a new series in our blog: the Great Expectation series. In these series, we will discuss how we use Great Expectations, a data quality package.
## Introducing Great Expectations
As a data engineer, ensuring the quality of your data is crucial to the success of your company's applications. Unfortunately, it can be challenging to make sure that your data is accurate, complete, and consistent, especially as the number of data sources grows. At Lolamarket, we have faced several problems related to data quality, such as bad data slipping into our data warehouse unnoticed, pipelines breaking due to bad data, and no centralized documentation for tables and what the data in them should look like. These challenges inspired us to seek a solution that would help us ensure the quality of our data, and that's where Great Expectations comes in.
![](https://images.ctfassets.net/ycwst8v1r2x5/jbrHhqGtdpbZFhki5MqBp/e6a5f6b567173b39430a1a18d060cb8e/gx_logo_horiz_color.png?w=1032&h=275&q=50&fm=png)
[Great Expectations](https://greatexpectations.io/) is an open-source tool for ensuring data quality that has become increasingly popular in the data engineering and analytics communities. It comes in the form of a [Python package](https://pypi.org/project/great-expectations/). The tool is designed to help data teams set, manage, and validate expectations for their data across all of their pipelines, and ensure that their data meets those expectations. With Great Expectations, data teams can proactively monitor their data, detect issues before they become critical, and quickly respond to data-related problems.
Great Expectations is a flexible tool that can be used in a variety of settings, including data warehousing, ETL pipelines, and analytics workflows. It supports a wide range of data sources, including SQL databases, NoSQL databases, and data lakes. The tool is also highly customizable, with a flexible API that allows data teams to create and execute their tests, integrate with other tools in their data stack, and extend the tool's functionality.
## The problems that come with bad data
At Lolamarket, we faced a number of problems related to data quality. These are common problems that any other data team can face and will probably feel familiar if you work in the industry.
One of the biggest problems we faced was that bad data would sometimes slip into our data warehouse silently, without anyone noticing. This caused significant problems downstream, as the business would see incorrect data in our data products. This led to poor decisions being made and a lack of trust in our data arose. It was frustrating and caused us to waste time and resources.
Another challenge we faced was that bad data could also cause an ETL job to fail. This meant that our engineers had to reproduce the faulty environment, including data, to debug the problem. This was a cumbersome and time-consuming process that required a lot of data engineering firepower. We knew we had to find a way to improve our logging capabilities so that, whenever one of these failures would happen, spotting the issue would be as straightforward as possible.
![https://i.kym-cdn.com/photos/images/original/001/430/367/007.gif](https://i.kym-cdn.com/photos/images/original/001/430/367/007.gif)
*My colleague trying to rubber-duck with me the ETL failure he has been debugging for 3 days straight.*
Finally, we realized that without documentation on what our data should look like, new joiners and other colleagues had to ask data team members about it. This added another layer of communication that was not always necessary, and it wasted valuable time that could be spent on more important tasks.
## Using Great Expectations at Lolamarket
At Lolamarket, we are now using Great Expectations to ensure the quality of the data throughout our infrastructure.
First and foremost, we use Great Expectations to set expectations for our data at different stages of the ETL pipeline. This includes expectations for our data sources, the transformation logic, and the final data product. By setting these expectations, we can proactively monitor our data, detect issues before they become critical, and ensure that our data meets the criteria we've set. This has been instrumental in helping us catch issues before they impact our applications, which has been critical to the success of our business.
```python
# Add an expectation to the suite that checks if the "age" column contains null values
suite.expect_column_values_to_not_be_null("age")
# Add an expectation to the suite that checks if the "name" column contains any whitespace characters
suite.expect_column_values_to_not_match_regex("name", r"\s+")
# Add an expectation to the suite that checks if the "email" column contains valid email addresses
suite.expect_column_values_to_match_regex("email", r"[^@]+@[^@]+\.[^@]+")
```
*A simple example of how Great Expectations allows you to define what your data should be like.*
In addition to setting expectations for our data, we also use Great Expectations to quickly notify stakeholders when any of our tests fail. This includes owners of data sources, the owner of the ETL, or internal customers that can be impacted by the issue. By quickly notifying stakeholders, we can reduce the amount of time it takes to identify and fix issues, which ultimately helps us save time and resources.
Great Expectations also helps us increase the observability of issues, making debugging much easier and reducing the workload of the engineer looking into the issue. By setting up tests at different points of the ETL flow, we can quickly identify where issues are occurring and diagnose the root cause. Also, in some of the data tests we run, we will store the faulty data in a quarantine schema whenever the test fails. This way, we can easily check the data that caused the issue. This makes it easier to identify issues and quickly reach a solution, which has been instrumental in helping us maintain the reliability of our applications.
![[ge_alert_example.png]]
*A example of our beloved Slack alerts bot letting us know that a data test failed.*
## Why we choose Great Expectations
At Lolamarket, we evaluated a variety of different tools for ensuring data quality before ultimately deciding to go with Great Expectations. While there are many other comparable tools on the market, we ultimately chose Great Expectations for a variety of reasons.
One of the biggest reasons we chose Great Expectations is its flexibility. The tool is open source and highly customizable, with a flexible API that allows us to create and execute our tests, integrate with other tools in our data stack, and extend the tool's functionality. This has been critical to helping us tailor the tool to our specific needs and integrate it seamlessly into our existing data infrastructure. We really dislike getting locked in with tools that we can't grow if necessary, so Great Expectations had the upper-hand here against commercial alternatives.
The previous point also leads us to money: Great Expectations didn't come with much of a bill for our team. It played nicely out of the box with our already existing infrastructure and the resource consumption in our cloud is not enough for us to even worry about it. We can say that, besides the time and effort invested in adopting it, Great Expectations didn't really cost us anything. We like to keep things lean.
![[money-shop.gif]]
*Our finance team, counting the money we didn't spend on a Data Quality tool.*
Another factor that played a big role in our decision to choose Great Expectations is the tool's ease of use. The tool is well-documented, with a variety of resources available that help users get up and running quickly. This has been critical in helping us get the most out of the tool while minimizing the amount of time and resources we need to invest in training our team.
Finally, the tool plays nicely with our stack. We can easily integrate Great Expectations checkpoints within our Prefect flows since it's just Python. Great Expectations has no trouble reading data from MySQL, which is the most frequent database server in Lolamarket. And configuring an S3 bucket to hold different metadata and logs from Great Expectations was trivial.
## More to come
In this post, we've introduced you to Great Expectations, an open-source tool for ensuring data quality, and how we're using it at Lolamarket to overcome data quality problems. In future posts, we'll dive deeper into our use of Great Expectations, including how we deploy it and how we use it to enforce data contracts with our backend teams. We'll also share tips and tricks we've learned along the way. Stay tuned!