everything

This commit is contained in:
pablo 2023-11-17 15:13:27 +01:00
commit 8f7278c6aa
105 changed files with 1206534 additions and 0 deletions

53
notes/blog/README.md Normal file
View file

@ -0,0 +1,53 @@
This repo stores blog entries from the Lolamarket Engineering Blog.
Series:
- Prefect series
- How we use Prefect
- Our Prefect deployment
- Where Prefect shines
- Where Prefect doesn't shine
- How to organize flows
- DRY and our complementary toolbox
- Great Expectations series
- [x] GE in Lolamarket
- Our deployment with AWS and Slack
- Data Contracts with backend
- Where GE shines
- Where GE doesn't shine
- Trino series
- [x] Trino in Lolamarket
- Challenges we have faced with Trino
- Where Trino shines
- Where Trino doesn't shine
- Tips and tricks when working with Trino
- Differences between Trino SQL and MySQL SQL
- How our deployment looks like
- Looker series
- How we use Looker
- Migrating from Metabase to Looker
- Why Looker beats other visualization tools
- Challenges we have faced with Looker
- Interviewing series
- How we do interviews
- How to prepare for an interview
- Business-oriented series
- Year 2022 in numbers
- Airbyte series
- Airbyte in Lolamarket
- Deployment of Airbyte
- Where Airbyte shines
- Where Airbyte does not shine
-
---
## ChatGPT context helper
A few things for you to take into account:
- I work as a Data Engineer.
- My company is called Lolamarket.
- This post will be published in the our tech and engineering blog.
- The purpose of the post is to show to the world how we use our technologies, hopefully impressing and seducing potential candidates for technical positions.
- I want to post to showcase a great technical skill and knowledge.

View file

@ -0,0 +1,80 @@
---
title: "Great Expectations in Lolamarket"
slug: "great-expectations-in-lolamarket"
tags: "stack,data,great expectations,data quality"
author: "Pablo Martin"
---
# Great Expectations in Lolamarket
Hi there, and welcome to the start of a new series in our blog: the Great Expectation series. In these series, we will discuss how we use Great Expectations, a data quality package.
## Introducing Great Expectations
As a data engineer, ensuring the quality of your data is crucial to the success of your company's applications. Unfortunately, it can be challenging to make sure that your data is accurate, complete, and consistent, especially as the number of data sources grows. At Lolamarket, we have faced several problems related to data quality, such as bad data slipping into our data warehouse unnoticed, pipelines breaking due to bad data, and no centralized documentation for tables and what the data in them should look like. These challenges inspired us to seek a solution that would help us ensure the quality of our data, and that's where Great Expectations comes in.
![](https://images.ctfassets.net/ycwst8v1r2x5/jbrHhqGtdpbZFhki5MqBp/e6a5f6b567173b39430a1a18d060cb8e/gx_logo_horiz_color.png?w=1032&h=275&q=50&fm=png)
[Great Expectations](https://greatexpectations.io/) is an open-source tool for ensuring data quality that has become increasingly popular in the data engineering and analytics communities. It comes in the form of a [Python package](https://pypi.org/project/great-expectations/). The tool is designed to help data teams set, manage, and validate expectations for their data across all of their pipelines, and ensure that their data meets those expectations. With Great Expectations, data teams can proactively monitor their data, detect issues before they become critical, and quickly respond to data-related problems.
Great Expectations is a flexible tool that can be used in a variety of settings, including data warehousing, ETL pipelines, and analytics workflows. It supports a wide range of data sources, including SQL databases, NoSQL databases, and data lakes. The tool is also highly customizable, with a flexible API that allows data teams to create and execute their tests, integrate with other tools in their data stack, and extend the tool's functionality.
## The problems that come with bad data
At Lolamarket, we faced a number of problems related to data quality. These are common problems that any other data team can face and will probably feel familiar if you work in the industry.
One of the biggest problems we faced was that bad data would sometimes slip into our data warehouse silently, without anyone noticing. This caused significant problems downstream, as the business would see incorrect data in our data products. This led to poor decisions being made and a lack of trust in our data arose. It was frustrating and caused us to waste time and resources.
Another challenge we faced was that bad data could also cause an ETL job to fail. This meant that our engineers had to reproduce the faulty environment, including data, to debug the problem. This was a cumbersome and time-consuming process that required a lot of data engineering firepower. We knew we had to find a way to improve our logging capabilities so that, whenever one of these failures would happen, spotting the issue would be as straightforward as possible.
![https://i.kym-cdn.com/photos/images/original/001/430/367/007.gif](https://i.kym-cdn.com/photos/images/original/001/430/367/007.gif)
*My colleague trying to rubber-duck with me the ETL failure he has been debugging for 3 days straight.*
Finally, we realized that without documentation on what our data should look like, new joiners and other colleagues had to ask data team members about it. This added another layer of communication that was not always necessary, and it wasted valuable time that could be spent on more important tasks.
## Using Great Expectations at Lolamarket
At Lolamarket, we are now using Great Expectations to ensure the quality of the data throughout our infrastructure.
First and foremost, we use Great Expectations to set expectations for our data at different stages of the ETL pipeline. This includes expectations for our data sources, the transformation logic, and the final data product. By setting these expectations, we can proactively monitor our data, detect issues before they become critical, and ensure that our data meets the criteria we've set. This has been instrumental in helping us catch issues before they impact our applications, which has been critical to the success of our business.
```python
# Add an expectation to the suite that checks if the "age" column contains null values
suite.expect_column_values_to_not_be_null("age")
# Add an expectation to the suite that checks if the "name" column contains any whitespace characters
suite.expect_column_values_to_not_match_regex("name", r"\s+")
# Add an expectation to the suite that checks if the "email" column contains valid email addresses
suite.expect_column_values_to_match_regex("email", r"[^@]+@[^@]+\.[^@]+")
```
*A simple example of how Great Expectations allows you to define what your data should be like.*
In addition to setting expectations for our data, we also use Great Expectations to quickly notify stakeholders when any of our tests fail. This includes owners of data sources, the owner of the ETL, or internal customers that can be impacted by the issue. By quickly notifying stakeholders, we can reduce the amount of time it takes to identify and fix issues, which ultimately helps us save time and resources.
Great Expectations also helps us increase the observability of issues, making debugging much easier and reducing the workload of the engineer looking into the issue. By setting up tests at different points of the ETL flow, we can quickly identify where issues are occurring and diagnose the root cause. Also, in some of the data tests we run, we will store the faulty data in a quarantine schema whenever the test fails. This way, we can easily check the data that caused the issue. This makes it easier to identify issues and quickly reach a solution, which has been instrumental in helping us maintain the reliability of our applications.
![[ge_alert_example.png]]
*A example of our beloved Slack alerts bot letting us know that a data test failed.*
## Why we choose Great Expectations
At Lolamarket, we evaluated a variety of different tools for ensuring data quality before ultimately deciding to go with Great Expectations. While there are many other comparable tools on the market, we ultimately chose Great Expectations for a variety of reasons.
One of the biggest reasons we chose Great Expectations is its flexibility. The tool is open source and highly customizable, with a flexible API that allows us to create and execute our tests, integrate with other tools in our data stack, and extend the tool's functionality. This has been critical to helping us tailor the tool to our specific needs and integrate it seamlessly into our existing data infrastructure. We really dislike getting locked in with tools that we can't grow if necessary, so Great Expectations had the upper-hand here against commercial alternatives.
The previous point also leads us to money: Great Expectations didn't come with much of a bill for our team. It played nicely out of the box with our already existing infrastructure and the resource consumption in our cloud is not enough for us to even worry about it. We can say that, besides the time and effort invested in adopting it, Great Expectations didn't really cost us anything. We like to keep things lean.
![[money-shop.gif]]
*Our finance team, counting the money we didn't spend on a Data Quality tool.*
Another factor that played a big role in our decision to choose Great Expectations is the tool's ease of use. The tool is well-documented, with a variety of resources available that help users get up and running quickly. This has been critical in helping us get the most out of the tool while minimizing the amount of time and resources we need to invest in training our team.
Finally, the tool plays nicely with our stack. We can easily integrate Great Expectations checkpoints within our Prefect flows since it's just Python. Great Expectations has no trouble reading data from MySQL, which is the most frequent database server in Lolamarket. And configuring an S3 bucket to hold different metadata and logs from Great Expectations was trivial.
## More to come
In this post, we've introduced you to Great Expectations, an open-source tool for ensuring data quality, and how we're using it at Lolamarket to overcome data quality problems. In future posts, we'll dive deeper into our use of Great Expectations, including how we deploy it and how we use it to enforce data contracts with our backend teams. We'll also share tips and tricks we've learned along the way. Stay tuned!

View file

@ -0,0 +1,84 @@
---
title: "Trino in Lolamarket"
slug: "trino-in-lolamarket"
tags: "stack,data,trino"
author: "Pablo Martin"
---
# Trino in Lolamarket
Hi there, and welcome to the start of a new series in our blog: the Trino series. In these series, we will discuss how we use Trino, the distributed query engine, in our company.
## What is Trino
If you are familiar with Trino (also known in the past as *Presto*), you can probably skip this section. If that's not the case, stay with us and we will introduce you to it.
![](https://trino.io/assets/trino-og.png)
According to [the official Trino page](https://trino.io/):
>Trino is a *fast distributed SQL query engine*.
There's a lot to unpack in there, so let's break it down:
- **Query Engine**: Trino is basically a piece of software that can throw queries at many different databases and get the results back for you. This is an interesting concept because it decouples the storage system you are using from how you query it, allowing you to abstract away its details.
- **Distributed**: Trino is designed to be horizontally scalable and follows a [master-slave architecture](https://trino.io/docs/current/overview/concepts.html#server-types). You can have many Trino workers who split the effort of executing your queries. This allows for interesting performance properties, since many queries allow for high degrees of parallelism. It also makes Trino play nicely with distributed storage engines like [Apache Hive](https://hive.apache.org/), which lend themselves to a highly parallelised access to data. Finally, one massive feature that comes with Trino is the ability of executing queries that fetch data across several datasources, even if they run on completely different technologies (picture joining data from MySQL, MongoDB and Hive in a single query).
- **SQL**: Trino implements a single SQL interface for all your queries. That means that, regardless of what is the underlying database where you are fetching data from, Trino the same SQL dialect. This means that anyone with SQL knowledge can go ahead and query any database that your Trino has access to, even if the the database is NoSQL.
- **Fast**: Trino boasts of neck-breaking query speeds. In our opinion, this claim should come with a few disclaimers since Trino is not a magic wand that will make *anything* run fast as hell. Nevertheless, it is true that Trino can enable very fast queries on large datasets with the right deployment and configuration. We will discuss this point in more detail in further posts of this series.
That's a lot to take in, eh? If all of this sounds great, it is because it is great. Now, just to make Trino something a bit more tangible in your mind, let us explain a bit how you connect to Trino and make queries. In Lolamarket, we generally connect to Trino in three ways:
**#1 Through SQL clients**
In Lolamarket, one of our go-to SQL clients is [DBeaver](https://dbeaver.io/). In DBeaver, connecting to Trino follows the same steps as connecting to any regular database, and Trino looks and feels like such. Once this is done, you can just query it the same way you would query any other database.
![](https://i.imgur.com/6CbH25H.jpeg)
**#2 With the Python client**
Most of our ETLs are written in Python. Thus, in order to connect to Trino, we use the `trino-python-client` ([github](https://github.com/trinodb/trino-python-client)). This package is compliant with the [Python DBAPI spec](https://peps.python.org/pep-0249/), so it mostly behaves as your usual SQL connector for Python. Since a code example is worth a thousand words, this is the quickstart example from the official repo:
```python
from trino.dbapi import connect
conn = connect(
host="<host>",
port=<port>,
user="<username>",
catalog="<catalog>",
schema="<schema>",
)
cur = conn.cursor()
cur.execute("SELECT * FROM system.runtime.nodes")
rows = cur.fetchall()
```
**#3 Through Looker**
[Looker](https://www.looker.com/) is our go-to solution for visualization, reports and dashboards in Lolamarket. Luckily for us, Trino is one of [Looker's supported connectors](https://cloud.google.com/looker/docs/dialects). So, by connecting our Looker instance to Trino, we can build dashboards that fetch data from any of the databases where Trino itself is connected.
## Why we chose Trino
Trino is a complex tool that can suit many different needs and roles. In our case, Trino came into our architecture for one very-specific need and has slowly grown to play a bigger role in our data architecture.
Without going into much detail on the internals of our app backend, the system behind [Mercadão](https://mercadao.pt/) uses two different databases for persistence: a MongoDB instance and a MySQL instance. All the interesting info about our business and operations (customers, orders, delivery locations, shoppers, etc.) is split between these two. This posed a problem: questions that spanned data across the two different technologies needed a way to fetch information from both and combine it.
In the very early days of Mercadão, this was done by manually exporting data from both sources and crossing them as needed in someone's laptop. As you can imagine, as our company and operations scaled, this became a growing headache. Plus, as we evolved into a more mature data stack, we wanted to be able to connect our main visualization tool at the time ([Metabase](https://www.metabase.com/)) to data without lousy manual exports being a critical part of the process.
It was then when we identified Trino as a great solution for this problem. Trino offered us the ability to act as an abstraction layer on top of our MySQL+MongoDB combination, which translated into being able to join data across both databases as if they were only one. Metabase had a connector to Trino, so we could integrate them together without further plumbing. As a result, by deploying Trino and connecting it to our datasources and Metabase, we were able to easily build queries and reports across both MySQL and MongoDB. This was great for productivity, and also enabled savvy business users to access data without having to care about the details of the underlying infrastructure, which would be confusing and bizzarre for them.
![](https://i.imgur.com/natqd2i.png)
In the end, it wasn't its distributed nature nor the promises on speed that led us to Trino, but simply the ability to hide two different databases under a single SQL connector. Time has passed since this early stages and some things have changed, but this feature is still the fundamental reason for why Trino lives in our data infrastructure.
## How Trino fits in our architecture
Today, Trino is one of the pillars of our architecture. There are three main use-cases for Trino in our company:
1. **Act as a bridge between databases in ETL processes**: Trino is a great tool for our ETL system (which we will cover in other posts in the blog). Trino covers the gap between two different databases with something as simple as running a `INSERT INTO database_1.some_table (...) SELECT (...) FROM database2.some_table`. This way, if you can write a SQL query, you can move data between the different databases in our company, as simple as that.
2. **Offering cross-db capabilities in Looker**: as we mentioned in the previous section, Trino enables queries across databases, which opens up a world of possibilities and convenience for our analysts who are working hard to build products in Looker to serve the business. Although not every report or dashboard in Lolamarket goes through Trino (we will cover some performance issues with Trino in future posts), a fair share of them fetch their data through Trino.
3. **Empower analysts**: analysts and advanced users can connect to Trino through SQL clients like DBeaver to do their own research, debug data-related issues, and generally access most data within the company in a single point. This way, we simplify their lives by allowing queries across different databases, as well as access and permissions. If we spin a new database within the company, simply adding it to Trino makes it accessible to our team without requiring any action from the users themselves.
## Just the beginning
Today's post was a brief intro to how Trino came into Lolamarket and what role is it playing in our architecture. But there are many more details, insights and lessons learned that we can share. This post is just the beginning of our Trino series, where we will keep sharing our adventures with Trino. Stay tuned if you want to learn more!