85 lines
8.4 KiB
Markdown
85 lines
8.4 KiB
Markdown
|
|
---
|
||
|
|
title: "Trino in Lolamarket"
|
||
|
|
slug: "trino-in-lolamarket"
|
||
|
|
tags: "stack,data,trino"
|
||
|
|
author: "Pablo Martin"
|
||
|
|
---
|
||
|
|
|
||
|
|
# Trino in Lolamarket
|
||
|
|
|
||
|
|
Hi there, and welcome to the start of a new series in our blog: the Trino series. In these series, we will discuss how we use Trino, the distributed query engine, in our company.
|
||
|
|
|
||
|
|
## What is Trino
|
||
|
|
|
||
|
|
If you are familiar with Trino (also known in the past as *Presto*), you can probably skip this section. If that's not the case, stay with us and we will introduce you to it.
|
||
|
|
|
||
|
|

|
||
|
|
|
||
|
|
According to [the official Trino page](https://trino.io/):
|
||
|
|
>Trino is a *fast distributed SQL query engine*.
|
||
|
|
|
||
|
|
There's a lot to unpack in there, so let's break it down:
|
||
|
|
- **Query Engine**: Trino is basically a piece of software that can throw queries at many different databases and get the results back for you. This is an interesting concept because it decouples the storage system you are using from how you query it, allowing you to abstract away its details.
|
||
|
|
- **Distributed**: Trino is designed to be horizontally scalable and follows a [master-slave architecture](https://trino.io/docs/current/overview/concepts.html#server-types). You can have many Trino workers who split the effort of executing your queries. This allows for interesting performance properties, since many queries allow for high degrees of parallelism. It also makes Trino play nicely with distributed storage engines like [Apache Hive](https://hive.apache.org/), which lend themselves to a highly parallelised access to data. Finally, one massive feature that comes with Trino is the ability of executing queries that fetch data across several datasources, even if they run on completely different technologies (picture joining data from MySQL, MongoDB and Hive in a single query).
|
||
|
|
- **SQL**: Trino implements a single SQL interface for all your queries. That means that, regardless of what is the underlying database where you are fetching data from, Trino the same SQL dialect. This means that anyone with SQL knowledge can go ahead and query any database that your Trino has access to, even if the the database is NoSQL.
|
||
|
|
- **Fast**: Trino boasts of neck-breaking query speeds. In our opinion, this claim should come with a few disclaimers since Trino is not a magic wand that will make *anything* run fast as hell. Nevertheless, it is true that Trino can enable very fast queries on large datasets with the right deployment and configuration. We will discuss this point in more detail in further posts of this series.
|
||
|
|
|
||
|
|
That's a lot to take in, eh? If all of this sounds great, it is because it is great. Now, just to make Trino something a bit more tangible in your mind, let us explain a bit how you connect to Trino and make queries. In Lolamarket, we generally connect to Trino in three ways:
|
||
|
|
|
||
|
|
**#1 Through SQL clients**
|
||
|
|
In Lolamarket, one of our go-to SQL clients is [DBeaver](https://dbeaver.io/). In DBeaver, connecting to Trino follows the same steps as connecting to any regular database, and Trino looks and feels like such. Once this is done, you can just query it the same way you would query any other database.
|
||
|
|
|
||
|
|

|
||
|
|
|
||
|
|
**#2 With the Python client**
|
||
|
|
Most of our ETLs are written in Python. Thus, in order to connect to Trino, we use the `trino-python-client` ([github](https://github.com/trinodb/trino-python-client)). This package is compliant with the [Python DBAPI spec](https://peps.python.org/pep-0249/), so it mostly behaves as your usual SQL connector for Python. Since a code example is worth a thousand words, this is the quickstart example from the official repo:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from trino.dbapi import connect
|
||
|
|
|
||
|
|
conn = connect(
|
||
|
|
host="<host>",
|
||
|
|
port=<port>,
|
||
|
|
user="<username>",
|
||
|
|
catalog="<catalog>",
|
||
|
|
schema="<schema>",
|
||
|
|
)
|
||
|
|
cur = conn.cursor()
|
||
|
|
cur.execute("SELECT * FROM system.runtime.nodes")
|
||
|
|
rows = cur.fetchall()
|
||
|
|
```
|
||
|
|
|
||
|
|
**#3 Through Looker**
|
||
|
|
|
||
|
|
[Looker](https://www.looker.com/) is our go-to solution for visualization, reports and dashboards in Lolamarket. Luckily for us, Trino is one of [Looker's supported connectors](https://cloud.google.com/looker/docs/dialects). So, by connecting our Looker instance to Trino, we can build dashboards that fetch data from any of the databases where Trino itself is connected.
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
## Why we chose Trino
|
||
|
|
|
||
|
|
Trino is a complex tool that can suit many different needs and roles. In our case, Trino came into our architecture for one very-specific need and has slowly grown to play a bigger role in our data architecture.
|
||
|
|
|
||
|
|
Without going into much detail on the internals of our app backend, the system behind [Mercadão](https://mercadao.pt/) uses two different databases for persistence: a MongoDB instance and a MySQL instance. All the interesting info about our business and operations (customers, orders, delivery locations, shoppers, etc.) is split between these two. This posed a problem: questions that spanned data across the two different technologies needed a way to fetch information from both and combine it.
|
||
|
|
|
||
|
|
In the very early days of Mercadão, this was done by manually exporting data from both sources and crossing them as needed in someone's laptop. As you can imagine, as our company and operations scaled, this became a growing headache. Plus, as we evolved into a more mature data stack, we wanted to be able to connect our main visualization tool at the time ([Metabase](https://www.metabase.com/)) to data without lousy manual exports being a critical part of the process.
|
||
|
|
|
||
|
|
It was then when we identified Trino as a great solution for this problem. Trino offered us the ability to act as an abstraction layer on top of our MySQL+MongoDB combination, which translated into being able to join data across both databases as if they were only one. Metabase had a connector to Trino, so we could integrate them together without further plumbing. As a result, by deploying Trino and connecting it to our datasources and Metabase, we were able to easily build queries and reports across both MySQL and MongoDB. This was great for productivity, and also enabled savvy business users to access data without having to care about the details of the underlying infrastructure, which would be confusing and bizzarre for them.
|
||
|
|
|
||
|
|

|
||
|
|
|
||
|
|
In the end, it wasn't its distributed nature nor the promises on speed that led us to Trino, but simply the ability to hide two different databases under a single SQL connector. Time has passed since this early stages and some things have changed, but this feature is still the fundamental reason for why Trino lives in our data infrastructure.
|
||
|
|
|
||
|
|
|
||
|
|
## How Trino fits in our architecture
|
||
|
|
|
||
|
|
Today, Trino is one of the pillars of our architecture. There are three main use-cases for Trino in our company:
|
||
|
|
1. **Act as a bridge between databases in ETL processes**: Trino is a great tool for our ETL system (which we will cover in other posts in the blog). Trino covers the gap between two different databases with something as simple as running a `INSERT INTO database_1.some_table (...) SELECT (...) FROM database2.some_table`. This way, if you can write a SQL query, you can move data between the different databases in our company, as simple as that.
|
||
|
|
2. **Offering cross-db capabilities in Looker**: as we mentioned in the previous section, Trino enables queries across databases, which opens up a world of possibilities and convenience for our analysts who are working hard to build products in Looker to serve the business. Although not every report or dashboard in Lolamarket goes through Trino (we will cover some performance issues with Trino in future posts), a fair share of them fetch their data through Trino.
|
||
|
|
3. **Empower analysts**: analysts and advanced users can connect to Trino through SQL clients like DBeaver to do their own research, debug data-related issues, and generally access most data within the company in a single point. This way, we simplify their lives by allowing queries across different databases, as well as access and permissions. If we spin a new database within the company, simply adding it to Trino makes it accessible to our team without requiring any action from the users themselves.
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
## Just the beginning
|
||
|
|
|
||
|
|
Today's post was a brief intro to how Trino came into Lolamarket and what role is it playing in our architecture. But there are many more details, insights and lessons learned that we can share. This post is just the beginning of our Trino series, where we will keep sharing our adventures with Trino. Stay tuned if you want to learn more!
|