2023-11-17 15:13:27 +01:00

182 KiB

Raw Blame History

20220711

Today was my first at XL. I started the day by briefly chatting with Dani. He is as energetic as always and was already trying to get my hands dirty with stuff. It will be fun to work with him, but I need to make sure he doesn't kidnap me completely.

I also called briefly with Maria Gorostiza. She was trying to call me through the phone but I called her through slack. It felt a bit weird that she was trying to get hold of me through my personal phone. I must confess that the vibes I get from the Portugal based team are very different from the ones I get from the Spain based team, and this is an example of that. Anyways, she just told me that I will need to put my working hours in Factorial and also that I should request my holidays throught there. She also asked I could put a real picture of me in Slack. No trouble from my side, but getting asked to do it felt quite childish, being honest.

The rest of the day, I mostly spent with João discussing the organization and the business. There is a lot of details and nuances flying all over the place, but I'm old enough now to know that it's ok if I don't retain many of them. The important things will inevitably find their place in my brain.

I'm very happy on how things look. The service is interesting. It has a lot of data and decisions that need to be made, so Data work will definetely be fun. The new business opening in Poland sounds very exciting and feels like a greenfield opportunity. And the team is also looking great. The culture is very different from ACN. More laid back. Less pompous. No fancy words. Strong common sense.

It's very early to tell, but I think I've made a great choice coming over here.

We also discussed with the team that we could maybe meet in Porto on September. It would be great to fly there and spend a few days with the team and chill around Porto in the afternoons.

20220712

Ana Pinto,

10 years of experience as a Data Analyst Started in insurance for a year, did not like the business. Joined continente for 5 years as part of the loyalty program Moved to Talkdesk for 2 years

Infra with Dani

He is playing around with Prefect and Trino
Redshift will soon be dropped

20220713

Ana Januario

Math Continente Topdesk

Adhoc analysis and Looker Will be working on moving Spanish data from Tableau to Looker

How to create user in AWS:

Go to IAM
Select AWS credential type Password
Assign to data-team user group
Skip tags sections
Hit send email option to contact

User name,Password,Access key ID,Secret access key,Console login link

pablo.martin,EP}A}_WlF2mTV-t,,,https://fonte.signin.aws.amazon.com/console

Access key ID,Secret access key

AKIAVNZZGXD4KNPQPMH5,0j02JWq9mnQF3d2G9A600dlgt70PDAiiguJizVfD

You have been signed up for a Meraki account. You are now authorized to use 514195843 @ FONTE - NEGÓCIOS ONLINE @ 156654047.

Here is your login information:

Email address: pablo.martin@lolamarket.com
Password: 4bCphthn

You can manage your account at https://account.network-auth.com/.

Data modeling with Dani

Orders model where we integrate both Mercadao and Lola https://pdofonte.atlassian.net/wiki/spaces/DATA/pages/2301067280/Orders
snake_case, as in Python

María envío

Ratón
Teclado
Mochila
Tarjeta
Aplicación móvil y ver vídeo

Reconocimiento médico Guía PRL Curso PRL

Query challenge

Total order groups served in LolaMarket in 2021
- Dani showed me that order groups are not really an entity in Lola. Instead, there is a field called "tag" that appears in carts. Multiple carts with the same "tag" are a complete order group.
- I learned with Dani that the cart table has all the failed and cancelled carts. Those carts have no tag.
- ANSWER: 148975
Total orders (visits to supermarket) made in Mercadao in June 2022
AOV for orders made in Lolamarket, in May 2022, by shop brand
Largest basket ever sold in Mercadao
Smallest basket ever sold in LolaMarket
Look for my first LolaMarket purchase and understand it in full detail
Find the customer that spent the most in Mercadao in 2021
See if there is any user (identified by email) that appears in both LolaMarket and Mercadao

user id de Dani -> 3799702

20220714

Tour through Mercadao tables with João

chargedamount -> what the customer pays (originaltotalprice * 1.03)
originaltotalprice -> original value of the goods (with discounts applied)
totalprice -> final cost of the picked goods
totaldiscount -> discount of goods (from the retailer)
totalwshipping -> totalprice+delivery fee
subtotal -> should be originaltotalprice + totaldiscount (but is flawed. Ignore)
status might be null (user reaches checkout payment screen but doesn't pay)
For status flow, review confluence ->
PickingLocationID -> suggested picking location
We have both the original and final delivery slot

20220719

Con María:

Llegaran mensajes de confirmación para que los dos slots una hora antes. Hay que darle o se me chutaran.

Meeting marathon

Projects

Project manager across the company
Working mostly on Mercadao
Projects
- With Glovo
  - Mercadao does the picking
  - Glovo riders do the delivery
- BOOM
- B2B Horeca project
  - With Recheio
  - Larger baskets
  - https://express.recheio.pt/
  - Picking is very different. Vans are needed instead of cars.

Tech

Missing

Finance

"Congratulations for joining João's team."
Luis Cabral -> Head of Finance for XL
Helps the head of each local finance do their work properly
Bridges financial comms to Glovo-DH
Each country has their own finance team

What systems do you have for: - Accounting: - Sage in Spain - Sage in Portugal - Forced to move to Glovo's SAP Hana (starting on 2023-01-01) - - Reporting - P&L - They use Looker, Metabase and a few Excel sheets

Ops

Manages the portuguese Quality Assurance, operations, availability, onboarding teams
Starting to assume the Spanish operations
She is clearly very enthusiastic

CX

Atencion al cliente y LiveOps
Jaime
- Empezo hace un año
- Lleva atencion al cliente
- Esta absorbiendo LiveOps (equipo que ayuda a los shoppers)
- Nuevo CRM Freshdesk para Portugal y España
- 3 supervisores y 8 agentes
- Intermedian entre clientes y shoppers
Data
- App ratings, surveys?
KPI
- Reducción contact rate shoppers

Commercial

Negotiate campaigns of the banners in the Marketplace
Only food. Currently exploring other services
Selling reporting to brands (cross-selling topics, AOV per brand, frequency and repetition). Currently only banners and clicks (CTR, click-through-rate)
Campaing impact on sales deltas for different products
Working with Looker. Needs a bit of help with exporting data
Runs Iberia (Carlota and Gloria work for him in Spain, Ines)
For Poland: let's run some large operations in there before selling our services to brands there.

Marketing

20220720

Shopper notes

Today I went out as a shopper for the first time. I took a few notes during the day and also have a few ideas in mind:

In the morning, the order proposal appeared in my screen. I hit accept as the order seemed suitable, but then the app said that another shopper had taken it in the meanwhile (makes sense since I was driving and I didn't have my phone handy). But then, when I went to my orders page, I could see that order in my schedule, so it seems that it was assigned to me regardless of that warning message. I noticed this, so no problems, but this could lead to a shopper getting an order assigned and not realizing.
When I was in Mercadona doing my first order, one item out of all the shopping list appeared in the Frozen section within the app. That item was not actually in the frozen section in the supermarket but rather in a normal open fridge, which was quite frustrating since I was trying to find it in the freezers. Funny enough, there was another product of the shopping list that was actually in the frozen section in the supermarket, but did not appear in the frozen section in the app's list. I felt a bit like the shopper app was playing with me.
During my first order, there was this lemonade product that had a content of 2L. In the app, if you opened up the picture, you could also see it had 2L of volume. But then, in the app metadata, it showed the product had 1.5L. I was a bit confused in here: should I get 5 bottles because the customer asked for 5 units? Or should I change the amount of bottles given that the customer shopped while thinking that each bottle was 1.5L? Or wait, what did the customer follow: the metadata that said 1.5L or the picture of the product that said 2L? In the end, I got 4 bottles instead of 5.
First delivery: shopping took way less time than the app suggested. This, on top of me being in the supermarket early to compensate for my expected clumsiness, turned into me being ready to pay at the POS around 20 minutes earlier than it should. I didn't want to be way too early at the customer's location, but I also didn't want to wait near there since the weather is super hot and frozen and cold products would get awful. I decided to wait... inside the supermarket (without picking the cold products yet). Pretty sure some Mercadona folks where scratching their heads :). Once the time passed, I went ahead, paid and drove to the customer location.
My first two orders had a pretty nasty overlap. Order #1 had delivery scheduled at something like 11:20, while Order #2 instructed me to start picking at 11:00 in a supermarket which was about 20min away from Order #1's customer location. I managed because I was able to finish Order #1 much earlier than planned. Otherwise, customer #2 would have had a hefty delay, for sure outside of their chosen timeslot.
This a rather lengthy one, and the biggest issue from the app I faced. It could have easily turned into a super late delivery for the customer. Happend in order #3 Step by step:
- I looked at the details of order #3, saw that the picking location was Lidl store. I clicked on the supermarket icon to open google maps and see the location.
- The shopper app transitioned into google maps and provided me with the location. The problem? There is no Lidl supermaket there. I knew because I'm very familiar with the area where the app was indicating that the Lidl was. But again, I knew there was no Lidl there. I went back to the shopper app and checked the address. The street name was right. I was scratching my head for a couple of minutes before I realised what was going on. The address was Carrer Apel·les Mestres, 109. The thing is there is a Carrer Apel·les Mestres in Barcelona, but there is also another one with the exact same name in El Prat de Llobregat (nearby town to BCN, kind of like Madrid-Pozuelo or Porto-Matosinhos). I realised because the postcode was not a Barcelona city postcode, and then I found the Lidl that does exist in Carrer Apel·les Mestres in El Prat.
- I checked the customer location and realised there was a large Lidl store about 500m away from there, so I decided to go there. Checked beforehand around 10 times that the store was actually there, because by this time I was both a bit confused and a bit skeptical.
- In the end everything went just fine. But, had I not been familiar with the streets, I would have not spotted the issue straight away and probably I would have suffered a +30min delay.
- I was also a bit puzzled by the fact that the shopper app didn't suggest the Lidl location right next to the customer location.
- The screenshots at the end show the exact address and what I was being shown in Google Maps (aka the false Lidl location)
Trying to send pictures to the client through the chat was a pain in the a** and I'm not even sure that it worked in the end. Issues:
- I would hit the camera icon, take the picture, and then return to the chat. No trace of it.
- After a few tries, the picture "appeared". There was an empty white box in the chat, which I guess should be the picture. But again, in my screen it only appeared as an empty box.
- A suggestion: it would be nice if multiple pictures could be taken at once (instead of: picture -> chat -> picture -> chat, etc). I was trying to give the customer 3 alternatives to a stockedout product, so taking the three pictures in a go would have been more convenient.
My android device is configured in English. All the shopper app appears in english, which I didn't mind. What I did mind was that the auto-suggested chat messages to the customer appeared in English. This doesn't make much sense to me, since I would assume that our default stance should be to either address the user in Spanish or, if the user can somehow inform their prefered language, in whatever the user has indicated.

A few additional mix fun and useful details:

The customer for order #1 was this very nice woman who was using Lola for the first time. She told me she was undergoing chemo and felt pretty sick all the time, so it was super convenient for her that we brought the groceries to her place. Hearing her story was touching.
I would advice not having the empty Lola bagpack on your back if you go on the highway faster than 100Km/h with a motorbike. The box acts kind of funny and pulls your arms in weird ways.
Paper bags at Lidl are quite crappy. I was very much afraid that some of them would break during the delivery.
Silly thing you don't even think about before starting as a shopper: the euro coin for the supermarket cart! I was lucky enough to have one, but in order #2, with all the rush, I forgot to recover it. The guys at the supermarket for order #3 where nice enough to unlock a cart for me because I had no coins by then.
The barcode reader fails quite a bit (by failing I mean it read the barcode and tells you it's the wrong product, even when you can very clearly see by the product description and image you are picking the right one).
Entering a supermarket you have never been into with a very long list of products creates this peculiar initial paralysis. It left my wondering which option would be better:
- Current way: we show the entire list to the shopper, the shopper goes around picking in whatever order he wants.
- My idea: we show the shopper items by small batches which are all related to the same product category. For instance, if the customer has chosen a few yogurt and cheese products, we show only those, and hence the shopper can focus on getting that done. Then the next batch could be beers and wines, the next one cleaning products, etc. This could also be used to leave cold and frozen products for the end, ensuring the shopper picks those after all the other ones.

20220721

Tech

I have the onboarding meeting with Berlana we had to pospone.

Questions:

How are you organized?
How can we keep up with releases and changes?
Who is who?
How is the whole instaleap thing working out?

Query thingy

Get access directly to Lola MySQL to compare performance.
Keep on researching
Documentar como conectar a MySQL lolamarket -> Apuntado
Documentar como pillar free-benefits de Glovo ->
Enviar papel de material a Maria
Entrar en lo de notion y documentarlo

My shopper ID is 8025

20220722

Check that fetchall is actuall fetching all.
Understand

20220725

I'm going to compare in detail the execution of the same query in both engines and see what I can get out of it. The query I'm using is Orders finished yesterday with less than 10 items.

The execution ID I'm looking into in Trino is this: 20220725_080932_00086_yu7k5

For Trino, this is the output of the EXPLAIN:

Fragment 0 [SINGLE]
    Output layout: [sum, id_105, cash_order_id_126, shopper_timeslot_id_80, shop_id_79, address_id_78, total_cost_92, total_price_117, total_price_discount_140, delivery_price_141, delivery_price_discount_112, delivery_type_95, status_108, modified_76, last_update_110, date_shopping_119, date_delivering_125, date_delivered_133, note_127, date_started_134, weight_99, last_overweight_notification_121, shopper_total_price_123, shopper_total_cost_135, margin_98, shopper_weight_106, comprea_note_83, next_shopper_timeslot_id_120, shopper_algorithm_124, loyalty_tip_131, date_loyalty_tip_115, ebitda_104, numcalls_84, manual_charge_111, commission_100, last_no_times_available_109, comprea_note_driver_128, num_user_changes_130, expected_shopping_time_139, expected_delivering_time_97, date_deliveries_requested_81, date_call_93, percentage_opened_hours_82, percentage_opened_hours_in_day_113, fraud_rating_114, tag_94, tag_color_90, expected_eta_time_102, promo_hours_137, numcallsclient_77, frozen_products_103, driver_timeslot_id_107, expected_delivering_distance_118, lola_id_136, comprea_note_warehouse_86, uuid_88, date_almost_delivered_132, date_ticket_pending_96, date_waiting_driver_138, num_products_116, num_products_taken_91, total_saving_89, percentage_opened_hours_in_87, cart_progress_branch_url_122, real_shop_id_129, gps_locked_85, tag_signature_101]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Output[item_count, id, cash_order_id, shopper_timeslot_id, shop_id, address_id, total_cost, total_price, total_price_discount, delivery_price, delivery_price_discount, delivery_type, status, modified, last_update, date_shopping, date_delivering, date_delivered, note, date_started, weight, last_overweight_notification, shopper_total_price, shopper_total_cost, margin, shopper_weight, comprea_note, next_shopper_timeslot_id, shopper_algorithm, loyalty_tip, date_loyalty_tip, ebitda, numcalls, manual_charge, commission, last_no_times_available, comprea_note_driver, num_user_changes, expected_shopping_time, expected_delivering_time, date_deliveries_requested, date_call, percentage_opened_hours, percentage_opened_hours_in_day, fraud_rating, tag, tag_color, expected_eta_time, promo_hours, numcallsclient, frozen_products, driver_timeslot_id, expected_delivering_distance, lola_id, comprea_note_warehouse, uuid, date_almost_delivered, date_ticket_pending, date_waiting_driver, num_products, num_products_taken, total_saving, percentage_opened_hours_in_10, cart_progress_branch_url, real_shop_id, gps_locked, tag_signature]
    │   Layout: [sum:bigint, id_105:integer, cash_order_id_126:integer, shopper_timeslot_id_80:integer, shop_id_79:integer, address_id_78:integer, total_cost_92:decimal(8,2), total_price_117:decimal(8,2), total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), delivery_price_discount_112:decimal(8,2), delivery_type_95:char(9), status_108:char(14), modified_76:tinyint, last_update_110:integer, date_shopping_119:integer, date_delivering_125:integer, date_delivered_133:integer, note_127:varchar, date_started_134:integer, weight_99:decimal(6,3), last_overweight_notification_121:integer, shopper_total_price_123:decimal(8,2), shopper_total_cost_135:decimal(8,2), margin_98:decimal(8,2), shopper_weight_106:decimal(6,3), comprea_note_83:varchar, next_shopper_timeslot_id_120:integer, shopper_algorithm_124:varchar, loyalty_tip_131:decimal(8,2), date_loyalty_tip_115:integer, ebitda_104:decimal(8,2), numcalls_84:integer, manual_charge_111:tinyint, commission_100:integer, last_no_times_available_109:integer, comprea_note_driver_128:varchar, num_user_changes_130:integer, expected_shopping_time_139:integer, expected_delivering_time_97:integer, date_deliveries_requested_81:integer, date_call_93:integer, percentage_opened_hours_82:decimal(5,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, tag_94:varchar(20), tag_color_90:varchar(15), expected_eta_time_102:integer, promo_hours_137:integer, numcallsclient_77:integer, frozen_products_103:tinyint, driver_timeslot_id_107:integer, expected_delivering_distance_118:integer, lola_id_136:varchar(50), comprea_note_warehouse_86:varchar, uuid_88:varchar(10), date_almost_delivered_132:integer, date_ticket_pending_96:integer, date_waiting_driver_138:integer, num_products_116:integer, num_products_taken_91:integer, total_saving_89:decimal(8,2), percentage_opened_hours_in_87:decimal(5,2), cart_progress_branch_url_122:varchar(255), real_shop_id_129:integer, gps_locked_85:tinyint, tag_signature_101:varchar(4)]
    │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
    │   item_count := sum
    │   id := id_105
    │   cash_order_id := cash_order_id_126
    │   shopper_timeslot_id := shopper_timeslot_id_80
    │   shop_id := shop_id_79
    │   address_id := address_id_78
    │   total_cost := total_cost_92
    │   total_price := total_price_117
    │   total_price_discount := total_price_discount_140
    │   delivery_price := delivery_price_141
    │   delivery_price_discount := delivery_price_discount_112
    │   delivery_type := delivery_type_95
    │   status := status_108
    │   modified := modified_76
    │   last_update := last_update_110
    │   date_shopping := date_shopping_119
    │   date_delivering := date_delivering_125
    │   date_delivered := date_delivered_133
    │   note := note_127
    │   date_started := date_started_134
    │   weight := weight_99
    │   last_overweight_notification := last_overweight_notification_121
    │   shopper_total_price := shopper_total_price_123
    │   shopper_total_cost := shopper_total_cost_135
    │   margin := margin_98
    │   shopper_weight := shopper_weight_106
    │   comprea_note := comprea_note_83
    │   next_shopper_timeslot_id := next_shopper_timeslot_id_120
    │   shopper_algorithm := shopper_algorithm_124
    │   loyalty_tip := loyalty_tip_131
    │   date_loyalty_tip := date_loyalty_tip_115
    │   ebitda := ebitda_104
    │   numcalls := numcalls_84
    │   manual_charge := manual_charge_111
    │   commission := commission_100
    │   last_no_times_available := last_no_times_available_109
    │   comprea_note_driver := comprea_note_driver_128
    │   num_user_changes := num_user_changes_130
    │   expected_shopping_time := expected_shopping_time_139
    │   expected_delivering_time := expected_delivering_time_97
    │   date_deliveries_requested := date_deliveries_requested_81
    │   date_call := date_call_93
    │   percentage_opened_hours := percentage_opened_hours_82
    │   percentage_opened_hours_in_day := percentage_opened_hours_in_day_113
    │   fraud_rating := fraud_rating_114
    │   tag := tag_94
    │   tag_color := tag_color_90
    │   expected_eta_time := expected_eta_time_102
    │   promo_hours := promo_hours_137
    │   numcallsclient := numcallsclient_77
    │   frozen_products := frozen_products_103
    │   driver_timeslot_id := driver_timeslot_id_107
    │   expected_delivering_distance := expected_delivering_distance_118
    │   lola_id := lola_id_136
    │   comprea_note_warehouse := comprea_note_warehouse_86
    │   uuid := uuid_88
    │   date_almost_delivered := date_almost_delivered_132
    │   date_ticket_pending := date_ticket_pending_96
    │   date_waiting_driver := date_waiting_driver_138
    │   num_products := num_products_116
    │   num_products_taken := num_products_taken_91
    │   total_saving := total_saving_89
    │   percentage_opened_hours_in_10 := percentage_opened_hours_in_87
    │   cart_progress_branch_url := cart_progress_branch_url_122
    │   real_shop_id := real_shop_id_129
    │   gps_locked := gps_locked_85
    │   tag_signature := tag_signature_101
    └─ RemoteSource[1]
           Layout: [sum:bigint, id_105:integer, cash_order_id_126:integer, shopper_timeslot_id_80:integer, shop_id_79:integer, address_id_78:integer, total_cost_92:decimal(8,2), total_price_117:decimal(8,2), total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), delivery_price_discount_112:decimal(8,2), delivery_type_95:char(9), status_108:char(14), modified_76:tinyint, last_update_110:integer, date_shopping_119:integer, date_delivering_125:integer, date_delivered_133:integer, note_127:varchar, date_started_134:integer, weight_99:decimal(6,3), last_overweight_notification_121:integer, shopper_total_price_123:decimal(8,2), shopper_total_cost_135:decimal(8,2), margin_98:decimal(8,2), shopper_weight_106:decimal(6,3), comprea_note_83:varchar, next_shopper_timeslot_id_120:integer, shopper_algorithm_124:varchar, loyalty_tip_131:decimal(8,2), date_loyalty_tip_115:integer, ebitda_104:decimal(8,2), numcalls_84:integer, manual_charge_111:tinyint, commission_100:integer, last_no_times_available_109:integer, comprea_note_driver_128:varchar, num_user_changes_130:integer, expected_shopping_time_139:integer, expected_delivering_time_97:integer, date_deliveries_requested_81:integer, date_call_93:integer, percentage_opened_hours_82:decimal(5,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, tag_94:varchar(20), tag_color_90:varchar(15), expected_eta_time_102:integer, promo_hours_137:integer, numcallsclient_77:integer, frozen_products_103:tinyint, driver_timeslot_id_107:integer, expected_delivering_distance_118:integer, lola_id_136:varchar(50), comprea_note_warehouse_86:varchar, uuid_88:varchar(10), date_almost_delivered_132:integer, date_ticket_pending_96:integer, date_waiting_driver_138:integer, num_products_116:integer, num_products_taken_91:integer, total_saving_89:decimal(8,2), percentage_opened_hours_in_87:decimal(5,2), cart_progress_branch_url_122:varchar(255), real_shop_id_129:integer, gps_locked_85:tinyint, tag_signature_101:varchar(4)]

Fragment 1 [HASH]
    Output layout: [sum, id_105, cash_order_id_126, shopper_timeslot_id_80, shop_id_79, address_id_78, total_cost_92, total_price_117, total_price_discount_140, delivery_price_141, delivery_price_discount_112, delivery_type_95, status_108, modified_76, last_update_110, date_shopping_119, date_delivering_125, date_delivered_133, note_127, date_started_134, weight_99, last_overweight_notification_121, shopper_total_price_123, shopper_total_cost_135, margin_98, shopper_weight_106, comprea_note_83, next_shopper_timeslot_id_120, shopper_algorithm_124, loyalty_tip_131, date_loyalty_tip_115, ebitda_104, numcalls_84, manual_charge_111, commission_100, last_no_times_available_109, comprea_note_driver_128, num_user_changes_130, expected_shopping_time_139, expected_delivering_time_97, date_deliveries_requested_81, date_call_93, percentage_opened_hours_82, percentage_opened_hours_in_day_113, fraud_rating_114, tag_94, tag_color_90, expected_eta_time_102, promo_hours_137, numcallsclient_77, frozen_products_103, driver_timeslot_id_107, expected_delivering_distance_118, lola_id_136, comprea_note_warehouse_86, uuid_88, date_almost_delivered_132, date_ticket_pending_96, date_waiting_driver_138, num_products_116, num_products_taken_91, total_saving_89, percentage_opened_hours_in_87, cart_progress_branch_url_122, real_shop_id_129, gps_locked_85, tag_signature_101]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    LeftJoin[("cart_id" = "id_105")][$hashvalue, $hashvalue_151]
    │   Layout: [sum:bigint, modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2)]
    │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
    │   Distribution: PARTITIONED
    ├─ RemoteSource[2]
    │      Layout: [cart_id:integer, sum:bigint, $hashvalue:bigint]
    └─ LocalExchange[HASH][$hashvalue_151] ("id_105")
       │   Layout: [modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), $hashvalue_151:bigint]
       │   Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}
       └─ RemoteSource[6]
              Layout: [modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), $hashvalue_152:bigint]

Fragment 2 [SINGLE]
    Output layout: [cart_id, sum, $hashvalue_143]
    Output partitioning: HASH [cart_id][$hashvalue_143]
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Limit[100]
    │   Layout: [cart_id:integer, sum:bigint, $hashvalue_143:bigint]
    │   Estimates: {rows: 100 (2.25kB), cpu: ?, memory: ?, network: ?}
    └─ LocalExchange[SINGLE] ()
       │   Layout: [cart_id:integer, sum:bigint, $hashvalue_143:bigint]
       │   Estimates: {rows: 100 (2.25kB), cpu: ?, memory: ?, network: ?}
       └─ RemoteSource[3]
              Layout: [cart_id:integer, sum:bigint, $hashvalue_144:bigint]

Fragment 3 [HASH]
    Output layout: [cart_id, sum, $hashvalue_145]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    LimitPartial[100]
    │   Layout: [cart_id:integer, $hashvalue_145:bigint, sum:bigint]
    │   Estimates: {rows: 100 (2.25kB), cpu: ?, memory: ?, network: ?}
    └─ Filter[filterPredicate = ("sum" < BIGINT '10')]
       │   Layout: [cart_id:integer, $hashvalue_145:bigint, sum:bigint]
       │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
       └─ Aggregate(FINAL)[cart_id][$hashvalue_145]
          │   Layout: [cart_id:integer, $hashvalue_145:bigint, sum:bigint]
          │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
          │   sum := sum("sum_142")
          └─ LocalExchange[HASH][$hashvalue_145] ("cart_id")
             │   Layout: [cart_id:integer, sum_142:row(bigint, boolean, bigint, boolean), $hashvalue_145:bigint]
             │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
             └─ Aggregate(PARTIAL)[cart_id][$hashvalue_146]
                │   Layout: [cart_id:integer, $hashvalue_146:bigint, sum_142:row(bigint, boolean, bigint, boolean)]
                │   sum_142 := sum("expr")
                └─ Project[]
                   │   Layout: [cart_id:integer, expr:bigint, $hashvalue_146:bigint]
                   │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
                   │   expr := CAST("quantity" AS bigint)
                   └─ InnerJoin[("cart_id" = "id_0")][$hashvalue_146, $hashvalue_148]
                      │   Layout: [cart_id:integer, quantity:integer, $hashvalue_146:bigint]
                      │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
                      │   Distribution: PARTITIONED
                      │   dynamicFilterAssignments = {id_0 -> #df_778}
                      ├─ RemoteSource[4]
                      │      Layout: [cart_id:integer, quantity:integer, $hashvalue_146:bigint]
                      └─ LocalExchange[HASH][$hashvalue_148] ("id_0")
                         │   Layout: [id_0:integer, $hashvalue_148:bigint]
                         │   Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}
                         └─ RemoteSource[5]
                                Layout: [id_0:integer, $hashvalue_149:bigint]

Fragment 4 [SOURCE]
    Output layout: [cart_id, quantity, $hashvalue_147]
    Output partitioning: HASH [cart_id][$hashvalue_147]
    Stage Execution Strategy: UNGROUPED_EXECUTION
    ScanFilterProject[table = app_lm_mysql:comprea.cart_product comprea.cart_product columns=[cart_id:integer:INT, quantity:integer:INT], grouped = false, filterPredicate = true, dynamicFilters = {"cart_id" = #df_778}]
        Layout: [cart_id:integer, quantity:integer, $hashvalue_147:bigint]
        Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
        $hashvalue_147 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("cart_id"), 0))
        cart_id := cart_id:integer:INT
        quantity := quantity:integer:INT

Fragment 5 [SOURCE]
    Output layout: [id_0, $hashvalue_150]
    Output partitioning: HASH [id_0][$hashvalue_150]
    Stage Execution Strategy: UNGROUPED_EXECUTION
    ScanFilterProject[table = app_lm_mysql:comprea.cart comprea.cart constraint on [status] columns=[id:integer:INT, status:char(14):ENUM, date_delivered:integer:INT], grouped = false, filterPredicate = (("status" = CAST('delivered' AS char(14))) AND (CAST(with_timezone(date_add('second', CAST("date_delivered" AS bigint), TIMESTAMP '1970-01-01 00:00:00'), 'Europe/Madrid') AS date) = DATE '2022-07-24'))]
        Layout: [id_0:integer, $hashvalue_150:bigint]
        Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
        $hashvalue_150 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("id_0"), 0))
        date_delivered := date_delivered:integer:INT
        id_0 := id:integer:INT
        status := status:char(14):ENUM

Fragment 6 [SOURCE]
    Output layout: [modified_76, numcallsclient_77, address_id_78, shop_id_79, shopper_timeslot_id_80, date_deliveries_requested_81, percentage_opened_hours_82, comprea_note_83, numcalls_84, gps_locked_85, comprea_note_warehouse_86, percentage_opened_hours_in_87, uuid_88, total_saving_89, tag_color_90, num_products_taken_91, total_cost_92, date_call_93, tag_94, delivery_type_95, date_ticket_pending_96, expected_delivering_time_97, margin_98, weight_99, commission_100, tag_signature_101, expected_eta_time_102, frozen_products_103, ebitda_104, id_105, shopper_weight_106, driver_timeslot_id_107, status_108, last_no_times_available_109, last_update_110, manual_charge_111, delivery_price_discount_112, percentage_opened_hours_in_day_113, fraud_rating_114, date_loyalty_tip_115, num_products_116, total_price_117, expected_delivering_distance_118, date_shopping_119, next_shopper_timeslot_id_120, last_overweight_notification_121, cart_progress_branch_url_122, shopper_total_price_123, shopper_algorithm_124, date_delivering_125, cash_order_id_126, note_127, comprea_note_driver_128, real_shop_id_129, num_user_changes_130, loyalty_tip_131, date_almost_delivered_132, date_delivered_133, date_started_134, shopper_total_cost_135, lola_id_136, promo_hours_137, date_waiting_driver_138, expected_shopping_time_139, total_price_discount_140, delivery_price_141, $hashvalue_153]
    Output partitioning: HASH [id_105][$hashvalue_153]
    Stage Execution Strategy: UNGROUPED_EXECUTION
    ScanProject[table = app_lm_mysql:comprea.cart comprea.cart columns=[modified:tinyint:TINYINT, numCallsClient:integer:INT, address_id:integer:INT, shop_id:integer:INT, shopper_timeslot_id:integer:INT, date_deliveries_requested:integer:INT, percentage_opened_hours:decimal(5,2):DECIMAL, comprea_note:varchar:LONGTEXT, numCalls:integer:INT, gps_locked:tinyint:TINYINT, comprea_note_warehouse:varchar:LONGTEXT, percentage_opened_hours_in_10:decimal(5,2):DECIMAL, uuid:varchar(10):VARCHAR, total_saving:decimal(8,2):DECIMAL, tag_color:varchar(15):VARCHAR, num_products_taken:integer:INT, total_cost:decimal(8,2):DECIMAL, date_call:integer:INT, tag:varchar(20):VARCHAR, delivery_type:char(9):ENUM, date_ticket_pending:integer:INT, expected_delivering_time:integer:INT, margin:decimal(8,2):DECIMAL, weight:decimal(6,3):DECIMAL, commission:integer:INT, tag_signature:varchar(4):VARCHAR, expected_eta_time:integer:INT, frozen_products:tinyint:TINYINT, ebitda:decimal(8,2):DECIMAL, id:integer:INT, shopper_weight:decimal(6,3):DECIMAL, driver_timeslot_id:integer:INT, status:char(14):ENUM, last_no_times_available:integer:INT, last_update:integer:INT, manual_charge:tinyint:TINYINT, delivery_price_discount:decimal(8,2):DECIMAL, percentage_opened_hours_in_day:decimal(5,2):DECIMAL, fraud_rating:integer:INT, date_loyalty_tip:integer:INT, num_products:integer:INT, total_price:decimal(8,2):DECIMAL, expected_delivering_distance:integer:INT, date_shopping:integer:INT, next_shopper_timeslot_id:integer:INT, last_overweight_notification:integer:INT, cart_progress_branch_url:varchar(255):VARCHAR, shopper_total_price:decimal(8,2):DECIMAL, shopper_algorithm:varchar:LONGTEXT, date_delivering:integer:INT, cash_order_id:integer:INT, note:varchar:LONGTEXT, comprea_note_driver:varchar:LONGTEXT, real_shop_id:integer:INT, num_user_changes:integer:INT, loyalty_tip:decimal(8,2):DECIMAL, date_almost_delivered:integer:INT, date_delivered:integer:INT, date_started:integer:INT, shopper_total_cost:decimal(8,2):DECIMAL, lola_id:varchar(50):VARCHAR, promo_hours:integer:INT, date_waiting_driver:integer:INT, expected_shopping_time:integer:INT, total_price_discount:decimal(8,2):DECIMAL, delivery_price:decimal(8,2):DECIMAL], grouped = false]
        Layout: [modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), $hashvalue_153:bigint]
        Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
        $hashvalue_153 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("id_105"), 0))
        weight_99 := weight:decimal(6,3):DECIMAL
        date_deliveries_requested_81 := date_deliveries_requested:integer:INT
        fraud_rating_114 := fraud_rating:integer:INT
        id_105 := id:integer:INT
        date_ticket_pending_96 := date_ticket_pending:integer:INT
        numcalls_84 := numCalls:integer:INT
        real_shop_id_129 := real_shop_id:integer:INT
        expected_shopping_time_139 := expected_shopping_time:integer:INT
        total_cost_92 := total_cost:decimal(8,2):DECIMAL
        tag_signature_101 := tag_signature:varchar(4):VARCHAR
        delivery_price_discount_112 := delivery_price_discount:decimal(8,2):DECIMAL
        tag_94 := tag:varchar(20):VARCHAR
        date_shopping_119 := date_shopping:integer:INT
        date_call_93 := date_call:integer:INT
        manual_charge_111 := manual_charge:tinyint:TINYINT
        driver_timeslot_id_107 := driver_timeslot_id:integer:INT
        promo_hours_137 := promo_hours:integer:INT
        shopper_timeslot_id_80 := shopper_timeslot_id:integer:INT
        shopper_weight_106 := shopper_weight:decimal(6,3):DECIMAL
        note_127 := note:varchar:LONGTEXT
        total_saving_89 := total_saving:decimal(8,2):DECIMAL
        gps_locked_85 := gps_locked:tinyint:TINYINT
        percentage_opened_hours_in_day_113 := percentage_opened_hours_in_day:decimal(5,2):DECIMAL
        num_products_taken_91 := num_products_taken:integer:INT
        commission_100 := commission:integer:INT
        last_no_times_available_109 := last_no_times_available:integer:INT
        percentage_opened_hours_82 := percentage_opened_hours:decimal(5,2):DECIMAL
        total_price_117 := total_price:decimal(8,2):DECIMAL
        date_delivering_125 := date_delivering:integer:INT
        expected_eta_time_102 := expected_eta_time:integer:INT
        ebitda_104 := ebitda:decimal(8,2):DECIMAL
        address_id_78 := address_id:integer:INT
        shopper_algorithm_124 := shopper_algorithm:varchar:LONGTEXT
        shopper_total_price_123 := shopper_total_price:decimal(8,2):DECIMAL
        shop_id_79 := shop_id:integer:INT
        expected_delivering_time_97 := expected_delivering_time:integer:INT
        date_waiting_driver_138 := date_waiting_driver:integer:INT
        loyalty_tip_131 := loyalty_tip:decimal(8,2):DECIMAL
        delivery_type_95 := delivery_type:char(9):ENUM
        numcallsclient_77 := numCallsClient:integer:INT
        date_almost_delivered_132 := date_almost_delivered:integer:INT
        date_started_134 := date_started:integer:INT
        total_price_discount_140 := total_price_discount:decimal(8,2):DECIMAL
        uuid_88 := uuid:varchar(10):VARCHAR
        frozen_products_103 := frozen_products:tinyint:TINYINT
        comprea_note_warehouse_86 := comprea_note_warehouse:varchar:LONGTEXT
        last_overweight_notification_121 := last_overweight_notification:integer:INT
        cart_progress_branch_url_122 := cart_progress_branch_url:varchar(255):VARCHAR
        last_update_110 := last_update:integer:INT
        comprea_note_driver_128 := comprea_note_driver:varchar:LONGTEXT
        delivery_price_141 := delivery_price:decimal(8,2):DECIMAL
        lola_id_136 := lola_id:varchar(50):VARCHAR
        date_delivered_133 := date_delivered:integer:INT
        num_products_116 := num_products:integer:INT
        modified_76 := modified:tinyint:TINYINT
        status_108 := status:char(14):ENUM
        next_shopper_timeslot_id_120 := next_shopper_timeslot_id:integer:INT
        shopper_total_cost_135 := shopper_total_cost:decimal(8,2):DECIMAL
        tag_color_90 := tag_color:varchar(15):VARCHAR
        date_loyalty_tip_115 := date_loyalty_tip:integer:INT
        margin_98 := margin:decimal(8,2):DECIMAL
        cash_order_id_126 := cash_order_id:integer:INT
        num_user_changes_130 := num_user_changes:integer:INT
        comprea_note_83 := comprea_note:varchar:LONGTEXT
        percentage_opened_hours_in_87 := percentage_opened_hours_in_10:decimal(5,2):DECIMAL
        expected_delivering_distance_118 := expected_delivering_distance:integer:INT

Trino results

Starting the measuring session. Query 'Smoke test query' took 1 seconds to run and returned 1 rows. Query 'Orders and GMV by Store and Month' took 256 seconds to run and returned 62 rows. Query 'First order by customer' took 264 seconds to run and returned 113939 rows. Query 'Sales by country, city, month for the last 1 months' took 257 seconds to run and returned 16 rows. Query 'Sales by country, city, month for the last 3 months' took 234 seconds to run and returned 40 rows. Query 'Sales by country, city, month for the last 6 months' took 237 seconds to run and returned 76 rows. Query 'Sales by country, city, month for the last 12 months' took 230 seconds to run and returned 150 rows. Query 'Sales by country, city, month for the last 24 months' took 225 seconds to run and returned 306 rows. Query 'Sales by country, city, month for the last 36 months' took 227 seconds to run and returned 455 rows. Finished the measuring session.

MySQL

Starting the measuring session. Opening up an SSH tunnel to pre.internal.lolamarket.com SSH tunnel is now open. Query 'Smoke test query' took 0 seconds to run and returned 1 rows. Query 'Orders and GMV by Store and Month' took 14 seconds to run and returned 62 rows. Query 'First order by customer' took 324 seconds to run and returned 113940 rows. Query 'Sales by country, city, month for the last 1 months' took 161 seconds to run and returned 16 rows. Query 'Sales by country, city, month for the last 3 months' took 169 seconds to run and returned 40 rows. Query 'Sales by country, city, month for the last 6 months' took 171 seconds to run and returned 76 rows. Query 'Sales by country, city, month for the last 12 months' took 178 seconds to run and returned 150 rows. Query 'Sales by country, city, month for the last 24 months' took 208 seconds to run and returned 161 rows. Query 'Sales by country, city, month for the last 36 months' took 234 seconds to run and returned 167 rows. Finished the measuring session. Closing down the SSH tunnel... SSH tunnel is now closed.

20220726

Meeting with Ricardo:

What operating model are we planning on running in Poland?
What do you think are the problems with Spain?
Why drop in-house tech and go for instaleap?
How do you keep the frugal philosophy being inside Glovo?
I'm surprised we don't have slightly more "controlling" style metrics
What is the story behind Mercadao?
What is your personal policy? Are you in for the money, for the fun?
What do you like and what don't you like about the data team as of today?

"Only measure things that matter."

Predict demand by SKU.

My ideas:

Keep deliveries decentralized, while being very smart in optimizing workload. Portuguese approach seems a bit too rigid. Spanish approach is not smart enough (plus not very caring about shoppers)
Define a holy-grail set of KPIs for operational excellence (something like €/delivery + shopper's €/hour) and start doing A/B testing of different picking+delivery tactics:
- Split picking and delivering
- Double or even triple deliveries
- Picnic approach (full-refrigerated truck, all day)

Make a different repository for experiments
Run final experiments with everything there
Clean up the package repository
Update confluence page with results
Prepare script for training session
What is the package used for and example
- Run any query against MySQL or Trino
- Measure how long it takes (Wall time)
- Used through CLI
- Sessions are defined through a JSON config
How to install
Tips and tricks
Further ideas
- Make a richer output format
- Include features geared towards comparison (compare several versions of a query, or same query across engines)

Questions:

Can it write results to a table?

20220727

Buenas Fer,

Te escribo un pequeño tocho. Lo comentamos por aquí o si quieres echamos una llamada.

Resumen: queremos proponer poner unos indices en MySQL para mejorar performance de algunas queries. Es para ver cómo nos coordinamos.

Long: en data queremos hacer algunos dashboards para ops de españa. Son dashboards para temas del dia a dia de disponibilidades de shoppers, ordenes de hoy/mañana, etc. Así que tienen que refrescarse con frecuencia y ser bastante interactivos.

El tema es que las queries con la tabla de cart son bastante mortales. Es típico filtrar para ver solo pedidos en un status o que pertenezcan a una date_delivered o date_started en concreto. Pero ni la columna de status ni ninguna de las de fecha tienen índice, así que cualquier query se alarga mucho.

Queremos explorar contigo qué opciones hay para ver qué sería lo más inteligente y hacerlo.

Ya me dices, gracias.

Tables that are being used:

address
cart
cart_product
cash_order
company
postalcode
shop
shopper_timeslot
shopper_postalcode
user

20220728

Try to connect my DBeaver to DW through the Jumphost indicated by Carlos
Build script to replicate tables
Replicate tables
Run inserts from one DB to another through Trino
Run the tests
Put new indices in place
Run the test again
Compare

CAST(status AS CHAR(14)) CAST(delivery_type AS CHAR(9)) id = 3548 -> the troublesome cart

Stuff I discovered during my debugging of the bloody enums

ENUMS + not strict mode -> errors turn into 0/''
ENUMS + strict mode -> Insert breaks with unhelpful message
LM MySQL doesn't have strict mode activated
DW does have strict mode activated
Easiest workaround for inserting faulty ENUM values from LM MySQL to DW:
- When defining the table in DW, make empty string ('') one of the possible ENUM values in the field. That way, when Trino tries to write an empty string from LM MySQL, DW sees it as a valid value.
- For now, there seems to be no down-side to this. Trouble would only happen if someone exported the table with the ENUMs as integer instead of strings, and tried to compare it with LM MySQL data. But that seems unlikely.

Retro

How can we improve as a team

Everyone provides ideas
We vote
We discuss the most voted
We take action points

20220816

~~Load the missing three tables.~~
Run the queries again, both through Trino and directly to DW.

DW meeting with Dani

We both agree that we should judge more carefully before we discard MySQL

20220817

Details of DW on AWS:

Region: eu-central-1 (Frankfurt)
Name: data-prod-mysql
VPC: pdo-prod (vpc-6a231802)
VPC Security Groups: default (sg-86e633ec)

Details of the EC2 I created:

Region: eu-central-1 (Frankfurt)
name: performance-query-host
size: t2.small

Details of the SSH key created:

query-performance-host-key

Table design scenarios

Base scenario - tables as they are today
Second scenario - index on status
Third scenario - index on status and date_delivered
Fourth scenario - partition over status ~~partition over year date_delivered~~ ACTIVE
Fifth scenario - index on status, year and month partitioning on date_delivered

Can't partition over date_delivered because it has null values and should be part of the PK

20220818

My update:

Managed to connect to DW
Playing with indices and partitions, results soon
Spoilers:
- DW is wicked fast when compared to the replica (could server size be the explanation)
- Indices seem to help when querying DW directly
- Indices seem to have no effect when querying through Trino
Will sit today and probably tomorrow with Pinto to build a more exhaustive query set
Want to develop a bit more the Python package because complexity of results is growing exp

Meeting with Pinto Pinto shares the products query that can never be run for more than 1-2 months because it dies She also shares another problematic one on similar tables I also asked for a handful of representative, relevant queries business-wise, even if they don't have performance problems as of today.

Partitioning pains:

Columns used in the partitioning key must be part of the PK
The table can't have foreign keys

20220819

Update results in confluence
Update slide
Update readme.md in directory
Make new feature in package
Develop somekind of test for package?

20220822

upload to s3
register with prefect
everything flow must be in a project
I need to set up bash for windows

20220823

Fill the slide comparing old LM vs new LM vs DW
Update the confluence page
Update the board
Update the readme

Stuff that is missing in the docs on how to setup prefect

Including the MFA line in the credentials file
Editing the .bashrc file with the line at the end
That prefect 1.2.2 should be installed, not prefect 2
Set up the backend.toml and prefect.toml
missing botocore, boto3 package
Permissions in S3 for the bucket

1:1 João:

Discutir con equipo ideas de Data Quality
Mirar fechas para viaje a Oporto:
Pensar en team building stuff:
Personal Development:

Steps:

Flow that runs
Flow that runs and logs something
Flow that runs and queries Trino
Flow that runs and queries Trino and prints something
Flow that runs and queries DW
Flow that runs and queries DW and prints something
Flow that runs and makes an update in DW

20220825

Update Prefect set up documentation
Schedule training
Oporto trip
- Agree on dates with Dani
Get Trino and Rancher user from Dani
Put picture in Google account
Schedule Data Quality session with the team
Prepare whiteboard for Great Expectations explanation
Think about ideas for the agenda Oporto trip

20220829

Share personalities
Post DevOps offer in Python barcelona meetup
Prepare Great Expectations demo for the team
Prepare roadmap proposal for GE

20220830

Tasks in Jira
Share materials with the team
Write a long form in Confluence
Drop status point on Jira task
Write summary of performance task

Re-structuring meeting with Gonçalo

Lidl cierra partnership con LolaMarket
- Efectivo 1/09
- Lidl no ha estado contento con los numeros
- Lidl invirtio 2 millones en LolaMarket. Por los movimientos de acciones y propiedad, a efectos practicos, han perdido 1 millon
- Mala imagen por Ley Rider
- 40% del negocio de Lola Market
Gasto de 200.00€ / mes. Los numeros no salen, y saldran menos con el cierre del partnership.
Inevitable re-estructuracion del negocio en España
- Es seguro que el equipo de España se va a tener que reducir
- Tech y Data cuelgan de HQ y son inmunes a esto
Piloto con Carrefour en Noviembre para convertirnos en su operativa en la sombra para same-day delivery
- Gonçalo ve mercado potencial de 75mill€ en España junto a ellos.
- La oportunidad de salvar Lola
Next steps
- Cerrar plan de re-estructuracion con Glovo y compartirlo ASAP

20220831

The next two:

Spot new guest users from mercadao and insert them in dw
Delete sensible data for users that have been flagged as right-to-be-forgotten

Use checksum table to test existing users equality? Could the "autoincrement" for everything policy be a slippery slope?

20220901

Map out user ETL
- Spanish ETL
- Portuguese ETL
Set up user etl development environment
Share meetup ideas with João
Create tasks for all flows

20220902

Finish mercadao new users ETL

20220903

Read https://ploomber.io/blog/ci-for-ds/
Read http://www.garysieling.com/blog/testing-etl-processes/
Discuss deployment (all together, or deploy asap?)
Finish my write up on my development

20220905

Check this to handle SSH tunnel https://discourse.prefect.io/t/how-to-clean-up-resources-used-in-a-flow/84

20220906

Ask about health insurance -> Liliana
Make specific goals
Dani cambia tabla user DW
Update inserts flow
Work with Carlos Matias to create a new SSH key for ETLs

Get - All the users where (the pendings) - Not in app - Yes in DB - System is mercadao - rtbf = 0 - 10 users where (the gones) - Not in app - Yes in DB - System is mercadao - rtbf = 1

Check that the gones - [x] Do not appear in the sensitive table - [ ] Have different created and update times <- NOPE

Check that the pendings - [x] Appear in the sensitive table

Backup the data from the pendings

Execute the pipeline

Check that the pendings - [ ] Do not appear in the sensitive table - [ ] Have different created and update times

20220908

Add retries to Trino
Finish deleted users ETL
Update SQL of deleted users staging tables

20220908

Modify create table to lengthen password field
Send to João the points to review

20220912

Sign-up in servicedesk

Insert guests Below is the SQL to insert a guest user. My question now is: on the normal users flow, what is preventing this same users from going in? Answer: guest users don't appear in the users table in MongoDB. They simply

-- Insert guests users
truncate table data_dw.staging.p006_pt21_new_guests_users

insert into data_dw.staging.p006_pt21_new_guests_users (
	email,
  	id_user_internal,
	n_date,
  	n_row,
  	temp_id
  	)
select
	a.email,
	b.id_user_internal,
	date_format(now(), '%y%m%d%H%i%s') as n_date,
	row_number() over (partition by null order by a.email) n_row,
	'guest' || cast(date_format(now(), '%y%m%d%H%i%s') as varchar) || cast(row_number() over (partition by null order by a.email) as varchar) temp_id
from
	(select 
		distinct
		lower(og.customeremail) as email
	from 
		app_md_mysql.pdo.ordergroup og
	where 
		og.customerid is null
		and og.status is not null
		) a
	left join data_dw.dw_xl.dim_user_sensitive b on (a.email = lower(b.email) and b.id_data_source = 1)
where 
	b.id_user_internal is null
	and a.email is not null
order by 1

Ok. Now we have all the guests whose email does not appear in the dw already.

Next up, get DW ids for them.

insert into data_dw.staging.p006_c01_auto_increment (
	id_data_source,
	key_for_anything
)
select 1 as id_data_source, temp_id as id_app_user
from data_dw.staging.p006_pt21_new_guests_users

Finally, get them inside the user table.

insert into data_dw.dw_xl.dim_user_sensitive (

	)
select 

from
	data_dw.staging.p006_pt21_new_guests_users c
	inner join data_dw.staging.p006_c01_auto_increment a on (c.temp_id = a.key_for_anything)
	

insert into data_dw.dw_xl.dim_user (
	id_user_internal,
	id_data_source,
	id_app_user,
	flag_is_guest_account,
	flag_is_social_account,
	flag_agreed_campaigns,
	flag_agreed_history_usage,
	flag_agreed_phonecoms,
	flag_agreed_terms,
	flag_email_verified,
	flag_is_active,
	user_role,
	flag_rtbf_deleted,
	name,
	email,
	created_at,
	updated_at
	)
select
	a.id_user_internal,
	1 as id_data_source,
	c.temp_id as id_app_user,
	1 as flag_is_guest_account,
	0 as flag_is_social_account,
	0 as flag_agreed_campaigns,
	0 as flag_agreed_history_usage,
	0 as flag_agreed_phonecoms,
	0 as flag_agreed_terms,
	0 as flag_email_verified,
	1 as flag_is_active,
	'ROLE_USER' as user_role,
	0 as flag_rtbf_deleted,
	'guest account' as name,
	c.email as email,
	CAST(NOW() AS TIMESTAMP) as created_at,
	CAST(NOW() AS TIMESTAMP) as updated_at
from 
	data_dw.staging.p006_pt21_new_guests_users c
	inner join data_dw.staging.p006_c01_auto_increment a on (c.temp_id = a.key_for_anything)

Modify inserts ETL to use SSH key from AWS

20220913

Failure on new users flow
- Review trino retry policy
- Review why flow appears as successful
Work on the special guest flow

Upgrade guests into normals - We look for users that: - Appear in mongodb and do not appear in DW (new-ish) - BUT there is an existing guest user (flag_is_guest_account) that has the same email. - And then we update the records of the existing guest internal DW id with the new data from the full-blow user we have seen in mongodb

staging.p006_pt41_users_with_existing_guest

c.id_user_internal as id_user_internal_guest

Make meeting to demo and Great Expectations

Team guidelines

Each flow should have an owner. Delegating is always possible, but we should prevent bystander effect.
Support work is work
Issues should be issues?

1 person 1 week Make task in our board

Vacations - ask in our internal chat

Slack alerts

Go to api.slack.com
Create new app
Activate Income webhooks
Create a webhook for the channel

Codes for each channel

Data team: https://hooks.slack.com/services/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
Data team alerts: https://hooks.slack.com/services/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

20220914

Review Dani's flow
- Password as SHA
- Hardcodeo del numero 2 como id data source de Lola
- t_004 ultima linea de la query. Que sentido tiene?
- t_004 es necesario el where? o el flag de new user?
Document slack thingy
- General stuff
- Link to example flow
- As part of prefect, talk about file in env with the webhook URLs
- Describe the usage of triggers
Review weird errors in flow 1
- Check if the current S3 flow is buggy
- Learn how to avoid false Successess -> Flow reference tasks https://docs-v1.prefect.io/api/latest/core/flow.html#flow-2
Apply slack notification to all flows
Modify upgrade guest flow with João suggestion + review

20220915

Review weird errors in flow 1
- Check if the current S3 flow is buggy
- Learn how to avoid false Successess -> Flow reference tasks https://docs-v1.prefect.io/api/latest/core/flow.html#flow-2
Modify update flow so that it checks all fields, even those that nowadays are fixed
Modify upgrade guest flow with João suggestion regarding date_registered +
- review same concept on flow insert user
Put docstrings in functions in flows
Refactor all flows to not use the first staging table
- Flow 01
- Flow 02
- Flow 03 ISSUE WITH THE SHA, THIS ONE STILL USES IT. BUT NO TROUBLE IF IT'S THE ONLY ONE
- Flow 04
- Flow 05
Apply slack notification to all flows
- Flow 01
- Flow 02
- Flow 03
- Flow 04
- Flow 05
Schedule all flows
- Flow 01
- Flow 02
- Flow 03
- Flow 04
- Flow 05

20220919

Finish docs
- General
- Flow
  - 01 -> Review references
  - 02
  - 03
  - 04
  - 05
- Table Scripts
  - staging.p006_c01_auto_increment
  - staging.p006_pt41_users_with_existing_guest
  - staging.p006_pt21_new_guests_users
  - staging.p006_pt01_current_users
  - staging.p006_pt11_modified_users
  - staging.p006_pt31_deleted_users
  - staging.p006_pt02_current_users_new
Share Datacamp account with Ana Martins

20220920

Orders
- Basic MD flow
Check how to force Prefect to return true in a flow regardless of anything

1:1 João

Barcelona Meetup
Berlana and the blog?
I'll talk now with Liliana
Feedback
Any news from Poland?

1:1 Liliana

PR for tech recruiting
- Blog
- Meetup
Relationship with unis
- We can get creative here
- I have a regular flow of smart Management and Economics students in my agenda
She comes from another start-up (Digital Therapy Software)

Python meetup Veriff

How to propose a talk?
Veriff is originally from Estonia

Fraud detection

Rivo, Fraud Engineering Lead
The founder got the idea from frauding his own age. Lol.
Do you implement a red team / blue team approach?
Testing feels like very difficult because your system is very stateful. How do you deal with it?
In-memory data for fast queries with millions of records... how much memory where you using?

QR Codes

Ismael Benito

QR codes re two dimensional bar codes

How could we use QR codes in Lola?

I need to do cool QR codes for my classes at UPF

How well supported is it? Is my phone stupid?
Any applications of this besides having fun? -> The chromatic ink stuff

20220921

I have made the following query:

SELECT o.id as id_order,
og.id as id_order_group,
UPPER(og.customerid) as id_user,
u.id_user_internal,
u.id_app_user,
gu.id_user_internal,
gu.id_app_user
FROM app_md_mysql.pdo."order" o
LEFT JOIN app_md_mysql.pdo.ordergroup og 
ON o.ordergroupid = og.id 
LEFT OUTER JOIN data_dw.dw_xl.dim_user u 
ON UPPER(og.customerid) = u.id_app_user
LEFT OUTER JOIN data_dw.dw_xl.dim_user gu 
ON lower(og.customeremail) = gu.email 
WHERE u.id_data_source = 1
AND gu.id_data_source = 1
AND u.flag_is_guest_account = 0
AND gu.flag_is_guest_account = 1

And it returns records like this:

This is strange. Each order should much with a user OR a guest, but not both.

I'm going to start looking into specific cases. The following one is the first

id_order	id_order_group	id_user	id_user_internal	id_app_user	id_user_internal	id_app_user
1136592	1159257	6140A68762E0C1003FD59CD3	6694660	6140A68762E0C1003FD59CD3	7183239	guest22091216010722939

STEP I want to:

Check all the orders made by 6329A8BF6ED53300400DBD5E
Check the entries of both dw_user_id=8669726 and dw_guest_id=7182737 to explore what they look like.

RESULT Users that have changed their email and matches with a guest user.

I must match orders with users, first by user, after by guest.

To achieve it by user:

Match og.customerid with dim_user.id_app_user
For those that don't match this way: match og.customeremail with guest accounts

The Bug shared by Janu

These are the faulty users:

id_user_internal	id_data_source	id_app_user	date_joined	flag_is_active	user_role	locale	priority	substitution_preference	default_payment_method	flag_agreed_campaigns	flag_agreed_history_usage	flag_agreed_phonecoms	flag_agreed_terms	flag_email_verified	flag_is_guest_account	flag_is_social_account	email	login	name	password	phone_number	id_facebook	poupamais_card	created_at	updated_at
7177183	1	guest2209121601074003		1	ROLE_USER					0	0	0	0	0	1	0	brucacajope1@gmail.com		guest account					2022-09-12 16:01:17.000	2022-09-12 16:01:17.000
6722529	1	5EFCBE721D257E004C026D07		1		pt	0	REPLACE	CREDIT_CARD	1	1	1	1	1	0	1	brucacajope1@gmail.com	10219829617349375@facebook	Miriam Vaz	87435096261dd1a22e25e55c91f045789a67ac85	963844518	10219829617349375	2446061022274	2022-09-09 09:21:42.000	2022-09-09 09:21:42.000
7178877	1	guest22091216010721636		1	ROLE_USER					0	0	0	0	0	1	0	raquelsantosmota@gmail.com		guest account					2022-09-12 16:01:17.000	2022-09-12 16:01:17.000
6708840	1	60244B928C1878004124EDBE	2021-02-10 21:09:38.000	1		pt	0	REPLACE	CREDIT_CARD	1	1	0	1	1	0	0	raquelsantosmota@gmail.com	raquelmota@agpico.edu.pt	Raquel Mota	fc46053c49bc85c2d919dbdde184254b6dd27156	965182320		2446010860094	2022-09-09 09:21:42.000	2022-09-09 09:21:42.000
7174727	1	guest22091216010720657		1	ROLE_USER					0	0	0	0	0	1	0	pedro.carreira74@gmail.com		guest account					2022-09-12 16:01:17.000	2022-09-12 16:01:17.000
6800726	1	62D1A2CAA8426C003F1975F2	2022-07-15 17:24:26.000	1		pt	0	REPLACE		1	1	1	1	1	0	1	pedro.carreira74@gmail.com	10224176747459562@facebook	Pedro Carreira		939867084	10224176747459562		2022-09-09 09:21:42.000	2022-09-09 09:21:42.000

Things I observe:

For each of them, there is one guest and one user version.
The guest versions have been created after the user version was created.

Hypothesis

The new guest flow is not prepared for cases where an existing user makes a purchase as a guest.
It should check if the guest email exists in dim_user, but it isn't hence it creates the user again.
Given that the user never appears as new, since it already exists in DW, the guest version will never be upgraded to user.

Result The hypothesis is rejected.

The new guest flow does watch out for existing users and guests in dim_user properly. It does so by matching by email.

Alicia buys as guest
Alicia gets inserted as guest in DW
Alicia registers as user
Alicia's existing guest gets upgraded the user
Alicia buys as guest
Alicia gets inserted as guest in DW

20220922

Review Dani's flows
- 02 Update users
  - ETL_PATH with env, otherwise the team can't run it
  - p006_lm21_gest_users <- typo
  - Pass slack channel from parameter, see new orders ETL flow for inspiration
  - user_agent gets capped to VARCHAR(250). Is that length reasonable, or could it fall short?
  - lowercase all emails so that comparisons are always proper (email is case insensitive)
  - The flow seems to do steps 1-4 for both users and guests, but then task 5 only updates users? Where are guests updated then?
01
- lower email
- slack noti + scheduled param
- __file__
- date_joined and date_registered
- flag_is_active
02
- lower email
- slack noti
- + scheduled param
- __file__
- date_joined and date_registered
- flag_is_active `

20220927

Review the failed flows
- 01
- 02
- 03 - was it cahnged?
Pedir usuario para Francisco
Seguir instrucciones para tener acceso a Github
Keep on with fixing user flows
- 03
  - lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
    - Could the lower function be different in Trino and MySQL?
      - It seems that LOWER in Trino works just fine.
      - In MySQL, LOWER and LCASE are supposed to do the same thing. I wrote a query to test it and
  - slack noti
  - + scheduled param
  - __file__
  - date_joined and date_registered
  - flag_is_active
  - EXTRACT DATE
  - UPDATE_AT in update
- 04
  - lower email
  - slack noti + scheduled param
  - __file__
  - date_joined and date_registered
  - flag_is_active
- 05
  - lower email
  - slack noti + scheduled param
  - __file__
  - date_joined and date_registered
  - `flag_is_active
Work on orders

Meeting with Gonçalo

Focus on operations because Poland removes our front-end
- What freedom/opportunities/limitations do we have to market biedronka.pl
100€ free delivery... inflation
Game theory: kidnap and squeeze
- White label is just a commodity
- Technology is shared with Instaleap.
- What does Instaleap want from us?
How happy is Berlana with Instaleap? Quality issues?

Meeting with Ricardo

We review the roadmap
Ricardo explains that they have discussed with Jerome Martins around catalogs per store and stockouts.
- We only have one catalog per retailer, even if we know that each store does not have exactly the same category
- JM argues that stockouts happen because our catalog is simplified, not because their operations are flawed
- To settle the discussion, the catalog for a couple of stores have been fixed manually to be hyper-accurate. With this, we experiment with the following hypothesis: Does having an accurate catalog reduce stockouts for that store significantly?
Biedronka launch
- Single shopper vs picker-driver
- Order of biedronka expansion by geography
- Recurrence
- Best drivers

For next one:

How can we imprve AOV, CPO and Stockouts
How much money are we losing to credit card fees in Spain?

20220929

Document user flow 006
Seguir instrucciones para tener acceso a Github
Keep on with fixing user flows
- 03
  - lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
    - Could the lower function be different in Trino and MySQL?
      - It seems that LOWER in Trino works just fine.
      - In MySQL, LOWER and LCASE are supposed to do the same thing. I wrote a query to test it and
  - slack noti
  - + scheduled param
  - __file__
  - date_joined and date_registered
  - flag_is_active
  - EXTRACT DATE
  - UPDATE_AT in update
- 04
  - lower email
  - slack noti + scheduled param
  - __file__
  - date_joined and date_registered
  - flag_is_active
- 05
  - lower email
  - slack noti + scheduled param
  - __file__
  - date_joined and date_registered
  - `flag_is_active

Work on the order etl

Okay. Since I haven't worked on fixing the user flows for days, I actually kind of forgot what was the issue and what was I doing about it. I'll start again from the initial point: trying to join orders to customers in the orders ETL.

Customers can be linked to orders in two ways:

Through Mercadão's user ID.
Through the order email.

Theoretically, with an up-to-date dim_user table:

All orders should be linked to a customer by one of the two methods.
Orders should only be matched by one method (if we find a user id, we should not match again by email)
I'm going to take a look at the query I was building to check if this is the case.

After running the query with a few adjustments, I come to the following observations:

The query matches the following way: !
Most match by ID.
Double matches don't worry me, since we can just prioritise using the app id match instead of the user email one.
The not matched are the ones that worry me. I need to look into those.

Hypothesis I'll explore the following way:

First, confirm that all non-matching orders do not have a user app id. This will confirm that we can narrow down the issue to email matching.
If the previous is true:
- I'll get one of the non-matching orders and save the email.
- I'll go into dim user and see if I can find the email with a wrong case.
- My expectation is that I can find it. If this is not the case... I need to get creative to think what comes next.

Result

All customer ids are null. So, I confirm that all non-matching orders are guest orders that should be matched by email.
I pick order 106889
- The customer email is conceicaorento@sapo.pt
- The email appers in dim_user, associated to a registered user. Aha-moment: when matching by email, I should NOT only do that for guest users. There are guest orders that should be matched by email to upgraded-users. The order doesn't have a customer id because it was made from a guest identity, but later we upgraded the guest to user, so there will be no matching guest account but matching through the app user id is not possible because the order was placed in the guest-past.

So, matching should be like:

If order has an user id, match that way
If not:
- If order matches email with user account, use user account
- If order matches email with no user account, use guest account
- If no account, problem

Hypothesis Applying the previous matching procedure should result in no orders unmatched.

Result Unmatched orders are down from 79,174 to 1,989. We have reduced the problem, but issues still remain.

Hypothesis I'll explore the following way:

First, confirm that all non-matching orders do not have a user app id. This will confirm that we can narrow down the issue to email matching.
If the previous is true:
- I'll get one of the non-matching orders and save the email.
- I'll go into dim user and see if I can find the email with a wrong case.
- My expectation is that I can find it. If this is not the case... I need to get creative to think what comes next. Result
Okay, first, for the orders that have a customerid:
- There are non-matched orders with customer ids, and their IDs don't appear in dim_user. My guess is that the following is happening:
  - User registers
  - User places orders
  - User deletes itself from Mercadão
  - ...time passes...
  - We start feeding dim_user
  - End result: we never had the chance to store the user in dim_user since it was deleted before we ran it.
- Options to move ahead:
  - Drop the orders: ugly as hell, potentially acceptable since it's only 2,000 orders of deleted users.
  - Make id_user=-1 for these orders. Not that intuitive for analysts. And if we have another expection some day, we will have to do id_user=-2 or something like that and things will start tangling up.
  - Create a new user flow for Mercadão where we infer deleted users from orders and include them in dim_user. A bit more work, but I think it's clean and will pay off long term.
For the customers that don't:
- I select order with id 66017
- The customer email is SONIASTC@SAPO.PT
- And I look for it in dim_user... and it's there, with the same case. Why didn't it match??? Well, obviously, because dim_user email is in UPPER, and my query is lowercasing stuff.
- If I rerun again ensuring cases are fine:
  - Does the same user appear? -> No
  - How many non-matched by email users appear? -> only 22
  - Out of these 22
    - 6 are recent orders that have been done after the last Users ETL run, so that's fine
    - The other 16 are super old orders and all of them come from users with email from Jerome Martins. When we insert guests into dim_user, we ignore order groups with status null. I'll do the same for these ETL.

Query for debugging in case I need it again in the future:

SELECT
	o.id AS id_order,
	og.id AS id_order_group,
	UPPER(mdu.id_app_user) AS id_app_user,
	mdug.email AS guest_user_email,
	CASE
		WHEN mdu.id_app_user IS NOT NULL
		AND mdug.email IS NULL THEN mdu.id_user_internal
		WHEN mdu.id_app_user IS NULL
		AND mdug.email IS NOT NULL THEN mdug.id_user_internal
		WHEN mdu.id_app_user IS NOT NULL
		AND mdug.email IS NOT NULL THEN mdu.id_user_internal
		WHEN mdu.id_app_user IS NULL
		AND mdug.email IS NULL THEN -1
	END AS dw_user_id,
	CASE
		WHEN mdu.id_app_user IS NOT NULL
		AND mdug.email IS NULL THEN 'matched_by_id'
		WHEN mdu.id_app_user IS NULL
		AND mdug.email IS NOT NULL THEN 'matched_by_email'
		WHEN mdu.id_app_user IS NOT NULL
		AND mdug.email IS NOT NULL THEN 'double_match'
		WHEN mdu.id_app_user IS NULL
		AND mdug.email IS NULL THEN 'not_matched'
		ELSE 'wut???'
	END AS match_type,
	o.*,
	og.*
FROM
	app_md_mysql.pdo."order" o
INNER JOIN app_md_mysql.pdo.ordergroup og 
ON
	o.ordergroupid = og.id
LEFT JOIN 
(
		SELECT
			du.id_user_internal,
			du.id_app_user
		FROM
			data_dw.dw_xl.dim_user AS du
		WHERE
			du.id_data_source = 1
			AND du.flag_is_guest_account = 0
	) AS mdu 
ON
	UPPER(og.customerid) = mdu.id_app_user
LEFT JOIN 
(	
		SELECT
			du.id_user_internal,
			du.email
		FROM
			data_dw.dw_xl.dim_user AS du
		WHERE
			du.id_data_source = 1
	) AS mdug 
ON
	LOWER(og.customeremail) = LOWER(mdug.email)
WHERE
	o.status IS NOT NULL
	AND og.status IS NOT NULL
	AND CASE
		WHEN mdu.id_app_user IS NOT NULL
		AND mdug.email IS NULL THEN 'matched_by_id'
		WHEN mdu.id_app_user IS NULL
		AND mdug.email IS NOT NULL THEN 'matched_by_email'
		WHEN mdu.id_app_user IS NOT NULL
		AND mdug.email IS NOT NULL THEN 'double_match'
		WHEN mdu.id_app_user IS NULL
		AND mdug.email IS NULL THEN 'not_matched'
		ELSE 'wut???'
	END = 'not_matched'

USE PARTITIONING TO SOLVE DOUBLE MATCHES BY PRIORITISING THE USER

Snippet to add deleted users to dim_user https://drive.google.com/file/d/1Iz_NypqRx2WgtgxNxWit0E5XggS85K7P/view?usp=sharing

20220929

Documenting and running deleted users thingy
Seguir instrucciones para tener acceso a Github
Keep on with fixing user flows
- 03
  - lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
    - Could the lower function be different in Trino and MySQL?
      - It seems that LOWER in Trino works just fine.
      - In MySQL, LOWER and LCASE are supposed to do the same thing. I wrote a query to test it and
  - slack noti
  - + scheduled param
  - __file__
  - date_joined and date_registered
  - flag_is_active
  - EXTRACT DATE
  - UPDATE_AT in update
- 04
  - lower email
  - slack noti + scheduled param
  - __file__
  - date_joined and date_registered
  - flag_is_active
- 05
  - lower email
  - slack noti + scheduled param
  - __file__
  - date_joined and date_registered
  - `flag_is_active

20221003

Document user flow 006
Seguir instrucciones para tener acceso a Github
Keep on with fixing user flows
- 03 OJO QUE LA COMPARISON NO ES CASE SENSITIVEEEEEE
  - lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
    - Could the lower function be different in Trino and MySQL?
      - It seems that LOWER in Trino works just fine.
      - In MySQL, LOWER and LCASE are supposed to do the same thing. I wrote a query to test it and
  - slack noti
  - + scheduled param
  - __file__
  - date_joined and date_registered
  - flag_is_active
  - EXTRACT DATE
  - UPDATE_AT in update
- 04
  - lower email
  - slack noti + scheduled param
  - Set the date_joined of existing guests as the the order of the first date. How?
  - __file__
  - date_joined and date_registered
  - flag_is_active
- 05
  - lower email
  - slack noti + scheduled param
  - __file__
  - date_joined and date_registered
    - Current flow is leaving date_registered null (bad)
    - Current flow is "destroying" date_joined from the previous guest user entry. Should it be simply carried over?
  - `flag_is_active
Run an update on email column to lowercase the shit out of it
Solve partitioning problem

id_order	id_order_group	id_user
1084918	1113194	9166273
588939	654522	9246797
664680	728505	9021328

Query to review if there are any uppercase emails in `dim_user

SELECT dim_user_email_has_upper, COUNT(1)
FROM
(SELECT 
a.email AS pdo_email,
b.email AS dim_user_email,
BINARY a.email <> BINARY LOWER(a.email) AS pdo_email_has_upper,
BINARY b.email <> BINARY LOWER(b.email) AS dim_user_email_has_upper,
a.email = b.email AS naive_equality,
lower(a.email) = lower(b.email) AS lowered_equality
from 
                staging.p006_pt01_current_users a
                left join dw_xl.dim_user b on (UPPER(a.id_app_user) = UPPER(b.id_app_user) and a.id_data_source = b.id_data_source)
    where
        b.id_user_internal is not null
        and b.id_app_user is not NULL
) email_stuff 
GROUP BY dim_user_email_has_upper

20221006

Remove unnecessary OVER PARTITION in status and date
Make note of the parameter change to use updatedat in MD Order flow

now on the orderwithuser part, a couple of questions:

you dont actually need to do the over partition and then the distinct. you can just do the max() and a group by by o.id. It should be faster than performing the over partition.

the mdug subquery, by bringing the email on the select and joining by the lower, we can actually get more than 1 rows on the select if the email gets to be lower cased and not in the table. I know this is probably handled on dim_users, and will be solved with the max on the main query. So no worries, just to be aware of this.

not sure about this, but we probably should limit the updatedat to not be today otherwise you will get users that dont exist on dim_user yet

20221013

Index
- Modify table definiton to include index
- Change code back to deleting by id + fetch by creation date
Discuss with João the issue with using update_at to limit the scope of the order flow
Drop large tables from query experiment

20221014

Order columns batch 2

id_picking_location
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
id_retailer
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
id_team
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_created
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_updated
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
status_order
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
total_price_ordered
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_delivery_slot_start
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_delivery_slot_end
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_last_delivery_slot_start
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_last_delivery_slot_end
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
delivery_fee
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
delivery_fee_discount
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
delivery_fee_original
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
delivery_type
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved

20221017

Order LM V1

The stupid non-null case

What I was doing

I was running an INSERT INTO to a table.
This included inserting in column X. Column X is nullable. I was aware that some of the values I was inserting where nulls... but that shouldn't be a problem since the column X is nullable.
The query was being ran on Trino.

What I expected to happen That the data just gets inserted properly.

What actually happened

The query failed with an exception.
The error message read: trino.exceptions.TrinoUserError: TrinoUserError(type=USER_ERROR, name=CONSTRAINT_VIOLATION, message="NULL value not allowed for NOT NULL column: X", query_id=20221017_135640_00663_f6qyk)
I was puzzled since, again column X is nullable.

Weird, let's research

After toying around a bit, I ran the same query without inserting into column X, but keeping many other columns in place.
After running, I got the same error message I was getting for column X, but this time for column Y.
This was also puzzling, because column Y is nullable as well.
I kept taking out nullable columns out of the INSERT, and the error message kept jumping to some other column every time.

Finding the root cause and solution

Eventually, I was left with 4 column, A, B, C and D that were NOT nullable.
After running the INSERT once more, I got the same error for column A.
I checked the data I was trying to insert and there were null values for column A. Hence, this time the error made sense: there where null values going into A, but A is not-nullable.
I modified the schema definition to make A nullable and ran again the insert with just A, B, C and D. It inserted properly and threw no errors.
I then re-executed my original query with all the columns I wanted to insert. Now it worked just fine!

Take-away If Trino ever gives you an error like trino.exceptions.TrinoUserError: TrinoUserError(type=USER_ERROR, name=CONSTRAINT_VIOLATION, message="NULL value not allowed for NOT NULL column: X", query_id=20221017_135640_00663_f6qyk), completely ignore the column being mentioned. Trino will just lie to your face repeteadly. Instead, restrict your operation to only the columns that can't be null and find what part of the data is breaking the constraint.

It looks like the LM order ETL is joining several users to one single order or tag. I need to look into it. Maybe Dani made some mistake in the user ETL.

20221026

Review João's review
VARCHAR(36) agreement
- Write convention on VARCHAR(36) for ids
- Modify existing columns and table scripts for fact_order
Los gastos del viaje
Fill in details of tasks created by João
id_picking_location
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
id_retailer
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
id_team
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_created
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_updated
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
status_order (naive version)
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
total_price_ordered
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_delivery_slot_start
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_delivery_slot_end
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_last_delivery_slot_start
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
date_last_delivery_slot_end
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
delivery_fee
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
delivery_fee_discount
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
delivery_fee_original
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
delivery_type
- Created col in staging
- Created col in dw_xl
- Included in first query
- Included in second query
- Approved
Status
- Check cash order
- Check with Bruna
- Make design proposal
Deal with the multicurrency bit
- Local currency should come from the currency field, not be hardcoded
- Apply the exchange rate. See https://bi.mercadao.pt/question/1344-lm-cart.
MAKE USER A NOT NULL FIELD AGAIN! (Now it's not because the LM ETL for dim_user is buggy)
Ask specifically for Dani to review with a focus on the delivery slots fields

20221103

Shutting down

Lolamarket operations will die today.

Is the society still up? No.
Who will stick around? Data and tech.
Completely shut down? Today?
Branding. Are we also going to kill the brand? Brand continues
Timeline and paperwork
- Timeline?
- Diff between brand new society and subsidiary?
- As long as we do it calmly, all good

20221104

Ricardo's request: Add % lost sales to dashboard
Find out how could we do the "dirty read thingy"
Try to make an environment where I can work with the GE notebooks

Given a ready to insert/update subset of data Mirror, and a table Target, there is no expectation that applies to Target but not to Mirror. Hence, the same expectation suite Suite that can validate fully the Mirror in an ETL flow, can also validate fully the Target in a monitoring flow.

20221107

Clean up Sandbox
Review Pinto's comment
- ~~Re-run and check if duplicate persists~~
  - The error persists.
- ~~If it persists, see what part of the monster JOIN is the one joining more than once and change what's necessary~~
  - I just noticed it's from Mercadão, not Lolamarket. Change of plans
- ~~First, I'm gonna check how many orders in each system have duplications.~~
  - It's only Mercadão.
- ~~I have noticed that the Mercadão flow is still doing the truncate with trino. This could be the reason behind the error. I'm going to check if the staging table has duplicates.~~
  - I fixed it and it looks like that was the guilty bit. Re-executing returned no duplicates, so I'll consider it to be good for now unless Ana spots anything again.

20221108

Q&A for Promotech SL death

"All seniority will be taken into account."
~~What is a "Mercadão Entity"? Is it an SL company in Spain owned at 100% by Mercadão?~~ It's just Mercadão itself with legal presence in Spain.
~~Do we have a timeline? Not that I'm in a hurry, just out of curiosity ~~-> "Our goal is for your transition to be completed at most in mid december"

20221110

Review Janu's expectations
- Use unique
- Explain trick on query vs table to sum all flags together
- Explain REGEX to id_app_user
- Share this: https://legacy.docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html

20221115

The Great Fixing of the Order ETLs

Put the trim in the Lolamarket order status

Solved

Check the DELIVERING order having null status for payment

Solved

Another thing, if you look at the orders from October, some of the orders don't have the correct status... For example: select * from data_dw.dw_xl.fact_order where id_order = 2041500; In this case, the status on the operational table is delivered and in the fact_order is LOYALTY_CARD_ERROR

TLDR: I can't reproduce the issue
I checked and fact_order.original_app_status = 'deleted' and the fact_order.id_operational_status = 4, which correctly corresponds to cancelled
Query: select op_ status.*, o.* from dw_xl.fact_order o LEFT JOIN dw_xl.dim_operational_status op_status ON o.id_operational_status = op_status.id where id_order = 2041500

Why the date delivered of this order is 2022-09-30 instead of 2022-10-01? select * from data_dw.dw_xl.fact_order where id_order = 9715131;

Could you elaborate what's the rationale behind that date_delivered_closed should be 2022-10-01?

Why for order 2857238 we have differences in the fields delivery_fee_original_eur and delivery_fee_discount_eur? (I'm using the public.exchange_rate_eur for the calculation)

I can't find any problem. Could you explain how do you think the following numbers should be? (These are for order 2857238)

delivery_fee_eur delivery_fee_original_eur delivery_fee_discount_eur delivery_fee_local delivery_fee_original_local delivery_fee_discount_local

0.3400 4.0000 3.6500 1.7200 19.9900 18.2700
Check why it doesn't add up to 4 Solved, it was because the hardcoded values were not of type DECIMAL and Trino is an idiot.
Trim ROLE_USER
No orders from users with ROLE_TEST
- Option A: we remove ROLE_TEST users from dim_user
- Option B: we keep ROLE_TEST users in dim_user, add a filter in the fact_order ETL to remove those users
Add the timezone field to orders
- LM
- MD
Should we remove cart = deleted? YES
Find alternative date_created for Lomarket orders where the current logic doesn't have a value:
- FROM_UNIXTIME(cart.date_deliveries_requested) AS date_created,
Do the change about the 100€ cart and other details of the delivery fee

delivery_fee_eur	delivery_fee_original_eur	delivery_fee_discount_eur	delivery_fee_local	delivery_fee_original_local	delivery_fee_discount_local
0.3400	4.0000	3.6500	1.7200	19.9900	18.2700

Check if dates in UTC Checked. All is UTC atm.

I don't know if I already say this, but for this old order (3127) at Mercadão, we should consider fulfill the deliveryslotdate using the date of the real delivery. In these cases, we shouldn't consider the date_last_delivery_slot_end. Should be null as the date_last_delivery_slot_start because these values don't make sense: select * from data_dw.dw_xl.fact_order where date_last_delivery_slot_start is null and id_data_source = 1;

What is the rule to obtain the "date of the real delivery"?
"Should be null as the date_last_delivery_slot_start because ..." <- Sorry, I'm not following. Order 3127 does not have a null value in date_last_delivery_slot_start
It's only this order: manually copy over the value from the last field to the original one.

I'm not sure if we talked about these specific cases:
select * from data_dw.dw_xl.fact_order where date_last_delivery_slot_start is null and id_data_source = 1;
But I think that the date_last_delivery_slot_end should be null because if you check, doesn't make sense, because we don't have slots finishing to 00:30 Maybe you could add this to the manual fixes

Manual fix

Check why these 5 orders have no date_delivered_closed even though they are in status delivered

SELECT *
FROM data_dw.dw_xl.fact_order fo
WHERE fo.date_delivered_closed IS NULL
AND original_app_status = 'delivered'

20221123

User ETL
- ~~Refactor to only pass the config from the expectations and the query~~
- ~~Make new bucket~~
- ~~Make new schema~~
- Copy over in other flows
  - ~~Update MD~~
  - ~~Guest MD~~
  - ~~Upgrade MD~~
  - New lola
  - Guest lola
  - Update Lola

id_user_internal

Usual id_data_source
2 id_app_user
Like id_user_internal date_joined
Usual date_registered
Usual user_role
{"ROLE_USER" ,"ROLE_WARNING" ,"ROLE_TEST" ,"ROLE_SUPER_WARNING" ,"ROLE_BANNED" ,"ROLE_INCIDENCES" ,"ROLE_ADMIN" ,"ROLE_SUPER_ADMIN" ,"ROLE_AMBASSADOR" ,"ROLE_SUPER_AMBASSADOR" ,"ROLE_B2B"}
Not null gender
{"masculine", "femenine", "none"} id_app_source
{"1", "2", "3"}
Not null ip
ip regex locale
"es" priority
int between 0 and 100 substitution_preference
{"call", "nothing", "shopper"}
not null email
regex login
regex id_stripe
the regex poupamais_card
null created_at
usual date stuff updated_at
usual date stuff

flag_is_active flag_gdpr_accepted flag_agreed_campaigns flag_agreed_history_usage flag_agreed_phonecoms flag_agreed_terms flag_email_verified flag_rtbf_deleted flag_is_guest_account flag_is_social_account

phone_number

regex

Performance Review Guidelines

3 moments
- Write self-review
- Manager (João) provides feedback
- 1:1 with manager
Timeline?
Every year the same?

20221213

Docker image research

Chapter 1 - Pulling the image

If we want to be able to freely modify and update our prefect-flow images, first we must understand what's inside the current one, since we need to keep the current production flows up and running. And if we want to look into it, first you need to have it in your local machine.

My trouble began here: I tried installing docker in my Ubuntu WSL, but whenever I tried to run any docker command, I got the error message: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?. Googling this was a pain in the ass. There are a thousand reasons for this error to happen and there are a million combinations of Windows + WSL + Docker setups which apparently have all sorts of different behaviours.

I finally managed to get something working by doing this:

Get docker working
- I installed Docker desktop in Windows.
- I installed the Debian WSL from the Microsoft Store.
- In the Docker Desktop GUI, I went to Settings (the little gear, top right) -> Resources -> WSL Integration and enabled the integration with the Debian engine.
- Debian can now use docker (as long as this Docker Desktop GUI is up and running).
- Note: our current Ubuntu WSL is not usable for this because it is a WSL Version 1 (don't ask me what this means). On the other hand, the Debian I installed is a WSL Version 2, which is correctly identified by the Docker Desktop GUI.
Get permission to pull the image from AWS ECR
- Whoever wants to pull an image from our ECR must have the following policy in AWS: AmazonEC2ContainerRegistryReadOnly (note that there are higher level policies such as AmazonElasticContainerRegistryPublicPowerUser or AmazonElasticContainerRegistryPublicReadOnly that would also work, but the read only one is enough for this purpose).
- Get awsume set up in the Debian env (you can follow instructions here: https://awsu.me/general/quickstart.html)
- Get your credentials set up in ~/.aws/credentials and use awsume
- Login into our AWS Docker Registry with this command: aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin 373245262072.dkr.ecr.eu-central-1.amazonaws.com
- You should be able to pull images from the repo like this: docker pull 373245262072.dkr.ecr.eu-central-1.amazonaws.com/pdo-data-prefect:latest

A good resource to understand the AWS login stuff that's going under so that it doesn't feel like dark magic.

Video explaining how to login to the AWS CLI with MFA: https://www.youtube.com/watch?v=EsSYFNcdDm8

Chapter 2 - Understanding what does our image have

I researched a bit and apparently there are a couple of ways to (https://appfleet.com/blog/reverse-engineer-docker-images-into-dockerfiles-with-dedockify/, https://stackoverflow.com/questions/48716536/how-to-show-a-dockerfile-of-image-docker) Once the image was in my hands, the next part was understanding what was inside.

To do so, I ran the following command and got the following result:

$ docker history 223871741b8a

IMAGE          CREATED        CREATED BY                                      SIZE      COMMENT
223871741b8a   6 months ago   /bin/sh -c #(nop)  ENTRYPOINT ["tini" "-g" "…   0B
<missing>      6 months ago   /bin/sh -c #(nop) COPY file:6068d9a0511b2a94…   795B
<missing>      6 months ago   /bin/sh -c pip install trino                    267kB
<missing>      7 months ago   /bin/sh -c #(nop)  ENV LANG=C.UTF-8             0B
<missing>      7 months ago   /bin/sh -c #(nop)  ENV LC_ALL=C.UTF-8           0B
<missing>      7 months ago   /bin/sh -c #(nop)  ENTRYPOINT ["tini" "-g" "…   0B
<missing>      7 months ago   /bin/sh -c #(nop) COPY file:e1bbbe4447dfaf1e…   795B
<missing>      7 months ago   |4 BUILD_DATE=2022-04-27T21:09:36Z EXTRAS=al…   505MB
<missing>      7 months ago   /bin/sh -c #(nop)  LABEL org.label-schema.bu…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  LABEL org.label-schema.vc…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  LABEL org.label-schema.ve…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  LABEL org.label-schema.ur…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  LABEL org.label-schema.na…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  LABEL org.label-schema.sc…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  LABEL io.prefect.python-v…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  LABEL maintainer=help@pre…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  ENV LANG=C.UTF-8             0B
<missing>      7 months ago   /bin/sh -c #(nop)  ENV LC_ALL=C.UTF-8           0B
<missing>      7 months ago   /bin/sh -c #(nop)  ARG BUILD_DATE               0B
<missing>      7 months ago   /bin/sh -c #(nop)  ARG GIT_SHA                  0B
<missing>      7 months ago   /bin/sh -c #(nop)  ARG EXTRAS                   0B
<missing>      7 months ago   /bin/sh -c #(nop)  ARG PREFECT_VERSION          0B
<missing>      7 months ago   /bin/sh -c #(nop)  CMD ["python3"]              0B
<missing>      7 months ago   /bin/sh -c set -eux;   savedAptMark="$(apt-m…   11.4MB
<missing>      7 months ago   /bin/sh -c #(nop)  ENV PYTHON_GET_PIP_SHA256…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  ENV PYTHON_GET_PIP_URL=ht…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  ENV PYTHON_SETUPTOOLS_VER…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  ENV PYTHON_PIP_VERSION=22…   0B
<missing>      7 months ago   /bin/sh -c set -eux;  for src in idle3 pydoc…   32B
<missing>      7 months ago   /bin/sh -c set -eux;   savedAptMark="$(apt-m…   28.1MB
<missing>      7 months ago   /bin/sh -c #(nop)  ENV PYTHON_VERSION=3.7.13    0B
<missing>      7 months ago   /bin/sh -c #(nop)  ENV GPG_KEY=0D96DF4D4110E…   0B
<missing>      7 months ago   /bin/sh -c set -eux;  apt-get update;  apt-g…   3.11MB
<missing>      7 months ago   /bin/sh -c #(nop)  ENV LANG=C.UTF-8             0B
<missing>      7 months ago   /bin/sh -c #(nop)  ENV PATH=/usr/local/bin:/…   0B
<missing>      7 months ago   /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      7 months ago   /bin/sh -c #(nop) ADD file:8b1e79f91081eb527…   80.4MB

I quickly realised most of the image didn't looked like something Ana might have modified manually but rather standard. I read the Prefect docs on the topic (https://docs-v1.prefect.io/orchestration/flow_config/docker.html) and then everything clicked in my mind:

Prefect provides a standard image to run the flows in.
The only thing Ana had done was to extend that image by installing the Trino Python client in it.

This is great news because it means we don't need to do any black magic to make new images that can be used by the old flows. We can simply grab the standard prefect image, add our required packages and add any other changes we do and that's it.

Chapter 3 - From now on

Try to make a new image and check if it works -> Yes!!!
Design our image design and building pipeline
Create a Git repository to document our images.

https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html

Do the performance review self-assessement
Make new repository and organize image modification pipeline
Manage to get a custom package inside our image
Start packaging things
- Connections
- Slack alerts
- General bloat
- Great Expectations
- Common patterns

Hand-off with Dani

DMS (Data Migration Service)
- The thingy that moves data from LM Replica to Redshift
- DMS is configured to be real-time

20221214

Payment process for Poland

How will the shopper invoicing work in Poland?
Which data should we look at in DB?
- Shopper Bill?
- New stuff in Flex schema?
What is "the division between Delivery fees and CPO"

20230102

Fix the bug with the LM update users pipeline
Finish PR
Tag production image and clarify in docs that latest and production are synonymous
Finish the cute drawing
Review CVs
Make the hello-world package
Uploading to pip
- Create new S3 bucket for Python packages
- Write generic guide on how to make any package uploadable to the S3 bucket
  - https://pdofonte.atlassian.net/wiki/spaces/DATA/pages/2440658945/Custom+Python+Packages+Distribution
- Make specific guide on how to upload bom-dia-mundo to S3
Modify build process to include custom packages
- Create config file with list of packages and versions
- Create script to make temp copy of it
- Manage to get it installed
- Run bom-dia-mundo
- Document and discuss with João
Prepare development plan with João
Review the data architecture summary with João
Document Metabase downtime, update, etc.
Store creds for AWS
Fix lm update users
Take a look at API thing
Document procedure to open connections in Looker for future reference
Ideas for Dani's gift
- A haircomb and a hairdresser giftcard

https://community.looker.com/technical-tips-tricks-1021/mysql-howto-why-does-mysql-need-database-permissions-for-derived-tables-25386

Interview methodology

Technical interview
- Small E2E, case, common sense
- Checklist of tool questions
- Silly coding task, automated

20230103

Looker connection to data_dw

20230109

Chase Carlos for the connection
Send code assignment to review
Fix the data_dw config myself
- stop the instance
- ~~make a backup of the instance~~ -> https://eu-central-1.console.aws.amazon.com/rds/home?region=eu-central-1#db-snapshot:engine=mysql;id=snapshot-for-subnet-config-changes-20230103
- ~~Start the instance again (otherwise the networkign stuff can't be modified)~~
- move it to default-vpc-4012372b(pdo-uat) subnet group
- move it back to pdo-prod-public subnet group
- ~~Add again the security group with the IP exceptions~~ (it was removed with the subnet group migrations)
- Delete the snapshot?
Send message to Liliana
Upgrade of the replica to MySQL 8.0
- Create chained replica of existing replica on MySQL 8.0
  - When creating this, I couldn't specify the MySQL version to be 8.0. It's either not allowed, or I missed it. Instead, I created the chained replica with 5.7 and afterwards upgrade it to 8.0
- Upgrade the second replica to MySQL 8.0
- Check that it replicates properly
- Modify Looker LM connection to point to second replica
  - (Exploring, testing and debugging can start at this point, with some potential downtime while we do changes in the guts of MySQL)
- Perform a recoverable snapshot/backup/something of the old read replica
- Test recovering from the snapshot/backup
- CHECKPOINT only move forward if both of these are true
  - We have confirmed that Looker + MySQL 8 connector satisfies our needs
  - We feel comfortable rolling back the MySQL old replica from v8 to v5 if we screw up
- Execute upgrade of old replica to MySQL 8
- Switch Looker connection again to point to old replica
- Check (if something below breaks, grab logs and rollback)
  - ETLs reading from old replica work fine
  - Looker explore works fine
  - Nothing else has broken
- Cleanup actions
  - Remove second read replica
  - After some prudent time (couple weeks, a month?), remove snapshots/backups from old replica in MySQL 5 version
- Related doc
  - General info on read replicas https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_MySQL.Replication.ReadReplicas.html
  - General info on upgrades https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.MySQL.html The busy day
Validate old dependencies on second replica - Point Trino - Backup old connection config of app_lm_mysql.properties - connector.name=mysql connection-url=jdbc:mysql://lolamarket-rr2.cnlivtclari7.eu-west-1.rds.amazonaws.com connection-user=trino-bi connection-password=WdRpC6aHS4n7UnG2K9z case-insensitive-name-matching=true - Modify connection URL - After this, Trino is still reading from the old replica (validated by throwing queries through Trino and observing the monitoring pages of both replicas) - Reboot Trino somehow - We used the redeploy button in Rancher on the worker workload - It worked. After a couple of minutes, the redeploy was successful. Queries now hit the second replica instead of the old one. ~~- Point ETL - It is done automatically after Trino points there. The ETLs never connect directly to the LM replica - Just run the ETLs and check if it runs~~
- Point Redshift Migration
  - ~~Backup current endpoint configuration so we can go back to it -> it's simply the RDS endpoint~~
  - ~~Stop the database migration tasks (otherwise we can't modify the endpoint)~~
  - ~~Modify endpoint to point to second replica instead of old one~~
  - ~~Start again the task~~
  - It breaks with an error
    - Last Error Failed to connect to database. Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2800] [1020414] Error 2019 (Can't initialize character set unknown (path: compiled_in)) connecting to MySQL server 'lolamarket-rr-mysql8-test.cnlivtclari7.eu-west-1.rds.amazonaws.com'; Errors in MySQL server binary logging configuration. Follow all prerequisites for 'MySQL as a source in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html or'MySQL as a target in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.MySQL.html ; Failed while preparing stream component 'st_0_KURC747DBGGXBD44ND6JF6DFZ4'.; Cannot initialize subtask; Stream component 'st_0_KURC747DBGGXBD44ND6JF6DFZ4' terminated [reptask/replicationtask.c:2808] [1020414] Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE
    - We decide to:
      - Make a new parameter group and set binlog_format to ROW (https://eu-west-1.console.aws.amazon.com/rds/home?region=eu-west-1#parameter-groups-detail:ids=lolamarket-replica-8-custom-params;type=instance)
      - Assign this parameter group to the second replica
      - Reboot the replica
      - Restart the DMS task
    - Pivot. We decide to leave DMS dead.
    - Before closing on the DMS topic, repoint the DMS endpoint to the old replica, knowing that it will fail once it gets upgrade to v8
- Validate... how?
  - By checking if new carts appear in Redshift
- ~~Check that metabase can connect to mysql 8 -> It can~~
Point old dependencies back to old replica
- Point Trino to old replica
  - ~~Change config back to how it was~~
  - ~~Redeploy workers~~
  - ~~Validate with the query game again~~
~~Upgrade old replica to MySQL 8~~
- ~~Take snapshot while in 5.7~~
- ~~Go to Modify menu and bump to latest version~~
- ~~Give it some time~~
~~Validate that everything is fine~~
~~Connect Looker to old replica on MySQL 8~~
~~Remove second replica~~
~~Make stories for things that we could deprecate
- Redshift and DMS
- Metabase contents reading from redshift~~
Read requirement documents and add my comments
- Capacity: Setting app
  - I agree with Carlos Matias on the non-overlapping of the delivery areas. Implementing this in a web UI might be a nightmare. We might better off using some GIS software such as QGIS and writing a little guide on how to create the polygons with it + implementing some kind of validator to prevent user mistakes.
  - If there are areas within areas, a strong hierarchy control must be in place. Each child area should only have one parent.
  - The data team should have a way to obtain the polygon files from the areas. Something programmatic would be ideal, since as we grow the number of areas might make manual downloads infeasible.
  - It's key that the history of areas is kept. Removing or modifying an area should not delete the old data, but rather archive the old version. If we fail to do this, looking at historical data will be a mess and we won't be able to answer questions regarding operational areas properly.
- Orchestration: Dispatching orders
  - What happens to orders that have been rejected by all shoppers?
  - How are orders assigned to a fleet?
  - Scenario: customer Alice places an order. Shopper Zac sees the order and accepts it. Some time afterwards, customer Alice cancels the order. How does shooper Zac learn about this? Is there any notification? Does shopper Zac have access to a list of cancelled-by-the-user orders? The rational for my question is: as a shopper in the field, you are not proactively checking your entire order list constantly (you get a bit of tunnel vision and focus only on the order at hand and perhaps the next). I think if there is no notification to push the cancel "in the face" of the shopper, it could go unnoticed and the shopper might be planning his day assuming that order still needs to happen.
Draft technical interview case
Code

Ygor Gomes

From North Brasil
7 years as DevOps
Knows Hive, Databricks, Hadoop
Married, 2 kids
He asks for docs. Boss

20230116

Review https://pdofonte.atlassian.net/browse/DATA-932

Interview with Afonso

Information Systems
Learn about ETLs and Dashboarding
Data Structures and Algorithmes
Databases and relational models
IESE Business school (Summer)
AESE Business school (Summer)
Shopper in Mercadão (#7)
- Suggested a change in ticketing systems
Logistics startup
- Fill empty trucks of different providers

JAVA Python Prolog Assembly HTML/CSS Database TSQL

My Impression

Overall
- Afonso is university student with a good profile. He has the gaps that one would expect from not having working experience, but he has the beginner skills and knowledge that are needed to start in our world and seems to be smart. Hiring him will mean investing time and effort on building his skillset. But I think he is smart enough to learn fast and eventually become a good engineer.
- A disclaimer: because of the previous points, I wouldn't advice hiring Afonso on a part-time basis. Afonso will face non-trivial ramping up period until he is productive within the team, so working part-time could translate into waiting months before being net positive for the team.
- Even though I genuinely think Afonso could work with us as a Junior member, I think we should try to assess slightly more experienced candidates who would be comfortable with our compensation. As much as Afonso shows great potential for a fresh graduate, the right profile with 1-2 years of experience could be orders of magnitude more useful to the team.
On the case
- Was the candidate able to draw a complete solution from the data sources (Pingo Doce, Orders database) to the business users (Finance team)?
  - Afonso was able to draw a rough but complete processing and monitoring scheme to go from raw data to the final data that would be needed.
  - On the other hand, he struggled to find the right component to host data to be accessed from Looker, and didn't suggest how to turn email data into S3 files.
- Was the candidate curious about the business requirements? Did he make an effort to deeply understand what his business colleagues want to achieve so that he can propose the most suitable solution? Did he think out of the box to realize what was "the request behind the request"?
  - Afonso asked good questions initial questions to understand better the situation, data and architecture wise. He asked early if data was available to check and get an idea on what he needs to work on.
  - On the other hand, Afonso didn't dive a lot into how the solution would change the finance team's way of working. I would have expected a bit more of curiosity around finance's problem and brainstorming over what the solution would be from their point of view, regardless of how we built it in the backend.
- Were the tools and techniques chosen by the candidate the best ones for each component in the system?
  - Afonso made good proposals on using lambda functions to host Python code and S3 buckets to handle the data as flow of files.
  - On the other hand, he didn't cover how to fetch an email into S3, and more importantly, couldn't correctly identify the need for SQL database in order to store the processed information for Looker to access it.
- Did the candidate challenge the fact that Pingo Doce sends the orders through email, and explored the idea of having an alternative ingestion method that is more powerful and consistent?
  - No.
- Did the candidate realize that a very simple, version 1 solution could be delivered to the finance team to provide early value, while more advanced solutions could be worked on?
  - No.
- Did the candidate ask to see examples of the raw data?
  - Yes, early. And made good remarks, paying attention to what info was contained, what were the primary keys and how it should be turned into a more structured format than excel.
- Did the candidate show an interest for the size of the data being handled, in order to take scalability matters into account when designing the system?
  - No.
- Was the candidate able to decompose the request into different, independent user stories? Such as: ingest and clean the data, automate the detection of unmatched or inconsistent orders, automate the suggestion of data correction.
  - No.
Additional points
- Surprisingly, Afonso didn't have a single question to ask at the end of the interview.

20230208

Interview with Timóteo

He's from Oporto

Started as micro IT helpdesk, scaled to IT infrastructure Went into big data support:

Bash
Ansible
"I did mostly scripts and firmware upgrades"
Azure, Cloudera

Now in Jumia:

Building ETL pipelines in Airflow
Python and SQL coding, first experiences "I'm doing basic pipelines"
Doing Udemy course in Python
"I consider myself a junior"
Acts as a bit of an admin on rancher
Apache Nifi
AWS
He's not sure about many things. Red flags

He will be fired in Jumia in February.

Tools

Checklist

Python
Git
SQL
Trino
Github
Jira
Confluence
Looker
AWS

My Impression

Overall
- Timóteo has very little experience in Data Engineering. His knowledge is mostly on administering and maintaining infrastructure rather than designing and developing systems. He has troubles come up with detailed designs for how to store, move and transform data, . His approach to the challenge was unstructured and most of the time he had time communicating his proposal in a clear and understandable manner.
- I would propose not moving forward with Timóteo. I think his experience adds little value towards his position and I don't think his soft skills are enough to compensate for it.
On the case
- Was the candidate able to draw a complete solution from the data sources (Pingo Doce, Orders database) to the business users (Finance team)?
- Was the candidate curious about the business requirements? Did he make an effort to deeply understand what his business colleagues want to achieve so that he can propose the most suitable solution? Did he think out of the box to realize what was "the request behind the request"?
- Were the tools and techniques chosen by the candidate the best ones for each component in the system?
- Did the candidate challenge the fact that Pingo Doce sends the orders through email, and explored the idea of having an alternative ingestion method that is more powerful and consistent?
- Did the candidate realize that a very simple, version 1 solution could be delivered to the finance team to provide early value, while more advanced solutions could be worked on?
- Did the candidate ask to see examples of the raw data?
- Did the candidate show an interest for the size of the data being handled, in order to take scalability matters into account when designing the system?
- Was the candidate able to decompose the request into different, independent user stories? Such as: ingest and clean the data, automate the detection of unmatched or inconsistent orders, automate the suggestion of data correction.
Additional points

20230214

Meeting with Ygor

Databases
- Lola backend
- Mercadão backend
- DW
Trino
Prefect
Other stuff in AWS
- ECR
- S3 buckets
  - for flows
  - for great expectations
  - for python packages
Wishlist
- Improve Trino uptime
- UAT Prefect server
- Improve automations on CI (docker, python packages)
- Airbyte

20230217

Entrevista con Ricardo

Perfil Ricardo
- Python
- Machine Learning
- Optimizacion exacta, metaheuristicas y simulacion
- Experiencia profesional y rol en la UPF
  - Macroeconomia
  - IB en UPF y Ingenieria Computacional y Matematica UOC
  - Banc Sabadell
    - BI
Sobre mi
Sobre la asignatura

Solo tardes a partir de las 15 Seminarios

Interview with Rui

From Porto, lives in Lisbon
Started worked in Caixa General, Business Intelligence Data Engineer
- First project, started from scratch infrastructure Apache Nifi
- Groovy Java,
- Sonae, Azure cloud

20230221

Silent failing flows for MD users

Why are they failing?
- Because the connection object passed to the great expectations task doesn't have the raw_user attribute anymore (it used to).
- This was introduced in the last release of the dim user project (1.0.3), when the connections stopped being managed by the hardcoded tasked and were refactored to use lolafect, which does not implement the raw_user and raw_password trick.
- Because of this, connecting to run the GE test fails.
Why are they not sending alerts?
- Because of how we named the transactions and basically built the flow, none of the final_tasks fails when t_00X fails.
- The slack messaging is configured to send an alert if any of the final_tasks is getting sent.
How to fix
- The simplest option is to migrate already the data test to lolafect.
- It's important the the final_tasks design mistake gets fixed so that, in the future, the slack messages get sent.
- I was about to do this, but I see you have already started work beyond the open release and are implementing this, so I'll just leave it be and let you carry on once you are back (or we can discuss if I should take over)

Fix

Okay, we are going to work together. Here is some context for you to be aware of.

We will be writing code for a Prefect 1 flow.
We are using Great Expectations to evaluate some data on a MySQL database.

Review user flows failing because lolafect does not support the raw user and password trick and why do the slack alerts not happen

Play around with transactions in the Python Trino client to understand them and how can we use them better in ETL

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

20230223

Discussion on Afonso and Rui

Liliana
- She thinks both are good culture-fit wise
- Afonso pro: operations knowledge
- Afonso pro: ambitious, eager
- Afonso dis: he might leave if he is not challenged enough
- Rui pro: knowledge and experience
- Rui cons: more expensive
Me
- Let's go for Rui if we want someone who knows his stuff, let's go for Afonso if we want to truly have a junior

Rui questions (deadline for yesterday) Afonso ambitious multinational

20230308

Airbyte vs Fivetran

Criteria
- Cost
  - Fivetran: 5 digits per year. Susceptible to large bills due to developer mistakes. + DevOps cost for integration.
  - Airbyte: infra cost (maybe 2K per year or something like that) + DevOps cost for maintenance.
  - Winner: airbyte will easily be 10 times cheaper with current needs. As needs increase, the gap will widen.
- Security
  - Fivetran: data travels through their infra. It does so encrypted, which theoretically means they see nothing. But we still need to keep an eye on the topic to ensure we are happy with however they are doing things. Credentials storage???
  - Airbyte: everything stays at home. We do have the responsibility of security ourselves, but it shouldn't be much of a headache. Airbyte should only be accessible within our VPN and with password based access.
  - Winner: airbyte makes this much simpler.
- Connectors:
  - Fivetran:
    - Has connectors to everything we need.
    - The evolution will probably be slow (they only have so much software firepower)
    - Developing custom connectors is out of the table.
  - Airbyte:
    - Has connectors to everything we need. Some of them are in alpha and beta though. Regardless, it should probably mature fast.
    - The evolution will probably be fast (open source, everyone and their mother will join)
    - Developing custom connectors is on the table. It's designed for it.
  - Winner: short term, it's a tie. Long term, Airbyte has the upper hand.
- Transformations: both work with dbt. I don't appreciate any significant differences on this point.
- Other
  - By simple game theory, Airbyte will outpace Fivetran. It's a solid, production ready, open source copycat of Fivetran. Adoption will eventually outpace Fivetran and the network effect of opensource will generate a positive spiral.
  - No lockage: we do not really depend on Airbyte (as a company) for anything. If some day they go stupid, we can still use the older versions. Also, a community fork would probably appear with the general interest in mind (see PrestoSQL and Trino as an example)

Guillermina

Curra en Kadre

Tether, previously Luma.

About me

Education and experience
Writing professional grade Python for many years

Offer

I am familiar with the problem of grid balancing, and I feel attracted by it
Position
Team,
- who do I report to,
- who else is there
- Why is there an army of interns?
- (I see you are hiring for many positions)
Selection process
- Next conocer al cliente con Martim or Luis
- Entrevista tecnica
Where is the company at, maturity wise
Business model

Founders: Luis and Martim

Conectar a miles de cargadores en tiempo real Solo dos personas + 5 interns Buscan dos perfiles: backend/devops Contratación directa con la empresa

Porque Luma

Problema interesante

Porque no lolamarket

https://legend.lnbits.com/wallet?usr=f0f7d962a8f44b179529dfbbf8d51d13&wal=a286ac99b8504fea858c6537d921983d

4f0215f77a0b4e33b1ca802fc21fa6cf

20230320

Townhall

Ricardo and Gonçalo are leaving the company.

20230324

Q&A

Will XL be an independent division? -> Yes.
Do we need to go to the office like Glovo guys -> No.

Slack messages not sent

Initial symptoms

The following flows failed due to a Trino outage, but the slack warnings were not sent:

006_md_03
006_md_04
006_md_05
003_md_01
003_lmpl_01
004_md_01
006_lmpl_01
006_lmpl_02
006_lmpl_03
013_010
007
017_pt_01

The following flows failed due to the same outage, but did send the slack message correctly:

011_md_01
011_lm_01
015_md_02
001_20

First exploration

To explore this issue, I focused on the flow 013_010, version 1.0.3 (the version that was scheduled in the prefect server).

The logs of the failed run showed that the slack message task had been skipped:

Task 'SendSlackMessageTask': Finished task run for task with final state: 'Skipped'

I decided to reproduce the error:
- First I ran the flow locally as-is: the execution was successful.
- I then executed again the flow locally while manually introducing an exception in the same task that failed in the scheduled run due to Trino. The Slack message was sent successfully.
- This was confusing. I was expecting the same behaviour as with the failed prefect server run
I reviewed the logs of the prefect server run that failed and didn't send the slack message again.
- I noticed that, besides the slack message task, the ssh tunnel closure task also had been skipped:
- Task 'close_ssh_tunnel': Finished task run for task with final state: 'Skipped'
- This was reasonable, since the ssh tunnel is not used in the prefect server. Nevertheless, it made me wonder: is the usage of the ssh tunnel the responsible for differences between local runs and server runs? There was no a priori reason to think so, but it was the only difference I could find between my local environment and the prefect server one.
I then observed the code once more and realized that the slack message task had as its upstream tasks all the final tasks.

final_tasks = [transaction_end, trino_closed, dw_closed, tunnel_closed, t_090]  

[...]
  
send_message = send_warning_message_on_any_failure(  
    webhook_url=channel_webhook,  
    text_to_send=warning_message,  
    upstream_tasks=final_tasks  
)

That the closure of the SSH tunnel was considered one of the final tasks.
Theoretically, this shouldn't matter. The slack message task is configured with a trigger of any_failed, so whether the ssh tunnel closure task was being successfully done or skipped was irrelevant as long as some other final task was failing (which was the case).
Nevertheless, I decided to try the following:
- Modify the code to remove tunnel_closed from the final tasks.
- Upload this version to the prefect server, along with a manually induced exception on the same task that failed previously due to Trino.
- Observe the logs.
Result: it worked! This time, the slack message task triggered properly and sent the message,

In conclusion, it seems that having the tunnel closed task in the final tasks somehow forces the slack message task to skip.

Open questions

Why the hell does that happen?
Is this the reason for all the silent failing flows we've seen? Or is this only specific to flow 013-010?
Are there other flows that do not have the tunnel_closed as part of the final tasks, and hence are sending the slack messages correctly when failing?

Next actions

Answer the open questions
Apply the fix to all flows that the tunnel_closed in their final tasks.

Answering open questions

Is this the reason for all the silent failing flows we've seen? + Are there other flows that do not have the tunnel_closed as part of the final tasks, and hence are sending the slack messages correctly when failing?
- I pick a sample of the other flows to check this.
- Faulty flows:
  - 006_md_03 -> Has tunnel_closed in final_tasks
  - 003_md_01 -> Has tunnel_closed in final_tasks
  - 006_lmpl_01 -> Has tunnel_closed in final_tasks
  - 007 -> NA, extremely outdated, doesn't use lolafect
- Properly working flows
  - 011_md_01 -> Has tunnel_closed in final_tasks
  - 011_lm_01 -> Has tunnel_closed in final_tasks
  - 015_md_02 -> Has tunnel_closed in final_tasks
  - 001_20 -> Has tunnel_closed in final_tasks
- The hypothesis is not holding. tunnel_closed can be in final_tasks and everything will work fine.
- But, I have observed the following difference between the flows in both blocks: the faulty flows are using two case blocks to deal with connecting with or without an SSH tunnel. The working flows are using the pre-lolafect approach.
Why the hell does that happen?
- Given the answer to the previous questions, it seems that the following pattern in flows is the one that causes the issue:
  - The ssh tunnel opening happens within a case block that doesn't activate when running on the prefect server.
  - The output of the ssh tunnel opening task gets referenced in the ssh tunnel closure task.
  - The output of the ssh tunnel closure task is listed as a final task.
  - The list of final tasks is passed to the slack warning task as the list of upstream tasks.
- The solution seems easy:
  - Either put the ssh tunnel opening output outside of the case block
  - or do not include the ssh tunnel closure output in the final tasks

Dummy SSH tunnel task test

I decided to test the following hypothesis:

If, instead of only creating the task output ssh_tunnel within the case block, we also create a dummy output with the same name earlier, outside of the case, does it fix the issue?
No, it doesn't. This is puzzling me even harder.

https://pdofonte.atlassian.net/wiki/spaces/DATA/pages/2379284485/Notifying+flow+failures+through+Slack#The-evil-slack-error

Fixed the evil slack error that prevented slack messages from being sent on failure.

Fixing flows

006_md_03: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
006_md_04: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
006_md_05: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
003_md_01: https://github.com/lolamarket/data-003-etl-xl-order/pull/9
003_lmpl_01: https://github.com/lolamarket/data-003-etl-xl-order/pull/9
004_md_01: done
006_lmpl_01: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
006_lmpl_02: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
006_lmpl_03: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
013_010: https://github.com/lolamarket/data-013-etl-xl-exchange-rate/pull/8
017_pt_01

Update users failed double flow

The flow failed
It's because the GE data test failed
The GE data test failed on an expectation on uniqueness for the id_app_user
The quarantine was not made because the code has a bug. The trigger is not the right one. This happens in (at least) all the MD user flows.
In which PR was this introduced?
- It seems that it was in 1.1.2.
- It also affects LMPL 2 and 3.
- It affects all MD flows
~~Send EVO payment~~
~~User for Looker~~
~~Check data issue on MD03~~
~~Review PR for João https://github.com/lolamarket/data-003-etl-xl-order/pull/10~~
Keep working on the evil slack bug
Clean up quarantine flow
Add onboarding steps for Afonso

Martim

My thousand questions
- Background
  - What is the story of Tether? When did you start, what has happened since then, where are you today?
  - What is your roadmap?
  - How is the equity of the company structured?
- Product/Service
  - Who pays who for what?
    - Frequency Containment Reserves:
    - TSO: how much power are you going to produce
    - They act as a bidder on the power market
  - CPO Charge Point Operator
  - Owners get smart charging (charge when price is cheap)
  - What is the potential for this market?
  - How the hell are you going to control thousands of charging stations? Isn't it tremendously expensive to run such field operation? What do the numbers look like?
  - How are you going to make the experience for car owners not a complete pain in the ass?
  - What is your geographical scope?
  - How does the evolution of the electric vehicle fleet impact your own roadmap?
- Team and position
  - What is the current team composition? What other positions are you hiring for?
  - Why do I only see a bunch of interns?
  - How would I fit in? What would my work look like during my first year?
  - What is your personal work culture? What do you like and don't? Could you describe me people and projects you enjoyed and people and projects you despised?
Martim
- Portuguese
- Bored about uni, wanted to go abroad
- Luis, meet in the masters, worked in energy in the US

The Pitch podcast - Bodyguard of the grid

Beta version running on computers Scheduling from partners Test version by end of the year (30-40 vehicles at first, up to 8000 at some point) Revenue at the beginning of the year Hiring for a devops engineer

Cloud advisor, business advisor, ML advisor (might join the team)

Optimization Engineer <- me
Devops Engineer
Managing next employees

Onelabel databases

Fer are going for a relational model, probably Aurora MySQL
How does this play with events?
Conventions for data types (money, dates)

What I would like:

Clear documentation
Representation layer decoupled from
Something airbyte-able

20230517

Glovo Data Platform meeting

Briefing

We meet with:

Kiran
Simone
Charly They are somehow related to Glovo's event data platform. We meet to learn about how they work so we can get some inspiration. Opening questions:
How does the data platform works and is structured
What time of access do we need to drive events there and handle them
What is it possible to do and not do
What kind of dependences there will be (PR approvals, team roadmap dependencies, etc.)

Notes

Simone and Charly com around.

Charly -> Manages the central data engineering team. Simone -> Manager in one the central data teams (data platform creation team (ingesting and transforming data))

Streaming vs Batch If we don't need to mix streaming with batch data, life will be better.

Simone:

Keep it with streaming native technologies as far as you can
When it comes to dumping events:
We are on Kafka (Confluence managed). We used to be on Kinesis.
We used Kafka connect to dump data on S3. We can share the tool we use with you to dump if you live under the same infrastructure as us.

Who should we talk about the Kafka?

What would be the ideal stack for streaming?

Enrichment and transformation engine (Flink, KSQL, Materialized, near-real time materialization engines)
After that, two options:
- Specialized database to dump (for example Druid)
- Trino/Starbust/Presto to throw queries

Glovo Data Platform client team

Briefing

We talk with this team who apparently is using the data platform to learn how their experience is.

Attendants

Pablo Butron: data product manager

Notes

His team is focused on consuming customer activity events on the front-end with the goal of reproducing the behaviour of users in the app.

He thinks we should talk with Engineering, not with the Data Platform. Engineering takes care of designing the backend. Data platform only puts the infra, but doesn't "fill it" with events in any way.

They are creating a Data Mesh. They had a monolith database that is being produced. Tier structure. Tier 0 is core stuff with strong SLAs (like orders).

They are not constrained or permissioned by Data Platform. They can implement their own databases and infrastructure. They share databases with other data teams (so intermediate states of )

Declarative data products. Build products in less than a couple of hours.

Data Platform provides importers that you need to access core data.

They use Amplitude and Looker for reporting to end-users.

Most of the products server internal reporting or external partners (like McDonalds).

They don't write back into the event streams themselves. They only act as consumers of the event bus.

20230518

~~Pinto says:
- ~~I'm looking at the order items table, and for this order 2383028, the replacement products are missing. I think is because you add the info when the order was created, but the delivery was a few days after. So I think in the process, maybe you should only add data on orders that it was already delivery~~~~

20230601

Onelabel infra with Marcos

Event driven architecture
Three gateways: one for retailers, another one for the backoffice apps, another one for the shopper apps
Auth0 for authentication
Assortment: reusing PHP service existing in Lolamarket. A re-implementation will be done.
One database per service. AuroraMySQL 8.0
Source of truth:
- For orders, order service
- For order planning, orchestrator service
- For fulfillment, order fulfillment
- For shopper data, shopper service
- For configuration of operations, operations config service. This includes capacity.
Capacity is defined in operations config. Availability is computed in the order orchestrator.
Redis for reservation expiration. Reservations only live in redis while they are not confirmed. Once they are confirmed, they get persisted.
Glovo's Kafka will replicate all the events received by the Onelabel kafka

Questions

Go through gateway, or have access to private network?
- We should be able to go inside the private network
One instance per client, right?
- Not clear.
- This needs to become clearer overtime.
- Most probably thing is no, multiple retailers
Aurora CDC replicas are possible?
- It can make replicas.
Timezones, multicurrency management, geographical data
- There are conventions and they are documented.
- For currencies: only local currency.
- For time: everything in UTC.
- For geo: the map projection is
Is there data that only live in the databases that don't get emitted as events because they are not relevant for other services?
- Yes.
- We might need to modify what gets published as events to prevent having to query to the services.
Are you going to NEVER have orders querying each other's databases?
- Rule of thumb is never.
- But exceptions might be done.
Ledge mentality, mutability of records, CRUD?
- Mutable records
Event schema registry, database models. Process views.
- They have this XL event catalog that contains everything.
- Unsure about API access.
What strategy will you have for versioning events and databases.
- Versioning of event schemas
- Versioning of APIs
Using events vs db vs apis?
- Skewed towards events.

Requests and next steps

Access to the miro board
Conventions
GlovoXL EventCatalog
Data contracts driven by data

20230608

The duplication of product-catalogue

Catalogue ID: 5a71ba59bbe9c6000f7fe360
SKU: 2147483647

But that SKU does not appear in mongodb. Wtf?

Ok, the previous thing was because of the int field size. Now I'm left with 3 bad combinations.

#1
- id_catalogue: 6009b142f27478003e846b6a
- sku: 6
#2
- id_catalogue: 6009b142f27478003e846b6a
- sku: 5
#3
- id_catalogue: 5a1ed1b0f777bd000f7a2ef6
- sku: 4

I surrender. The ID is going to be a varchar and that's it.

If I run again, I shouldn't get duplicates.

Ok. Now my issue is that checking for uniqueness on the string combo of SKU + catalogueID takes AGES.

I'm hacking to see if we can somehow use hashes to speed up this uniqueness check, because dropping it could be very dangerous. This data comes from partner catalogues, so the chances of receiving shit data are sky high.

Okay. I did the MD5 trick. It works.

Now I still have have to deal with some duplicate ids. Review time it is again.

Ok. The agreed procedure is to pick some values at random.

Finally, none of this was necessary once I fixed type issues in the different steps of the ETL.

The order with thousands quantity

id_order: 2409563 sku: 699569

Performance review meeting with Liliana

H1'23
Will be done through factorial
June 12th to 23rd
- I rate João
- I rate other colleagues
26-30 June
- Each manager judges it's own team members
- Optionally: self-assessment
3-7 July
- Calibration: I don't understand this even after the explanation
Finished in July 24th
- Results will be visible in Factorial

A lot of pep talk on the idea that we might develop ourselves outside of the company if necessary. Is the ship sinking?

Is this going to happen every six months?

20230613

Meeting with Charly

Entro en Glovo en Febrero del 2022. Vive en Sant Cugat.

Data Engineering Manager en HQ. Antes en Growth/Marketing. Veterinario de origen, tiro a bioinformatica, luego a puro data. Estuvo en Mercado Libre como senior manager de data.

20230623

Catalogue expansion of DW

Should we first:

Add MD catalogue to DW (regardless of whether that's done by expanding dim_product or creating dim_catalogue)
- This is useful for name and retailer
Or should we include LMPL in fact_orderproduct and dim_product?
- Important to keep the model generic enough
Catalogue
- Check fields in Mercadão and LM
- Map them
- Share with them
- Settle for final list and model
In MD, the only things that seem to be interesting enough about the catalogue are the name and the retailer id
In LM, it seems the uniqueness of products comes from associating them with a specific store. I'm not finding the catalogue abstraction anywhere

We conclude, quick and dirty

20230627

Metabase UAT upgrade

Autoscaling group name: awseb-e-qguaiqvtw7-stack-AWSEBAutoScalingGroup-1E973Q3A4RYIR https://www.youtube.com/watch?v=yUXV34RrVmQ
Existing EC2 instance in UAT: i-07129f5b2695f6a4b
Existing version in UAT: v0.43.3
EC2 instance after reboot: i-0682d52f80e19e3bd
Version in UAT after reboot: v0.46.5

20230801

Trino in Metabase

https://github.com/starburstdata/metabase-driver https://www.metabase.com/docs/latest/developers-guide/partner-and-community-drivers https://docs.starburst.io/data-consumer/clients/metabase.html

How to access the machine to drop files

Obtain the UAT SSH key
Set up putty to open an SSH tunnel to the Metabase machine through the jumphost
Connect through the tunnel with Putty and/or Filezilla

How to place the Starbust drivers inside the container

First, download to your laptop the JAR file which is the driver itself. These are in the github repo from starbust.
Now, use Filezilla to drop the driver somewhere in the EC2 host. For example, if the driver file is called starburst-3.0.1.metabase-driver.jar, you can drop it in the EC2 so that it lives in /home/ec2-user/starburst-3.0.1.metabase-driver.jar
Metabase is running as a docker container within the EC2 host.
The path within the container where the driver file should live is /plugins. So, again, if the driver file is starburst-3.0.1.metabase-driver.jar, it should be placed like /plugins/starburst-3.0.1.metabase-driver.jar.

ISSUE: we need to restart

Redshift and DMS deprecation

Current state as of 20230609

In the Lolamarket AWS account (251404039695):

Redshift
- There is an existing redshift cluster
- Name: lola-market-bi
- id: 36d07598-37ab-46a6-9a60-cbb0f231fa7d
- lola-market-bi.c3ircn2vj5i5.eu-west-1.redshift.amazonaws.com:5439/lolamarket
- It is up and running. I thought we had stopped it.
- It is receiving queries on a daily basis. Some of them seem to originate from metabase.
DMS
- The RS cluster exists as a destination point in the DMS configuration.
- There are two active replication tasks

Plan

Create Redshift cluster snapshot: https://eu-west-1.console.aws.amazon.com/redshiftv2/home?region=eu-west-1#snapshot-details?snapshot=backup-20230614
Test recovery
- Raise new cluster from snapshot
- Run a query in both the original cluster and the replica and check that values are identical
- Delete recovery test
Stop Redshift cluster
Delete DMS tasks
Delete DMS sources and destinations
Delete DMS Replication Instance (https://eu-west-1.console.aws.amazon.com/dms/v2/home?region=eu-west-1#replicationInstanceDetails/rds-mysql-to-redshift-instance)
Wait some time
Delete Redshift cluster

User issues with Pinto

What happens when a guest users has two possible registed users to match with? https://glovoapp.eu.looker.com/explore/XL-Biz/realtime_pt_orders?qid=bNR5i4o4UgvuTTQEes5PFX&toggle=fil
- I'm going to research in the code to understand how this gets handled.
- Flow 006-md-05-upgrade-guest-users.py is the responsible for spotting:
  - Registered users that exist already in DW
  - That have the same email as some guest user that already exists in DW
- Then, DW updates the existing guest user record in DW and assigns the
Users that disappear with first created order (red ones in Ana's excel)
- Create story
- Store data in sandbox
- Set up time next week with Ana to review

Sunday runs issues

Current state

Currently, the order flow is scheduled with a lookback strategy. That means that, whenever you start a flow run, you must specify how many days into the past should the ETL go for. For instance, running with lookback_days=7 means you will fetch orders from today up to 7 days in the past.

More specifically, this gets used in the following way in a Trino query to get orders from the source:

[... some more SQL code]
from  
app_md_mysql.pdo."order" o  
INNER JOIN app_md_mysql.pdo.ordergroup og ON o.ordergroupid = og.id  
WHERE  
og.status IS NOT NULL  
AND o.status IS NOT NULL  
AND o.updatedat > DATE '2023-07-01' -- <---- This would be the result of a lookback of 7 days.
AND o.updatedat < DATE '2023-07-07' -- <---- This would be today  
),
[... some more SQL code]

As you can see in the previous code, what counts is the updatedat field in the Mercadão.

Issues with this implementation

There is one caveat to how this implementation works and one corner case which could potentially explain the issues we have seen with orders created on monday not appearing in DW on monday.

The caveat is that, with the current implementation, orders updated on the day on which the ETL runs are NOT included in the ETL. This can be seen here:

AND o.updatedat < DATE '2023-07-07' -- <---- This would be today

Simple. If we run on the 7th of July of 2023, orders updated on the 7th of July of 2023 are not included. The most recent ones will be the ones updated on the 6th of July of 2023.

This could be an issue for the following reasons:

If there is any update to the cart in the night between Sunday and Monday that happens already on Monday hours, then the cart wouldn't be included.
I am not fully aware of what timezone is the Prefect container that runs the flow living in, and this is very relevant. It could be that we are scheduling the flow at a time where in Portugal it's already Monday, but inside the Prefect machine it's still sunday. That would leave out all the orders created on Sunday.

Solutions

The first and obvious solution is to push the schedule deeper into Monday so that there is no doubt that it's running on Monday, even if there might be some hour differences due to timezones. Before, the flow was scheduled at 2AM UTC. Now it's scheduled at 6AM UTC.
The second solution is to push the filter from today to tomorrow. That ensures that, irrespective of weird hour and timezone games, order's updated today are included in the ETL. This simply requires a small code change that is WIP.

I am not 100% that these issues and caveats are what cause the past problems since this topic is very hard to debug due to the stateful nature of both the origin sources and DW itself. Nevertheless, I'll apply both solutions and we will observe next week if the issue is still taking place.

Guest orders get assigned to different registered users

Status log for João

20230627

Asked Marcos for:
- Access to schema catalog -> Not deployed. Only got pointed to https://github.com/lolamarket/xl-bus-events/tree/main/src/events
- Data model/docs for the order's service database -> Not existing

Llamada con Víctor

CBRE
- Negocio
  - Que ha cambiado desde que me fui. Transacciones siguen siendo el core?
    - Mas diversificado
    - Nueva linea consultoria estrategica
    - Ya no hay consejo. Solo todos los directores juntos.
  - Que tal habeis llevado los ultimos años?
  - Como sigue research
  - Carlos Casado sigue siendo el capo por encima de D&T?
- D&T
  - Vamos a dibujar el mapa
El puesto
- Como le llamas?
- Que skills son mas importantes
- Perfiles de las personas al cargo y donde estan
- Clientes del puesto
  - Que data products estan construidos encima de este DWH -> Seguimos como estabamos. No hay nada crítico, dashboards y tal y cual pero todo muy fluffy.
  - Se tiro del cable un lunes por la mañana, quien me va a llamar esa semana y porque
- Cositas en el roadmap para este puesto
- Cosas de mi perfil que te encajen/no te encajen
Condiciones
- Pelas
- Horarios
- Cosas que me escuecen
  - Hacer chuladas y que no se usan
  - Flexibilidad oficina y meterme un traje
  - Mierdas de ultima hora porque a un nivel C se le ha ido la olla
  - Perder tiempo con reuniones y falta de claridad en roadmap. Tener un plan, stick to it, ir haciendo.
Stack
- Airflow -> Prefect
- Data docs -> Amundsen
- AWS -> Oki
- Query engine?
- Visualizacion -> Tableau? Metabase?
- Data Quality -> Great expectations
- PostgreSQL -> Que cabrones, estoy cansado de MySQL
- ELT -> Airbyte
Otros
- Que tal BDC 3D?
- Quien sigue del gang original?
- Como le va el negocio a tu señora?
- Me tuve que ir yo para que montaseis un stack como dios manda eh

20230805

DP V1 Training

V1 vs V2:
- V1 depends on the old monolithical redshift

Kinesis is gone, now it's Confluence Kafka.
IDP (staging) contains info as raw as possible
ODP (data marts) contains transformed stuff

! The actual map of how we will do things.

Two environments, dev and prod. Dev gets deployed automatically,

We need a monster laptop to be able to deploy DP1 locally.

How do we coordinate with Data Analysts?
- Looker explores are not part of the DP
- They shouldn't care about DPs for now
If I make a monster pipeline and smash run it every minute, where are the monster $ bills gonna come from?
Hello-world weather forecast

DDP Training

Hardcode table names and columns
One DAG node or multiple -> One DAG
Arbitrary python code, like read from a public API and write into a SQL
Git version

Data Platform Onboarding steps

Github

Onelabel DP checklist

Be included in glovo's github:
- Follow this guide: https://glovoapp.atlassian.net/wiki/spaces/ITS/pages/3097985253/How+to+Get+Access+to+Github
- In case the glovo people get picky, reference on of these tickets:
  - https://glovoapp.atlassian.net/servicedesk/customer/portal/23/ITS-100484
  - ITS-55167
Get added to the "All" team (IT support from Glovo)
Get added to onelabel data group: https://github.com/orgs/Glovo/teams/onelabel-data
Get added to this repository (either by Joan or João): https://github.com/Glovo/onelabel-data-mesh

AWS VPN

General instructions: https://glovoapp.atlassian.net/wiki/spaces/TECH/pages/3468329771/AWS+Client+VPN+Access+Set-up
If you don't have permission to get the VPN config file listed there, you can get it in our team Google Drive: Data Drive > 90 Useful > 20 vpn config files
Once you set it up, you can try one of these links to check if things are working. Note that the important thing is that you reach the page, even if you get an access denied message (each of these services has their own permission system, independent of the VPN).
Known issues from my side:
- If, when you try to start the VPN, you get redirected to a Glovo OneLogin page that says access denied, you are probably missing a role in the Onelogin system. Open an IT ticket to request: "AWS VPN - All"

Call with Joan Heredia

What does DP offer?
- Expose events and make them consumable?
Environments
Permissions
Contacts

He's on Alena's team. He can point to trainings and accesses. He

20230908

Details about our new office. !New Office - Guidelines.pdf

Onelabel

Waiting for Onelabel team to modify events thingies. We will only have events in prod. Everything will be deployed there. Team was able to do a hello world in production.

20230912

Bitwarden migration

I should check with you if it's fine for me to simply follow the steps described here: https://pdofonte.atlassian.net/wiki/spaces/DM/pages/2569469953/Account+Migration+to+Glovo+-+2023+08

Tip from David Clemente to avoid having to create the account:

I've logged in with the old account, an error showed saying that I needed to leave all organisations in order to accept the glovo invite. I left mercadao organisation (the only one I had in my account). Then logged in again through the invitation email, and a message saying "Invite successfully accepted, an admin will need to accept your account" (something like that) Some minutes later I had the glovo organisation in my account. My personal vault was always there throughout the process

According to what I've read in Slack, there should be an email somewhere in my inbox inviting me to the right Glovo-Bitwarden orgs.

And I should add this info provided to João somewhere in Confluence for the next time an employee joins the team.

For future new user access, and requests for new collections, an automated servicedesk was created, please check how to do it here. In the future, if you have any issue with the Bitwarden access or any other Bitwarden issue, please create a ticket for Security.

20230922

Airbyte

Header

Deploy Airbyte locally with Terraform
Deploy Airbyte on AWS manually
Source an EC2 instance on AWS with Terraform
Deploy Airbyte on the UAT EC2 instance with Terraform

Deploying Airbyte in UAT manually

I'm going to use a private subnet in UAT.

VPC: # pdo-uat - vpc-4012372b
Subnet: pdo-uat-private-1a - subnet-37858a5c
Instance type: t2.medium
Keypair: pdo-uat

The instance I created: https://eu-central-1.console.aws.amazon.com/ec2/home?region=eu-central-1#InstanceDetails:instanceId=i-01be407713b011383

Prepare SSH tunnels and jump into the new instance.

Install docker and docker compose. The docker compose install must be with the modern plugin (as in, docker compose, not docker-compose). See useful link.

Follow the airbyte instructions.

Prepare SSH tunnel for web access.

Modify .env to set credentials.

Define webhooks for alerts.

Useful links

Airbyte deployment docs: https://docs.airbyte.com/category/deploy-airbyte
How to update Airbyte: https://docs.airbyte.com/operator-guides/upgrading-airbyte#upgrading-on-docker
Terraform tutorial: https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli
Fixing the issue with the local docker example in the Terraform tutorial: https://github.com/kreuzwerker/terraform-provider-docker/issues/44
Running a shell script on an EC2 machine: https://brad-simonin.medium.com/learning-how-to-execute-a-bash-script-from-terraform-for-aws-b7fe513b6406
Install Docker and Docker Compose on AMI 2023: https://medium.com/@fredmanre/how-to-configure-docker-docker-compose-in-aws-ec2-amazon-linux-2023-ami-ab4d10b2bcdc
How to install Docker Compose Plugin (as in, docker compose, not docker-compose): https://stackoverflow.com/a/73680537/8776339
On how to resize a Linux partition after an EBS volume has been enlarged: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html

Missing stuff

Finish confluence docs
Add docs on ssh tunnels
Schedule session with the team
Send pre-session material to the team
Store videos
Ensure all credentials are in Bitwarden

20230927

Call with Vini

What the hell do you do? Integrations team. They are centralizing a lot of solutions and making them Infra as code.

Move data from streaming (Kafka connector) to an S3 bucket in an hourly fashion (SLA is 90 minutes).

Files are compacted in parquet. Partitioned by day.

Are there duplications?

New events won't be added automatically. We will have to ask them to ingest it and make it available for us.

Datadog to monitor ingestion. Datahub for schemas and metadata.

How will we manage breaking changes?
What flows of info should there be between us?
Any documentation you can share?

20231018

Hello world in DDP

I can't access the FAQ on DDP.
Is it possible to install custom packages in our Notebook environment? If so, how?
Data discoverability
- Intermediate tables (our)
  - Look at Airflow logs
  - Datahub
- Sources
  - Not really very helpful
Git, versioning, rolling back
- Things have changed
- We can use github
Multiple tables at the end?
- Yes
What is one DP, what are multiple DP, philosophy
- Very flexible
Can a DP only generate intermediate tables?
- Yes
Can the output of a DP be the input
- Yes
We basically can combine in anyway we want
Esquiar fora pista a DDP

Making the hello world

Check the demo table on starbust
Make a new product
- Simply make a select star and copy it with full refresh mode
- Run
- Check Airflow
- Check resulting table in Starbust
Experiment with all the refresh modes
- Understand full refresh
- Understand merge
- Understand XXX
Intermediate data products
- Build a data product that only generates intermediate tables
- Read it from a different data product
Make a session with Pinto to try to read things from one of the data products
Delete everything to keep things clean

Counting delivered orders

Goal

Make a table in the public schema in delta with the following schema:

Column	Description	Type	PK
date_utc	The date	date	X
orders_delivered_count	The number of orders delivered in that date	integer

The table should contain all the history of Onelabel. The table should be refreshed on an hourly basis.

Design

Sources

Table "hive"."desert"."orders_v0__com_glovoxl_events_orders_v0_orderdelivered"

Final tables

Table "XXX"."YYY"."whatever_something_that_thing__delivered_orders_per_day"

Schedule 30 * * * *

We had an issue with permissions. The DDP user doesn't have permission to read from the desert schema. Instructions from Joan to fix this:

el user del DDP el tens a rendered template de Airflow, al operador de query_to_table , si busques per b_dp_ es algo aixi: b_dp_10a8aaa3-252e-4295-a9eb-26c3f3475e06 Insert the user in b_desert_migration_tmp

This is the user I found: b_dp_9e1363ee-4444-4ad2-8c87-15ed05a76e71

Event questions with Marcos

What is the id_company field?
Do the events have IDs?
- Not really
Picking Location
- IDs that exist in changed events but not in added
Orders
- Two timestamps: createdAt and kafka timestamp. What to use?
- Doubt.
Timezones
- Kafka and createdAt are always UTC
Currency
- What currency codes are you following?
- https://en.wikipedia.org/wiki/ISO_4217 We get the VPN and more permissions in confluent cloud to read events
Documentation of lifecycle of orderings and picking
- Not there
Change management flow for versioning, specially breaking changes
- How do we do?
- How will we know when the different versions stop/start?
Dirty data in prod. Will you remove it? If not, how do we tell it apart?
- Fresh start at some point
Will events be forever?
- Initially yes, let's wait for us to run out of space to consider a change

20231019

003-etl-xl-order - 003-md-01-refresh-orders - Uncontrolled enum change

TLDR: the Mercadão backend is breaching our unwritten data contract by having value DELIVERED in the delivery_type field of orrders.

Context

The flow 003-md-01-refresh-orders has an expectation for the field delivery_type that restricts the valid values to the set {"DELIVERY", "CLICK_AND_COLLECT"}

Issue

In this flow run (https://prefecthq-ui.mercadao.pt/mercadao/flow-run/786b017b-1ccf-49dc-886c-4c25d4b0a137), two records contained the value DELIVERED for the field delivery_type.

The batch of data containing the invalid values is stored in the quarantine table QUARANTINETABLE HERE. The specific records can be spotted with the following query: QUERY

This is a risk for upstream code in looker that might have hardcodes that rely on delivery_type only having {"DELIVERY", "CLICK_AND_COLLECT"} as valid values.

Possible courses of action

Talk with engineering team to understand what happened and make them change the values.
- Ideally, they go back to only using {"DELIVERY", "CLICK_AND_COLLECT"}.
- If not, we must adapt on our side***.
Ignore the tech team altogether and simply adapt on our side***.

***Options to adapt on our side:

Play with the ETL to turn values DELIVERED into DELIVERY.
Add DELIVERED as a valid value for the field delivery_type and adapt in Looker and othe reports.

Aftermath

Someone had messed around with the faulty orders manually in the backoffice. The values where fixed manually again the backoffice and the ETL ran just fine afterwards.

20231025

Data Quality incident on 020_ETL_XL_Products_020_md_01_refresh_product on 20231024

Context

The flow has failed due to not passing the data test.

Only one expectation was violated: the uniqueness of the combination of id_product and id_catalogue.

The quarantine data reveals that all the values have been duplicated several times. Each combination appears four times. This is very, very weird.

Diagnostic

As a first step, I try to simply rerun the flow without any change to see if the duplication persists.

The rerun generates the same problem without the slightest change.

I have checked the repository and it seems João made some changes to the pipeline last week with release 0.4.2.

I'm going to run the suspicious query for versions 0.4.2 and 0.4.1 and compare the output.

After running the transformation step query for both 0.4.2 and 0.4.1, I can conclude that the issue appeared in 0.4.2. The query from 0.4.1 on the same data does not generate duplicate values.

Furthermore, I can see that query 0.4.2 introduced two joins in the query. These new joins are most probably the source of the duplications.

Conclusion: changes from 0.4.2 are responsible for duplicating values in the pipeline.

Courses of action

We can:

Roll back to 0.4.1 ASAP to keep the product table refreshed, even if the newcategory_XXX columns that we introduced in release 0.4.2 will remain empty.
Simultaneously, we should apply a fix and make a new release 0.4.3 that includes the category_XXX columns but does not introduce duplicate values mistakenly.

Links and references

First flow failure: https://prefecthq-ui.mercadao.pt/mercadao/flow-run/feb5ab1e-cba6-4064-8e4c-bc16e3659e16
Great expectations validation output with failed expectations: https://s3.console.aws.amazon.com/s3/object/pdo-prod-great-expectations?region=eu-central-1&prefix=validations/020_ETL_XL_Products_020_md_01_refresh_product_suite/020_ETL_XL_Products_020_md_01_refresh_product_suite_checkpoint/20231024T061732.662931Z/3c859ea3c1fa0158eeceadd14d3d5950.json
Quarantine table: quarantine.020_md_01_refresh_product_20231024_061838

Work log

Make story about coding fix
Rollback deployed flow to 0.4.1
Re-run with flow 0.4.1

Ongoing

DATA-1182

https://pdofonte.atlassian.net/jira/software/c/projects/DATA/boards/6?selectedIssue=DATA-1182

First look

Can I derive all the fields required in the task from the query that João already composed?


Task Field	Query peer	Comments
Order ID	id_order
Status	status
Shopper ID	id_shopper
Team / Fleet	NA	We must obtain it through a very messy process from the shopper. I would drop it for now.
Date Created	date_order_received
Date Delivered	date_order_delivered
Local Currency	local_currency
Total Amount Ordered	charged_amount
Total Amount Delivered	NA	It's unclear how to obtain this. Must it be derived from adding up all the "Article picked" events, and substracting all the removed articles? The events around articles are also confusing (difference between stockout and not available? difference between unpicked and removed?)
Delivery Time Slot Start	delivery_time_slot_start
Delivery Time Slot End	delivery_time_slot_end
Picking Location ID	id_picking_location

A couple of them are ambiguous. I'll drop them for now and get ahold of Marcos to clarify how to obtain them.

Design

We are going to make a pipeline that generates the final table in a single step.

This is the query:

select
rec.orderref as id_order,
coalesce(can.status, del.status, oit.status, cod.status, pic.status, allo.status, con.status, rej.status, rec.status) as status,
rec.kafka_record_timestamp as date_order_received,
del.createdat as date_order_delivered,
con.pickinglocation.id as id_picking_location,
allo.shopperid as id_shopper,
coalesce(con.totals.amountcharged.amount, rec.totals.amountcharged.amount) as charged_amount,
coalesce(con.totals.amountcharged.currency, rec.totals.amountcharged.currency) as local_currency,
rec.servicedate.fromdate as delivery_time_slot_start,
rec.servicedate.todate as delivery_time_slot_end,
from
delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderreceived rec
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderconfirmed con on (rec.id = con.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderrejected rej on (rec.id = rej.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderallocated allo on (rec.id = allo.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderinpicking pic on (rec.id = pic.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_ordercheckoutdone cod on (rec.id = cod.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderintransit oit on (rec.id = oit.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderdelivered del on (rec.id = del.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_ordercanceled can on (rec.id = can.id)

Which means that the

Onelabel ways of working

Alerts
FX data
Team being able to modify pipeline
Data Quality Expectations

Departure Management

Open fronts list

lolafect
- On the usage of lolafect, I think the team is up to date.
- On the internals of lolafect, that's a different story. I think I'm the only one fully familiar with the internals of the package. Perhaps it would be a good idea to hold a session on it, and perhaps to try to go for a couple of silly-but-real stories to implement something in the package just so that you and/or Afonso go through the entire lifecycle of adding something new to lolafect and making a new release with it.
DDP
- I generally think we are quite aligned.
- I will document all the knowledge I have and the ongoing work with the orders pipeline.
- I would suggest planning some transfer sessions on my last week for whatever WIP I still have.
AWS
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.
Prefect
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.
Airbyte
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.
Trino
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.

Call with Joan to improve ways of working

Overcoming notebooks
- Option 1: share notebooks and just survive
- Option 2: repository
  - Centralizes development in a repository with a Github Action that publishes to production
  - is it a notebook or how do you structure code?
  - one repo for one ddp? one repo for all ddps?
- Option 3:
  - Same as option 2, but with an individual repository for onelabel
data quality checks
- Use the data quality checks from Glovo, not GE, not dbt-expectations
- Docs for that?
  - We will read the tutorial notebooks and come back with questions
- Control flow?
  - No
Alerts on Airflow failures

https://glovoapp.atlassian.net/servicedesk/customer/portal/26/PHC-11171?created=true

Artifactory instructions setup: https://glovoapp.atlassian.net/wiki/spaces/TECH/pages/1124565144/Configure+access+to+Artifactory+repository cmVmdGtuOjAxOjE3MzA4OTcyNjA6NGtQaTJRbDk3QTFObmxHa25QNWY2NHc2R2x5

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

Hack del browser (Charly)
DQ per cada output ODP
Add SLO to add_sql_transformation

OIDC Token Refresh bug

Package version: glovo-data-platform-meshub-client==0.1.68
Summary: meshub client fails to correctly store a valid access token when using a refresh token to obtain it. The new access token is obtained correctly and stored in memory, but it never reaches the credential file. The issues is created by an exception: jwt.exceptions.ImmatureSignatureError: The token is not yet valid (iat) that gets triggered in the following line of code: 34eb43822f/meshub-backend/service/glovo_data_platform/meshub_client/authentication/oidc_tokens.py (L37).
Current workaround: copy the new access token from an in-memory variable by debugging with Pycharm and pasting it in the credentials file, along with some faked out access_issued_at and access_expires_at. Obviously, we won't get very far this way.

Tag: Joan, @dp-transformations-greenflag.

This is what my credentials file right now looks like:

[oidc meshub]
refresh_token = eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6Ii1FY0U2TlE0MFJtaUJRQUt6Zm4wYnFheXhZY2JXOXdaaURmZ28wQm52a3cifQ.eyJqdGkiOiJWWTZlZFFUZV9pTU9mRWhzM1d0cDciLCJzdWIiOiI5MjQ0MjU4NSIsImlzcyI6Imh0dHBzOi8vZ2xvdm9hcHAub25lbG9naW4uY29tL29pZGMvMiIsImlhdCI6MTY5OTQ1NDQ0MiwiZXhwIjoxNzAyMDQ2NDQyLCJzY29wZSI6Im9wZW5pZCIsImF1ZCI6ImNjOTc5NmIwLTc3ZDgtMDEzYi1jOThmLTA2NThkMDRkMjM2NjM3ODE1In0.hpp8bKfSSBpivMVl3zwwPXeDtGzOrPETAI-HRsy-hsgVqG13eahdw8MAHgDKNUdXQ-l01uqGG90RiYXn3CCU8b5Bx3QEh90FMQvrzAOJXWZufSVhR9WNKwvmh7lr568Xxg__3Ux6JVau8Qo7PH7KCcPQTNbrf9aV2v3rSSczkNMgKKUO5GN8w9UYFs1vN6DX8olIE8voVbDhWEuidMRhl8EZWDJG2rRiY3EvLlAl3QFbQZZdGTbxd6o7tyH_DEPDyIQ0Mhk5CK3qGDEx7w5ySSwoVC_uxI_BcC1cAtha2klL0Dz4OT06d_5DIRLCHLqrGjGuM75yXBc6rOaiLUus_g
refresh_issued_at = 2023-11-08T15:40:42
refresh_expires_at = 2023-12-08T15:40:42

This is what the glovo code has read when running:

Docs

Clone repos
Clone personal notes
Send messages to
- ~~Slack general ~~
- ~~Maria~~
- ~~Dani~~
- Team
- ~~Charly~~

Future

There is no future anymore.

182 KiB Raw Blame History

20220711

20220712

20220713

Data modeling with Dani

María envío

Query challenge

20220714

Tour through Mercadao tables with João

20220719

Meeting marathon

Projects

BOOM

Tech

Finance

Ops

CX

Commercial

Marketing

20220720

Shopper notes

20220721

Tech

Query thingy

20220722

20220725

Trino results

MySQL

20220726

20220727

20220728

Stuff I discovered during my debugging of the bloody enums

Retro

20220816

20220817

Table design scenarios

20220818

20220819

20220822

20220823

20220825

20220829

20220830

Re-structuring meeting with Gonçalo

20220831

20220901

20220902

20220903

20220905

20220906

20220908

20220908

20220912

20220913

Team guidelines

Slack alerts

20220914

20220915

20220919

20220920

1:1 João

1:1 Liliana

Python meetup Veriff

Fraud detection

QR Codes

20220921

The Bug shared by Janu

20220922

20220927

Meeting with Gonçalo

Meeting with Ricardo

20220929

Work on the order etl

20220929

20221003

20221006

20221013

20221014

Order columns batch 2

20221017

182 KiB

Raw Blame History