182 KiB
20220711
Today was my first at XL. I started the day by briefly chatting with Dani. He is as energetic as always and was already trying to get my hands dirty with stuff. It will be fun to work with him, but I need to make sure he doesn't kidnap me completely.
I also called briefly with Maria Gorostiza. She was trying to call me through the phone but I called her through slack. It felt a bit weird that she was trying to get hold of me through my personal phone. I must confess that the vibes I get from the Portugal based team are very different from the ones I get from the Spain based team, and this is an example of that. Anyways, she just told me that I will need to put my working hours in Factorial and also that I should request my holidays throught there. She also asked I could put a real picture of me in Slack. No trouble from my side, but getting asked to do it felt quite childish, being honest.
The rest of the day, I mostly spent with João discussing the organization and the business. There is a lot of details and nuances flying all over the place, but I'm old enough now to know that it's ok if I don't retain many of them. The important things will inevitably find their place in my brain.
I'm very happy on how things look. The service is interesting. It has a lot of data and decisions that need to be made, so Data work will definetely be fun. The new business opening in Poland sounds very exciting and feels like a greenfield opportunity. And the team is also looking great. The culture is very different from ACN. More laid back. Less pompous. No fancy words. Strong common sense.
It's very early to tell, but I think I've made a great choice coming over here.
We also discussed with the team that we could maybe meet in Porto on September. It would be great to fly there and spend a few days with the team and chill around Porto in the afternoons.
20220712
Ana Pinto,
10 years of experience as a Data Analyst Started in insurance for a year, did not like the business. Joined continente for 5 years as part of the loyalty program Moved to Talkdesk for 2 years
Infra with Dani
- He is playing around with Prefect and Trino
- Redshift will soon be dropped
20220713
Ana Januario
Math Continente Topdesk
Adhoc analysis and Looker Will be working on moving Spanish data from Tableau to Looker
How to create user in AWS:
- Go to IAM
- Select AWS credential type Password
- Assign to data-team user group
- Skip tags sections
- Hit send email option to contact
User name,Password,Access key ID,Secret access key,Console login link
pablo.martin,EP}A}_WlF2mTV-t,,,https://fonte.signin.aws.amazon.com/console
Access key ID,Secret access key
AKIAVNZZGXD4KNPQPMH5,0j02JWq9mnQF3d2G9A600dlgt70PDAiiguJizVfD
You have been signed up for a Meraki account. You are now authorized to use 514195843 @ FONTE - NEGÓCIOS ONLINE @ 156654047.
Here is your login information:
Email address: pablo.martin@lolamarket.com
Password: 4bCphthn
You can manage your account at https://account.network-auth.com/.
Data modeling with Dani
- Orders model where we integrate both Mercadao and Lola https://pdofonte.atlassian.net/wiki/spaces/DATA/pages/2301067280/Orders
- snake_case, as in Python
María envío
- Ratón
- Teclado
- Mochila
- Tarjeta
- Aplicación móvil y ver vídeo
Reconocimiento médico Guía PRL Curso PRL
Query challenge
- Total order groups served in LolaMarket in 2021
- Dani showed me that order groups are not really an entity in Lola. Instead, there is a field called "tag" that appears in carts. Multiple carts with the same "tag" are a complete order group.
- I learned with Dani that the cart table has all the failed and cancelled carts. Those carts have no tag.
- ANSWER: 148975
- Total orders (visits to supermarket) made in Mercadao in June 2022
- AOV for orders made in Lolamarket, in May 2022, by shop brand
- Largest basket ever sold in Mercadao
- Smallest basket ever sold in LolaMarket
- Look for my first LolaMarket purchase and understand it in full detail
- Find the customer that spent the most in Mercadao in 2021
- See if there is any user (identified by email) that appears in both LolaMarket and Mercadao
user id de Dani -> 3799702
20220714
Tour through Mercadao tables with João
- chargedamount -> what the customer pays (originaltotalprice * 1.03)
- originaltotalprice -> original value of the goods (with discounts applied)
- totalprice -> final cost of the picked goods
- totaldiscount -> discount of goods (from the retailer)
- totalwshipping -> totalprice+delivery fee
- subtotal -> should be originaltotalprice + totaldiscount (but is flawed. Ignore)
- status might be null (user reaches checkout payment screen but doesn't pay)
- For status flow, review confluence ->
- PickingLocationID -> suggested picking location
- We have both the original and final delivery slot
20220719
Con María:
- Llegaran mensajes de confirmación para que los dos slots una hora antes. Hay que darle o se me chutaran.
Meeting marathon
Projects
- Project manager across the company
- Working mostly on Mercadao
- Projects
- With Glovo
- Mercadao does the picking
- Glovo riders do the delivery
-
BOOM
- B2B Horeca project
- With Recheio
- Larger baskets
- https://express.recheio.pt/
- Picking is very different. Vans are needed instead of cars.
- With Glovo
Tech
- Missing
Finance
- "Congratulations for joining João's team."
- Luis Cabral -> Head of Finance for XL
- Helps the head of each local finance do their work properly
- Bridges financial comms to Glovo-DH
- Each country has their own finance team
What systems do you have for: - Accounting: - Sage in Spain - Sage in Portugal - Forced to move to Glovo's SAP Hana (starting on 2023-01-01) - - Reporting - P&L - They use Looker, Metabase and a few Excel sheets
Ops
- Manages the portuguese Quality Assurance, operations, availability, onboarding teams
- Starting to assume the Spanish operations
- She is clearly very enthusiastic
CX
- Atencion al cliente y LiveOps
- Jaime
- Empezo hace un año
- Lleva atencion al cliente
- Esta absorbiendo LiveOps (equipo que ayuda a los shoppers)
- Nuevo CRM Freshdesk para Portugal y España
- 3 supervisores y 8 agentes
- Intermedian entre clientes y shoppers
- Data
- App ratings, surveys?
- KPI
- Reducción contact rate shoppers
Commercial
- Negotiate campaigns of the banners in the Marketplace
- Only food. Currently exploring other services
- Selling reporting to brands (cross-selling topics, AOV per brand, frequency and repetition). Currently only banners and clicks (CTR, click-through-rate)
- Campaing impact on sales deltas for different products
- Working with Looker. Needs a bit of help with exporting data
- Runs Iberia (Carlota and Gloria work for him in Spain, Ines)
- For Poland: let's run some large operations in there before selling our services to brands there.
Marketing
20220720
Shopper notes
Today I went out as a shopper for the first time. I took a few notes during the day and also have a few ideas in mind:
- In the morning, the order proposal appeared in my screen. I hit accept as the order seemed suitable, but then the app said that another shopper had taken it in the meanwhile (makes sense since I was driving and I didn't have my phone handy). But then, when I went to my orders page, I could see that order in my schedule, so it seems that it was assigned to me regardless of that warning message. I noticed this, so no problems, but this could lead to a shopper getting an order assigned and not realizing.
- When I was in Mercadona doing my first order, one item out of all the shopping list appeared in the Frozen section within the app. That item was not actually in the frozen section in the supermarket but rather in a normal open fridge, which was quite frustrating since I was trying to find it in the freezers. Funny enough, there was another product of the shopping list that was actually in the frozen section in the supermarket, but did not appear in the frozen section in the app's list. I felt a bit like the shopper app was playing with me.
- During my first order, there was this lemonade product that had a content of 2L. In the app, if you opened up the picture, you could also see it had 2L of volume. But then, in the app metadata, it showed the product had 1.5L. I was a bit confused in here: should I get 5 bottles because the customer asked for 5 units? Or should I change the amount of bottles given that the customer shopped while thinking that each bottle was 1.5L? Or wait, what did the customer follow: the metadata that said 1.5L or the picture of the product that said 2L? In the end, I got 4 bottles instead of 5.
- First delivery: shopping took way less time than the app suggested. This, on top of me being in the supermarket early to compensate for my expected clumsiness, turned into me being ready to pay at the POS around 20 minutes earlier than it should. I didn't want to be way too early at the customer's location, but I also didn't want to wait near there since the weather is super hot and frozen and cold products would get awful. I decided to wait... inside the supermarket (without picking the cold products yet). Pretty sure some Mercadona folks where scratching their heads :). Once the time passed, I went ahead, paid and drove to the customer location.
- My first two orders had a pretty nasty overlap. Order #1 had delivery scheduled at something like 11:20, while Order #2 instructed me to start picking at 11:00 in a supermarket which was about 20min away from Order #1's customer location. I managed because I was able to finish Order #1 much earlier than planned. Otherwise, customer #2 would have had a hefty delay, for sure outside of their chosen timeslot.
- This a rather lengthy one, and the biggest issue from the app I faced. It could have easily turned into a super late delivery for the customer. Happend in order #3 Step by step:
- I looked at the details of order #3, saw that the picking location was Lidl store. I clicked on the supermarket icon to open google maps and see the location.
- The shopper app transitioned into google maps and provided me with the location. The problem? There is no Lidl supermaket there. I knew because I'm very familiar with the area where the app was indicating that the Lidl was. But again, I knew there was no Lidl there. I went back to the shopper app and checked the address. The street name was right. I was scratching my head for a couple of minutes before I realised what was going on. The address was Carrer Apel·les Mestres, 109. The thing is there is a Carrer Apel·les Mestres in Barcelona, but there is also another one with the exact same name in El Prat de Llobregat (nearby town to BCN, kind of like Madrid-Pozuelo or Porto-Matosinhos). I realised because the postcode was not a Barcelona city postcode, and then I found the Lidl that does exist in Carrer Apel·les Mestres in El Prat.
- I checked the customer location and realised there was a large Lidl store about 500m away from there, so I decided to go there. Checked beforehand around 10 times that the store was actually there, because by this time I was both a bit confused and a bit skeptical.
- In the end everything went just fine. But, had I not been familiar with the streets, I would have not spotted the issue straight away and probably I would have suffered a +30min delay.
- I was also a bit puzzled by the fact that the shopper app didn't suggest the Lidl location right next to the customer location.
- The screenshots at the end show the exact address and what I was being shown in Google Maps (aka the false Lidl location)
- Trying to send pictures to the client through the chat was a pain in the a** and I'm not even sure that it worked in the end. Issues:
- I would hit the camera icon, take the picture, and then return to the chat. No trace of it.
- After a few tries, the picture "appeared". There was an empty white box in the chat, which I guess should be the picture. But again, in my screen it only appeared as an empty box.
- A suggestion: it would be nice if multiple pictures could be taken at once (instead of: picture -> chat -> picture -> chat, etc). I was trying to give the customer 3 alternatives to a stockedout product, so taking the three pictures in a go would have been more convenient.
- My android device is configured in English. All the shopper app appears in english, which I didn't mind. What I did mind was that the auto-suggested chat messages to the customer appeared in English. This doesn't make much sense to me, since I would assume that our default stance should be to either address the user in Spanish or, if the user can somehow inform their prefered language, in whatever the user has indicated.
A few additional mix fun and useful details:
- The customer for order #1 was this very nice woman who was using Lola for the first time. She told me she was undergoing chemo and felt pretty sick all the time, so it was super convenient for her that we brought the groceries to her place. Hearing her story was touching.
- I would advice not having the empty Lola bagpack on your back if you go on the highway faster than 100Km/h with a motorbike. The box acts kind of funny and pulls your arms in weird ways.
- Paper bags at Lidl are quite crappy. I was very much afraid that some of them would break during the delivery.
- Silly thing you don't even think about before starting as a shopper: the euro coin for the supermarket cart! I was lucky enough to have one, but in order #2, with all the rush, I forgot to recover it. The guys at the supermarket for order #3 where nice enough to unlock a cart for me because I had no coins by then.
- The barcode reader fails quite a bit (by failing I mean it read the barcode and tells you it's the wrong product, even when you can very clearly see by the product description and image you are picking the right one).
- Entering a supermarket you have never been into with a very long list of products creates this peculiar initial paralysis. It left my wondering which option would be better:
- Current way: we show the entire list to the shopper, the shopper goes around picking in whatever order he wants.
- My idea: we show the shopper items by small batches which are all related to the same product category. For instance, if the customer has chosen a few yogurt and cheese products, we show only those, and hence the shopper can focus on getting that done. Then the next batch could be beers and wines, the next one cleaning products, etc. This could also be used to leave cold and frozen products for the end, ensuring the shopper picks those after all the other ones.
20220721
Tech
I have the onboarding meeting with Berlana we had to pospone.
Questions:
- How are you organized?
- How can we keep up with releases and changes?
- Who is who?
- How is the whole instaleap thing working out?
Query thingy
-
Get access directly to Lola MySQL to compare performance.
-
Keep on researching
-
Documentar como conectar a MySQL lolamarket -> Apuntado
-
Documentar como pillar free-benefits de Glovo ->
-
Enviar papel de material a Maria
-
Entrar en lo de notion y documentarlo
My shopper ID is 8025
20220722
- Check that fetchall is actuall fetching all.
- Understand
20220725
I'm going to compare in detail the execution of the same query in both engines and see what I can get out of it. The query I'm using is Orders finished yesterday with less than 10 items.
The execution ID I'm looking into in Trino is this: 20220725_080932_00086_yu7k5
For Trino, this is the output of the EXPLAIN:
Fragment 0 [SINGLE]
Output layout: [sum, id_105, cash_order_id_126, shopper_timeslot_id_80, shop_id_79, address_id_78, total_cost_92, total_price_117, total_price_discount_140, delivery_price_141, delivery_price_discount_112, delivery_type_95, status_108, modified_76, last_update_110, date_shopping_119, date_delivering_125, date_delivered_133, note_127, date_started_134, weight_99, last_overweight_notification_121, shopper_total_price_123, shopper_total_cost_135, margin_98, shopper_weight_106, comprea_note_83, next_shopper_timeslot_id_120, shopper_algorithm_124, loyalty_tip_131, date_loyalty_tip_115, ebitda_104, numcalls_84, manual_charge_111, commission_100, last_no_times_available_109, comprea_note_driver_128, num_user_changes_130, expected_shopping_time_139, expected_delivering_time_97, date_deliveries_requested_81, date_call_93, percentage_opened_hours_82, percentage_opened_hours_in_day_113, fraud_rating_114, tag_94, tag_color_90, expected_eta_time_102, promo_hours_137, numcallsclient_77, frozen_products_103, driver_timeslot_id_107, expected_delivering_distance_118, lola_id_136, comprea_note_warehouse_86, uuid_88, date_almost_delivered_132, date_ticket_pending_96, date_waiting_driver_138, num_products_116, num_products_taken_91, total_saving_89, percentage_opened_hours_in_87, cart_progress_branch_url_122, real_shop_id_129, gps_locked_85, tag_signature_101]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Output[item_count, id, cash_order_id, shopper_timeslot_id, shop_id, address_id, total_cost, total_price, total_price_discount, delivery_price, delivery_price_discount, delivery_type, status, modified, last_update, date_shopping, date_delivering, date_delivered, note, date_started, weight, last_overweight_notification, shopper_total_price, shopper_total_cost, margin, shopper_weight, comprea_note, next_shopper_timeslot_id, shopper_algorithm, loyalty_tip, date_loyalty_tip, ebitda, numcalls, manual_charge, commission, last_no_times_available, comprea_note_driver, num_user_changes, expected_shopping_time, expected_delivering_time, date_deliveries_requested, date_call, percentage_opened_hours, percentage_opened_hours_in_day, fraud_rating, tag, tag_color, expected_eta_time, promo_hours, numcallsclient, frozen_products, driver_timeslot_id, expected_delivering_distance, lola_id, comprea_note_warehouse, uuid, date_almost_delivered, date_ticket_pending, date_waiting_driver, num_products, num_products_taken, total_saving, percentage_opened_hours_in_10, cart_progress_branch_url, real_shop_id, gps_locked, tag_signature]
│ Layout: [sum:bigint, id_105:integer, cash_order_id_126:integer, shopper_timeslot_id_80:integer, shop_id_79:integer, address_id_78:integer, total_cost_92:decimal(8,2), total_price_117:decimal(8,2), total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), delivery_price_discount_112:decimal(8,2), delivery_type_95:char(9), status_108:char(14), modified_76:tinyint, last_update_110:integer, date_shopping_119:integer, date_delivering_125:integer, date_delivered_133:integer, note_127:varchar, date_started_134:integer, weight_99:decimal(6,3), last_overweight_notification_121:integer, shopper_total_price_123:decimal(8,2), shopper_total_cost_135:decimal(8,2), margin_98:decimal(8,2), shopper_weight_106:decimal(6,3), comprea_note_83:varchar, next_shopper_timeslot_id_120:integer, shopper_algorithm_124:varchar, loyalty_tip_131:decimal(8,2), date_loyalty_tip_115:integer, ebitda_104:decimal(8,2), numcalls_84:integer, manual_charge_111:tinyint, commission_100:integer, last_no_times_available_109:integer, comprea_note_driver_128:varchar, num_user_changes_130:integer, expected_shopping_time_139:integer, expected_delivering_time_97:integer, date_deliveries_requested_81:integer, date_call_93:integer, percentage_opened_hours_82:decimal(5,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, tag_94:varchar(20), tag_color_90:varchar(15), expected_eta_time_102:integer, promo_hours_137:integer, numcallsclient_77:integer, frozen_products_103:tinyint, driver_timeslot_id_107:integer, expected_delivering_distance_118:integer, lola_id_136:varchar(50), comprea_note_warehouse_86:varchar, uuid_88:varchar(10), date_almost_delivered_132:integer, date_ticket_pending_96:integer, date_waiting_driver_138:integer, num_products_116:integer, num_products_taken_91:integer, total_saving_89:decimal(8,2), percentage_opened_hours_in_87:decimal(5,2), cart_progress_branch_url_122:varchar(255), real_shop_id_129:integer, gps_locked_85:tinyint, tag_signature_101:varchar(4)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ item_count := sum
│ id := id_105
│ cash_order_id := cash_order_id_126
│ shopper_timeslot_id := shopper_timeslot_id_80
│ shop_id := shop_id_79
│ address_id := address_id_78
│ total_cost := total_cost_92
│ total_price := total_price_117
│ total_price_discount := total_price_discount_140
│ delivery_price := delivery_price_141
│ delivery_price_discount := delivery_price_discount_112
│ delivery_type := delivery_type_95
│ status := status_108
│ modified := modified_76
│ last_update := last_update_110
│ date_shopping := date_shopping_119
│ date_delivering := date_delivering_125
│ date_delivered := date_delivered_133
│ note := note_127
│ date_started := date_started_134
│ weight := weight_99
│ last_overweight_notification := last_overweight_notification_121
│ shopper_total_price := shopper_total_price_123
│ shopper_total_cost := shopper_total_cost_135
│ margin := margin_98
│ shopper_weight := shopper_weight_106
│ comprea_note := comprea_note_83
│ next_shopper_timeslot_id := next_shopper_timeslot_id_120
│ shopper_algorithm := shopper_algorithm_124
│ loyalty_tip := loyalty_tip_131
│ date_loyalty_tip := date_loyalty_tip_115
│ ebitda := ebitda_104
│ numcalls := numcalls_84
│ manual_charge := manual_charge_111
│ commission := commission_100
│ last_no_times_available := last_no_times_available_109
│ comprea_note_driver := comprea_note_driver_128
│ num_user_changes := num_user_changes_130
│ expected_shopping_time := expected_shopping_time_139
│ expected_delivering_time := expected_delivering_time_97
│ date_deliveries_requested := date_deliveries_requested_81
│ date_call := date_call_93
│ percentage_opened_hours := percentage_opened_hours_82
│ percentage_opened_hours_in_day := percentage_opened_hours_in_day_113
│ fraud_rating := fraud_rating_114
│ tag := tag_94
│ tag_color := tag_color_90
│ expected_eta_time := expected_eta_time_102
│ promo_hours := promo_hours_137
│ numcallsclient := numcallsclient_77
│ frozen_products := frozen_products_103
│ driver_timeslot_id := driver_timeslot_id_107
│ expected_delivering_distance := expected_delivering_distance_118
│ lola_id := lola_id_136
│ comprea_note_warehouse := comprea_note_warehouse_86
│ uuid := uuid_88
│ date_almost_delivered := date_almost_delivered_132
│ date_ticket_pending := date_ticket_pending_96
│ date_waiting_driver := date_waiting_driver_138
│ num_products := num_products_116
│ num_products_taken := num_products_taken_91
│ total_saving := total_saving_89
│ percentage_opened_hours_in_10 := percentage_opened_hours_in_87
│ cart_progress_branch_url := cart_progress_branch_url_122
│ real_shop_id := real_shop_id_129
│ gps_locked := gps_locked_85
│ tag_signature := tag_signature_101
└─ RemoteSource[1]
Layout: [sum:bigint, id_105:integer, cash_order_id_126:integer, shopper_timeslot_id_80:integer, shop_id_79:integer, address_id_78:integer, total_cost_92:decimal(8,2), total_price_117:decimal(8,2), total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), delivery_price_discount_112:decimal(8,2), delivery_type_95:char(9), status_108:char(14), modified_76:tinyint, last_update_110:integer, date_shopping_119:integer, date_delivering_125:integer, date_delivered_133:integer, note_127:varchar, date_started_134:integer, weight_99:decimal(6,3), last_overweight_notification_121:integer, shopper_total_price_123:decimal(8,2), shopper_total_cost_135:decimal(8,2), margin_98:decimal(8,2), shopper_weight_106:decimal(6,3), comprea_note_83:varchar, next_shopper_timeslot_id_120:integer, shopper_algorithm_124:varchar, loyalty_tip_131:decimal(8,2), date_loyalty_tip_115:integer, ebitda_104:decimal(8,2), numcalls_84:integer, manual_charge_111:tinyint, commission_100:integer, last_no_times_available_109:integer, comprea_note_driver_128:varchar, num_user_changes_130:integer, expected_shopping_time_139:integer, expected_delivering_time_97:integer, date_deliveries_requested_81:integer, date_call_93:integer, percentage_opened_hours_82:decimal(5,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, tag_94:varchar(20), tag_color_90:varchar(15), expected_eta_time_102:integer, promo_hours_137:integer, numcallsclient_77:integer, frozen_products_103:tinyint, driver_timeslot_id_107:integer, expected_delivering_distance_118:integer, lola_id_136:varchar(50), comprea_note_warehouse_86:varchar, uuid_88:varchar(10), date_almost_delivered_132:integer, date_ticket_pending_96:integer, date_waiting_driver_138:integer, num_products_116:integer, num_products_taken_91:integer, total_saving_89:decimal(8,2), percentage_opened_hours_in_87:decimal(5,2), cart_progress_branch_url_122:varchar(255), real_shop_id_129:integer, gps_locked_85:tinyint, tag_signature_101:varchar(4)]
Fragment 1 [HASH]
Output layout: [sum, id_105, cash_order_id_126, shopper_timeslot_id_80, shop_id_79, address_id_78, total_cost_92, total_price_117, total_price_discount_140, delivery_price_141, delivery_price_discount_112, delivery_type_95, status_108, modified_76, last_update_110, date_shopping_119, date_delivering_125, date_delivered_133, note_127, date_started_134, weight_99, last_overweight_notification_121, shopper_total_price_123, shopper_total_cost_135, margin_98, shopper_weight_106, comprea_note_83, next_shopper_timeslot_id_120, shopper_algorithm_124, loyalty_tip_131, date_loyalty_tip_115, ebitda_104, numcalls_84, manual_charge_111, commission_100, last_no_times_available_109, comprea_note_driver_128, num_user_changes_130, expected_shopping_time_139, expected_delivering_time_97, date_deliveries_requested_81, date_call_93, percentage_opened_hours_82, percentage_opened_hours_in_day_113, fraud_rating_114, tag_94, tag_color_90, expected_eta_time_102, promo_hours_137, numcallsclient_77, frozen_products_103, driver_timeslot_id_107, expected_delivering_distance_118, lola_id_136, comprea_note_warehouse_86, uuid_88, date_almost_delivered_132, date_ticket_pending_96, date_waiting_driver_138, num_products_116, num_products_taken_91, total_saving_89, percentage_opened_hours_in_87, cart_progress_branch_url_122, real_shop_id_129, gps_locked_85, tag_signature_101]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
LeftJoin[("cart_id" = "id_105")][$hashvalue, $hashvalue_151]
│ Layout: [sum:bigint, modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: PARTITIONED
├─ RemoteSource[2]
│ Layout: [cart_id:integer, sum:bigint, $hashvalue:bigint]
└─ LocalExchange[HASH][$hashvalue_151] ("id_105")
│ Layout: [modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), $hashvalue_151:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}
└─ RemoteSource[6]
Layout: [modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), $hashvalue_152:bigint]
Fragment 2 [SINGLE]
Output layout: [cart_id, sum, $hashvalue_143]
Output partitioning: HASH [cart_id][$hashvalue_143]
Stage Execution Strategy: UNGROUPED_EXECUTION
Limit[100]
│ Layout: [cart_id:integer, sum:bigint, $hashvalue_143:bigint]
│ Estimates: {rows: 100 (2.25kB), cpu: ?, memory: ?, network: ?}
└─ LocalExchange[SINGLE] ()
│ Layout: [cart_id:integer, sum:bigint, $hashvalue_143:bigint]
│ Estimates: {rows: 100 (2.25kB), cpu: ?, memory: ?, network: ?}
└─ RemoteSource[3]
Layout: [cart_id:integer, sum:bigint, $hashvalue_144:bigint]
Fragment 3 [HASH]
Output layout: [cart_id, sum, $hashvalue_145]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
LimitPartial[100]
│ Layout: [cart_id:integer, $hashvalue_145:bigint, sum:bigint]
│ Estimates: {rows: 100 (2.25kB), cpu: ?, memory: ?, network: ?}
└─ Filter[filterPredicate = ("sum" < BIGINT '10')]
│ Layout: [cart_id:integer, $hashvalue_145:bigint, sum:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
└─ Aggregate(FINAL)[cart_id][$hashvalue_145]
│ Layout: [cart_id:integer, $hashvalue_145:bigint, sum:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ sum := sum("sum_142")
└─ LocalExchange[HASH][$hashvalue_145] ("cart_id")
│ Layout: [cart_id:integer, sum_142:row(bigint, boolean, bigint, boolean), $hashvalue_145:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
└─ Aggregate(PARTIAL)[cart_id][$hashvalue_146]
│ Layout: [cart_id:integer, $hashvalue_146:bigint, sum_142:row(bigint, boolean, bigint, boolean)]
│ sum_142 := sum("expr")
└─ Project[]
│ Layout: [cart_id:integer, expr:bigint, $hashvalue_146:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ expr := CAST("quantity" AS bigint)
└─ InnerJoin[("cart_id" = "id_0")][$hashvalue_146, $hashvalue_148]
│ Layout: [cart_id:integer, quantity:integer, $hashvalue_146:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: PARTITIONED
│ dynamicFilterAssignments = {id_0 -> #df_778}
├─ RemoteSource[4]
│ Layout: [cart_id:integer, quantity:integer, $hashvalue_146:bigint]
└─ LocalExchange[HASH][$hashvalue_148] ("id_0")
│ Layout: [id_0:integer, $hashvalue_148:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}
└─ RemoteSource[5]
Layout: [id_0:integer, $hashvalue_149:bigint]
Fragment 4 [SOURCE]
Output layout: [cart_id, quantity, $hashvalue_147]
Output partitioning: HASH [cart_id][$hashvalue_147]
Stage Execution Strategy: UNGROUPED_EXECUTION
ScanFilterProject[table = app_lm_mysql:comprea.cart_product comprea.cart_product columns=[cart_id:integer:INT, quantity:integer:INT], grouped = false, filterPredicate = true, dynamicFilters = {"cart_id" = #df_778}]
Layout: [cart_id:integer, quantity:integer, $hashvalue_147:bigint]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
$hashvalue_147 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("cart_id"), 0))
cart_id := cart_id:integer:INT
quantity := quantity:integer:INT
Fragment 5 [SOURCE]
Output layout: [id_0, $hashvalue_150]
Output partitioning: HASH [id_0][$hashvalue_150]
Stage Execution Strategy: UNGROUPED_EXECUTION
ScanFilterProject[table = app_lm_mysql:comprea.cart comprea.cart constraint on [status] columns=[id:integer:INT, status:char(14):ENUM, date_delivered:integer:INT], grouped = false, filterPredicate = (("status" = CAST('delivered' AS char(14))) AND (CAST(with_timezone(date_add('second', CAST("date_delivered" AS bigint), TIMESTAMP '1970-01-01 00:00:00'), 'Europe/Madrid') AS date) = DATE '2022-07-24'))]
Layout: [id_0:integer, $hashvalue_150:bigint]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
$hashvalue_150 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("id_0"), 0))
date_delivered := date_delivered:integer:INT
id_0 := id:integer:INT
status := status:char(14):ENUM
Fragment 6 [SOURCE]
Output layout: [modified_76, numcallsclient_77, address_id_78, shop_id_79, shopper_timeslot_id_80, date_deliveries_requested_81, percentage_opened_hours_82, comprea_note_83, numcalls_84, gps_locked_85, comprea_note_warehouse_86, percentage_opened_hours_in_87, uuid_88, total_saving_89, tag_color_90, num_products_taken_91, total_cost_92, date_call_93, tag_94, delivery_type_95, date_ticket_pending_96, expected_delivering_time_97, margin_98, weight_99, commission_100, tag_signature_101, expected_eta_time_102, frozen_products_103, ebitda_104, id_105, shopper_weight_106, driver_timeslot_id_107, status_108, last_no_times_available_109, last_update_110, manual_charge_111, delivery_price_discount_112, percentage_opened_hours_in_day_113, fraud_rating_114, date_loyalty_tip_115, num_products_116, total_price_117, expected_delivering_distance_118, date_shopping_119, next_shopper_timeslot_id_120, last_overweight_notification_121, cart_progress_branch_url_122, shopper_total_price_123, shopper_algorithm_124, date_delivering_125, cash_order_id_126, note_127, comprea_note_driver_128, real_shop_id_129, num_user_changes_130, loyalty_tip_131, date_almost_delivered_132, date_delivered_133, date_started_134, shopper_total_cost_135, lola_id_136, promo_hours_137, date_waiting_driver_138, expected_shopping_time_139, total_price_discount_140, delivery_price_141, $hashvalue_153]
Output partitioning: HASH [id_105][$hashvalue_153]
Stage Execution Strategy: UNGROUPED_EXECUTION
ScanProject[table = app_lm_mysql:comprea.cart comprea.cart columns=[modified:tinyint:TINYINT, numCallsClient:integer:INT, address_id:integer:INT, shop_id:integer:INT, shopper_timeslot_id:integer:INT, date_deliveries_requested:integer:INT, percentage_opened_hours:decimal(5,2):DECIMAL, comprea_note:varchar:LONGTEXT, numCalls:integer:INT, gps_locked:tinyint:TINYINT, comprea_note_warehouse:varchar:LONGTEXT, percentage_opened_hours_in_10:decimal(5,2):DECIMAL, uuid:varchar(10):VARCHAR, total_saving:decimal(8,2):DECIMAL, tag_color:varchar(15):VARCHAR, num_products_taken:integer:INT, total_cost:decimal(8,2):DECIMAL, date_call:integer:INT, tag:varchar(20):VARCHAR, delivery_type:char(9):ENUM, date_ticket_pending:integer:INT, expected_delivering_time:integer:INT, margin:decimal(8,2):DECIMAL, weight:decimal(6,3):DECIMAL, commission:integer:INT, tag_signature:varchar(4):VARCHAR, expected_eta_time:integer:INT, frozen_products:tinyint:TINYINT, ebitda:decimal(8,2):DECIMAL, id:integer:INT, shopper_weight:decimal(6,3):DECIMAL, driver_timeslot_id:integer:INT, status:char(14):ENUM, last_no_times_available:integer:INT, last_update:integer:INT, manual_charge:tinyint:TINYINT, delivery_price_discount:decimal(8,2):DECIMAL, percentage_opened_hours_in_day:decimal(5,2):DECIMAL, fraud_rating:integer:INT, date_loyalty_tip:integer:INT, num_products:integer:INT, total_price:decimal(8,2):DECIMAL, expected_delivering_distance:integer:INT, date_shopping:integer:INT, next_shopper_timeslot_id:integer:INT, last_overweight_notification:integer:INT, cart_progress_branch_url:varchar(255):VARCHAR, shopper_total_price:decimal(8,2):DECIMAL, shopper_algorithm:varchar:LONGTEXT, date_delivering:integer:INT, cash_order_id:integer:INT, note:varchar:LONGTEXT, comprea_note_driver:varchar:LONGTEXT, real_shop_id:integer:INT, num_user_changes:integer:INT, loyalty_tip:decimal(8,2):DECIMAL, date_almost_delivered:integer:INT, date_delivered:integer:INT, date_started:integer:INT, shopper_total_cost:decimal(8,2):DECIMAL, lola_id:varchar(50):VARCHAR, promo_hours:integer:INT, date_waiting_driver:integer:INT, expected_shopping_time:integer:INT, total_price_discount:decimal(8,2):DECIMAL, delivery_price:decimal(8,2):DECIMAL], grouped = false]
Layout: [modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), $hashvalue_153:bigint]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
$hashvalue_153 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("id_105"), 0))
weight_99 := weight:decimal(6,3):DECIMAL
date_deliveries_requested_81 := date_deliveries_requested:integer:INT
fraud_rating_114 := fraud_rating:integer:INT
id_105 := id:integer:INT
date_ticket_pending_96 := date_ticket_pending:integer:INT
numcalls_84 := numCalls:integer:INT
real_shop_id_129 := real_shop_id:integer:INT
expected_shopping_time_139 := expected_shopping_time:integer:INT
total_cost_92 := total_cost:decimal(8,2):DECIMAL
tag_signature_101 := tag_signature:varchar(4):VARCHAR
delivery_price_discount_112 := delivery_price_discount:decimal(8,2):DECIMAL
tag_94 := tag:varchar(20):VARCHAR
date_shopping_119 := date_shopping:integer:INT
date_call_93 := date_call:integer:INT
manual_charge_111 := manual_charge:tinyint:TINYINT
driver_timeslot_id_107 := driver_timeslot_id:integer:INT
promo_hours_137 := promo_hours:integer:INT
shopper_timeslot_id_80 := shopper_timeslot_id:integer:INT
shopper_weight_106 := shopper_weight:decimal(6,3):DECIMAL
note_127 := note:varchar:LONGTEXT
total_saving_89 := total_saving:decimal(8,2):DECIMAL
gps_locked_85 := gps_locked:tinyint:TINYINT
percentage_opened_hours_in_day_113 := percentage_opened_hours_in_day:decimal(5,2):DECIMAL
num_products_taken_91 := num_products_taken:integer:INT
commission_100 := commission:integer:INT
last_no_times_available_109 := last_no_times_available:integer:INT
percentage_opened_hours_82 := percentage_opened_hours:decimal(5,2):DECIMAL
total_price_117 := total_price:decimal(8,2):DECIMAL
date_delivering_125 := date_delivering:integer:INT
expected_eta_time_102 := expected_eta_time:integer:INT
ebitda_104 := ebitda:decimal(8,2):DECIMAL
address_id_78 := address_id:integer:INT
shopper_algorithm_124 := shopper_algorithm:varchar:LONGTEXT
shopper_total_price_123 := shopper_total_price:decimal(8,2):DECIMAL
shop_id_79 := shop_id:integer:INT
expected_delivering_time_97 := expected_delivering_time:integer:INT
date_waiting_driver_138 := date_waiting_driver:integer:INT
loyalty_tip_131 := loyalty_tip:decimal(8,2):DECIMAL
delivery_type_95 := delivery_type:char(9):ENUM
numcallsclient_77 := numCallsClient:integer:INT
date_almost_delivered_132 := date_almost_delivered:integer:INT
date_started_134 := date_started:integer:INT
total_price_discount_140 := total_price_discount:decimal(8,2):DECIMAL
uuid_88 := uuid:varchar(10):VARCHAR
frozen_products_103 := frozen_products:tinyint:TINYINT
comprea_note_warehouse_86 := comprea_note_warehouse:varchar:LONGTEXT
last_overweight_notification_121 := last_overweight_notification:integer:INT
cart_progress_branch_url_122 := cart_progress_branch_url:varchar(255):VARCHAR
last_update_110 := last_update:integer:INT
comprea_note_driver_128 := comprea_note_driver:varchar:LONGTEXT
delivery_price_141 := delivery_price:decimal(8,2):DECIMAL
lola_id_136 := lola_id:varchar(50):VARCHAR
date_delivered_133 := date_delivered:integer:INT
num_products_116 := num_products:integer:INT
modified_76 := modified:tinyint:TINYINT
status_108 := status:char(14):ENUM
next_shopper_timeslot_id_120 := next_shopper_timeslot_id:integer:INT
shopper_total_cost_135 := shopper_total_cost:decimal(8,2):DECIMAL
tag_color_90 := tag_color:varchar(15):VARCHAR
date_loyalty_tip_115 := date_loyalty_tip:integer:INT
margin_98 := margin:decimal(8,2):DECIMAL
cash_order_id_126 := cash_order_id:integer:INT
num_user_changes_130 := num_user_changes:integer:INT
comprea_note_83 := comprea_note:varchar:LONGTEXT
percentage_opened_hours_in_87 := percentage_opened_hours_in_10:decimal(5,2):DECIMAL
expected_delivering_distance_118 := expected_delivering_distance:integer:INT
Trino results
Starting the measuring session. Query 'Smoke test query' took 1 seconds to run and returned 1 rows. Query 'Orders and GMV by Store and Month' took 256 seconds to run and returned 62 rows. Query 'First order by customer' took 264 seconds to run and returned 113939 rows. Query 'Sales by country, city, month for the last 1 months' took 257 seconds to run and returned 16 rows. Query 'Sales by country, city, month for the last 3 months' took 234 seconds to run and returned 40 rows. Query 'Sales by country, city, month for the last 6 months' took 237 seconds to run and returned 76 rows. Query 'Sales by country, city, month for the last 12 months' took 230 seconds to run and returned 150 rows. Query 'Sales by country, city, month for the last 24 months' took 225 seconds to run and returned 306 rows. Query 'Sales by country, city, month for the last 36 months' took 227 seconds to run and returned 455 rows. Finished the measuring session.
MySQL
Starting the measuring session. Opening up an SSH tunnel to pre.internal.lolamarket.com SSH tunnel is now open. Query 'Smoke test query' took 0 seconds to run and returned 1 rows. Query 'Orders and GMV by Store and Month' took 14 seconds to run and returned 62 rows. Query 'First order by customer' took 324 seconds to run and returned 113940 rows. Query 'Sales by country, city, month for the last 1 months' took 161 seconds to run and returned 16 rows. Query 'Sales by country, city, month for the last 3 months' took 169 seconds to run and returned 40 rows. Query 'Sales by country, city, month for the last 6 months' took 171 seconds to run and returned 76 rows. Query 'Sales by country, city, month for the last 12 months' took 178 seconds to run and returned 150 rows. Query 'Sales by country, city, month for the last 24 months' took 208 seconds to run and returned 161 rows. Query 'Sales by country, city, month for the last 36 months' took 234 seconds to run and returned 167 rows. Finished the measuring session. Closing down the SSH tunnel... SSH tunnel is now closed.
20220726
Meeting with Ricardo:
-
What operating model are we planning on running in Poland?
-
What do you think are the problems with Spain?
-
Why drop in-house tech and go for instaleap?
-
How do you keep the frugal philosophy being inside Glovo?
-
I'm surprised we don't have slightly more "controlling" style metrics
-
What is the story behind Mercadao?
-
What is your personal policy? Are you in for the money, for the fun?
-
What do you like and what don't you like about the data team as of today?
"Only measure things that matter."
Predict demand by SKU.
My ideas:
- Keep deliveries decentralized, while being very smart in optimizing workload. Portuguese approach seems a bit too rigid. Spanish approach is not smart enough (plus not very caring about shoppers)
- Define a holy-grail set of KPIs for operational excellence (something like €/delivery + shopper's €/hour) and start doing A/B testing of different picking+delivery tactics:
- Split picking and delivering
- Double or even triple deliveries
- Picnic approach (full-refrigerated truck, all day)
-
Make a different repository for experiments
-
Run final experiments with everything there
-
Clean up the package repository
-
Update confluence page with results
-
Prepare script for training session
-
What is the package used for and example
- Run any query against MySQL or Trino
- Measure how long it takes (Wall time)
- Used through CLI
- Sessions are defined through a JSON config
-
How to install
-
Tips and tricks
-
Further ideas
- Make a richer output format
- Include features geared towards comparison (compare several versions of a query, or same query across engines)
Questions:
- Can it write results to a table?
20220727
Buenas Fer,
Te escribo un pequeño tocho. Lo comentamos por aquí o si quieres echamos una llamada.
Resumen: queremos proponer poner unos indices en MySQL para mejorar performance de algunas queries. Es para ver cómo nos coordinamos.
Long: en data queremos hacer algunos dashboards para ops de españa. Son dashboards para temas del dia a dia de disponibilidades de shoppers, ordenes de hoy/mañana, etc. Así que tienen que refrescarse con frecuencia y ser bastante interactivos.
El tema es que las queries con la tabla de cart son bastante mortales. Es típico filtrar para ver solo pedidos en un status o que pertenezcan a una date_delivered o date_started en concreto. Pero ni la columna de status ni ninguna de las de fecha tienen índice, así que cualquier query se alarga mucho.
Queremos explorar contigo qué opciones hay para ver qué sería lo más inteligente y hacerlo.
Ya me dices, gracias.
Tables that are being used:
- address
- cart
- cart_product
- cash_order
- company
- postalcode
- shop
- shopper_timeslot
- shopper_postalcode
- user
20220728
- Try to connect my DBeaver to DW through the Jumphost indicated by Carlos
- Build script to replicate tables
- Replicate tables
- Run inserts from one DB to another through Trino
- Run the tests
- Put new indices in place
- Run the test again
- Compare
CAST(status AS CHAR(14)) CAST(delivery_type AS CHAR(9)) id = 3548 -> the troublesome cart
Stuff I discovered during my debugging of the bloody enums
- ENUMS + not strict mode -> errors turn into 0/''
- ENUMS + strict mode -> Insert breaks with unhelpful message
- LM MySQL doesn't have strict mode activated
- DW does have strict mode activated
- Easiest workaround for inserting faulty ENUM values from LM MySQL to DW:
- When defining the table in DW, make empty string ('') one of the possible ENUM values in the field. That way, when Trino tries to write an empty string from LM MySQL, DW sees it as a valid value.
- For now, there seems to be no down-side to this. Trouble would only happen if someone exported the table with the ENUMs as integer instead of strings, and tried to compare it with LM MySQL data. But that seems unlikely.
Retro
How can we improve as a team
- Everyone provides ideas
- We vote
- We discuss the most voted
- We take action points
20220816
Load the missing three tables.- Run the queries again, both through Trino and directly to DW.
DW meeting with Dani
- We both agree that we should judge more carefully before we discard MySQL
20220817
Details of DW on AWS:
- Region:
eu-central-1(Frankfurt) - Name: data-prod-mysql
- VPC: pdo-prod (vpc-6a231802)
- VPC Security Groups: default (sg-86e633ec)
Details of the EC2 I created:
- Region:
eu-central-1(Frankfurt) - name: performance-query-host
- size: t2.small
Details of the SSH key created:
- query-performance-host-key
Table design scenarios
- Base scenario - tables as they are today
- Second scenario - index on
status - Third scenario - index on
statusanddate_delivered - Fourth scenario - partition over
statuspartition over yearACTIVEdate_delivered - Fifth scenario - index on
status, year and month partitioning ondate_delivered
Can't partition over date_delivered because it has null values and should be part of the PK
20220818
My update:
- Managed to connect to DW
- Playing with indices and partitions, results soon
- Spoilers:
- DW is wicked fast when compared to the replica (could server size be the explanation)
- Indices seem to help when querying DW directly
- Indices seem to have no effect when querying through Trino
- Will sit today and probably tomorrow with Pinto to build a more exhaustive query set
- Want to develop a bit more the Python package because complexity of results is growing exp
Meeting with Pinto Pinto shares the products query that can never be run for more than 1-2 months because it dies She also shares another problematic one on similar tables I also asked for a handful of representative, relevant queries business-wise, even if they don't have performance problems as of today.
Partitioning pains:
- Columns used in the partitioning key must be part of the PK
- The table can't have foreign keys
20220819
- Update results in confluence
- Update slide
- Update readme.md in directory
- Make new feature in package
- Develop somekind of test for package?
20220822
- upload to s3
- register with prefect
- everything flow must be in a project
- I need to set up bash for windows
20220823
- Fill the slide comparing old LM vs new LM vs DW
- Update the confluence page
- Update the board
- Update the readme
Stuff that is missing in the docs on how to setup prefect
- Including the MFA line in the credentials file
- Editing the .bashrc file with the line at the end
- That prefect 1.2.2 should be installed, not prefect 2
- Set up the
backend.tomlandprefect.toml - missing botocore, boto3 package
- Permissions in S3 for the bucket
1:1 João:
- Discutir con equipo ideas de Data Quality
- Mirar fechas para viaje a Oporto:
- Pensar en team building stuff:
- Personal Development:
Steps:
- Flow that runs
- Flow that runs and logs something
- Flow that runs and queries Trino
- Flow that runs and queries Trino and prints something
- Flow that runs and queries DW
- Flow that runs and queries DW and prints something
- Flow that runs and makes an update in DW
20220825
-
Update Prefect set up documentation
-
Schedule training
-
Oporto trip
- Agree on dates with Dani
-
Get Trino and Rancher user from Dani
-
Put picture in Google account
-
Schedule Data Quality session with the team
-
Prepare whiteboard for Great Expectations explanation
-
Think about ideas for the agenda Oporto trip
20220829
- Share personalities
- Post DevOps offer in Python barcelona meetup
- Prepare Great Expectations demo for the team
- Prepare roadmap proposal for GE
20220830
- Tasks in Jira
- Share materials with the team
- Write a long form in Confluence
- Drop status point on Jira task
- Write summary of performance task
Re-structuring meeting with Gonçalo
- Lidl cierra partnership con LolaMarket
- Efectivo 1/09
- Lidl no ha estado contento con los numeros
- Lidl invirtio 2 millones en LolaMarket. Por los movimientos de acciones y propiedad, a efectos practicos, han perdido 1 millon
- Mala imagen por Ley Rider
- 40% del negocio de Lola Market
- Gasto de 200.00€ / mes. Los numeros no salen, y saldran menos con el cierre del partnership.
- Inevitable re-estructuracion del negocio en España
- Es seguro que el equipo de España se va a tener que reducir
- Tech y Data cuelgan de HQ y son inmunes a esto
- Piloto con Carrefour en Noviembre para convertirnos en su operativa en la sombra para same-day delivery
- Gonçalo ve mercado potencial de 75mill€ en España junto a ellos.
- La oportunidad de salvar Lola
- Next steps
- Cerrar plan de re-estructuracion con Glovo y compartirlo ASAP
20220831
The next two:
- Spot new guest users from mercadao and insert them in dw
- Delete sensible data for users that have been flagged as right-to-be-forgotten
Use checksum table to test existing users equality? Could the "autoincrement" for everything policy be a slippery slope?
20220901
- Map out user ETL
- Spanish ETL
- Portuguese ETL
- Set up user etl development environment
- Share meetup ideas with João
- Create tasks for all flows
20220902
- Finish mercadao new users ETL
20220903
-
Discuss deployment (all together, or deploy asap?)
-
Finish my write up on my development
20220905
- Check this to handle SSH tunnel https://discourse.prefect.io/t/how-to-clean-up-resources-used-in-a-flow/84
20220906
-
Ask about health insurance -> Liliana
-
Make specific goals
-
Dani cambia tabla user DW
-
Update inserts flow
-
Work with Carlos Matias to create a new SSH key for ETLs
- Get - All the users where (the pendings) - Not in app - Yes in DB - System is mercadao - rtbf = 0 - 10 users where (the gones) - Not in app - Yes in DB - System is mercadao - rtbf = 1
- Check that the gones - [x] Do not appear in the sensitive table - [ ] Have different created and update times <- NOPE
- Check that the pendings - [x] Appear in the sensitive table
- Backup the data from the pendings
- Execute the pipeline
- Check that the pendings - [ ] Do not appear in the sensitive table - [ ] Have different created and update times
20220908
- Add retries to Trino
- Finish deleted users ETL
- Update SQL of deleted users staging tables
20220908
- Modify create table to lengthen password field
- Send to João the points to review
20220912
- Sign-up in servicedesk
Insert guests Below is the SQL to insert a guest user. My question now is: on the normal users flow, what is preventing this same users from going in? Answer: guest users don't appear in the users table in MongoDB. They simply
-- Insert guests users
truncate table data_dw.staging.p006_pt21_new_guests_users
insert into data_dw.staging.p006_pt21_new_guests_users (
email,
id_user_internal,
n_date,
n_row,
temp_id
)
select
a.email,
b.id_user_internal,
date_format(now(), '%y%m%d%H%i%s') as n_date,
row_number() over (partition by null order by a.email) n_row,
'guest' || cast(date_format(now(), '%y%m%d%H%i%s') as varchar) || cast(row_number() over (partition by null order by a.email) as varchar) temp_id
from
(select
distinct
lower(og.customeremail) as email
from
app_md_mysql.pdo.ordergroup og
where
og.customerid is null
and og.status is not null
) a
left join data_dw.dw_xl.dim_user_sensitive b on (a.email = lower(b.email) and b.id_data_source = 1)
where
b.id_user_internal is null
and a.email is not null
order by 1
Ok. Now we have all the guests whose email does not appear in the dw already.
Next up, get DW ids for them.
insert into data_dw.staging.p006_c01_auto_increment (
id_data_source,
key_for_anything
)
select 1 as id_data_source, temp_id as id_app_user
from data_dw.staging.p006_pt21_new_guests_users
Finally, get them inside the user table.
insert into data_dw.dw_xl.dim_user_sensitive (
)
select
from
data_dw.staging.p006_pt21_new_guests_users c
inner join data_dw.staging.p006_c01_auto_increment a on (c.temp_id = a.key_for_anything)
insert into data_dw.dw_xl.dim_user (
id_user_internal,
id_data_source,
id_app_user,
flag_is_guest_account,
flag_is_social_account,
flag_agreed_campaigns,
flag_agreed_history_usage,
flag_agreed_phonecoms,
flag_agreed_terms,
flag_email_verified,
flag_is_active,
user_role,
flag_rtbf_deleted,
name,
email,
created_at,
updated_at
)
select
a.id_user_internal,
1 as id_data_source,
c.temp_id as id_app_user,
1 as flag_is_guest_account,
0 as flag_is_social_account,
0 as flag_agreed_campaigns,
0 as flag_agreed_history_usage,
0 as flag_agreed_phonecoms,
0 as flag_agreed_terms,
0 as flag_email_verified,
1 as flag_is_active,
'ROLE_USER' as user_role,
0 as flag_rtbf_deleted,
'guest account' as name,
c.email as email,
CAST(NOW() AS TIMESTAMP) as created_at,
CAST(NOW() AS TIMESTAMP) as updated_at
from
data_dw.staging.p006_pt21_new_guests_users c
inner join data_dw.staging.p006_c01_auto_increment a on (c.temp_id = a.key_for_anything)
- Modify inserts ETL to use SSH key from AWS
20220913
-
Failure on new users flow
- Review trino retry policy
- Review why flow appears as successful
-
Work on the special guest flow
Upgrade guests into normals
- We look for users that:
- Appear in mongodb and do not appear in DW (new-ish)
- BUT there is an existing guest user (flag_is_guest_account) that has the same email.
- And then we update the records of the existing guest internal DW id with the new data from the full-blow user we have seen in mongodb
staging.p006_pt41_users_with_existing_guest
c.id_user_internal as id_user_internal_guest
- Make meeting to demo and Great Expectations
Team guidelines
- Each flow should have an owner. Delegating is always possible, but we should prevent bystander effect.
- Support work is work
- Issues should be issues?
1 person 1 week Make task in our board
Vacations - ask in our internal chat
Slack alerts
- Go to api.slack.com
- Create new app
- Activate Income webhooks
- Create a webhook for the channel
Codes for each channel
- Data team: https://hooks.slack.com/services/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
- Data team alerts: https://hooks.slack.com/services/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
20220914
- Review Dani's flow
- Password as SHA
- Hardcodeo del numero 2 como id data source de Lola
t_004ultima linea de la query. Que sentido tiene?t_004es necesario el where? o el flag de new user?
- Document slack thingy
- General stuff
- Link to example flow
- As part of prefect, talk about file in
envwith the webhook URLs - Describe the usage of triggers
- Review weird errors in flow 1
- Check if the current S3 flow is buggy
- Learn how to avoid false Successess -> Flow reference tasks https://docs-v1.prefect.io/api/latest/core/flow.html#flow-2
- Apply slack notification to all flows
- Modify upgrade guest flow with João suggestion + review
20220915
-
Review weird errors in flow 1
- Check if the current S3 flow is buggy
- Learn how to avoid false Successess -> Flow reference tasks https://docs-v1.prefect.io/api/latest/core/flow.html#flow-2
-
Modify update flow so that it checks all fields, even those that nowadays are fixed
-
Modify upgrade guest flow with João suggestion regarding
date_registered+- review same concept on flow insert user
-
Put docstrings in functions in flows
-
Refactor all flows to not use the first staging table
- Flow 01
- Flow 02
- Flow 03 ISSUE WITH THE SHA, THIS ONE STILL USES IT. BUT NO TROUBLE IF IT'S THE ONLY ONE
- Flow 04
- Flow 05
-
Apply slack notification to all flows
- Flow 01
- Flow 02
- Flow 03
- Flow 04
- Flow 05
-
Schedule all flows
- Flow 01
- Flow 02
- Flow 03
- Flow 04
- Flow 05
20220919
- Finish docs
- General
- Flow
- 01 -> Review references
- 02
- 03
- 04
- 05
- Table Scripts
- staging.p006_c01_auto_increment
- staging.p006_pt41_users_with_existing_guest
- staging.p006_pt21_new_guests_users
- staging.p006_pt01_current_users
- staging.p006_pt11_modified_users
- staging.p006_pt31_deleted_users
- staging.p006_pt02_current_users_new
- Share Datacamp account with Ana Martins
20220920
-
Orders
- Basic MD flow
-
Check how to force Prefect to return true in a flow regardless of anything
1:1 João
- Barcelona Meetup
- Berlana and the blog?
- I'll talk now with Liliana
- Feedback
- Any news from Poland?
1:1 Liliana
-
PR for tech recruiting
- Blog
- Meetup
-
Relationship with unis
- We can get creative here
- I have a regular flow of smart Management and Economics students in my agenda
-
She comes from another start-up (Digital Therapy Software)
Python meetup Veriff
- How to propose a talk?
- Veriff is originally from Estonia
Fraud detection
-
Rivo, Fraud Engineering Lead
-
The founder got the idea from frauding his own age. Lol.
-
Do you implement a red team / blue team approach?
-
Testing feels like very difficult because your system is very stateful. How do you deal with it?
-
In-memory data for fast queries with millions of records... how much memory where you using?
QR Codes
Ismael Benito
QR codes re two dimensional bar codes
How could we use QR codes in Lola?
I need to do cool QR codes for my classes at UPF
- How well supported is it? Is my phone stupid?
- Any applications of this besides having fun? -> The chromatic ink stuff
20220921
- I have made the following query:
SELECT o.id as id_order,
og.id as id_order_group,
UPPER(og.customerid) as id_user,
u.id_user_internal,
u.id_app_user,
gu.id_user_internal,
gu.id_app_user
FROM app_md_mysql.pdo."order" o
LEFT JOIN app_md_mysql.pdo.ordergroup og
ON o.ordergroupid = og.id
LEFT OUTER JOIN data_dw.dw_xl.dim_user u
ON UPPER(og.customerid) = u.id_app_user
LEFT OUTER JOIN data_dw.dw_xl.dim_user gu
ON lower(og.customeremail) = gu.email
WHERE u.id_data_source = 1
AND gu.id_data_source = 1
AND u.flag_is_guest_account = 0
AND gu.flag_is_guest_account = 1
And it returns records like this:
This is strange. Each order should much with a user OR a guest, but not both.
I'm going to start looking into specific cases. The following one is the first
| id_order | id_order_group | id_user | id_user_internal | id_app_user | id_user_internal | id_app_user |
|---|---|---|---|---|---|---|
| 1136592 | 1159257 | 6140A68762E0C1003FD59CD3 | 6694660 | 6140A68762E0C1003FD59CD3 | 7183239 | guest22091216010722939 |
STEP I want to:
- Check all the orders made by
6329A8BF6ED53300400DBD5E - Check the entries of both
dw_user_id=8669726anddw_guest_id=7182737to explore what they look like.
RESULT Users that have changed their email and matches with a guest user.
I must match orders with users, first by user, after by guest.
To achieve it by user:
- Match
og.customeridwithdim_user.id_app_user - For those that don't match this way: match
og.customeremailwith guest accounts
The Bug shared by Janu
These are the faulty users:
| id_user_internal | id_data_source | id_app_user | date_joined | date_registered | date_uninstall | flag_is_active | user_role | gender | id_app_source | ip | app_version | user_agent | locale | priority | substitution_preference | default_payment_method | flag_gdpr_accepted | flag_agreed_campaigns | flag_agreed_history_usage | flag_agreed_phonecoms | flag_agreed_terms | flag_email_verified | flag_rtbf_deleted | flag_is_guest_account | flag_is_social_account | login | name | password | phone_number | id_facebook | id_stripe | id_third_party | poupamais_card | created_at | updated_at | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7177183 | 1 | guest2209121601074003 | 1 | ROLE_USER | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | brucacajope1@gmail.com | guest account | 2022-09-12 16:01:17.000 | 2022-09-12 16:01:17.000 | ||||||||||||||||||||
| 6722529 | 1 | 5EFCBE721D257E004C026D07 | 1 | pt | 0 | REPLACE | CREDIT_CARD | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | brucacajope1@gmail.com | 10219829617349375@facebook | Miriam Vaz | 87435096261dd1a22e25e55c91f045789a67ac85 | 963844518 | 10219829617349375 | 2446061022274 | 2022-09-09 09:21:42.000 | 2022-09-09 09:21:42.000 | ||||||||||||
| 7178877 | 1 | guest22091216010721636 | 1 | ROLE_USER | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | raquelsantosmota@gmail.com | guest account | 2022-09-12 16:01:17.000 | 2022-09-12 16:01:17.000 | ||||||||||||||||||||
| 6708840 | 1 | 60244B928C1878004124EDBE | 2021-02-10 21:09:38.000 | 1 | pt | 0 | REPLACE | CREDIT_CARD | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | raquelsantosmota@gmail.com | raquelmota@agpico.edu.pt | Raquel Mota | fc46053c49bc85c2d919dbdde184254b6dd27156 | 965182320 | 2446010860094 | 2022-09-09 09:21:42.000 | 2022-09-09 09:21:42.000 | ||||||||||||
| 7174727 | 1 | guest22091216010720657 | 1 | ROLE_USER | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | pedro.carreira74@gmail.com | guest account | 2022-09-12 16:01:17.000 | 2022-09-12 16:01:17.000 | ||||||||||||||||||||
| 6800726 | 1 | 62D1A2CAA8426C003F1975F2 | 2022-07-15 17:24:26.000 | 1 | pt | 0 | REPLACE | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | pedro.carreira74@gmail.com | 10224176747459562@facebook | Pedro Carreira | 939867084 | 10224176747459562 | 2022-09-09 09:21:42.000 | 2022-09-09 09:21:42.000 |
Things I observe:
- For each of them, there is one guest and one user version.
- The guest versions have been created after the user version was created.
Hypothesis
- The new guest flow is not prepared for cases where an existing user makes a purchase as a guest.
- It should check if the guest email exists in
dim_user, but it isn't hence it creates the user again. - Given that the user never appears as new, since it already exists in DW, the guest version will never be upgraded to user.
Result The hypothesis is rejected.
The new guest flow does watch out for existing users and guests in dim_user properly. It does so by matching by email.
-
Alicia buys as guest
-
Alicia gets inserted as guest in DW
-
Alicia registers as user
-
Alicia's existing guest gets upgraded the user
-
Alicia buys as guest
-
Alicia gets inserted as guest in DW
20220922
- Review Dani's flows
- 02 Update users
ETL_PATHwith env, otherwise the team can't run itp006_lm21_gest_users<- typo- Pass slack channel from parameter, see new orders ETL flow for inspiration
user_agentgets capped toVARCHAR(250). Is that length reasonable, or could it fall short?- lowercase all emails so that comparisons are always proper (email is case insensitive)
- The flow seems to do steps 1-4 for both users and guests, but then task 5 only updates users? Where are guests updated then?
- 02 Update users
- 01
- lower email
- slack noti + scheduled param
__file__date_joinedanddate_registeredflag_is_active
- 02
- lower email
- slack noti
- + scheduled param
__file__date_joinedanddate_registeredflag_is_active`
20220927
-
Review the failed flows
- 01
- 02
- 03 - was it cahnged?
-
Pedir usuario para Francisco
-
Seguir instrucciones para tener acceso a Github
-
Keep on with fixing user flows
- 03
- lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- Could the lower function be different in Trino and MySQL?
- It seems that
LOWERin Trino works just fine. - In MySQL,
LOWERandLCASEare supposed to do the same thing. I wrote a query to test it and
- It seems that
- Could the lower function be different in Trino and MySQL?
- slack noti
- + scheduled param
__file__date_joinedanddate_registeredflag_is_active- EXTRACT DATE
- UPDATE_AT in update
- lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- 04
- lower email
- slack noti + scheduled param
__file__date_joinedanddate_registeredflag_is_active
- 05
- lower email
- slack noti + scheduled param
__file__date_joinedanddate_registered- `flag_is_active
- 03
-
Work on orders
Meeting with Gonçalo
- Focus on operations because Poland removes our front-end
- What freedom/opportunities/limitations do we have to market biedronka.pl
- 100€ free delivery... inflation
- Game theory: kidnap and squeeze
- White label is just a commodity
- Technology is shared with Instaleap.
- What does Instaleap want from us?
- How happy is Berlana with Instaleap? Quality issues?
Meeting with Ricardo
-
We review the roadmap
-
Ricardo explains that they have discussed with Jerome Martins around catalogs per store and stockouts.
- We only have one catalog per retailer, even if we know that each store does not have exactly the same category
- JM argues that stockouts happen because our catalog is simplified, not because their operations are flawed
- To settle the discussion, the catalog for a couple of stores have been fixed manually to be hyper-accurate. With this, we experiment with the following hypothesis: Does having an accurate catalog reduce stockouts for that store significantly?
-
Biedronka launch
- Single shopper vs picker-driver
- Order of biedronka expansion by geography
- Recurrence
- Best drivers
For next one:
-
How can we imprve AOV, CPO and Stockouts
-
How much money are we losing to credit card fees in Spain?
20220929
- Document user flow 006
- Seguir instrucciones para tener acceso a Github
- Keep on with fixing user flows
- 03
- lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- Could the lower function be different in Trino and MySQL?
- It seems that
LOWERin Trino works just fine. - In MySQL,
LOWERandLCASEare supposed to do the same thing. I wrote a query to test it and
- It seems that
- Could the lower function be different in Trino and MySQL?
- slack noti
- + scheduled param
__file__date_joinedanddate_registeredflag_is_active- EXTRACT DATE
- UPDATE_AT in update
- lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- 04
- lower email
- slack noti + scheduled param
__file__date_joinedanddate_registeredflag_is_active
- 05
- lower email
- slack noti + scheduled param
__file__date_joinedanddate_registered- `flag_is_active
- 03
Work on the order etl
Okay. Since I haven't worked on fixing the user flows for days, I actually kind of forgot what was the issue and what was I doing about it. I'll start again from the initial point: trying to join orders to customers in the orders ETL.
Customers can be linked to orders in two ways:
- Through Mercadão's user ID.
- Through the order email.
Theoretically, with an up-to-date dim_user table:
- All orders should be linked to a customer by one of the two methods.
- Orders should only be matched by one method (if we find a user id, we should not match again by email)
- I'm going to take a look at the query I was building to check if this is the case.
After running the query with a few adjustments, I come to the following observations:
- The query matches the following way:
!

- Most match by ID.
- Double matches don't worry me, since we can just prioritise using the app id match instead of the user email one.
- The not matched are the ones that worry me. I need to look into those.
Hypothesis I'll explore the following way:
- First, confirm that all non-matching orders do not have a user app id. This will confirm that we can narrow down the issue to email matching.
- If the previous is true:
- I'll get one of the non-matching orders and save the email.
- I'll go into dim user and see if I can find the email with a wrong case.
- My expectation is that I can find it. If this is not the case... I need to get creative to think what comes next.
Result
- All customer ids are null. So, I confirm that all non-matching orders are guest orders that should be matched by email.
- I pick order
106889- The customer email is
conceicaorento@sapo.pt - The email appers in
dim_user, associated to a registered user. Aha-moment: when matching by email, I should NOT only do that for guest users. There are guest orders that should be matched by email to upgraded-users. The order doesn't have a customer id because it was made from a guest identity, but later we upgraded the guest to user, so there will be no matching guest account but matching through the app user id is not possible because the order was placed in the guest-past.
- The customer email is
So, matching should be like:
- If order has an user id, match that way
- If not:
- If order matches email with user account, use user account
- If order matches email with no user account, use guest account
- If no account, problem
Hypothesis Applying the previous matching procedure should result in no orders unmatched.
Result Unmatched orders are down from 79,174 to 1,989. We have reduced the problem, but issues still remain.
Hypothesis I'll explore the following way:
-
First, confirm that all non-matching orders do not have a user app id. This will confirm that we can narrow down the issue to email matching.
-
If the previous is true:
- I'll get one of the non-matching orders and save the email.
- I'll go into dim user and see if I can find the email with a wrong case.
- My expectation is that I can find it. If this is not the case... I need to get creative to think what comes next. Result
-
Okay, first, for the orders that have a
customerid:- There are non-matched orders with customer ids, and their IDs don't appear in
dim_user. My guess is that the following is happening:- User registers
- User places orders
- User deletes itself from Mercadão
- ...time passes...
- We start feeding
dim_user - End result: we never had the chance to store the user in
dim_usersince it was deleted before we ran it.
- Options to move ahead:
- Drop the orders: ugly as hell, potentially acceptable since it's only 2,000 orders of deleted users.
- Make
id_user=-1for these orders. Not that intuitive for analysts. And if we have another expection some day, we will have to doid_user=-2or something like that and things will start tangling up. - Create a new user flow for Mercadão where we infer deleted users from orders and include them in
dim_user. A bit more work, but I think it's clean and will pay off long term.
- There are non-matched orders with customer ids, and their IDs don't appear in
-
For the customers that don't:
- I select order with id
66017 - The customer email is
SONIASTC@SAPO.PT - And I look for it in
dim_user... and it's there, with the same case. Why didn't it match??? Well, obviously, becausedim_useremail is in UPPER, and my query is lowercasing stuff. - If I rerun again ensuring cases are fine:
- Does the same user appear? -> No
- How many non-matched by email users appear? -> only 22
- Out of these 22
- 6 are recent orders that have been done after the last Users ETL run, so that's fine
- The other 16 are super old orders and all of them come from users with email from Jerome Martins. When we insert guests into
dim_user, we ignore order groups with status null. I'll do the same for these ETL.
- I select order with id
Query for debugging in case I need it again in the future:
SELECT
o.id AS id_order,
og.id AS id_order_group,
UPPER(mdu.id_app_user) AS id_app_user,
mdug.email AS guest_user_email,
CASE
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NULL THEN mdu.id_user_internal
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NOT NULL THEN mdug.id_user_internal
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NOT NULL THEN mdu.id_user_internal
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NULL THEN -1
END AS dw_user_id,
CASE
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NULL THEN 'matched_by_id'
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NOT NULL THEN 'matched_by_email'
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NOT NULL THEN 'double_match'
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NULL THEN 'not_matched'
ELSE 'wut???'
END AS match_type,
o.*,
og.*
FROM
app_md_mysql.pdo."order" o
INNER JOIN app_md_mysql.pdo.ordergroup og
ON
o.ordergroupid = og.id
LEFT JOIN
(
SELECT
du.id_user_internal,
du.id_app_user
FROM
data_dw.dw_xl.dim_user AS du
WHERE
du.id_data_source = 1
AND du.flag_is_guest_account = 0
) AS mdu
ON
UPPER(og.customerid) = mdu.id_app_user
LEFT JOIN
(
SELECT
du.id_user_internal,
du.email
FROM
data_dw.dw_xl.dim_user AS du
WHERE
du.id_data_source = 1
) AS mdug
ON
LOWER(og.customeremail) = LOWER(mdug.email)
WHERE
o.status IS NOT NULL
AND og.status IS NOT NULL
AND CASE
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NULL THEN 'matched_by_id'
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NOT NULL THEN 'matched_by_email'
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NOT NULL THEN 'double_match'
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NULL THEN 'not_matched'
ELSE 'wut???'
END = 'not_matched'
USE PARTITIONING TO SOLVE DOUBLE MATCHES BY PRIORITISING THE USER
Snippet to add deleted users to dim_user
https://drive.google.com/file/d/1Iz_NypqRx2WgtgxNxWit0E5XggS85K7P/view?usp=sharing
20220929
-
Documenting and running deleted users thingy
-
Seguir instrucciones para tener acceso a Github
-
Keep on with fixing user flows
- 03
- lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- Could the lower function be different in Trino and MySQL?
- It seems that
LOWERin Trino works just fine. - In MySQL,
LOWERandLCASEare supposed to do the same thing. I wrote a query to test it and
- It seems that
- Could the lower function be different in Trino and MySQL?
- slack noti
- + scheduled param
__file__date_joinedanddate_registeredflag_is_active- EXTRACT DATE
- UPDATE_AT in update
- lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- 04
- lower email
- slack noti + scheduled param
__file__date_joinedanddate_registeredflag_is_active
- 05
- lower email
- slack noti + scheduled param
__file__date_joinedanddate_registered- `flag_is_active
- 03
20221003
-
Document user flow 006
-
Seguir instrucciones para tener acceso a Github
-
Keep on with fixing user flows
- 03 OJO QUE LA COMPARISON NO ES CASE SENSITIVEEEEEE
- lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- Could the lower function be different in Trino and MySQL?
- It seems that
LOWERin Trino works just fine. - In MySQL,
LOWERandLCASEare supposed to do the same thing. I wrote a query to test it and
- It seems that
- Could the lower function be different in Trino and MySQL?
- slack noti
- + scheduled param
__file__date_joinedanddate_registeredflag_is_active- EXTRACT DATE
- UPDATE_AT in update
- lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- 04
- lower email
- slack noti + scheduled param
- Set the
date_joinedof existing guests as the the order of the first date. How? __file__date_joinedanddate_registeredflag_is_active
- 05
- lower email
- slack noti + scheduled param
__file__date_joinedanddate_registered- Current flow is leaving
date_registerednull (bad) - Current flow is "destroying"
date_joinedfrom the previous guest user entry. Should it be simply carried over?
- Current flow is leaving
- `flag_is_active
- 03 OJO QUE LA COMPARISON NO ES CASE SENSITIVEEEEEE
-
Run an update on email column to lowercase the shit out of it
-
Solve partitioning problem
| id_order | id_order_group | id_user |
|---|---|---|
| 1084918 | 1113194 | 9166273 |
| 588939 | 654522 | 9246797 |
| 664680 | 728505 | 9021328 |
Query to review if there are any uppercase emails in `dim_user
SELECT dim_user_email_has_upper, COUNT(1)
FROM
(SELECT
a.email AS pdo_email,
b.email AS dim_user_email,
BINARY a.email <> BINARY LOWER(a.email) AS pdo_email_has_upper,
BINARY b.email <> BINARY LOWER(b.email) AS dim_user_email_has_upper,
a.email = b.email AS naive_equality,
lower(a.email) = lower(b.email) AS lowered_equality
from
staging.p006_pt01_current_users a
left join dw_xl.dim_user b on (UPPER(a.id_app_user) = UPPER(b.id_app_user) and a.id_data_source = b.id_data_source)
where
b.id_user_internal is not null
and b.id_app_user is not NULL
) email_stuff
GROUP BY dim_user_email_has_upper
20221006
- Remove unnecessary OVER PARTITION in status and date
- Make note of the parameter change to use
updatedatin MD Order flow
now on the orderwithuser part, a couple of questions:
- you dont actually need to do the over partition and then the distinct. you can just do the max() and a group by by o.id. It should be faster than performing the over partition.
- the
mdugsubquery, by bringing the email on the select and joining by the lower, we can actually get more than 1 rows on the select if the email gets to be lower cased and not in the table. I know this is probably handled on dim_users, and will be solved with the max on the main query. So no worries, just to be aware of this.
not sure about this, but we probably should limit the updatedat to not be today otherwise you will get users that dont exist on dim_user yet
20221013
-
Index
- Modify table definiton to include index
- Change code back to deleting by id + fetch by creation date
-
Discuss with João the issue with using
update_atto limit the scope of the order flow -
Drop large tables from query experiment
20221014
Order columns batch 2
id_picking_location- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
id_retailer- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
id_team- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
date_created- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
date_updated- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
status_order- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
total_price_ordered- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
date_delivery_slot_start- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
date_delivery_slot_end- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
date_last_delivery_slot_start- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
date_last_delivery_slot_end- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
delivery_fee- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
delivery_fee_discount- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
delivery_fee_original- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
delivery_type- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
20221017
- Order LM V1
The stupid non-null case
What I was doing
- I was running an
INSERT INTOto a table. - This included inserting in column
X. ColumnXis nullable. I was aware that some of the values I was inserting where nulls... but that shouldn't be a problem since the columnXis nullable. - The query was being ran on Trino.
What I expected to happen That the data just gets inserted properly.
What actually happened
- The query failed with an exception.
- The error message read:
trino.exceptions.TrinoUserError: TrinoUserError(type=USER_ERROR, name=CONSTRAINT_VIOLATION, message="NULL value not allowed for NOT NULL column: X", query_id=20221017_135640_00663_f6qyk) - I was puzzled since, again column
Xis nullable.
Weird, let's research
- After toying around a bit, I ran the same query without inserting into column
X, but keeping many other columns in place. - After running, I got the same error message I was getting for column
X, but this time for columnY. - This was also puzzling, because column
Yis nullable as well. - I kept taking out nullable columns out of the
INSERT, and the error message kept jumping to some other column every time.
Finding the root cause and solution
- Eventually, I was left with 4 column,
A,B,CandDthat were NOT nullable. - After running the
INSERTonce more, I got the same error for columnA. - I checked the data I was trying to insert and there were null values for column
A. Hence, this time the error made sense: there where null values going intoA, butAis not-nullable. - I modified the schema definition to make
Anullable and ran again the insert with justA,B,CandD. It inserted properly and threw no errors. - I then re-executed my original query with all the columns I wanted to insert. Now it worked just fine!
Take-away
If Trino ever gives you an error like trino.exceptions.TrinoUserError: TrinoUserError(type=USER_ERROR, name=CONSTRAINT_VIOLATION, message="NULL value not allowed for NOT NULL column: X", query_id=20221017_135640_00663_f6qyk), completely ignore the column being mentioned. Trino will just lie to your face repeteadly. Instead, restrict your operation to only the columns that can't be null and find what part of the data is breaking the constraint.
It looks like the LM order ETL is joining several users to one single order or tag. I need to look into it. Maybe Dani made some mistake in the user ETL.
20221026
-
Review João's review
-
VARCHAR(36)agreement- Write convention on
VARCHAR(36)for ids - Modify existing columns and table scripts for
fact_order
- Write convention on
-
Los gastos del viaje
-
Fill in details of tasks created by João
-
id_picking_location- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
id_retailer- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
id_team- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
date_created- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
date_updated- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
status_order(naive version)- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
total_price_ordered- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
date_delivery_slot_start- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
date_delivery_slot_end- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
date_last_delivery_slot_start- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
date_last_delivery_slot_end- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
delivery_fee- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
delivery_fee_discount- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
delivery_fee_original- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
delivery_type- Created col in
staging - Created col in
dw_xl - Included in first query
- Included in second query
- Approved
- Created col in
-
Status
- Check cash order
- Check with Bruna
- Make design proposal
-
Deal with the multicurrency bit
- Local currency should come from the currency field, not be hardcoded
- Apply the exchange rate. See https://bi.mercadao.pt/question/1344-lm-cart.
-
MAKE USER A
NOT NULLFIELD AGAIN! (Now it's not because the LM ETL fordim_useris buggy) -
Ask specifically for Dani to review with a focus on the delivery slots fields
20221103
Shutting down
Lolamarket operations will die today.
-
Is the society still up? No.
-
Who will stick around? Data and tech.
-
Completely shut down? Today?
-
Branding. Are we also going to kill the brand? Brand continues
-
Timeline and paperwork
- Timeline?
- Diff between brand new society and subsidiary?
- As long as we do it calmly, all good
20221104
-
Ricardo's request: Add % lost sales to dashboard
-
Find out how could we do the "dirty read thingy"
-
Try to make an environment where I can work with the GE notebooks
Given a ready to insert/update subset of data Mirror, and a table Target, there is no expectation that applies to Target but not to Mirror. Hence, the same expectation suite Suite that can validate fully the Mirror in an ETL flow, can also validate fully the Target in a monitoring flow.
20221107
- Clean up Sandbox
- Review Pinto's comment
Re-run and check if duplicate persists- The error persists.
If it persists, see what part of the monster JOIN is the one joining more than once and change what's necessary- I just noticed it's from Mercadão, not Lolamarket. Change of plans
First, I'm gonna check how many orders in each system have duplications.- It's only Mercadão.
I have noticed that the Mercadão flow is still doing the truncate with trino. This could be the reason behind the error. I'm going to check if the staging table has duplicates.- I fixed it and it looks like that was the guilty bit. Re-executing returned no duplicates, so I'll consider it to be good for now unless Ana spots anything again.
20221108
Q&A for Promotech SL death
-
"All seniority will be taken into account."
-
What is a "Mercadão Entity"? Is it an SL company in Spain owned at 100% by Mercadão?It's just Mercadão itself with legal presence in Spain. -
~~Do we have a timeline? Not that I'm in a hurry, just out of curiosity ~~-> "Our goal is for your transition to be completed at most in mid december"
20221110
- Review Janu's expectations
- Use unique
- Explain trick on query vs table to sum all flags together
- Explain REGEX to id_app_user
- Share this: https://legacy.docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html
20221115
The Great Fixing of the Order ETLs
Put the trim in the Lolamarket order status
- Solved
Check the DELIVERING order having null status for payment
- Solved
Another thing, if you look at the orders from October, some of the orders don't have the correct status... For example:
select * from data_dw.dw_xl.fact_order where id_order = 2041500;In this case, the status on the operational table isdeliveredand in thefact_orderisLOYALTY_CARD_ERROR
- TLDR: I can't reproduce the issue
- I checked and
fact_order.original_app_status = 'deleted'and thefact_order.id_operational_status = 4, which correctly corresponds tocancelled - Query:
select op_ status.*, o.* from dw_xl.fact_order o LEFT JOIN dw_xl.dim_operational_status op_status ON o.id_operational_status = op_status.id where id_order = 2041500
Why the date delivered of this order is
2022-09-30instead of2022-10-01?select * from data_dw.dw_xl.fact_order where id_order = 9715131;
- Could you elaborate what's the rationale behind that
date_delivered_closedshould be2022-10-01?
Why for order
2857238we have differences in the fieldsdelivery_fee_original_euranddelivery_fee_discount_eur? (I'm using thepublic.exchange_rate_eurfor the calculation)
-
I can't find any problem. Could you explain how do you think the following numbers should be? (These are for order
2857238)delivery_fee_eur delivery_fee_original_eur delivery_fee_discount_eur delivery_fee_local delivery_fee_original_local delivery_fee_discount_local 0.3400 4.0000 3.6500 1.7200 19.9900 18.2700 -
Check why it doesn't add up to 4 Solved, it was because the hardcoded values were not of type
DECIMALand Trino is an idiot. -
Trim
ROLE_USER -
No orders from users with
ROLE_TEST- Option A: we remove
ROLE_TESTusers fromdim_user - Option B: we keep
ROLE_TESTusers indim_user, add a filter in thefact_orderETL to remove those users
- Option A: we remove
-
Add the timezone field to orders
- LM
- MD
-
Should we remove
cart = deleted? YES -
Find alternative
date_createdfor Lomarket orders where the current logic doesn't have a value:- FROM_UNIXTIME(cart.date_deliveries_requested) AS date_created,
-
Do the change about the 100€ cart and other details of the delivery fee
Check if dates in UTC Checked. All is UTC atm.
I don't know if I already say this, but for this old order (
3127) at Mercadão, we should consider fulfill thedeliveryslotdateusing the date of the real delivery. In these cases, we shouldn't consider thedate_last_delivery_slot_end. Should be null as thedate_last_delivery_slot_startbecause these values don't make sense:select * from data_dw.dw_xl.fact_order where date_last_delivery_slot_start is null and id_data_source = 1;
- What is the rule to obtain the "date of the real delivery"?
- "Should be null as the
date_last_delivery_slot_startbecause ..." <- Sorry, I'm not following. Order3127does not have a null value indate_last_delivery_slot_start - It's only this order: manually copy over the value from the
lastfield to the original one.
I'm not sure if we talked about these specific cases:
select * from data_dw.dw_xl.fact_order where date_last_delivery_slot_start is null and id_data_source = 1;
But I think that thedate_last_delivery_slot_endshould be null because if you check, doesn't make sense, because we don't have slots finishing to 00:30 Maybe you could add this to the manual fixes
- Manual fix
- Check why these 5 orders have no
date_delivered_closedeven though they are in statusdelivered
SELECT *
FROM data_dw.dw_xl.fact_order fo
WHERE fo.date_delivered_closed IS NULL
AND original_app_status = 'delivered'
20221123
- User ETL
Refactor to only pass the config from the expectations and the queryMake new bucketMake new schema- Copy over in other flows
Update MDGuest MDUpgrade MD- New lola
- Guest lola
- Update Lola
id_user_internal
- Usual id_data_source
- 2 id_app_user
- Like id_user_internal date_joined
- Usual date_registered
- Usual user_role
- {"ROLE_USER" ,"ROLE_WARNING" ,"ROLE_TEST" ,"ROLE_SUPER_WARNING" ,"ROLE_BANNED" ,"ROLE_INCIDENCES" ,"ROLE_ADMIN" ,"ROLE_SUPER_ADMIN" ,"ROLE_AMBASSADOR" ,"ROLE_SUPER_AMBASSADOR" ,"ROLE_B2B"}
- Not null gender
- {"masculine", "femenine", "none"} id_app_source
- {"1", "2", "3"}
- Not null ip
- ip regex locale
- "es" priority
- int between 0 and 100 substitution_preference
- {"call", "nothing", "shopper"}
- not null email
- regex login
- regex id_stripe
- the regex poupamais_card
- null created_at
- usual date stuff updated_at
- usual date stuff
flag_is_active flag_gdpr_accepted flag_agreed_campaigns flag_agreed_history_usage flag_agreed_phonecoms flag_agreed_terms flag_email_verified flag_rtbf_deleted flag_is_guest_account flag_is_social_account
phone_number
- regex
Performance Review Guidelines
-
3 moments
- Write self-review
- Manager (João) provides feedback
- 1:1 with manager
-
Timeline?
-
Every year the same?
20221213
Docker image research
Chapter 1 - Pulling the image
If we want to be able to freely modify and update our prefect-flow images, first we must understand what's inside the current one, since we need to keep the current production flows up and running. And if we want to look into it, first you need to have it in your local machine.
My trouble began here: I tried installing docker in my Ubuntu WSL, but whenever I tried to run any docker command, I got the error message: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?. Googling this was a pain in the ass. There are a thousand reasons for this error to happen and there are a million combinations of Windows + WSL + Docker setups which apparently have all sorts of different behaviours.
I finally managed to get something working by doing this:
- Get docker working
- I installed Docker desktop in Windows.
- I installed the Debian WSL from the Microsoft Store.
- In the Docker Desktop GUI, I went to Settings (the little gear, top right) -> Resources -> WSL Integration and enabled the integration with the Debian engine.
- Debian can now use docker (as long as this Docker Desktop GUI is up and running).
- Note: our current Ubuntu WSL is not usable for this because it is a WSL Version 1 (don't ask me what this means). On the other hand, the Debian I installed is a WSL Version 2, which is correctly identified by the Docker Desktop GUI.
- Get permission to pull the image from AWS ECR
- Whoever wants to pull an image from our ECR must have the following policy in AWS:
AmazonEC2ContainerRegistryReadOnly(note that there are higher level policies such asAmazonElasticContainerRegistryPublicPowerUserorAmazonElasticContainerRegistryPublicReadOnlythat would also work, but the read only one is enough for this purpose). - Get
awsumeset up in the Debian env (you can follow instructions here: https://awsu.me/general/quickstart.html) - Get your credentials set up in
~/.aws/credentialsand useawsume - Login into our AWS Docker Registry with this command:
aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin 373245262072.dkr.ecr.eu-central-1.amazonaws.com - You should be able to pull images from the repo like this:
docker pull 373245262072.dkr.ecr.eu-central-1.amazonaws.com/pdo-data-prefect:latest
- Whoever wants to pull an image from our ECR must have the following policy in AWS:
A good resource to understand the AWS login stuff that's going under so that it doesn't feel like dark magic.
- Video explaining how to login to the AWS CLI with MFA: https://www.youtube.com/watch?v=EsSYFNcdDm8
Chapter 2 - Understanding what does our image have
I researched a bit and apparently there are a couple of ways to (https://appfleet.com/blog/reverse-engineer-docker-images-into-dockerfiles-with-dedockify/, https://stackoverflow.com/questions/48716536/how-to-show-a-dockerfile-of-image-docker) Once the image was in my hands, the next part was understanding what was inside.
To do so, I ran the following command and got the following result:
$ docker history 223871741b8a
IMAGE CREATED CREATED BY SIZE COMMENT
223871741b8a 6 months ago /bin/sh -c #(nop) ENTRYPOINT ["tini" "-g" "… 0B
<missing> 6 months ago /bin/sh -c #(nop) COPY file:6068d9a0511b2a94… 795B
<missing> 6 months ago /bin/sh -c pip install trino 267kB
<missing> 7 months ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV LC_ALL=C.UTF-8 0B
<missing> 7 months ago /bin/sh -c #(nop) ENTRYPOINT ["tini" "-g" "… 0B
<missing> 7 months ago /bin/sh -c #(nop) COPY file:e1bbbe4447dfaf1e… 795B
<missing> 7 months ago |4 BUILD_DATE=2022-04-27T21:09:36Z EXTRAS=al… 505MB
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.bu… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.vc… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.ve… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.ur… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.na… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.sc… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL io.prefect.python-v… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL maintainer=help@pre… 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV LC_ALL=C.UTF-8 0B
<missing> 7 months ago /bin/sh -c #(nop) ARG BUILD_DATE 0B
<missing> 7 months ago /bin/sh -c #(nop) ARG GIT_SHA 0B
<missing> 7 months ago /bin/sh -c #(nop) ARG EXTRAS 0B
<missing> 7 months ago /bin/sh -c #(nop) ARG PREFECT_VERSION 0B
<missing> 7 months ago /bin/sh -c #(nop) CMD ["python3"] 0B
<missing> 7 months ago /bin/sh -c set -eux; savedAptMark="$(apt-m… 11.4MB
<missing> 7 months ago /bin/sh -c #(nop) ENV PYTHON_GET_PIP_SHA256… 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV PYTHON_GET_PIP_URL=ht… 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV PYTHON_SETUPTOOLS_VER… 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV PYTHON_PIP_VERSION=22… 0B
<missing> 7 months ago /bin/sh -c set -eux; for src in idle3 pydoc… 32B
<missing> 7 months ago /bin/sh -c set -eux; savedAptMark="$(apt-m… 28.1MB
<missing> 7 months ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.7.13 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV GPG_KEY=0D96DF4D4110E… 0B
<missing> 7 months ago /bin/sh -c set -eux; apt-get update; apt-g… 3.11MB
<missing> 7 months ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV PATH=/usr/local/bin:/… 0B
<missing> 7 months ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 7 months ago /bin/sh -c #(nop) ADD file:8b1e79f91081eb527… 80.4MB
I quickly realised most of the image didn't looked like something Ana might have modified manually but rather standard. I read the Prefect docs on the topic (https://docs-v1.prefect.io/orchestration/flow_config/docker.html) and then everything clicked in my mind:
- Prefect provides a standard image to run the flows in.
- The only thing Ana had done was to extend that image by installing the Trino Python client in it.
This is great news because it means we don't need to do any black magic to make new images that can be used by the old flows. We can simply grab the standard prefect image, add our required packages and add any other changes we do and that's it.
Chapter 3 - From now on
- Try to make a new image and check if it works -> Yes!!!
- Design our image design and building pipeline
- Create a Git repository to document our images.
https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html
-
Do the performance review self-assessement
-
Make new repository and organize image modification pipeline
-
Manage to get a custom package inside our image
-
Start packaging things
- Connections
- Slack alerts
- General bloat
- Great Expectations
- Common patterns
Hand-off with Dani
- DMS (Data Migration Service)
- The thingy that moves data from LM Replica to Redshift
- DMS is configured to be real-time
20221214
Payment process for Poland
-
How will the shopper invoicing work in Poland?
-
Which data should we look at in DB?
- Shopper Bill?
- New stuff in Flex schema?
-
What is "the division between Delivery fees and CPO"
20230102
-
Fix the bug with the LM update users pipeline
-
Finish PR
-
Tag production image and clarify in docs that latest and production are synonymous
-
Finish the cute drawing
-
Review CVs
-
Make the hello-world package
-
Uploading to pip
- Create new S3 bucket for Python packages
- Write generic guide on how to make any package uploadable to the S3 bucket
- Make specific guide on how to upload
bom-dia-mundoto S3
-
Modify build process to include custom packages
- Create config file with list of packages and versions
- Create script to make temp copy of it
- Manage to get it installed
- Run bom-dia-mundo
- Document and discuss with João
-
Prepare development plan with João
-
Review the data architecture summary with João
-
Document Metabase downtime, update, etc.
-
Store creds for AWS
-
Fix lm update users
-
Take a look at API thing
-
Document procedure to open connections in Looker for future reference
-
Ideas for Dani's gift
- A haircomb and a hairdresser giftcard
Interview methodology
- Technical interview
- Small E2E, case, common sense
- Checklist of tool questions
- Silly coding task, automated
20230103
Looker connection to data_dw
- RDS: https://eu-central-1.console.aws.amazon.com/rds/home?region=eu-central-1#database:id=data-prod-mysql;is-cluster=false
- Security Group: https://eu-central-1.console.aws.amazon.com/ec2/v2/home?region=eu-central-1#SecurityGroup:groupId=sg-0275d4096c2890cea
- Subnets
- https://eu-central-1.console.aws.amazon.com/vpc/home?region=eu-central-1#SubnetDetails:subnetId=subnet-42945a29
- https://eu-central-1.console.aws.amazon.com/vpc/home?region=eu-central-1#SubnetDetails:subnetId=subnet-8a955be1
- https://eu-central-1.console.aws.amazon.com/vpc/home?region=eu-central-1#SubnetDetails:subnetId=subnet-0eaed173
- https://eu-central-1.console.aws.amazon.com/vpc/home?region=eu-central-1#SubnetDetails:subnetId=subnet-79b4b803
- VPC: https://eu-central-1.console.aws.amazon.com/vpc/home?region=eu-central-1#VpcDetails:VpcId=vpc-6a231802
20230109
-
Chase Carlos for the connection
-
Send code assignment to review
-
Fix the data_dw config myself
- stop the instance
make a backup of the instance-> https://eu-central-1.console.aws.amazon.com/rds/home?region=eu-central-1#db-snapshot:engine=mysql;id=snapshot-for-subnet-config-changes-20230103Start the instance again (otherwise the networkign stuff can't be modified)- move it to default-vpc-4012372b(pdo-uat) subnet group
- move it back to pdo-prod-public subnet group
Add again the security group with the IP exceptions(it was removed with the subnet group migrations)- Delete the snapshot?
-
Send message to Liliana
-
Upgrade of the replica to MySQL 8.0
- Create chained replica of existing replica on MySQL 8.0
- When creating this, I couldn't specify the MySQL version to be 8.0. It's either not allowed, or I missed it. Instead, I created the chained replica with 5.7 and afterwards upgrade it to 8.0
- Upgrade the second replica to MySQL 8.0
- Check that it replicates properly
- Modify Looker LM connection to point to second replica
- (Exploring, testing and debugging can start at this point, with some potential downtime while we do changes in the guts of MySQL)
- Perform a recoverable snapshot/backup/something of the old read replica
- Test recovering from the snapshot/backup
- CHECKPOINT only move forward if both of these are true
- We have confirmed that Looker + MySQL 8 connector satisfies our needs
- We feel comfortable rolling back the MySQL old replica from v8 to v5 if we screw up
- Execute upgrade of old replica to MySQL 8
- Switch Looker connection again to point to old replica
- Check (if something below breaks, grab logs and rollback)
- ETLs reading from old replica work fine
- Looker explore works fine
- Nothing else has broken
- Cleanup actions
- Remove second read replica
- After some prudent time (couple weeks, a month?), remove snapshots/backups from old replica in MySQL 5 version
- Related doc
- General info on read replicas https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_MySQL.Replication.ReadReplicas.html
- General info on upgrades https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.MySQL.html The busy day
- Create chained replica of existing replica on MySQL 8.0
-
Validate old dependencies on second replica
- Point Trino - Backup old connection config ofapp_lm_mysql.properties-connector.name=mysql connection-url=jdbc:mysql://lolamarket-rr2.cnlivtclari7.eu-west-1.rds.amazonaws.com connection-user=trino-bi connection-password=WdRpC6aHS4n7UnG2K9z case-insensitive-name-matching=true- Modify connection URL - After this, Trino is still reading from the old replica (validated by throwing queries through Trino and observing the monitoring pages of both replicas) - Reboot Trino somehow - We used the redeploy button in Rancher on the worker workload - It worked. After a couple of minutes, the redeploy was successful. Queries now hit the second replica instead of the old one.- Point ETL - It is done automatically after Trino points there. The ETLs never connect directly to the LM replica - Just run the ETLs and check if it runs- Point Redshift Migration
Backup current endpoint configuration so we can go back to it -> it's simply the RDS endpointStop the database migration tasks (otherwise we can't modify the endpoint)Modify endpoint to point to second replica instead of old oneStart again the task- It breaks with an error
Last Error Failed to connect to database. Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2800] [1020414] Error 2019 (Can't initialize character set unknown (path: compiled_in)) connecting to MySQL server 'lolamarket-rr-mysql8-test.cnlivtclari7.eu-west-1.rds.amazonaws.com'; Errors in MySQL server binary logging configuration. Follow all prerequisites for 'MySQL as a source in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html or'MySQL as a target in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.MySQL.html ; Failed while preparing stream component 'st_0_KURC747DBGGXBD44ND6JF6DFZ4'.; Cannot initialize subtask; Stream component 'st_0_KURC747DBGGXBD44ND6JF6DFZ4' terminated [reptask/replicationtask.c:2808] [1020414] Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE- We decide to:
- Make a new parameter group and set
binlog_formattoROW(https://eu-west-1.console.aws.amazon.com/rds/home?region=eu-west-1#parameter-groups-detail:ids=lolamarket-replica-8-custom-params;type=instance) - Assign this parameter group to the second replica
- Reboot the replica
- Restart the DMS task
- Make a new parameter group and set
- Pivot. We decide to leave DMS dead.
- Before closing on the DMS topic, repoint the DMS endpoint to the old replica, knowing that it will fail once it gets upgrade to v8
- Validate... how?
- By checking if new carts appear in Redshift
Check that metabase can connect to mysql 8 -> It can
- Point Redshift Migration
-
Point old dependencies back to old replica
- Point Trino to old replica
Change config back to how it wasRedeploy workersValidate with the query game again
- Point Trino to old replica
-
Upgrade old replica to MySQL 8Take snapshot while in 5.7Go toModifymenu and bump to latest versionGive it some time
-
Validate that everything is fine -
Connect Looker to old replica on MySQL 8 -
Remove second replica -
~~Make stories for things that we could deprecate
- Redshift and DMS
- Metabase contents reading from redshift~~
-
Read requirement documents and add my comments
- Capacity: Setting app
- I agree with Carlos Matias on the non-overlapping of the delivery areas. Implementing this in a web UI might be a nightmare. We might better off using some GIS software such as QGIS and writing a little guide on how to create the polygons with it + implementing some kind of validator to prevent user mistakes.
- If there are areas within areas, a strong hierarchy control must be in place. Each child area should only have one parent.
- The data team should have a way to obtain the polygon files from the areas. Something programmatic would be ideal, since as we grow the number of areas might make manual downloads infeasible.
- It's key that the history of areas is kept. Removing or modifying an area should not delete the old data, but rather archive the old version. If we fail to do this, looking at historical data will be a mess and we won't be able to answer questions regarding operational areas properly.
- Orchestration: Dispatching orders
- What happens to orders that have been rejected by all shoppers?
- How are orders assigned to a fleet?
- Scenario: customer Alice places an order. Shopper Zac sees the order and accepts it. Some time afterwards, customer Alice cancels the order. How does shooper Zac learn about this? Is there any notification? Does shopper Zac have access to a list of cancelled-by-the-user orders? The rational for my question is: as a shopper in the field, you are not proactively checking your entire order list constantly (you get a bit of tunnel vision and focus only on the order at hand and perhaps the next). I think if there is no notification to push the cancel "in the face" of the shopper, it could go unnoticed and the shopper might be planning his day assuming that order still needs to happen.
- Capacity: Setting app
-
Draft technical interview case
-
Code
Ygor Gomes
- From North Brasil
- 7 years as DevOps
- Knows Hive, Databricks, Hadoop
- Married, 2 kids
- He asks for docs. Boss
20230116
Interview with Afonso
-
Information Systems
-
Learn about ETLs and Dashboarding
-
Data Structures and Algorithmes
-
Databases and relational models
-
IESE Business school (Summer)
-
AESE Business school (Summer)
-
Shopper in Mercadão (#7)
- Suggested a change in ticketing systems
-
Logistics startup
- Fill empty trucks of different providers
JAVA Python Prolog Assembly HTML/CSS Database TSQL
My Impression
- Overall
- Afonso is university student with a good profile. He has the gaps that one would expect from not having working experience, but he has the beginner skills and knowledge that are needed to start in our world and seems to be smart. Hiring him will mean investing time and effort on building his skillset. But I think he is smart enough to learn fast and eventually become a good engineer.
- A disclaimer: because of the previous points, I wouldn't advice hiring Afonso on a part-time basis. Afonso will face non-trivial ramping up period until he is productive within the team, so working part-time could translate into waiting months before being net positive for the team.
- Even though I genuinely think Afonso could work with us as a Junior member, I think we should try to assess slightly more experienced candidates who would be comfortable with our compensation. As much as Afonso shows great potential for a fresh graduate, the right profile with 1-2 years of experience could be orders of magnitude more useful to the team.
- On the case
-
Was the candidate able to draw a complete solution from the data sources (Pingo Doce, Orders database) to the business users (Finance team)?
- Afonso was able to draw a rough but complete processing and monitoring scheme to go from raw data to the final data that would be needed.
- On the other hand, he struggled to find the right component to host data to be accessed from Looker, and didn't suggest how to turn email data into S3 files.
-
Was the candidate curious about the business requirements? Did he make an effort to deeply understand what his business colleagues want to achieve so that he can propose the most suitable solution? Did he think out of the box to realize what was "the request behind the request"?
- Afonso asked good questions initial questions to understand better the situation, data and architecture wise. He asked early if data was available to check and get an idea on what he needs to work on.
- On the other hand, Afonso didn't dive a lot into how the solution would change the finance team's way of working. I would have expected a bit more of curiosity around finance's problem and brainstorming over what the solution would be from their point of view, regardless of how we built it in the backend.
-
Were the tools and techniques chosen by the candidate the best ones for each component in the system?
- Afonso made good proposals on using lambda functions to host Python code and S3 buckets to handle the data as flow of files.
- On the other hand, he didn't cover how to fetch an email into S3, and more importantly, couldn't correctly identify the need for SQL database in order to store the processed information for Looker to access it.
-
Did the candidate challenge the fact that Pingo Doce sends the orders through email, and explored the idea of having an alternative ingestion method that is more powerful and consistent?
- No.
-
Did the candidate realize that a very simple, version 1 solution could be delivered to the finance team to provide early value, while more advanced solutions could be worked on?
- No.
-
Did the candidate ask to see examples of the raw data?
- Yes, early. And made good remarks, paying attention to what info was contained, what were the primary keys and how it should be turned into a more structured format than excel.
-
Did the candidate show an interest for the size of the data being handled, in order to take scalability matters into account when designing the system?
- No.
-
Was the candidate able to decompose the request into different, independent user stories? Such as: ingest and clean the data, automate the detection of unmatched or inconsistent orders, automate the suggestion of data correction.
- No.
-
- Additional points
- Surprisingly, Afonso didn't have a single question to ask at the end of the interview.
20230208
Interview with Timóteo
He's from Oporto
Started as micro IT helpdesk, scaled to IT infrastructure Went into big data support:
- Bash
- Ansible
- "I did mostly scripts and firmware upgrades"
- Azure, Cloudera
Now in Jumia:
- Building ETL pipelines in Airflow
- Python and SQL coding, first experiences "I'm doing basic pipelines"
- Doing Udemy course in Python
- "I consider myself a junior"
- Acts as a bit of an admin on rancher
- Apache Nifi
- AWS
- He's not sure about many things. Red flags
He will be fired in Jumia in February.
Tools
Checklist
- Python
- Git
- SQL
- Trino
- Github
- Jira
- Confluence
- Looker
- AWS
My Impression
-
Overall
- Timóteo has very little experience in Data Engineering. His knowledge is mostly on administering and maintaining infrastructure rather than designing and developing systems. He has troubles come up with detailed designs for how to store, move and transform data, . His approach to the challenge was unstructured and most of the time he had time communicating his proposal in a clear and understandable manner.
- I would propose not moving forward with Timóteo. I think his experience adds little value towards his position and I don't think his soft skills are enough to compensate for it.
-
On the case
-
Was the candidate able to draw a complete solution from the data sources (Pingo Doce, Orders database) to the business users (Finance team)?
-
Was the candidate curious about the business requirements? Did he make an effort to deeply understand what his business colleagues want to achieve so that he can propose the most suitable solution? Did he think out of the box to realize what was "the request behind the request"?
-
Were the tools and techniques chosen by the candidate the best ones for each component in the system?
-
Did the candidate challenge the fact that Pingo Doce sends the orders through email, and explored the idea of having an alternative ingestion method that is more powerful and consistent?
-
Did the candidate realize that a very simple, version 1 solution could be delivered to the finance team to provide early value, while more advanced solutions could be worked on?
-
Did the candidate ask to see examples of the raw data?
-
Did the candidate show an interest for the size of the data being handled, in order to take scalability matters into account when designing the system?
-
Was the candidate able to decompose the request into different, independent user stories? Such as: ingest and clean the data, automate the detection of unmatched or inconsistent orders, automate the suggestion of data correction.
-
-
Additional points
20230214
Meeting with Ygor
-
Databases
- Lola backend
- Mercadão backend
- DW
-
Trino
-
Prefect
-
Other stuff in AWS
- ECR
- S3 buckets
- for flows
- for great expectations
- for python packages
-
Wishlist
- Improve Trino uptime
- UAT Prefect server
- Improve automations on CI (docker, python packages)
- Airbyte
20230217
Entrevista con Ricardo
- Perfil Ricardo
- Python
- Machine Learning
- Optimizacion exacta, metaheuristicas y simulacion
- Experiencia profesional y rol en la UPF
- Macroeconomia
- IB en UPF y Ingenieria Computacional y Matematica UOC
- Banc Sabadell
- BI
- Sobre mi
- Sobre la asignatura
Solo tardes a partir de las 15 Seminarios
Interview with Rui
- From Porto, lives in Lisbon
- Started worked in Caixa General, Business Intelligence Data Engineer
- First project, started from scratch infrastructure Apache Nifi
- Groovy Java,
- Sonae, Azure cloud
20230221
Silent failing flows for MD users
-
Why are they failing?
- Because the connection object passed to the great expectations task doesn't have the
raw_userattribute anymore (it used to). - This was introduced in the last release of the dim user project (1.0.3), when the connections stopped being managed by the hardcoded tasked and were refactored to use lolafect, which does not implement the
raw_userandraw_passwordtrick. - Because of this, connecting to run the GE test fails.
- Because the connection object passed to the great expectations task doesn't have the
-
Why are they not sending alerts?
- Because of how we named the transactions and basically built the flow, none of the
final_tasksfails whent_00Xfails. - The slack messaging is configured to send an alert if any of the
final_tasksis getting sent.
- Because of how we named the transactions and basically built the flow, none of the
-
How to fix
- The simplest option is to migrate already the data test to
lolafect. - It's important the the
final_tasksdesign mistake gets fixed so that, in the future, the slack messages get sent. - I was about to do this, but I see you have already started work beyond the open release and are implementing this, so I'll just leave it be and let you carry on once you are back (or we can discuss if I should take over)
- The simplest option is to migrate already the data test to
Fix
Okay, we are going to work together. Here is some context for you to be aware of.
- We will be writing code for a Prefect 1 flow.
- We are using Great Expectations to evaluate some data on a MySQL database.
Review user flows failing because lolafect does not support the raw user and password trick and why do the slack alerts not happen
- Play around with transactions in the Python Trino client to understand them and how can we use them better in ETL
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
20230223
Discussion on Afonso and Rui
-
Liliana
- She thinks both are good culture-fit wise
- Afonso pro: operations knowledge
- Afonso pro: ambitious, eager
- Afonso dis: he might leave if he is not challenged enough
- Rui pro: knowledge and experience
- Rui cons: more expensive
-
Me
- Let's go for Rui if we want someone who knows his stuff, let's go for Afonso if we want to truly have a junior
Rui questions (deadline for yesterday) Afonso ambitious multinational
20230308
Airbyte vs Fivetran
- Criteria
- Cost
- Fivetran: 5 digits per year. Susceptible to large bills due to developer mistakes. + DevOps cost for integration.
- Airbyte: infra cost (maybe 2K per year or something like that) + DevOps cost for maintenance.
- Winner: airbyte will easily be 10 times cheaper with current needs. As needs increase, the gap will widen.
- Security
- Fivetran: data travels through their infra. It does so encrypted, which theoretically means they see nothing. But we still need to keep an eye on the topic to ensure we are happy with however they are doing things. Credentials storage???
- Airbyte: everything stays at home. We do have the responsibility of security ourselves, but it shouldn't be much of a headache. Airbyte should only be accessible within our VPN and with password based access.
- Winner: airbyte makes this much simpler.
- Connectors:
- Fivetran:
- Has connectors to everything we need.
- The evolution will probably be slow (they only have so much software firepower)
- Developing custom connectors is out of the table.
- Airbyte:
- Has connectors to everything we need. Some of them are in alpha and beta though. Regardless, it should probably mature fast.
- The evolution will probably be fast (open source, everyone and their mother will join)
- Developing custom connectors is on the table. It's designed for it.
- Winner: short term, it's a tie. Long term, Airbyte has the upper hand.
- Fivetran:
- Transformations: both work with dbt. I don't appreciate any significant differences on this point.
- Other
- By simple game theory, Airbyte will outpace Fivetran. It's a solid, production ready, open source copycat of Fivetran. Adoption will eventually outpace Fivetran and the network effect of opensource will generate a positive spiral.
- No lockage: we do not really depend on Airbyte (as a company) for anything. If some day they go stupid, we can still use the older versions. Also, a community fork would probably appear with the general interest in mind (see PrestoSQL and Trino as an example)
- Cost
Guillermina
Curra en Kadre
Tether, previously Luma.
About me
- Education and experience
- Writing professional grade Python for many years
Offer
- I am familiar with the problem of grid balancing, and I feel attracted by it
- Position
- Team,
- who do I report to,
- who else is there
- Why is there an army of interns?
- (I see you are hiring for many positions)
- Selection process
- Next conocer al cliente con Martim or Luis
- Entrevista tecnica
- Where is the company at, maturity wise
- Business model
Founders: Luis and Martim
Conectar a miles de cargadores en tiempo real Solo dos personas + 5 interns Buscan dos perfiles: backend/devops Contratación directa con la empresa
Porque Luma
- Problema interesante
Porque no lolamarket
4f0215f77a0b4e33b1ca802fc21fa6cf
20230320
Townhall
- Ricardo and Gonçalo are leaving the company.
20230324
Q&A
- Will XL be an independent division? -> Yes.
- Do we need to go to the office like Glovo guys -> No.
Slack messages not sent
Initial symptoms
The following flows failed due to a Trino outage, but the slack warnings were not sent:
006_md_03006_md_04006_md_05003_md_01003_lmpl_01004_md_01006_lmpl_01006_lmpl_02006_lmpl_03013_010007017_pt_01
The following flows failed due to the same outage, but did send the slack message correctly:
011_md_01011_lm_01015_md_02001_20
First exploration
To explore this issue, I focused on the flow 013_010, version 1.0.3 (the version that was scheduled in the prefect server).
- The logs of the failed run showed that the slack message task had been skipped:
Task 'SendSlackMessageTask': Finished task run for task with final state: 'Skipped'
- I decided to reproduce the error:
- First I ran the flow locally as-is: the execution was successful.
- I then executed again the flow locally while manually introducing an exception in the same task that failed in the scheduled run due to Trino. The Slack message was sent successfully.
- This was confusing. I was expecting the same behaviour as with the failed prefect server run
- I reviewed the logs of the prefect server run that failed and didn't send the slack message again.
- I noticed that, besides the slack message task, the ssh tunnel closure task also had been skipped:
Task 'close_ssh_tunnel': Finished task run for task with final state: 'Skipped'- This was reasonable, since the ssh tunnel is not used in the prefect server. Nevertheless, it made me wonder: is the usage of the ssh tunnel the responsible for differences between local runs and server runs? There was no a priori reason to think so, but it was the only difference I could find between my local environment and the prefect server one.
- I then observed the code once more and realized that the slack message task had as its upstream tasks all the final tasks.
final_tasks = [transaction_end, trino_closed, dw_closed, tunnel_closed, t_090]
[...]
send_message = send_warning_message_on_any_failure(
webhook_url=channel_webhook,
text_to_send=warning_message,
upstream_tasks=final_tasks
)
- That the closure of the SSH tunnel was considered one of the final tasks.
- Theoretically, this shouldn't matter. The slack message task is configured with a trigger of
any_failed, so whether the ssh tunnel closure task was being successfully done or skipped was irrelevant as long as some other final task was failing (which was the case). - Nevertheless, I decided to try the following:
- Modify the code to remove
tunnel_closedfrom the final tasks. - Upload this version to the prefect server, along with a manually induced exception on the same task that failed previously due to Trino.
- Observe the logs.
- Modify the code to remove
- Result: it worked! This time, the slack message task triggered properly and sent the message,
In conclusion, it seems that having the tunnel closed task in the final tasks somehow forces the slack message task to skip.
Open questions
- Why the hell does that happen?
- Is this the reason for all the silent failing flows we've seen? Or is this only specific to flow
013-010? - Are there other flows that do not have the
tunnel_closedas part of the final tasks, and hence are sending the slack messages correctly when failing?
Next actions
- Answer the open questions
- Apply the fix to all flows that the
tunnel_closedin their final tasks.
Answering open questions
- Is this the reason for all the silent failing flows we've seen? + Are there other flows that do not have the
tunnel_closedas part of the final tasks, and hence are sending the slack messages correctly when failing?- I pick a sample of the other flows to check this.
- Faulty flows:
006_md_03-> Hastunnel_closedinfinal_tasks003_md_01-> Hastunnel_closedinfinal_tasks006_lmpl_01-> Hastunnel_closedinfinal_tasks007-> NA, extremely outdated, doesn't uselolafect
- Properly working flows
011_md_01-> Hastunnel_closedinfinal_tasks011_lm_01-> Hastunnel_closedinfinal_tasks015_md_02-> Hastunnel_closedinfinal_tasks001_20-> Hastunnel_closedinfinal_tasks
- The hypothesis is not holding.
tunnel_closedcan be infinal_tasksand everything will work fine. - But, I have observed the following difference between the flows in both blocks: the faulty flows are using two
caseblocks to deal with connecting with or without an SSH tunnel. The working flows are using the pre-lolafectapproach.
- Why the hell does that happen?
- Given the answer to the previous questions, it seems that the following pattern in flows is the one that causes the issue:
- The ssh tunnel opening happens within a
caseblock that doesn't activate when running on the prefect server. - The output of the ssh tunnel opening task gets referenced in the ssh tunnel closure task.
- The output of the ssh tunnel closure task is listed as a final task.
- The list of final tasks is passed to the slack warning task as the list of upstream tasks.
- The ssh tunnel opening happens within a
- The solution seems easy:
- Either put the ssh tunnel opening output outside of the
caseblock - or do not include the ssh tunnel closure output in the final tasks
- Either put the ssh tunnel opening output outside of the
- Given the answer to the previous questions, it seems that the following pattern in flows is the one that causes the issue:
Dummy SSH tunnel task test
I decided to test the following hypothesis:
- If, instead of only creating the task output
ssh_tunnelwithin thecaseblock, we also create a dummy output with the same name earlier, outside of thecase, does it fix the issue? - No, it doesn't. This is puzzling me even harder.
- Fixed the evil slack error that prevented slack messages from being sent on failure.
Fixing flows
006_md_03: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18006_md_04: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18006_md_05: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18003_md_01: https://github.com/lolamarket/data-003-etl-xl-order/pull/9003_lmpl_01: https://github.com/lolamarket/data-003-etl-xl-order/pull/9004_md_01: done006_lmpl_01: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18006_lmpl_02: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18006_lmpl_03: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18013_010: https://github.com/lolamarket/data-013-etl-xl-exchange-rate/pull/8017_pt_01
Update users failed double flow
-
The flow failed
-
It's because the GE data test failed
-
The GE data test failed on an expectation on uniqueness for the
id_app_user -
The quarantine was not made because the code has a bug. The trigger is not the right one. This happens in (at least) all the MD user flows.
-
In which PR was this introduced?
- It seems that it was in 1.1.2.
- It also affects LMPL 2 and 3.
- It affects all MD flows
-
Send EVO payment -
User for Looker -
Check data issue on MD03 -
Review PR for João https://github.com/lolamarket/data-003-etl-xl-order/pull/10 -
Keep working on the evil slack bug
-
Clean up quarantine flow
-
Add onboarding steps for Afonso
Martim
-
My thousand questions
- Background
- What is the story of Tether? When did you start, what has happened since then, where are you today?
- What is your roadmap?
- How is the equity of the company structured?
- Product/Service
- Who pays who for what?
- Frequency Containment Reserves:
- TSO: how much power are you going to produce
- They act as a bidder on the power market
- CPO Charge Point Operator
- Owners get smart charging (charge when price is cheap)
- What is the potential for this market?
- How the hell are you going to control thousands of charging stations? Isn't it tremendously expensive to run such field operation? What do the numbers look like?
- How are you going to make the experience for car owners not a complete pain in the ass?
- What is your geographical scope?
- How does the evolution of the electric vehicle fleet impact your own roadmap?
- Who pays who for what?
- Team and position
- What is the current team composition? What other positions are you hiring for?
- Why do I only see a bunch of interns?
- How would I fit in? What would my work look like during my first year?
- What is your personal work culture? What do you like and don't? Could you describe me people and projects you enjoyed and people and projects you despised?
- Background
-
Martim
- Portuguese
- Bored about uni, wanted to go abroad
- Luis, meet in the masters, worked in energy in the US
The Pitch podcast - Bodyguard of the grid
Beta version running on computers Scheduling from partners Test version by end of the year (30-40 vehicles at first, up to 8000 at some point) Revenue at the beginning of the year Hiring for a devops engineer
Cloud advisor, business advisor, ML advisor (might join the team)
-
Optimization Engineer <- me
-
Devops Engineer
-
Managing next employees
Onelabel databases
-
Fer are going for a relational model, probably Aurora MySQL
-
How does this play with events?
-
Conventions for data types (money, dates)
What I would like:
- Clear documentation
- Representation layer decoupled from
- Something airbyte-able
20230517
Glovo Data Platform meeting
Briefing
We meet with:
- Kiran
- Simone
- Charly They are somehow related to Glovo's event data platform. We meet to learn about how they work so we can get some inspiration. Opening questions:
- How does the data platform works and is structured
- What time of access do we need to drive events there and handle them
- What is it possible to do and not do
- What kind of dependences there will be (PR approvals, team roadmap dependencies, etc.)
Notes
Simone and Charly com around.
Charly -> Manages the central data engineering team. Simone -> Manager in one the central data teams (data platform creation team (ingesting and transforming data))
Streaming vs Batch If we don't need to mix streaming with batch data, life will be better.
Simone:
- Keep it with streaming native technologies as far as you can
- When it comes to dumping events:
- We are on Kafka (Confluence managed). We used to be on Kinesis.
- We used Kafka connect to dump data on S3. We can share the tool we use with you to dump if you live under the same infrastructure as us.
Who should we talk about the Kafka?
What would be the ideal stack for streaming?
- Enrichment and transformation engine (Flink, KSQL, Materialized, near-real time materialization engines)
- After that, two options:
- Specialized database to dump (for example Druid)
- Trino/Starbust/Presto to throw queries
Glovo Data Platform client team
Briefing
We talk with this team who apparently is using the data platform to learn how their experience is.
Attendants
- Pablo Butron: data product manager
Notes
His team is focused on consuming customer activity events on the front-end with the goal of reproducing the behaviour of users in the app.
He thinks we should talk with Engineering, not with the Data Platform. Engineering takes care of designing the backend. Data platform only puts the infra, but doesn't "fill it" with events in any way.
They are creating a Data Mesh. They had a monolith database that is being produced. Tier structure. Tier 0 is core stuff with strong SLAs (like orders).
They are not constrained or permissioned by Data Platform. They can implement their own databases and infrastructure. They share databases with other data teams (so intermediate states of )
Declarative data products. Build products in less than a couple of hours.
Data Platform provides importers that you need to access core data.
They use Amplitude and Looker for reporting to end-users.
Most of the products server internal reporting or external partners (like McDonalds).
They don't write back into the event streams themselves. They only act as consumers of the event bus.
20230518
- ~~Pinto says:
- ~~I'm looking at the order items table, and for this order 2383028, the replacement products are missing. I think is because you add the info when the order was created, but the delivery was a few days after. So I think in the process, maybe you should only add data on orders that it was already delivery~~~~
20230601
Onelabel infra with Marcos
- Event driven architecture
- Three gateways: one for retailers, another one for the backoffice apps, another one for the shopper apps
- Auth0 for authentication
- Assortment: reusing PHP service existing in Lolamarket. A re-implementation will be done.
- One database per service. AuroraMySQL 8.0
- Source of truth:
- For orders, order service
- For order planning, orchestrator service
- For fulfillment, order fulfillment
- For shopper data, shopper service
- For configuration of operations, operations config service. This includes capacity.
- Capacity is defined in operations config. Availability is computed in the order orchestrator.
- Redis for reservation expiration. Reservations only live in redis while they are not confirmed. Once they are confirmed, they get persisted.
- Glovo's Kafka will replicate all the events received by the Onelabel kafka
Questions
- Go through gateway, or have access to private network?
- We should be able to go inside the private network
- One instance per client, right?
- Not clear.
- This needs to become clearer overtime.
- Most probably thing is no, multiple retailers
- Aurora CDC replicas are possible?
- It can make replicas.
- Timezones, multicurrency management, geographical data
- There are conventions and they are documented.
- For currencies: only local currency.
- For time: everything in UTC.
- For geo: the map projection is
- Is there data that only live in the databases that don't get emitted as events because they are not relevant for other services?
- Yes.
- We might need to modify what gets published as events to prevent having to query to the services.
- Are you going to NEVER have orders querying each other's databases?
- Rule of thumb is never.
- But exceptions might be done.
- Ledge mentality, mutability of records, CRUD?
- Mutable records
- Event schema registry, database models. Process views.
- They have this XL event catalog that contains everything.
- Unsure about API access.
- What strategy will you have for versioning events and databases.
- Versioning of event schemas
- Versioning of APIs
- Using events vs db vs apis?
- Skewed towards events.
Requests and next steps
- Access to the miro board
- Conventions
- GlovoXL EventCatalog
- Data contracts driven by data
20230608
The duplication of product-catalogue
- Catalogue ID:
5a71ba59bbe9c6000f7fe360 - SKU:
2147483647
But that SKU does not appear in mongodb. Wtf?
Ok, the previous thing was because of the int field size. Now I'm left with 3 bad combinations.
- #1
id_catalogue: 6009b142f27478003e846b6asku: 6
- #2
id_catalogue: 6009b142f27478003e846b6asku: 5
- #3
id_catalogue: 5a1ed1b0f777bd000f7a2ef6sku: 4
I surrender. The ID is going to be a varchar and that's it.
If I run again, I shouldn't get duplicates.
Ok. Now my issue is that checking for uniqueness on the string combo of SKU + catalogueID takes AGES.
I'm hacking to see if we can somehow use hashes to speed up this uniqueness check, because dropping it could be very dangerous. This data comes from partner catalogues, so the chances of receiving shit data are sky high.
Okay. I did the MD5 trick. It works.
Now I still have have to deal with some duplicate ids. Review time it is again.
Ok. The agreed procedure is to pick some values at random.
Finally, none of this was necessary once I fixed type issues in the different steps of the ETL.
The order with thousands quantity
id_order: 2409563
sku: 699569
Performance review meeting with Liliana
- H1'23
- Will be done through factorial
- June 12th to 23rd
- I rate João
- I rate other colleagues
- 26-30 June
- Each manager judges it's own team members
- Optionally: self-assessment
- 3-7 July
- Calibration: I don't understand this even after the explanation
- Finished in July 24th
- Results will be visible in Factorial
A lot of pep talk on the idea that we might develop ourselves outside of the company if necessary. Is the ship sinking?
Is this going to happen every six months?
20230613
Meeting with Charly
Entro en Glovo en Febrero del 2022. Vive en Sant Cugat.
Data Engineering Manager en HQ. Antes en Growth/Marketing. Veterinario de origen, tiro a bioinformatica, luego a puro data. Estuvo en Mercado Libre como senior manager de data.
20230623
Catalogue expansion of DW
Should we first:
-
Add MD catalogue to DW (regardless of whether that's done by expanding
dim_productor creatingdim_catalogue)- This is useful for name and retailer
-
Or should we include LMPL in
fact_orderproductanddim_product?- Important to keep the model generic enough
-
Catalogue
- Check fields in Mercadão and LM
- Map them
- Share with them
- Settle for final list and model
-
In MD, the only things that seem to be interesting enough about the catalogue are the name and the retailer id
-
In LM, it seems the uniqueness of products comes from associating them with a specific store. I'm not finding the catalogue abstraction anywhere
We conclude, quick and dirty
20230627
Metabase UAT upgrade
-
Autoscaling group name: awseb-e-qguaiqvtw7-stack-AWSEBAutoScalingGroup-1E973Q3A4RYIR https://www.youtube.com/watch?v=yUXV34RrVmQ
-
Existing EC2 instance in UAT: i-07129f5b2695f6a4b
-
Existing version in UAT: v0.43.3
-
EC2 instance after reboot: i-0682d52f80e19e3bd
-
Version in UAT after reboot: v0.46.5
20230801
Trino in Metabase
https://github.com/starburstdata/metabase-driver https://www.metabase.com/docs/latest/developers-guide/partner-and-community-drivers https://docs.starburst.io/data-consumer/clients/metabase.html
How to access the machine to drop files
- Obtain the UAT SSH key
- Set up putty to open an SSH tunnel to the Metabase machine through the jumphost
- Connect through the tunnel with Putty and/or Filezilla
How to place the Starbust drivers inside the container
- First, download to your laptop the
JARfile which is the driver itself. These are in the github repo from starbust. - Now, use Filezilla to drop the driver somewhere in the EC2 host. For example, if the driver file is called
starburst-3.0.1.metabase-driver.jar, you can drop it in the EC2 so that it lives in/home/ec2-user/starburst-3.0.1.metabase-driver.jar - Metabase is running as a docker container within the EC2 host.
- The path within the container where the driver file should live is
/plugins. So, again, if the driver file isstarburst-3.0.1.metabase-driver.jar, it should be placed like/plugins/starburst-3.0.1.metabase-driver.jar.
ISSUE: we need to restart
Redshift and DMS deprecation
Current state as of 20230609
In the Lolamarket AWS account (251404039695):
- Redshift
- There is an existing redshift cluster
- Name: lola-market-bi
- id: 36d07598-37ab-46a6-9a60-cbb0f231fa7d
- lola-market-bi.c3ircn2vj5i5.eu-west-1.redshift.amazonaws.com:5439/lolamarket
- It is up and running. I thought we had stopped it.
- It is receiving queries on a daily basis. Some of them seem to originate from metabase.
- DMS
- The RS cluster exists as a destination point in the DMS configuration.
- There are two active replication tasks
Plan
- Create Redshift cluster snapshot: https://eu-west-1.console.aws.amazon.com/redshiftv2/home?region=eu-west-1#snapshot-details?snapshot=backup-20230614
- Test recovery
- Raise new cluster from snapshot
- Run a query in both the original cluster and the replica and check that values are identical
- Delete recovery test
- Stop Redshift cluster
- Delete DMS tasks
- Delete DMS sources and destinations
- Delete DMS Replication Instance (https://eu-west-1.console.aws.amazon.com/dms/v2/home?region=eu-west-1#replicationInstanceDetails/rds-mysql-to-redshift-instance)
- Wait some time
- Delete Redshift cluster
User issues with Pinto
-
What happens when a guest users has two possible registed users to match with? https://glovoapp.eu.looker.com/explore/XL-Biz/realtime_pt_orders?qid=bNR5i4o4UgvuTTQEes5PFX&toggle=fil
- I'm going to research in the code to understand how this gets handled.
- Flow
006-md-05-upgrade-guest-users.pyis the responsible for spotting:- Registered users that exist already in DW
- That have the same email as some guest user that already exists in DW
- Then, DW updates the existing guest user record in DW and assigns the
-
Users that disappear with first created order (red ones in Ana's excel)
- Create story
- Store data in sandbox
- Set up time next week with Ana to review
Sunday runs issues
Current state
Currently, the order flow is scheduled with a lookback strategy. That means that, whenever you start a flow run, you must specify how many days into the past should the ETL go for. For instance, running with lookback_days=7 means you will fetch orders from today up to 7 days in the past.
More specifically, this gets used in the following way in a Trino query to get orders from the source:
[... some more SQL code]
from
app_md_mysql.pdo."order" o
INNER JOIN app_md_mysql.pdo.ordergroup og ON o.ordergroupid = og.id
WHERE
og.status IS NOT NULL
AND o.status IS NOT NULL
AND o.updatedat > DATE '2023-07-01' -- <---- This would be the result of a lookback of 7 days.
AND o.updatedat < DATE '2023-07-07' -- <---- This would be today
),
[... some more SQL code]
As you can see in the previous code, what counts is the updatedat field in the Mercadão.
Issues with this implementation
There is one caveat to how this implementation works and one corner case which could potentially explain the issues we have seen with orders created on monday not appearing in DW on monday.
The caveat is that, with the current implementation, orders updated on the day on which the ETL runs are NOT included in the ETL. This can be seen here:
AND o.updatedat < DATE '2023-07-07' -- <---- This would be today
Simple. If we run on the 7th of July of 2023, orders updated on the 7th of July of 2023 are not included. The most recent ones will be the ones updated on the 6th of July of 2023.
This could be an issue for the following reasons:
- If there is any update to the cart in the night between Sunday and Monday that happens already on Monday hours, then the cart wouldn't be included.
- I am not fully aware of what timezone is the Prefect container that runs the flow living in, and this is very relevant. It could be that we are scheduling the flow at a time where in Portugal it's already Monday, but inside the Prefect machine it's still sunday. That would leave out all the orders created on Sunday.
Solutions
- The first and obvious solution is to push the schedule deeper into Monday so that there is no doubt that it's running on Monday, even if there might be some hour differences due to timezones. Before, the flow was scheduled at 2AM UTC. Now it's scheduled at 6AM UTC.
- The second solution is to push the filter from today to tomorrow. That ensures that, irrespective of weird hour and timezone games, order's updated today are included in the ETL. This simply requires a small code change that is WIP.
I am not 100% that these issues and caveats are what cause the past problems since this topic is very hard to debug due to the stateful nature of both the origin sources and DW itself. Nevertheless, I'll apply both solutions and we will observe next week if the issue is still taking place.
Guest orders get assigned to different registered users
Status log for João
20230627
- Asked Marcos for:
- Access to schema catalog -> Not deployed. Only got pointed to https://github.com/lolamarket/xl-bus-events/tree/main/src/events
- Data model/docs for the order's service database -> Not existing
Llamada con Víctor
- CBRE
- Negocio
- Que ha cambiado desde que me fui. Transacciones siguen siendo el core?
- Mas diversificado
- Nueva linea consultoria estrategica
- Ya no hay consejo. Solo todos los directores juntos.
- Que tal habeis llevado los ultimos años?
- Como sigue research
- Carlos Casado sigue siendo el capo por encima de D&T?
- Que ha cambiado desde que me fui. Transacciones siguen siendo el core?
- D&T
- Vamos a dibujar el mapa
- Negocio
- El puesto
- Como le llamas?
- Que skills son mas importantes
- Perfiles de las personas al cargo y donde estan
- Clientes del puesto
- Que data products estan construidos encima de este DWH -> Seguimos como estabamos. No hay nada crítico, dashboards y tal y cual pero todo muy fluffy.
- Se tiro del cable un lunes por la mañana, quien me va a llamar esa semana y porque
- Cositas en el roadmap para este puesto
- Cosas de mi perfil que te encajen/no te encajen
- Condiciones
- Pelas
- Horarios
- Cosas que me escuecen
- Hacer chuladas y que no se usan
- Flexibilidad oficina y meterme un traje
- Mierdas de ultima hora porque a un nivel C se le ha ido la olla
- Perder tiempo con reuniones y falta de claridad en roadmap. Tener un plan, stick to it, ir haciendo.
- Stack
- Airflow -> Prefect
- Data docs -> Amundsen
- AWS -> Oki
- Query engine?
- Visualizacion -> Tableau? Metabase?
- Data Quality -> Great expectations
- PostgreSQL -> Que cabrones, estoy cansado de MySQL
- ELT -> Airbyte
- Otros
- Que tal BDC 3D?
- Quien sigue del gang original?
- Como le va el negocio a tu señora?
- Me tuve que ir yo para que montaseis un stack como dios manda eh
20230805
DP V1 Training
- V1 vs V2:
- V1 depends on the old monolithical redshift
- Kinesis is gone, now it's Confluence Kafka.
- IDP (staging) contains info as raw as possible
- ODP (data marts) contains transformed stuff
!
The actual map of how we will do things.
Two environments, dev and prod. Dev gets deployed automatically,
We need a monster laptop to be able to deploy DP1 locally.
- How do we coordinate with Data Analysts?
- Looker explores are not part of the DP
- They shouldn't care about DPs for now
- If I make a monster pipeline and smash run it every minute, where are the monster $ bills gonna come from?
- Hello-world weather forecast
DDP Training
- Hardcode table names and columns
- One DAG node or multiple -> One DAG
- Arbitrary python code, like read from a public API and write into a SQL
- Git version
Data Platform Onboarding steps
Github
Onelabel DP checklist
- Be included in glovo's github:
- Follow this guide: https://glovoapp.atlassian.net/wiki/spaces/ITS/pages/3097985253/How+to+Get+Access+to+Github
- In case the glovo people get picky, reference on of these tickets:
- Get added to the "All" team (IT support from Glovo)
- Get added to onelabel data group: https://github.com/orgs/Glovo/teams/onelabel-data
- Get added to this repository (either by Joan or João): https://github.com/Glovo/onelabel-data-mesh
AWS VPN
- General instructions: https://glovoapp.atlassian.net/wiki/spaces/TECH/pages/3468329771/AWS+Client+VPN+Access+Set-up
- If you don't have permission to get the VPN config file listed there, you can get it in our team Google Drive: Data Drive > 90 Useful > 20 vpn config files
- Once you set it up, you can try one of these links to check if things are working. Note that the important thing is that you reach the page, even if you get an access denied message (each of these services has their own permission system, independent of the VPN).
- https://notebooks.g8s-data-platform-prod.glovoint.com/
- https://starburst.g8s-data-platform-prod.glovoint.com/ui/insights/login
- https://datamesh-workflow.g8s-data-platform-dev.glovoint.com/
- https://datamesh-workflow.g8s-data-platform-prod.glovoint.com/
- https://datahub.g8s-data-platform-prod.glovoint.com/
- Known issues from my side:
- If, when you try to start the VPN, you get redirected to a Glovo OneLogin page that says access denied, you are probably missing a role in the Onelogin system. Open an IT ticket to request: "AWS VPN - All"
Call with Joan Heredia
- What does DP offer?
- Expose events and make them consumable?
- Environments
- Permissions
- Contacts
He's on Alena's team. He can point to trainings and accesses. He
20230908
Details about our new office. !New Office - Guidelines.pdf
Onelabel
Waiting for Onelabel team to modify events thingies. We will only have events in prod. Everything will be deployed there. Team was able to do a hello world in production.
20230912
Bitwarden migration
I should check with you if it's fine for me to simply follow the steps described here: https://pdofonte.atlassian.net/wiki/spaces/DM/pages/2569469953/Account+Migration+to+Glovo+-+2023+08
Tip from David Clemente to avoid having to create the account:
I've logged in with the old account, an error showed saying that I needed to leave all organisations in order to accept the glovo invite. I left
mercadaoorganisation (the only one I had in my account). Then logged in again through the invitation email, and a message saying "Invite successfully accepted, an admin will need to accept your account" (something like that) Some minutes later I had the glovo organisation in my account. My personal vault was always there throughout the process
According to what I've read in Slack, there should be an email somewhere in my inbox inviting me to the right Glovo-Bitwarden orgs.
And I should add this info provided to João somewhere in Confluence for the next time an employee joins the team.
For future new user access, and requests for new collections, an automated servicedesk was created, please check how to do it here. In the future, if you have any issue with the Bitwarden access or any other Bitwarden issue, please create a ticket for Security.
20230922
Airbyte
Header
- Deploy Airbyte locally with Terraform
- Deploy Airbyte on AWS manually
- Source an EC2 instance on AWS with Terraform
- Deploy Airbyte on the UAT EC2 instance with Terraform
Deploying Airbyte in UAT manually
I'm going to use a private subnet in UAT.
- VPC: # pdo-uat - vpc-4012372b
- Subnet: pdo-uat-private-1a - subnet-37858a5c
- Instance type:
t2.medium - Keypair:
pdo-uat
The instance I created: https://eu-central-1.console.aws.amazon.com/ec2/home?region=eu-central-1#InstanceDetails:instanceId=i-01be407713b011383
Prepare SSH tunnels and jump into the new instance.
Install docker and docker compose. The docker compose install must be with the modern plugin (as in, docker compose, not docker-compose). See useful link.
Follow the airbyte instructions.
Prepare SSH tunnel for web access.
Modify .env to set credentials.
Define webhooks for alerts.
Useful links
- Airbyte deployment docs: https://docs.airbyte.com/category/deploy-airbyte
- How to update Airbyte: https://docs.airbyte.com/operator-guides/upgrading-airbyte#upgrading-on-docker
- Terraform tutorial: https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli
- Fixing the issue with the local docker example in the Terraform tutorial: https://github.com/kreuzwerker/terraform-provider-docker/issues/44
- Running a shell script on an EC2 machine: https://brad-simonin.medium.com/learning-how-to-execute-a-bash-script-from-terraform-for-aws-b7fe513b6406
- Install Docker and Docker Compose on AMI 2023: https://medium.com/@fredmanre/how-to-configure-docker-docker-compose-in-aws-ec2-amazon-linux-2023-ami-ab4d10b2bcdc
- How to install Docker Compose Plugin (as in,
docker compose, notdocker-compose): https://stackoverflow.com/a/73680537/8776339 - On how to resize a Linux partition after an EBS volume has been enlarged: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html
Missing stuff
- Finish confluence docs
- Add docs on ssh tunnels
- Schedule session with the team
- Send pre-session material to the team
- Store videos
- Ensure all credentials are in Bitwarden
20230927
Call with Vini
- What the hell do you do? Integrations team. They are centralizing a lot of solutions and making them Infra as code.
Move data from streaming (Kafka connector) to an S3 bucket in an hourly fashion (SLA is 90 minutes).
Files are compacted in parquet. Partitioned by day.
Are there duplications?
New events won't be added automatically. We will have to ask them to ingest it and make it available for us.
Datadog to monitor ingestion. Datahub for schemas and metadata.
- How will we manage breaking changes?
- What flows of info should there be between us?
- Any documentation you can share?
20231018
Hello world in DDP
-
I can't access the FAQ on DDP.
-
Is it possible to install custom packages in our Notebook environment? If so, how?
-
Data discoverability
- Intermediate tables (our)
- Look at Airflow logs
- Datahub
- Sources
- Not really very helpful
- Intermediate tables (our)
-
Git, versioning, rolling back
- Things have changed
- We can use github
-
Multiple tables at the end?
- Yes
-
What is one DP, what are multiple DP, philosophy
- Very flexible
-
Can a DP only generate intermediate tables?
- Yes
-
Can the output of a DP be the input
- Yes
-
We basically can combine in anyway we want
-
Esquiar fora pista a DDP
Making the hello world
- Check the demo table on starbust
- Make a new product
- Simply make a select star and copy it with full refresh mode
- Run
- Check Airflow
- Check resulting table in Starbust
- Experiment with all the refresh modes
- Understand full refresh
- Understand merge
- Understand XXX
- Intermediate data products
- Build a data product that only generates intermediate tables
- Read it from a different data product
- Make a session with Pinto to try to read things from one of the data products
- Delete everything to keep things clean
Counting delivered orders
Goal
Make a table in the public schema in delta with the following schema:
| Column | Description | Type | PK |
|---|---|---|---|
| date_utc | The date | date | X |
| orders_delivered_count | The number of orders delivered in that date | integer |
The table should contain all the history of Onelabel. The table should be refreshed on an hourly basis.
Design
Sources
- Table
"hive"."desert"."orders_v0__com_glovoxl_events_orders_v0_orderdelivered"
Final tables
- Table
"XXX"."YYY"."whatever_something_that_thing__delivered_orders_per_day"
Schedule
30 * * * *
We had an issue with permissions. The DDP user doesn't have permission to read from the desert schema. Instructions from Joan to fix this:
el user del DDP el tens a rendered template de Airflow, al operador de query_to_table , si busques per b_dp_
es algo aixi: b_dp_10a8aaa3-252e-4295-a9eb-26c3f3475e06
Insert the user in b_desert_migration_tmp
This is the user I found: b_dp_9e1363ee-4444-4ad2-8c87-15ed05a76e71
Event questions with Marcos
-
What is the
id_companyfield? -
Do the events have IDs?
- Not really
-
Picking Location
- IDs that exist in
changedevents but not inadded
- IDs that exist in
-
Orders
- Two timestamps:
createdAtand kafka timestamp. What to use? - Doubt.
- Two timestamps:
-
Timezones
- Kafka and createdAt are always UTC
-
Currency
- What currency codes are you following?
- https://en.wikipedia.org/wiki/ISO_4217 We get the VPN and more permissions in confluent cloud to read events
-
Documentation of lifecycle of orderings and picking
- Not there
-
Change management flow for versioning, specially breaking changes
- How do we do?
- How will we know when the different versions stop/start?
-
Dirty data in prod. Will you remove it? If not, how do we tell it apart?
- Fresh start at some point
-
Will events be forever?
- Initially yes, let's wait for us to run out of space to consider a change
20231019
003-etl-xl-order - 003-md-01-refresh-orders - Uncontrolled enum change
TLDR: the Mercadão backend is breaching our unwritten data contract by having value DELIVERED in the delivery_type field of orrders.
Context
The flow 003-md-01-refresh-orders has an expectation for the field delivery_type that restricts the valid values to the set {"DELIVERY", "CLICK_AND_COLLECT"}
Issue
In this flow run (https://prefecthq-ui.mercadao.pt/mercadao/flow-run/786b017b-1ccf-49dc-886c-4c25d4b0a137), two records contained the value DELIVERED for the field delivery_type.
The batch of data containing the invalid values is stored in the quarantine table QUARANTINETABLE HERE. The specific records can be spotted with the following query: QUERY
This is a risk for upstream code in looker that might have hardcodes that rely on delivery_type only having {"DELIVERY", "CLICK_AND_COLLECT"} as valid values.
Possible courses of action
- Talk with engineering team to understand what happened and make them change the values.
- Ideally, they go back to only using
{"DELIVERY", "CLICK_AND_COLLECT"}. - If not, we must adapt on our side***.
- Ideally, they go back to only using
- Ignore the tech team altogether and simply adapt on our side***.
***Options to adapt on our side:
- Play with the ETL to turn values
DELIVEREDintoDELIVERY. - Add
DELIVEREDas a valid value for the fielddelivery_typeand adapt in Looker and othe reports.
Aftermath
Someone had messed around with the faulty orders manually in the backoffice. The values where fixed manually again the backoffice and the ETL ran just fine afterwards.
20231025
Data Quality incident on 020_ETL_XL_Products_020_md_01_refresh_product on 20231024
Context
The flow has failed due to not passing the data test.
Only one expectation was violated: the uniqueness of the combination of id_product and id_catalogue.
The quarantine data reveals that all the values have been duplicated several times. Each combination appears four times. This is very, very weird.
Diagnostic
As a first step, I try to simply rerun the flow without any change to see if the duplication persists.
The rerun generates the same problem without the slightest change.
I have checked the repository and it seems João made some changes to the pipeline last week with release 0.4.2.
I'm going to run the suspicious query for versions 0.4.2 and 0.4.1 and compare the output.
After running the transformation step query for both 0.4.2 and 0.4.1, I can conclude that the issue appeared in 0.4.2. The query from 0.4.1 on the same data does not generate duplicate values.
Furthermore, I can see that query 0.4.2 introduced two joins in the query. These new joins are most probably the source of the duplications.
Conclusion: changes from 0.4.2 are responsible for duplicating values in the pipeline.
Courses of action
We can:
- Roll back to 0.4.1 ASAP to keep the product table refreshed, even if the new
category_XXXcolumns that we introduced in release 0.4.2 will remain empty. - Simultaneously, we should apply a fix and make a new release 0.4.3 that includes the
category_XXXcolumns but does not introduce duplicate values mistakenly.
Links and references
- First flow failure: https://prefecthq-ui.mercadao.pt/mercadao/flow-run/feb5ab1e-cba6-4064-8e4c-bc16e3659e16
- Great expectations validation output with failed expectations: https://s3.console.aws.amazon.com/s3/object/pdo-prod-great-expectations?region=eu-central-1&prefix=validations/020_ETL_XL_Products_020_md_01_refresh_product_suite/020_ETL_XL_Products_020_md_01_refresh_product_suite_checkpoint/20231024T061732.662931Z/3c859ea3c1fa0158eeceadd14d3d5950.json
- Quarantine table: quarantine.
020_md_01_refresh_product_20231024_061838
Work log
- Make story about coding fix
- Rollback deployed flow to 0.4.1
- Re-run with flow 0.4.1
Ongoing
DATA-1182
https://pdofonte.atlassian.net/jira/software/c/projects/DATA/boards/6?selectedIssue=DATA-1182
First look
Can I derive all the fields required in the task from the query that João already composed?
| Task Field | Query peer | Comments |
| Order ID | id_order | |
| Status | status | |
| Shopper ID | id_shopper | |
| Team / Fleet | NA | We must obtain it through a very messy process from the shopper. I would drop it for now. |
| Date Created | date_order_received | |
| Date Delivered | date_order_delivered | |
| Local Currency | local_currency | |
| Total Amount Ordered | charged_amount | |
| Total Amount Delivered | NA | It's unclear how to obtain this. Must it be derived from adding up all the "Article picked" events, and substracting all the removed articles? The events around articles are also confusing (difference between stockout and not available? difference between unpicked and removed?) |
| Delivery Time Slot Start | delivery_time_slot_start | |
| Delivery Time Slot End | delivery_time_slot_end | |
| Picking Location ID | id_picking_location |
A couple of them are ambiguous. I'll drop them for now and get ahold of Marcos to clarify how to obtain them.
Design
We are going to make a pipeline that generates the final table in a single step.
This is the query:
select
rec.orderref as id_order,
coalesce(can.status, del.status, oit.status, cod.status, pic.status, allo.status, con.status, rej.status, rec.status) as status,
rec.kafka_record_timestamp as date_order_received,
del.createdat as date_order_delivered,
con.pickinglocation.id as id_picking_location,
allo.shopperid as id_shopper,
coalesce(con.totals.amountcharged.amount, rec.totals.amountcharged.amount) as charged_amount,
coalesce(con.totals.amountcharged.currency, rec.totals.amountcharged.currency) as local_currency,
rec.servicedate.fromdate as delivery_time_slot_start,
rec.servicedate.todate as delivery_time_slot_end,
from
delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderreceived rec
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderconfirmed con on (rec.id = con.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderrejected rej on (rec.id = rej.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderallocated allo on (rec.id = allo.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderinpicking pic on (rec.id = pic.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_ordercheckoutdone cod on (rec.id = cod.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderintransit oit on (rec.id = oit.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderdelivered del on (rec.id = del.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_ordercanceled can on (rec.id = can.id)
Which means that the
Onelabel ways of working
- Alerts
- FX data
- Team being able to modify pipeline
- Data Quality Expectations
Departure Management
Open fronts list
- lolafect
- On the usage of
lolafect, I think the team is up to date. - On the internals of
lolafect, that's a different story. I think I'm the only one fully familiar with the internals of the package. Perhaps it would be a good idea to hold a session on it, and perhaps to try to go for a couple of silly-but-real stories to implement something in the package just so that you and/or Afonso go through the entire lifecycle of adding something new tololafectand making a new release with it.
- On the usage of
- DDP
- I generally think we are quite aligned.
- I will document all the knowledge I have and the ongoing work with the orders pipeline.
- I would suggest planning some transfer sessions on my last week for whatever WIP I still have.
- AWS
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.
- Prefect
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.
- Airbyte
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.
- Trino
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.
Call with Joan to improve ways of working
-
Overcoming notebooks
- Option 1: share notebooks and just survive
- Option 2: repository
- Centralizes development in a repository with a Github Action that publishes to production
- is it a notebook or how do you structure code?
- one repo for one ddp? one repo for all ddps?
- Option 3:
- Same as option 2, but with an individual repository for onelabel
-
data quality checks
- Use the data quality checks from Glovo, not GE, not dbt-expectations
- Docs for that?
- We will read the tutorial notebooks and come back with questions
- Control flow?
- No
-
Alerts on Airflow failures
https://glovoapp.atlassian.net/servicedesk/customer/portal/26/PHC-11171?created=true
Artifactory instructions setup: https://glovoapp.atlassian.net/wiki/spaces/TECH/pages/1124565144/Configure+access+to+Artifactory+repository cmVmdGtuOjAxOjE3MzA4OTcyNjA6NGtQaTJRbDk3QTFObmxHa25QNWY2NHc2R2x5
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
- Hack del browser (Charly)
- DQ per cada output ODP
- Add SLO to add_sql_transformation
OIDC Token Refresh bug
- Package version:
glovo-data-platform-meshub-client==0.1.68 - Summary: meshub client fails to correctly store a valid access token when using a refresh token to obtain it. The new access token is obtained correctly and stored in memory, but it never reaches the
credentialfile. The issues is created by an exception:jwt.exceptions.ImmatureSignatureError: The token is not yet valid (iat)that gets triggered in the following line of code:34eb43822f/meshub-backend/service/glovo_data_platform/meshub_client/authentication/oidc_tokens.py (L37). - Current workaround: copy the new access token from an in-memory variable by debugging with Pycharm and pasting it in the
credentialsfile, along with some faked outaccess_issued_atandaccess_expires_at. Obviously, we won't get very far this way.
Tag: Joan, @dp-transformations-greenflag.
This is what my credentials file right now looks like:
[oidc meshub]
refresh_token = eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6Ii1FY0U2TlE0MFJtaUJRQUt6Zm4wYnFheXhZY2JXOXdaaURmZ28wQm52a3cifQ.eyJqdGkiOiJWWTZlZFFUZV9pTU9mRWhzM1d0cDciLCJzdWIiOiI5MjQ0MjU4NSIsImlzcyI6Imh0dHBzOi8vZ2xvdm9hcHAub25lbG9naW4uY29tL29pZGMvMiIsImlhdCI6MTY5OTQ1NDQ0MiwiZXhwIjoxNzAyMDQ2NDQyLCJzY29wZSI6Im9wZW5pZCIsImF1ZCI6ImNjOTc5NmIwLTc3ZDgtMDEzYi1jOThmLTA2NThkMDRkMjM2NjM3ODE1In0.hpp8bKfSSBpivMVl3zwwPXeDtGzOrPETAI-HRsy-hsgVqG13eahdw8MAHgDKNUdXQ-l01uqGG90RiYXn3CCU8b5Bx3QEh90FMQvrzAOJXWZufSVhR9WNKwvmh7lr568Xxg__3Ux6JVau8Qo7PH7KCcPQTNbrf9aV2v3rSSczkNMgKKUO5GN8w9UYFs1vN6DX8olIE8voVbDhWEuidMRhl8EZWDJG2rRiY3EvLlAl3QFbQZZdGTbxd6o7tyH_DEPDyIQ0Mhk5CK3qGDEx7w5ySSwoVC_uxI_BcC1cAtha2klL0Dz4OT06d_5DIRLCHLqrGjGuM75yXBc6rOaiLUus_g
refresh_issued_at = 2023-11-08T15:40:42
refresh_expires_at = 2023-12-08T15:40:42
This is what the glovo code has read when running:
Docs
- Clone repos
- Clone personal notes
- Send messages to
- ~~Slack general ~~
MariaDani- Team
Charly
Future
There is no future anymore.










