lolamarket-notes/notes/Logbook.md

4223 lines
182 KiB
Markdown
Raw Normal View History

2023-11-17 15:13:27 +01:00
# 20220711
Today was my first at XL. I started the day by briefly chatting with Dani. He is as energetic as always and was already trying to get my hands dirty with stuff. It will be fun to work with him, but I need to make sure he doesn't kidnap me completely.
I also called briefly with Maria Gorostiza. She was trying to call me through the phone but I called her through slack. It felt a bit weird that she was trying to get hold of me through my personal phone. I must confess that the vibes I get from the Portugal based team are very different from the ones I get from the Spain based team, and this is an example of that. Anyways, she just told me that I will need to put my working hours in Factorial and also that I should request my holidays throught there. She also asked I could put a real picture of me in Slack. No trouble from my side, but getting asked to do it felt quite childish, being honest.
The rest of the day, I mostly spent with João discussing the organization and the business. There is a lot of details and nuances flying all over the place, but I'm old enough now to know that it's ok if I don't retain many of them. The important things will inevitably find their place in my brain.
I'm very happy on how things look. The service is interesting. It has a lot of data and decisions that need to be made, so Data work will definetely be fun. The new business opening in Poland sounds very exciting and feels like a greenfield opportunity. And the team is also looking great. The culture is very different from ACN. More laid back. Less pompous. No fancy words. Strong common sense.
It's very early to tell, but I think I've made a great choice coming over here.
We also discussed with the team that we could maybe meet in Porto on September. It would be great to fly there and spend a few days with the team and chill around Porto in the afternoons.
# 20220712
Ana Pinto,
10 years of experience as a Data Analyst
Started in insurance for a year, did not like the business.
Joined continente for 5 years as part of the loyalty program
Moved to Talkdesk for 2 years
Infra with Dani
- He is playing around with Prefect and Trino
- Redshift will soon be dropped
# 20220713
Ana Januario
Math
Continente
Topdesk
Adhoc analysis and Looker
Will be working on moving Spanish data from Tableau to Looker
How to create user in AWS:
- Go to IAM
- Select AWS credential type Password
- Assign to data-team user group
- Skip tags sections
- Hit send email option to contact
User name,Password,Access key ID,Secret access key,Console login link
pablo.martin,EP}A}_WlF2mTV-t,,,https://fonte.signin.aws.amazon.com/console
Access key ID,Secret access key
AKIAVNZZGXD4KNPQPMH5,0j02JWq9mnQF3d2G9A600dlgt70PDAiiguJizVfD
You have been signed up for a Meraki account. You are now authorized to use 514195843 @ FONTE - NEGÓCIOS ONLINE @ 156654047.
Here is your login information:
Email address: [pablo.martin@lolamarket.com](mailto:pablo.martin@lolamarket.com)
Password: 4bCphthn
You can manage your account at [https://account.network-auth.com/](https://account.network-auth.com/).
## Data modeling with Dani
- Orders model where we integrate both Mercadao and Lola https://pdofonte.atlassian.net/wiki/spaces/DATA/pages/2301067280/Orders
- snake_case, as in Python
## María envío
- Ratón
- Teclado
- Mochila
- Tarjeta
- Aplicación móvil y ver vídeo
Reconocimiento médico
Guía PRL
Curso PRL
## Query challenge
- [x] Total order groups served in LolaMarket in 2021
- Dani showed me that order groups are not really an entity in Lola. Instead, there is a field called "tag" that appears in carts. Multiple carts with the same "tag" are a complete order group.
- I learned with Dani that the cart table has all the failed and cancelled carts. Those carts have no tag.
- ANSWER: 148975
- [ ] Total orders (visits to supermarket) made in Mercadao in June 2022
- [ ]
- [ ] AOV for orders made in Lolamarket, in May 2022, by shop brand
- [ ] Largest basket ever sold in Mercadao
- [ ] Smallest basket ever sold in LolaMarket
- [ ] Look for my first LolaMarket purchase and understand it in full detail
- [ ] Find the customer that spent the most in Mercadao in 2021
- [ ] See if there is any user (identified by email) that appears in both LolaMarket and Mercadao
user id de Dani -> 3799702
# 20220714
## Tour through Mercadao tables with João
- chargedamount -> what the customer pays (originaltotalprice * 1.03)
- originaltotalprice -> original value of the goods (with discounts applied)
- totalprice -> final cost of the picked goods
- totaldiscount -> discount of goods (from the retailer)
- totalwshipping -> totalprice+delivery fee
- subtotal -> should be originaltotalprice + totaldiscount (but is flawed. Ignore)
- status might be null (user reaches checkout payment screen but doesn't pay)
- For status flow, review confluence ->
- PickingLocationID -> suggested picking location
- We have both the original and final delivery slot
# 20220719
Con María:
- Llegaran mensajes de confirmación para que los dos slots una hora antes. Hay que darle o se me chutaran.
## Meeting marathon
### Projects
- Project manager across the company
- Working mostly on Mercadao
- Projects
- With Glovo
- Mercadao does the picking
- Glovo riders do the delivery
- BOOM
-
- B2B Horeca project
- With Recheio
- Larger baskets
- [https://express.recheio.pt/](https://express.recheio.pt/)
- Picking is very different. Vans are needed instead of cars.
-
-
### Tech
- Missing
### Finance
- "Congratulations for joining João's team."
- Luis Cabral -> Head of Finance for XL
- Helps the head of each local finance do their work properly
- Bridges financial comms to Glovo-DH
- Each country has their own finance team
What systems do you have for:
- Accounting:
- Sage in Spain
- Sage in Portugal
- Forced to move to Glovo's SAP Hana (starting on 2023-01-01)
-
- Reporting
- P&L
- They use Looker, Metabase and a few Excel sheets
### Ops
- Manages the portuguese Quality Assurance, operations, availability, onboarding teams
- Starting to assume the Spanish operations
- She is clearly very enthusiastic
### CX
- Atencion al cliente y LiveOps
- Jaime
- Empezo hace un año
- Lleva atencion al cliente
- Esta absorbiendo LiveOps (equipo que ayuda a los shoppers)
- Nuevo CRM Freshdesk para Portugal y España
- 3 supervisores y 8 agentes
- Intermedian entre clientes y shoppers
- Data
- App ratings, surveys?
-
- KPI
- Reducción contact rate shoppers
### Commercial
- Negotiate campaigns of the banners in the Marketplace
- Only food. Currently exploring other services
- Selling reporting to brands (cross-selling topics, AOV per brand, frequency and repetition). Currently only banners and clicks (CTR, click-through-rate)
- Campaing impact on sales deltas for different products
- Working with Looker. Needs a bit of help with exporting data
- Runs Iberia (Carlota and Gloria work for him in Spain, Ines)
- For Poland: let's run some large operations in there before selling our services to brands there.
### Marketing
# 20220720
## Shopper notes
Today I went out as a shopper for the first time. I took a few notes during the day and also have a few ideas in mind:
- In the morning, the order proposal appeared in my screen. I hit accept as the order seemed suitable, but then the app said that another shopper had taken it in the meanwhile (makes sense since I was driving and I didn't have my phone handy). But then, when I went to my orders page, I could see that order in my schedule, so it seems that it was assigned to me regardless of that warning message. I noticed this, so no problems, but this could lead to a shopper getting an order assigned and not realizing.
- When I was in Mercadona doing my first order, one item out of all the shopping list appeared in the Frozen section within the app. That item was not actually in the frozen section in the supermarket but rather in a normal open fridge, which was quite frustrating since I was trying to find it in the freezers. Funny enough, there was another product of the shopping list that was actually in the frozen section in the supermarket, but did not appear in the frozen section in the app's list. I felt a bit like the shopper app was playing with me.
- During my first order, there was this lemonade product that had a content of 2L. In the app, if you opened up the picture, you could also see it had 2L of volume. But then, in the app metadata, it showed the product had 1.5L. I was a bit confused in here: should I get 5 bottles because the customer asked for 5 units? Or should I change the amount of bottles given that the customer shopped while thinking that each bottle was 1.5L? Or wait, what did the customer follow: the metadata that said 1.5L or the picture of the product that said 2L? In the end, I got 4 bottles instead of 5.
- First delivery: shopping took way less time than the app suggested. This, on top of me being in the supermarket early to compensate for my expected clumsiness, turned into me being ready to pay at the POS around 20 minutes earlier than it should. I didn't want to be way too early at the customer's location, but I also didn't want to wait near there since the weather is super hot and frozen and cold products would get awful. I decided to wait... inside the supermarket (without picking the cold products yet). Pretty sure some Mercadona folks where scratching their heads :). Once the time passed, I went ahead, paid and drove to the customer location.
- My first two orders had a pretty nasty overlap. Order #1 had delivery scheduled at something like 11:20, while Order #2 instructed me to start picking at 11:00 in a supermarket which was about 20min away from Order #1's customer location. I managed because I was able to finish Order #1 much earlier than planned. Otherwise, customer #2 would have had a hefty delay, for sure outside of their chosen timeslot.
- This a rather lengthy one, and the biggest issue from the app I faced. It could have easily turned into a super late delivery for the customer. Happend in order #3 Step by step:
- I looked at the details of order #3, saw that the picking location was Lidl store. I clicked on the supermarket icon to open google maps and see the location.
- The shopper app transitioned into google maps and provided me with the location. The problem? There is no Lidl supermaket there. I knew because I'm very familiar with the area where the app was indicating that the Lidl was. But again, I knew there was no Lidl there. I went back to the shopper app and checked the address. The street name was right. I was scratching my head for a couple of minutes before I realised what was going on. The address was Carrer Apel·les Mestres, 109. The thing is there is a Carrer Apel·les Mestres in Barcelona, but there is also another one with the exact same name in El Prat de Llobregat (nearby town to BCN, kind of like Madrid-Pozuelo or Porto-Matosinhos). I realised because the postcode was not a Barcelona city postcode, and then I found the Lidl that does exist in Carrer Apel·les Mestres in El Prat.
- I checked the customer location and realised there was a large Lidl store about 500m away from there, so I decided to go there. Checked beforehand around 10 times that the store was actually there, because by this time I was both a bit confused and a bit skeptical.
- In the end everything went just fine. But, had I not been familiar with the streets, I would have not spotted the issue straight away and probably I would have suffered a +30min delay.
- I was also a bit puzzled by the fact that the shopper app didn't suggest the Lidl location right next to the customer location.
- The screenshots at the end show the exact address and what I was being shown in Google Maps (aka the false Lidl location)
- Trying to send pictures to the client through the chat was a pain in the a** and I'm not even sure that it worked in the end. Issues:
- I would hit the camera icon, take the picture, and then return to the chat. No trace of it.
- After a few tries, the picture "appeared". There was an empty white box in the chat, which I guess should be the picture. But again, in my screen it only appeared as an empty box.
- A suggestion: it would be nice if multiple pictures could be taken at once (instead of: picture -> chat -> picture -> chat, etc). I was trying to give the customer 3 alternatives to a stockedout product, so taking the three pictures in a go would have been more convenient.
- My android device is configured in English. All the shopper app appears in english, which I didn't mind. What I did mind was that the auto-suggested chat messages to the customer appeared in English. This doesn't make much sense to me, since I would assume that our default stance should be to either address the user in Spanish or, if the user can somehow inform their prefered language, in whatever the user has indicated.
A few additional mix fun and useful details:
- The customer for order #1 was this very nice woman who was using Lola for the first time. She told me she was undergoing chemo and felt pretty sick all the time, so it was super convenient for her that we brought the groceries to her place. Hearing her story was touching.
- I would advice not having the empty Lola bagpack on your back if you go on the highway faster than 100Km/h with a motorbike. The box acts kind of funny and pulls your arms in weird ways.
- Paper bags at Lidl are quite crappy. I was very much afraid that some of them would break during the delivery.
- Silly thing you don't even think about before starting as a shopper: the euro coin for the supermarket cart! I was lucky enough to have one, but in order #2, with all the rush, I forgot to recover it. The guys at the supermarket for order #3 where nice enough to unlock a cart for me because I had no coins by then.
- The barcode reader fails quite a bit (by failing I mean it read the barcode and tells you it's the wrong product, even when you can very clearly see by the product description and image you are picking the right one).
- Entering a supermarket you have never been into with a very long list of products creates this peculiar initial paralysis. It left my wondering which option would be better:
- Current way: we show the entire list to the shopper, the shopper goes around picking in whatever order he wants.
- My idea: we show the shopper items by small batches which are all related to the same product category. For instance, if the customer has chosen a few yogurt and cheese products, we show only those, and hence the shopper can focus on getting that done. Then the next batch could be beers and wines, the next one cleaning products, etc. This could also be used to leave cold and frozen products for the end, ensuring the shopper picks those after all the other ones.
![[1658335654251.jpg]]
![[1658335654261.jpg]]
![[1658335654273.jpg]]
![[1658335747385.jpg]]
![[1658335848404 1.jpg]]
# 20220721
## Tech
I have the onboarding meeting with Berlana we had to pospone.
Questions:
- How are you organized?
- How can we keep up with releases and changes?
- Who is who?
- How is the whole instaleap thing working out?
## Query thingy
- Get access directly to Lola MySQL to compare performance.
- Keep on researching
- Documentar como conectar a MySQL lolamarket -> Apuntado
- Documentar como pillar free-benefits de Glovo ->
- Enviar papel de material a Maria
- Entrar en lo de notion y documentarlo
My shopper ID is 8025
# 20220722
- Check that fetchall is actuall fetching all.
- Understand
# 20220725
I'm going to compare in detail the execution of the same query in both engines and see what I can get out of it. The query I'm using is `Orders finished yesterday with less than 10 items`.
The execution ID I'm looking into in Trino is this: [20220725_080932_00086_yu7k5](https://trino.mercadao.pt/ui/query.html?20220725_080932_00086_yu7k5)
For Trino, this is the output of the EXPLAIN:
```
Fragment 0 [SINGLE]
Output layout: [sum, id_105, cash_order_id_126, shopper_timeslot_id_80, shop_id_79, address_id_78, total_cost_92, total_price_117, total_price_discount_140, delivery_price_141, delivery_price_discount_112, delivery_type_95, status_108, modified_76, last_update_110, date_shopping_119, date_delivering_125, date_delivered_133, note_127, date_started_134, weight_99, last_overweight_notification_121, shopper_total_price_123, shopper_total_cost_135, margin_98, shopper_weight_106, comprea_note_83, next_shopper_timeslot_id_120, shopper_algorithm_124, loyalty_tip_131, date_loyalty_tip_115, ebitda_104, numcalls_84, manual_charge_111, commission_100, last_no_times_available_109, comprea_note_driver_128, num_user_changes_130, expected_shopping_time_139, expected_delivering_time_97, date_deliveries_requested_81, date_call_93, percentage_opened_hours_82, percentage_opened_hours_in_day_113, fraud_rating_114, tag_94, tag_color_90, expected_eta_time_102, promo_hours_137, numcallsclient_77, frozen_products_103, driver_timeslot_id_107, expected_delivering_distance_118, lola_id_136, comprea_note_warehouse_86, uuid_88, date_almost_delivered_132, date_ticket_pending_96, date_waiting_driver_138, num_products_116, num_products_taken_91, total_saving_89, percentage_opened_hours_in_87, cart_progress_branch_url_122, real_shop_id_129, gps_locked_85, tag_signature_101]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Output[item_count, id, cash_order_id, shopper_timeslot_id, shop_id, address_id, total_cost, total_price, total_price_discount, delivery_price, delivery_price_discount, delivery_type, status, modified, last_update, date_shopping, date_delivering, date_delivered, note, date_started, weight, last_overweight_notification, shopper_total_price, shopper_total_cost, margin, shopper_weight, comprea_note, next_shopper_timeslot_id, shopper_algorithm, loyalty_tip, date_loyalty_tip, ebitda, numcalls, manual_charge, commission, last_no_times_available, comprea_note_driver, num_user_changes, expected_shopping_time, expected_delivering_time, date_deliveries_requested, date_call, percentage_opened_hours, percentage_opened_hours_in_day, fraud_rating, tag, tag_color, expected_eta_time, promo_hours, numcallsclient, frozen_products, driver_timeslot_id, expected_delivering_distance, lola_id, comprea_note_warehouse, uuid, date_almost_delivered, date_ticket_pending, date_waiting_driver, num_products, num_products_taken, total_saving, percentage_opened_hours_in_10, cart_progress_branch_url, real_shop_id, gps_locked, tag_signature]
│ Layout: [sum:bigint, id_105:integer, cash_order_id_126:integer, shopper_timeslot_id_80:integer, shop_id_79:integer, address_id_78:integer, total_cost_92:decimal(8,2), total_price_117:decimal(8,2), total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), delivery_price_discount_112:decimal(8,2), delivery_type_95:char(9), status_108:char(14), modified_76:tinyint, last_update_110:integer, date_shopping_119:integer, date_delivering_125:integer, date_delivered_133:integer, note_127:varchar, date_started_134:integer, weight_99:decimal(6,3), last_overweight_notification_121:integer, shopper_total_price_123:decimal(8,2), shopper_total_cost_135:decimal(8,2), margin_98:decimal(8,2), shopper_weight_106:decimal(6,3), comprea_note_83:varchar, next_shopper_timeslot_id_120:integer, shopper_algorithm_124:varchar, loyalty_tip_131:decimal(8,2), date_loyalty_tip_115:integer, ebitda_104:decimal(8,2), numcalls_84:integer, manual_charge_111:tinyint, commission_100:integer, last_no_times_available_109:integer, comprea_note_driver_128:varchar, num_user_changes_130:integer, expected_shopping_time_139:integer, expected_delivering_time_97:integer, date_deliveries_requested_81:integer, date_call_93:integer, percentage_opened_hours_82:decimal(5,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, tag_94:varchar(20), tag_color_90:varchar(15), expected_eta_time_102:integer, promo_hours_137:integer, numcallsclient_77:integer, frozen_products_103:tinyint, driver_timeslot_id_107:integer, expected_delivering_distance_118:integer, lola_id_136:varchar(50), comprea_note_warehouse_86:varchar, uuid_88:varchar(10), date_almost_delivered_132:integer, date_ticket_pending_96:integer, date_waiting_driver_138:integer, num_products_116:integer, num_products_taken_91:integer, total_saving_89:decimal(8,2), percentage_opened_hours_in_87:decimal(5,2), cart_progress_branch_url_122:varchar(255), real_shop_id_129:integer, gps_locked_85:tinyint, tag_signature_101:varchar(4)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ item_count := sum
│ id := id_105
│ cash_order_id := cash_order_id_126
│ shopper_timeslot_id := shopper_timeslot_id_80
│ shop_id := shop_id_79
│ address_id := address_id_78
│ total_cost := total_cost_92
│ total_price := total_price_117
│ total_price_discount := total_price_discount_140
│ delivery_price := delivery_price_141
│ delivery_price_discount := delivery_price_discount_112
│ delivery_type := delivery_type_95
│ status := status_108
│ modified := modified_76
│ last_update := last_update_110
│ date_shopping := date_shopping_119
│ date_delivering := date_delivering_125
│ date_delivered := date_delivered_133
│ note := note_127
│ date_started := date_started_134
│ weight := weight_99
│ last_overweight_notification := last_overweight_notification_121
│ shopper_total_price := shopper_total_price_123
│ shopper_total_cost := shopper_total_cost_135
│ margin := margin_98
│ shopper_weight := shopper_weight_106
│ comprea_note := comprea_note_83
│ next_shopper_timeslot_id := next_shopper_timeslot_id_120
│ shopper_algorithm := shopper_algorithm_124
│ loyalty_tip := loyalty_tip_131
│ date_loyalty_tip := date_loyalty_tip_115
│ ebitda := ebitda_104
│ numcalls := numcalls_84
│ manual_charge := manual_charge_111
│ commission := commission_100
│ last_no_times_available := last_no_times_available_109
│ comprea_note_driver := comprea_note_driver_128
│ num_user_changes := num_user_changes_130
│ expected_shopping_time := expected_shopping_time_139
│ expected_delivering_time := expected_delivering_time_97
│ date_deliveries_requested := date_deliveries_requested_81
│ date_call := date_call_93
│ percentage_opened_hours := percentage_opened_hours_82
│ percentage_opened_hours_in_day := percentage_opened_hours_in_day_113
│ fraud_rating := fraud_rating_114
│ tag := tag_94
│ tag_color := tag_color_90
│ expected_eta_time := expected_eta_time_102
│ promo_hours := promo_hours_137
│ numcallsclient := numcallsclient_77
│ frozen_products := frozen_products_103
│ driver_timeslot_id := driver_timeslot_id_107
│ expected_delivering_distance := expected_delivering_distance_118
│ lola_id := lola_id_136
│ comprea_note_warehouse := comprea_note_warehouse_86
│ uuid := uuid_88
│ date_almost_delivered := date_almost_delivered_132
│ date_ticket_pending := date_ticket_pending_96
│ date_waiting_driver := date_waiting_driver_138
│ num_products := num_products_116
│ num_products_taken := num_products_taken_91
│ total_saving := total_saving_89
│ percentage_opened_hours_in_10 := percentage_opened_hours_in_87
│ cart_progress_branch_url := cart_progress_branch_url_122
│ real_shop_id := real_shop_id_129
│ gps_locked := gps_locked_85
│ tag_signature := tag_signature_101
└─ RemoteSource[1]
Layout: [sum:bigint, id_105:integer, cash_order_id_126:integer, shopper_timeslot_id_80:integer, shop_id_79:integer, address_id_78:integer, total_cost_92:decimal(8,2), total_price_117:decimal(8,2), total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), delivery_price_discount_112:decimal(8,2), delivery_type_95:char(9), status_108:char(14), modified_76:tinyint, last_update_110:integer, date_shopping_119:integer, date_delivering_125:integer, date_delivered_133:integer, note_127:varchar, date_started_134:integer, weight_99:decimal(6,3), last_overweight_notification_121:integer, shopper_total_price_123:decimal(8,2), shopper_total_cost_135:decimal(8,2), margin_98:decimal(8,2), shopper_weight_106:decimal(6,3), comprea_note_83:varchar, next_shopper_timeslot_id_120:integer, shopper_algorithm_124:varchar, loyalty_tip_131:decimal(8,2), date_loyalty_tip_115:integer, ebitda_104:decimal(8,2), numcalls_84:integer, manual_charge_111:tinyint, commission_100:integer, last_no_times_available_109:integer, comprea_note_driver_128:varchar, num_user_changes_130:integer, expected_shopping_time_139:integer, expected_delivering_time_97:integer, date_deliveries_requested_81:integer, date_call_93:integer, percentage_opened_hours_82:decimal(5,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, tag_94:varchar(20), tag_color_90:varchar(15), expected_eta_time_102:integer, promo_hours_137:integer, numcallsclient_77:integer, frozen_products_103:tinyint, driver_timeslot_id_107:integer, expected_delivering_distance_118:integer, lola_id_136:varchar(50), comprea_note_warehouse_86:varchar, uuid_88:varchar(10), date_almost_delivered_132:integer, date_ticket_pending_96:integer, date_waiting_driver_138:integer, num_products_116:integer, num_products_taken_91:integer, total_saving_89:decimal(8,2), percentage_opened_hours_in_87:decimal(5,2), cart_progress_branch_url_122:varchar(255), real_shop_id_129:integer, gps_locked_85:tinyint, tag_signature_101:varchar(4)]
Fragment 1 [HASH]
Output layout: [sum, id_105, cash_order_id_126, shopper_timeslot_id_80, shop_id_79, address_id_78, total_cost_92, total_price_117, total_price_discount_140, delivery_price_141, delivery_price_discount_112, delivery_type_95, status_108, modified_76, last_update_110, date_shopping_119, date_delivering_125, date_delivered_133, note_127, date_started_134, weight_99, last_overweight_notification_121, shopper_total_price_123, shopper_total_cost_135, margin_98, shopper_weight_106, comprea_note_83, next_shopper_timeslot_id_120, shopper_algorithm_124, loyalty_tip_131, date_loyalty_tip_115, ebitda_104, numcalls_84, manual_charge_111, commission_100, last_no_times_available_109, comprea_note_driver_128, num_user_changes_130, expected_shopping_time_139, expected_delivering_time_97, date_deliveries_requested_81, date_call_93, percentage_opened_hours_82, percentage_opened_hours_in_day_113, fraud_rating_114, tag_94, tag_color_90, expected_eta_time_102, promo_hours_137, numcallsclient_77, frozen_products_103, driver_timeslot_id_107, expected_delivering_distance_118, lola_id_136, comprea_note_warehouse_86, uuid_88, date_almost_delivered_132, date_ticket_pending_96, date_waiting_driver_138, num_products_116, num_products_taken_91, total_saving_89, percentage_opened_hours_in_87, cart_progress_branch_url_122, real_shop_id_129, gps_locked_85, tag_signature_101]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
LeftJoin[("cart_id" = "id_105")][$hashvalue, $hashvalue_151]
│ Layout: [sum:bigint, modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: PARTITIONED
├─ RemoteSource[2]
│ Layout: [cart_id:integer, sum:bigint, $hashvalue:bigint]
└─ LocalExchange[HASH][$hashvalue_151] ("id_105")
│ Layout: [modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), $hashvalue_151:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}
└─ RemoteSource[6]
Layout: [modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), $hashvalue_152:bigint]
Fragment 2 [SINGLE]
Output layout: [cart_id, sum, $hashvalue_143]
Output partitioning: HASH [cart_id][$hashvalue_143]
Stage Execution Strategy: UNGROUPED_EXECUTION
Limit[100]
│ Layout: [cart_id:integer, sum:bigint, $hashvalue_143:bigint]
│ Estimates: {rows: 100 (2.25kB), cpu: ?, memory: ?, network: ?}
└─ LocalExchange[SINGLE] ()
│ Layout: [cart_id:integer, sum:bigint, $hashvalue_143:bigint]
│ Estimates: {rows: 100 (2.25kB), cpu: ?, memory: ?, network: ?}
└─ RemoteSource[3]
Layout: [cart_id:integer, sum:bigint, $hashvalue_144:bigint]
Fragment 3 [HASH]
Output layout: [cart_id, sum, $hashvalue_145]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
LimitPartial[100]
│ Layout: [cart_id:integer, $hashvalue_145:bigint, sum:bigint]
│ Estimates: {rows: 100 (2.25kB), cpu: ?, memory: ?, network: ?}
└─ Filter[filterPredicate = ("sum" < BIGINT '10')]
│ Layout: [cart_id:integer, $hashvalue_145:bigint, sum:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
└─ Aggregate(FINAL)[cart_id][$hashvalue_145]
│ Layout: [cart_id:integer, $hashvalue_145:bigint, sum:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ sum := sum("sum_142")
└─ LocalExchange[HASH][$hashvalue_145] ("cart_id")
│ Layout: [cart_id:integer, sum_142:row(bigint, boolean, bigint, boolean), $hashvalue_145:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
└─ Aggregate(PARTIAL)[cart_id][$hashvalue_146]
│ Layout: [cart_id:integer, $hashvalue_146:bigint, sum_142:row(bigint, boolean, bigint, boolean)]
│ sum_142 := sum("expr")
└─ Project[]
│ Layout: [cart_id:integer, expr:bigint, $hashvalue_146:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ expr := CAST("quantity" AS bigint)
└─ InnerJoin[("cart_id" = "id_0")][$hashvalue_146, $hashvalue_148]
│ Layout: [cart_id:integer, quantity:integer, $hashvalue_146:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: PARTITIONED
│ dynamicFilterAssignments = {id_0 -> #df_778}
├─ RemoteSource[4]
│ Layout: [cart_id:integer, quantity:integer, $hashvalue_146:bigint]
└─ LocalExchange[HASH][$hashvalue_148] ("id_0")
│ Layout: [id_0:integer, $hashvalue_148:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}
└─ RemoteSource[5]
Layout: [id_0:integer, $hashvalue_149:bigint]
Fragment 4 [SOURCE]
Output layout: [cart_id, quantity, $hashvalue_147]
Output partitioning: HASH [cart_id][$hashvalue_147]
Stage Execution Strategy: UNGROUPED_EXECUTION
ScanFilterProject[table = app_lm_mysql:comprea.cart_product comprea.cart_product columns=[cart_id:integer:INT, quantity:integer:INT], grouped = false, filterPredicate = true, dynamicFilters = {"cart_id" = #df_778}]
Layout: [cart_id:integer, quantity:integer, $hashvalue_147:bigint]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
$hashvalue_147 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("cart_id"), 0))
cart_id := cart_id:integer:INT
quantity := quantity:integer:INT
Fragment 5 [SOURCE]
Output layout: [id_0, $hashvalue_150]
Output partitioning: HASH [id_0][$hashvalue_150]
Stage Execution Strategy: UNGROUPED_EXECUTION
ScanFilterProject[table = app_lm_mysql:comprea.cart comprea.cart constraint on [status] columns=[id:integer:INT, status:char(14):ENUM, date_delivered:integer:INT], grouped = false, filterPredicate = (("status" = CAST('delivered' AS char(14))) AND (CAST(with_timezone(date_add('second', CAST("date_delivered" AS bigint), TIMESTAMP '1970-01-01 00:00:00'), 'Europe/Madrid') AS date) = DATE '2022-07-24'))]
Layout: [id_0:integer, $hashvalue_150:bigint]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
$hashvalue_150 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("id_0"), 0))
date_delivered := date_delivered:integer:INT
id_0 := id:integer:INT
status := status:char(14):ENUM
Fragment 6 [SOURCE]
Output layout: [modified_76, numcallsclient_77, address_id_78, shop_id_79, shopper_timeslot_id_80, date_deliveries_requested_81, percentage_opened_hours_82, comprea_note_83, numcalls_84, gps_locked_85, comprea_note_warehouse_86, percentage_opened_hours_in_87, uuid_88, total_saving_89, tag_color_90, num_products_taken_91, total_cost_92, date_call_93, tag_94, delivery_type_95, date_ticket_pending_96, expected_delivering_time_97, margin_98, weight_99, commission_100, tag_signature_101, expected_eta_time_102, frozen_products_103, ebitda_104, id_105, shopper_weight_106, driver_timeslot_id_107, status_108, last_no_times_available_109, last_update_110, manual_charge_111, delivery_price_discount_112, percentage_opened_hours_in_day_113, fraud_rating_114, date_loyalty_tip_115, num_products_116, total_price_117, expected_delivering_distance_118, date_shopping_119, next_shopper_timeslot_id_120, last_overweight_notification_121, cart_progress_branch_url_122, shopper_total_price_123, shopper_algorithm_124, date_delivering_125, cash_order_id_126, note_127, comprea_note_driver_128, real_shop_id_129, num_user_changes_130, loyalty_tip_131, date_almost_delivered_132, date_delivered_133, date_started_134, shopper_total_cost_135, lola_id_136, promo_hours_137, date_waiting_driver_138, expected_shopping_time_139, total_price_discount_140, delivery_price_141, $hashvalue_153]
Output partitioning: HASH [id_105][$hashvalue_153]
Stage Execution Strategy: UNGROUPED_EXECUTION
ScanProject[table = app_lm_mysql:comprea.cart comprea.cart columns=[modified:tinyint:TINYINT, numCallsClient:integer:INT, address_id:integer:INT, shop_id:integer:INT, shopper_timeslot_id:integer:INT, date_deliveries_requested:integer:INT, percentage_opened_hours:decimal(5,2):DECIMAL, comprea_note:varchar:LONGTEXT, numCalls:integer:INT, gps_locked:tinyint:TINYINT, comprea_note_warehouse:varchar:LONGTEXT, percentage_opened_hours_in_10:decimal(5,2):DECIMAL, uuid:varchar(10):VARCHAR, total_saving:decimal(8,2):DECIMAL, tag_color:varchar(15):VARCHAR, num_products_taken:integer:INT, total_cost:decimal(8,2):DECIMAL, date_call:integer:INT, tag:varchar(20):VARCHAR, delivery_type:char(9):ENUM, date_ticket_pending:integer:INT, expected_delivering_time:integer:INT, margin:decimal(8,2):DECIMAL, weight:decimal(6,3):DECIMAL, commission:integer:INT, tag_signature:varchar(4):VARCHAR, expected_eta_time:integer:INT, frozen_products:tinyint:TINYINT, ebitda:decimal(8,2):DECIMAL, id:integer:INT, shopper_weight:decimal(6,3):DECIMAL, driver_timeslot_id:integer:INT, status:char(14):ENUM, last_no_times_available:integer:INT, last_update:integer:INT, manual_charge:tinyint:TINYINT, delivery_price_discount:decimal(8,2):DECIMAL, percentage_opened_hours_in_day:decimal(5,2):DECIMAL, fraud_rating:integer:INT, date_loyalty_tip:integer:INT, num_products:integer:INT, total_price:decimal(8,2):DECIMAL, expected_delivering_distance:integer:INT, date_shopping:integer:INT, next_shopper_timeslot_id:integer:INT, last_overweight_notification:integer:INT, cart_progress_branch_url:varchar(255):VARCHAR, shopper_total_price:decimal(8,2):DECIMAL, shopper_algorithm:varchar:LONGTEXT, date_delivering:integer:INT, cash_order_id:integer:INT, note:varchar:LONGTEXT, comprea_note_driver:varchar:LONGTEXT, real_shop_id:integer:INT, num_user_changes:integer:INT, loyalty_tip:decimal(8,2):DECIMAL, date_almost_delivered:integer:INT, date_delivered:integer:INT, date_started:integer:INT, shopper_total_cost:decimal(8,2):DECIMAL, lola_id:varchar(50):VARCHAR, promo_hours:integer:INT, date_waiting_driver:integer:INT, expected_shopping_time:integer:INT, total_price_discount:decimal(8,2):DECIMAL, delivery_price:decimal(8,2):DECIMAL], grouped = false]
Layout: [modified_76:tinyint, numcallsclient_77:integer, address_id_78:integer, shop_id_79:integer, shopper_timeslot_id_80:integer, date_deliveries_requested_81:integer, percentage_opened_hours_82:decimal(5,2), comprea_note_83:varchar, numcalls_84:integer, gps_locked_85:tinyint, comprea_note_warehouse_86:varchar, percentage_opened_hours_in_87:decimal(5,2), uuid_88:varchar(10), total_saving_89:decimal(8,2), tag_color_90:varchar(15), num_products_taken_91:integer, total_cost_92:decimal(8,2), date_call_93:integer, tag_94:varchar(20), delivery_type_95:char(9), date_ticket_pending_96:integer, expected_delivering_time_97:integer, margin_98:decimal(8,2), weight_99:decimal(6,3), commission_100:integer, tag_signature_101:varchar(4), expected_eta_time_102:integer, frozen_products_103:tinyint, ebitda_104:decimal(8,2), id_105:integer, shopper_weight_106:decimal(6,3), driver_timeslot_id_107:integer, status_108:char(14), last_no_times_available_109:integer, last_update_110:integer, manual_charge_111:tinyint, delivery_price_discount_112:decimal(8,2), percentage_opened_hours_in_day_113:decimal(5,2), fraud_rating_114:integer, date_loyalty_tip_115:integer, num_products_116:integer, total_price_117:decimal(8,2), expected_delivering_distance_118:integer, date_shopping_119:integer, next_shopper_timeslot_id_120:integer, last_overweight_notification_121:integer, cart_progress_branch_url_122:varchar(255), shopper_total_price_123:decimal(8,2), shopper_algorithm_124:varchar, date_delivering_125:integer, cash_order_id_126:integer, note_127:varchar, comprea_note_driver_128:varchar, real_shop_id_129:integer, num_user_changes_130:integer, loyalty_tip_131:decimal(8,2), date_almost_delivered_132:integer, date_delivered_133:integer, date_started_134:integer, shopper_total_cost_135:decimal(8,2), lola_id_136:varchar(50), promo_hours_137:integer, date_waiting_driver_138:integer, expected_shopping_time_139:integer, total_price_discount_140:decimal(8,2), delivery_price_141:decimal(8,2), $hashvalue_153:bigint]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}
$hashvalue_153 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("id_105"), 0))
weight_99 := weight:decimal(6,3):DECIMAL
date_deliveries_requested_81 := date_deliveries_requested:integer:INT
fraud_rating_114 := fraud_rating:integer:INT
id_105 := id:integer:INT
date_ticket_pending_96 := date_ticket_pending:integer:INT
numcalls_84 := numCalls:integer:INT
real_shop_id_129 := real_shop_id:integer:INT
expected_shopping_time_139 := expected_shopping_time:integer:INT
total_cost_92 := total_cost:decimal(8,2):DECIMAL
tag_signature_101 := tag_signature:varchar(4):VARCHAR
delivery_price_discount_112 := delivery_price_discount:decimal(8,2):DECIMAL
tag_94 := tag:varchar(20):VARCHAR
date_shopping_119 := date_shopping:integer:INT
date_call_93 := date_call:integer:INT
manual_charge_111 := manual_charge:tinyint:TINYINT
driver_timeslot_id_107 := driver_timeslot_id:integer:INT
promo_hours_137 := promo_hours:integer:INT
shopper_timeslot_id_80 := shopper_timeslot_id:integer:INT
shopper_weight_106 := shopper_weight:decimal(6,3):DECIMAL
note_127 := note:varchar:LONGTEXT
total_saving_89 := total_saving:decimal(8,2):DECIMAL
gps_locked_85 := gps_locked:tinyint:TINYINT
percentage_opened_hours_in_day_113 := percentage_opened_hours_in_day:decimal(5,2):DECIMAL
num_products_taken_91 := num_products_taken:integer:INT
commission_100 := commission:integer:INT
last_no_times_available_109 := last_no_times_available:integer:INT
percentage_opened_hours_82 := percentage_opened_hours:decimal(5,2):DECIMAL
total_price_117 := total_price:decimal(8,2):DECIMAL
date_delivering_125 := date_delivering:integer:INT
expected_eta_time_102 := expected_eta_time:integer:INT
ebitda_104 := ebitda:decimal(8,2):DECIMAL
address_id_78 := address_id:integer:INT
shopper_algorithm_124 := shopper_algorithm:varchar:LONGTEXT
shopper_total_price_123 := shopper_total_price:decimal(8,2):DECIMAL
shop_id_79 := shop_id:integer:INT
expected_delivering_time_97 := expected_delivering_time:integer:INT
date_waiting_driver_138 := date_waiting_driver:integer:INT
loyalty_tip_131 := loyalty_tip:decimal(8,2):DECIMAL
delivery_type_95 := delivery_type:char(9):ENUM
numcallsclient_77 := numCallsClient:integer:INT
date_almost_delivered_132 := date_almost_delivered:integer:INT
date_started_134 := date_started:integer:INT
total_price_discount_140 := total_price_discount:decimal(8,2):DECIMAL
uuid_88 := uuid:varchar(10):VARCHAR
frozen_products_103 := frozen_products:tinyint:TINYINT
comprea_note_warehouse_86 := comprea_note_warehouse:varchar:LONGTEXT
last_overweight_notification_121 := last_overweight_notification:integer:INT
cart_progress_branch_url_122 := cart_progress_branch_url:varchar(255):VARCHAR
last_update_110 := last_update:integer:INT
comprea_note_driver_128 := comprea_note_driver:varchar:LONGTEXT
delivery_price_141 := delivery_price:decimal(8,2):DECIMAL
lola_id_136 := lola_id:varchar(50):VARCHAR
date_delivered_133 := date_delivered:integer:INT
num_products_116 := num_products:integer:INT
modified_76 := modified:tinyint:TINYINT
status_108 := status:char(14):ENUM
next_shopper_timeslot_id_120 := next_shopper_timeslot_id:integer:INT
shopper_total_cost_135 := shopper_total_cost:decimal(8,2):DECIMAL
tag_color_90 := tag_color:varchar(15):VARCHAR
date_loyalty_tip_115 := date_loyalty_tip:integer:INT
margin_98 := margin:decimal(8,2):DECIMAL
cash_order_id_126 := cash_order_id:integer:INT
num_user_changes_130 := num_user_changes:integer:INT
comprea_note_83 := comprea_note:varchar:LONGTEXT
percentage_opened_hours_in_87 := percentage_opened_hours_in_10:decimal(5,2):DECIMAL
expected_delivering_distance_118 := expected_delivering_distance:integer:INT
```
## Trino results
Starting the measuring session.
Query 'Smoke test query' took 1 seconds to run and returned 1 rows.
Query 'Orders and GMV by Store and Month' took 256 seconds to run and returned 62 rows.
Query 'First order by customer' took 264 seconds to run and returned 113939 rows.
Query 'Sales by country, city, month for the last 1 months' took 257 seconds to run and returned 16 rows.
Query 'Sales by country, city, month for the last 3 months' took 234 seconds to run and returned 40 rows.
Query 'Sales by country, city, month for the last 6 months' took 237 seconds to run and returned 76 rows.
Query 'Sales by country, city, month for the last 12 months' took 230 seconds to run and returned 150 rows.
Query 'Sales by country, city, month for the last 24 months' took 225 seconds to run and returned 306 rows.
Query 'Sales by country, city, month for the last 36 months' took 227 seconds to run and returned 455 rows.
Finished the measuring session.
## MySQL
Starting the measuring session.
Opening up an SSH tunnel to pre.internal.lolamarket.com
SSH tunnel is now open.
Query 'Smoke test query' took 0 seconds to run and returned 1 rows.
Query 'Orders and GMV by Store and Month' took 14 seconds to run and returned 62 rows.
Query 'First order by customer' took 324 seconds to run and returned 113940 rows.
Query 'Sales by country, city, month for the last 1 months' took 161 seconds to run and returned 16 rows.
Query 'Sales by country, city, month for the last 3 months' took 169 seconds to run and returned 40 rows.
Query 'Sales by country, city, month for the last 6 months' took 171 seconds to run and returned 76 rows.
Query 'Sales by country, city, month for the last 12 months' took 178 seconds to run and returned 150 rows.
Query 'Sales by country, city, month for the last 24 months' took 208 seconds to run and returned 161 rows.
Query 'Sales by country, city, month for the last 36 months' took 234 seconds to run and returned 167 rows.
Finished the measuring session.
Closing down the SSH tunnel...
SSH tunnel is now closed.
# 20220726
Meeting with Ricardo:
- What operating model are we planning on running in Poland?
- What do you think are the problems with Spain?
- Why drop in-house tech and go for instaleap?
- How do you keep the frugal philosophy being inside Glovo?
- I'm surprised we don't have slightly more "controlling" style metrics
- What is the story behind Mercadao?
- What is your personal policy? Are you in for the money, for the fun?
- What do you like and what don't you like about the data team as of today?
"Only measure things that matter."
Predict demand by SKU.
My ideas:
- Keep deliveries decentralized, while being very smart in optimizing workload. Portuguese approach seems a bit too rigid. Spanish approach is not smart enough (plus not very caring about shoppers)
- Define a holy-grail set of KPIs for operational excellence (something like €/delivery + shopper's €/hour) and start doing A/B testing of different picking+delivery tactics:
- Split picking and delivering
- Double or even triple deliveries
- Picnic approach (full-refrigerated truck, all day)
-
---
- [x] Make a different repository for experiments
- [ ] Run final experiments with everything there
- [x] Clean up the package repository
- [x] Update confluence page with results
- [ ] Prepare script for training session
- What is the package used for and example
- Run any query against MySQL or Trino
- Measure how long it takes (Wall time)
- Used through CLI
- Sessions are defined through a JSON config
- How to install
- Tips and tricks
- Further ideas
- Make a richer output format
- Include features geared towards comparison (compare several versions of a query, or same query across engines)
Questions:
- Can it write results to a table?
-
# 20220727
Buenas Fer,
Te escribo un pequeño tocho. Lo comentamos por aquí o si quieres echamos una llamada.
Resumen: queremos proponer poner unos indices en MySQL para mejorar performance de algunas queries. Es para ver cómo nos coordinamos.
Long: en data queremos hacer algunos dashboards para ops de españa. Son dashboards para temas del dia a dia de disponibilidades de shoppers, ordenes de hoy/mañana, etc. Así que tienen que refrescarse con frecuencia y ser bastante interactivos.
El tema es que las queries con la tabla de `cart` son bastante mortales. Es típico filtrar para ver solo pedidos en un `status` o que pertenezcan a una `date_delivered` o `date_started` en concreto. Pero ni la columna de `status` ni ninguna de las de fecha tienen índice, así que cualquier query se alarga mucho.
Queremos explorar contigo qué opciones hay para ver qué sería lo más inteligente y hacerlo.
Ya me dices, gracias.
Tables that are being used:
- address
- cart
- cart_product
- cash_order
- company
- postalcode
- shop
- shopper_timeslot
- shopper_postalcode
- user
# 20220728
- [x] Try to connect my DBeaver to DW through the Jumphost indicated by Carlos
- [x] Build script to replicate tables
- [x] Replicate tables
- [ ] Run inserts from one DB to another through Trino
- [ ] Run the tests
- [ ] Put new indices in place
- [ ] Run the test again
- [ ] Compare
CAST(status AS CHAR(14))
CAST(delivery_type AS CHAR(9))
id = 3548 -> the troublesome cart
## Stuff I discovered during my debugging of the bloody enums
- ENUMS + not strict mode -> errors turn into 0/''
- ENUMS + strict mode -> Insert breaks with unhelpful message
- LM MySQL doesn't have strict mode activated
- DW does have strict mode activated
- Easiest workaround for inserting faulty ENUM values from LM MySQL to DW:
- When defining the table in DW, make empty string ('') one of the possible ENUM values in the field. That way, when Trino tries to write an empty string from LM MySQL, DW sees it as a valid value.
- For now, there seems to be no down-side to this. Trouble would only happen if someone exported the table with the ENUMs as integer instead of strings, and tried to compare it with LM MySQL data. But that seems unlikely.
## Retro
How can we improve as a team
- Everyone provides ideas
- We vote
- We discuss the most voted
- We take action points
# 20220816
- ~~Load the missing three tables.~~
- Run the queries again, both through Trino and directly to DW.
DW meeting with Dani
- We both agree that we should judge more carefully before we discard MySQL
# 20220817
Details of DW on AWS:
- Region: `eu-central-1` (Frankfurt)
- Name: data-prod-mysql
- VPC: pdo-prod (vpc-6a231802)
- VPC Security Groups: default (sg-86e633ec)
Details of the EC2 I created:
- Region: `eu-central-1` (Frankfurt)
- name: performance-query-host
- size: t2.small
Details of the SSH key created:
- query-performance-host-key
### Table design scenarios
1. Base scenario - tables as they are today
2. Second scenario - index on `status`
3. Third scenario - index on `status` and `date_delivered`
4. Fourth scenario - partition over `status` ~~partition over year `date_delivered`~~ **ACTIVE**
5. Fifth scenario - index on `status`, year and month partitioning on `date_delivered`
Can't partition over `date_delivered` because it has null values and should be part of the PK
# 20220818
My update:
- Managed to connect to DW
- Playing with indices and partitions, results soon
- Spoilers:
- DW is wicked fast when compared to the replica (could server size be the explanation)
- Indices seem to help when querying DW directly
- Indices seem to have no effect when querying through Trino
- Will sit today and probably tomorrow with Pinto to build a more exhaustive query set
- Want to develop a bit more the Python package because complexity of results is growing exp
**Meeting with Pinto**
Pinto shares the products query that can never be run for more than 1-2 months because it dies
She also shares another problematic one on similar tables
I also asked for a handful of representative, relevant queries business-wise, even if they don't have performance problems as of today.
Partitioning pains:
- Columns used in the partitioning key must be part of the PK
- The table can't have foreign keys
# 20220819
- [x] Update results in confluence
- [x] Update slide
- [x] Update readme.md in directory
- [x] Make new feature in package
- [x] Develop somekind of test for package?
# 20220822
- upload to s3
- register with prefect
- everything flow must be in a project
- I need to set up bash for windows
# 20220823
- [x] Fill the slide comparing old LM vs new LM vs DW
- [x] Update the confluence page
- [x] Update the board
- [x] Update the readme
Stuff that is missing in the docs on how to setup prefect
- [x] Including the MFA line in the credentials file
- [x] Editing the .bashrc file with the line at the end
- [x] That prefect 1.2.2 should be installed, not prefect 2
- [x] Set up the `backend.toml` and `prefect.toml`
- [x] missing botocore, boto3 package
- [x] Permissions in S3 for the bucket
1:1 João:
- Discutir con equipo ideas de Data Quality
- Mirar fechas para viaje a Oporto:
- Pensar en team building stuff:
- Personal Development:
Steps:
1. [x] Flow that runs
2. [x] Flow that runs and logs something
3. [x] Flow that runs and queries Trino
4. [x] Flow that runs and queries Trino and prints something
5. [ ] Flow that runs and queries DW
6. [ ] Flow that runs and queries DW and prints something
7. [ ] Flow that runs and makes an update in DW
# 20220825
- [x] Update Prefect set up documentation
- [x] Schedule training
- [ ] Oporto trip
- [x] Agree on dates with Dani
- [ ]
- [x] Get Trino and Rancher user from Dani
- [x] Put picture in Google account
- [x] Schedule Data Quality session with the team
- [x] Prepare whiteboard for Great Expectations explanation
- [x] Think about ideas for the agenda Oporto trip
# 20220829
- [x] Share personalities
- [x] Post DevOps offer in Python barcelona meetup
- [x] Prepare Great Expectations demo for the team
- [x] Prepare roadmap proposal for GE
# 20220830
- [x] Tasks in Jira
- [x] Share materials with the team
- [x] Write a long form in Confluence
- [x] Drop status point on Jira task
- [x] Write summary of performance task
## Re-structuring meeting with Gonçalo
- Lidl cierra partnership con LolaMarket
- Efectivo 1/09
- Lidl no ha estado contento con los numeros
- Lidl invirtio 2 millones en LolaMarket. Por los movimientos de acciones y propiedad, a efectos practicos, han perdido 1 millon
- Mala imagen por Ley Rider
- 40% del negocio de Lola Market
- Gasto de 200.00€ / mes. Los numeros no salen, y saldran menos con el cierre del partnership.
- Inevitable re-estructuracion del negocio en España
- Es seguro que el equipo de España se va a tener que reducir
- Tech y Data cuelgan de HQ y son inmunes a esto
- Piloto con Carrefour en Noviembre para convertirnos en su operativa en la sombra para same-day delivery
- Gonçalo ve mercado potencial de 75mill€ en España junto a ellos.
- La oportunidad de salvar Lola
- Next steps
- Cerrar plan de re-estructuracion con Glovo y compartirlo ASAP
# 20220831
The next two:
- Spot new guest users from mercadao and insert them in dw
- Delete sensible data for users that have been flagged as right-to-be-forgotten
Use checksum table to test existing users equality?
Could the "autoincrement" for everything policy be a slippery slope?
# 20220901
- [x] Map out user ETL
- [x] Spanish ETL
- [x] Portuguese ETL
- [x] Set up user etl development environment
- [x] Share meetup ideas with João
- [x] Create tasks for all flows
- [ ]
# 20220902
- [x] Finish mercadao new users ETL
# 20220903
- [x] Read https://ploomber.io/blog/ci-for-ds/
- [x] Read http://www.garysieling.com/blog/testing-etl-processes/
- [x] Discuss deployment (all together, or deploy asap?)
- [ ] Finish my write up on my development
# 20220905
- [x] Check this to handle SSH tunnel https://discourse.prefect.io/t/how-to-clean-up-resources-used-in-a-flow/84
# 20220906
- [x] Ask about health insurance -> Liliana
- [x] Make specific goals
- [x] Dani cambia tabla user DW
- [x] Update inserts flow
- [x] Work with Carlos Matias to create a new SSH key for ETLs
> - Get
- All the users where (the pendings)
- Not in app
- Yes in DB
- System is mercadao
- rtbf = 0
- 10 users where (the gones)
- Not in app
- Yes in DB
- System is mercadao
- rtbf = 1
> - Check that the gones
- [x] Do not appear in the sensitive table
- [ ] Have different created and update times <- NOPE
> - Check that the pendings
- [x] Appear in the sensitive table
> - [x] Backup the data from the pendings
> - [ ] Execute the pipeline
> - Check that the pendings
- [ ] Do not appear in the sensitive table
- [ ] Have different created and update times
# 20220908
- [x] Add retries to Trino
- [x] Finish deleted users ETL
- [x] Update SQL of deleted users staging tables
# 20220908
- [x] Modify create table to lengthen password field
- [x] Send to João the points to review
# 20220912
- [x] Sign-up in servicedesk
**Insert guests**
Below is the SQL to insert a guest user. My question now is: on the normal users flow, what is preventing this same users from going in?
Answer: guest users don't appear in the users table in MongoDB. They simply
```SQL
-- Insert guests users
truncate table data_dw.staging.p006_pt21_new_guests_users
insert into data_dw.staging.p006_pt21_new_guests_users (
email,
id_user_internal,
n_date,
n_row,
temp_id
)
select
a.email,
b.id_user_internal,
date_format(now(), '%y%m%d%H%i%s') as n_date,
row_number() over (partition by null order by a.email) n_row,
'guest' || cast(date_format(now(), '%y%m%d%H%i%s') as varchar) || cast(row_number() over (partition by null order by a.email) as varchar) temp_id
from
(select
distinct
lower(og.customeremail) as email
from
app_md_mysql.pdo.ordergroup og
where
og.customerid is null
and og.status is not null
) a
left join data_dw.dw_xl.dim_user_sensitive b on (a.email = lower(b.email) and b.id_data_source = 1)
where
b.id_user_internal is null
and a.email is not null
order by 1
```
Ok. Now we have all the guests whose email does not appear in the dw already.
Next up, get DW ids for them.
```SQL
insert into data_dw.staging.p006_c01_auto_increment (
id_data_source,
key_for_anything
)
select 1 as id_data_source, temp_id as id_app_user
from data_dw.staging.p006_pt21_new_guests_users
```
Finally, get them inside the user table.
```SQL
insert into data_dw.dw_xl.dim_user_sensitive (
)
select
from
data_dw.staging.p006_pt21_new_guests_users c
inner join data_dw.staging.p006_c01_auto_increment a on (c.temp_id = a.key_for_anything)
insert into data_dw.dw_xl.dim_user (
id_user_internal,
id_data_source,
id_app_user,
flag_is_guest_account,
flag_is_social_account,
flag_agreed_campaigns,
flag_agreed_history_usage,
flag_agreed_phonecoms,
flag_agreed_terms,
flag_email_verified,
flag_is_active,
user_role,
flag_rtbf_deleted,
name,
email,
created_at,
updated_at
)
select
a.id_user_internal,
1 as id_data_source,
c.temp_id as id_app_user,
1 as flag_is_guest_account,
0 as flag_is_social_account,
0 as flag_agreed_campaigns,
0 as flag_agreed_history_usage,
0 as flag_agreed_phonecoms,
0 as flag_agreed_terms,
0 as flag_email_verified,
1 as flag_is_active,
'ROLE_USER' as user_role,
0 as flag_rtbf_deleted,
'guest account' as name,
c.email as email,
CAST(NOW() AS TIMESTAMP) as created_at,
CAST(NOW() AS TIMESTAMP) as updated_at
from
data_dw.staging.p006_pt21_new_guests_users c
inner join data_dw.staging.p006_c01_auto_increment a on (c.temp_id = a.key_for_anything)
```
- [x] Modify inserts ETL to use SSH key from AWS
# 20220913
- [ ] Failure on new users flow
- [ ] Review trino retry policy
- [ ] Review why flow appears as successful
- [x] Work on the special guest flow
**Upgrade guests into normals**
- We look for users that:
- Appear in mongodb and do not appear in DW (new-ish)
- BUT there is an existing guest user (`flag_is_guest_account`) that has the same email.
- And then we update the records of the existing guest internal DW id with the new data from the full-blow user we have seen in mongodb
staging.p006_pt41_users_with_existing_guest
c.id_user_internal as id_user_internal_guest
- [x] Make meeting to demo and Great Expectations
## Team guidelines
- Each flow should have an owner. Delegating is always possible, but we should prevent bystander effect.
- Support work is work
- Issues should be issues?
1 person 1 week
Make task in our board
Vacations - ask in our internal chat
## Slack alerts
- Go to api.slack.com
- Create new app
- Activate Income webhooks
- Create a webhook for the channel
Codes for each channel
- Data team: https://hooks.slack.com/services/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
- Data team alerts: https://hooks.slack.com/services/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
# 20220914
- [x] Review Dani's flow
- Password as SHA
- Hardcodeo del numero 2 como id data source de Lola
- `t_004` ultima linea de la query. Que sentido tiene?
- `t_004` es necesario el where? o el flag de new user?
- [ ] Document slack thingy
- [x] General stuff
- [x] Link to example flow
- [x] As part of prefect, talk about file in `env` with the webhook URLs
- [x] Describe the usage of triggers
- [ ] Review weird errors in flow 1
- [ ] Check if the current S3 flow is buggy
- [ ] Learn how to avoid false Successess -> Flow reference tasks https://docs-v1.prefect.io/api/latest/core/flow.html#flow-2
- [ ] Apply slack notification to all flows
- [ ] Modify upgrade guest flow with João suggestion + review
# 20220915
- [x] Review weird errors in flow 1
- [x] Check if the current S3 flow is buggy
- [x] Learn how to avoid false Successess -> Flow reference tasks https://docs-v1.prefect.io/api/latest/core/flow.html#flow-2
- [x] Modify update flow so that it checks all fields, even those that nowadays are fixed
- [x] Modify upgrade guest flow with João suggestion regarding `date_registered` +
- [x] review same concept on flow insert user
- [x] Put docstrings in functions in flows
- [x] Refactor all flows to not use the first staging table
- [x] Flow 01
- [x] Flow 02
- [x] Flow 03 ISSUE WITH THE SHA, THIS ONE STILL USES IT. BUT NO TROUBLE IF IT'S THE ONLY ONE
- [x] Flow 04
- [x] Flow 05
- [x] Apply slack notification to all flows
- [x] Flow 01
- [x] Flow 02
- [x] Flow 03
- [x] Flow 04
- [x] Flow 05
- [x] Schedule all flows
- [x] Flow 01
- [x] Flow 02
- [x] Flow 03
- [x] Flow 04
- [x] Flow 05
# 20220919
- [ ] Finish docs
- [x] General
- [ ] Flow
- [ ] 01 -> Review references
- [x] 02
- [x] 03
- [x] 04
- [x] 05
- [x] Table Scripts
- [x] staging.p006_c01_auto_increment
- [x] staging.p006_pt41_users_with_existing_guest
- [x] staging.p006_pt21_new_guests_users
- [x] staging.p006_pt01_current_users
- [x] staging.p006_pt11_modified_users
- [x] staging.p006_pt31_deleted_users
- [x] staging.p006_pt02_current_users_new
- [x] Share Datacamp account with Ana Martins
# 20220920
- [ ] Orders
- [ ] Basic MD flow
- [ ]
- [ ] Check how to force Prefect to return true in a flow regardless of anything
## 1:1 João
- Barcelona Meetup
- Berlana and the blog?
- I'll talk now with Liliana
- Feedback
- Any news from Poland?
## 1:1 Liliana
- PR for tech recruiting
- Blog
- Meetup
- Relationship with unis
- We can get creative here
- I have a regular flow of smart Management and Economics students in my agenda
- She comes from another start-up (Digital Therapy Software)
## Python meetup Veriff
- How to propose a talk?
- Veriff is originally from Estonia
### Fraud detection
- Rivo, Fraud Engineering Lead
- The founder got the idea from frauding his own age. Lol.
- Do you implement a red team / blue team approach?
- Testing feels like very difficult because your system is very stateful. How do you deal with it?
- In-memory data for fast queries with millions of records... how much memory where you using?
### QR Codes
Ismael Benito
QR codes re two dimensional bar codes
How could we use QR codes in Lola?
I need to do cool QR codes for my classes at UPF
- How well supported is it? Is my phone stupid?
- Any applications of this besides having fun? -> The chromatic ink stuff
# 20220921
- I have made the following query:
```SQL
SELECT o.id as id_order,
og.id as id_order_group,
UPPER(og.customerid) as id_user,
u.id_user_internal,
u.id_app_user,
gu.id_user_internal,
gu.id_app_user
FROM app_md_mysql.pdo."order" o
LEFT JOIN app_md_mysql.pdo.ordergroup og
ON o.ordergroupid = og.id
LEFT OUTER JOIN data_dw.dw_xl.dim_user u
ON UPPER(og.customerid) = u.id_app_user
LEFT OUTER JOIN data_dw.dw_xl.dim_user gu
ON lower(og.customeremail) = gu.email
WHERE u.id_data_source = 1
AND gu.id_data_source = 1
AND u.flag_is_guest_account = 0
AND gu.flag_is_guest_account = 1
```
And it returns records like this:
![[Pasted image 20220921124220.png]]
This is strange. Each order should much with a user **OR** a guest, but not both.
I'm going to start looking into specific cases. The following one is the first
|id_order|id_order_group|id_user|id_user_internal|id_app_user|id_user_internal|id_app_user|
|--------|--------------|-------|----------------|-----------|----------------|-----------|
|1136592|1159257|6140A68762E0C1003FD59CD3|6694660|6140A68762E0C1003FD59CD3|7183239|guest22091216010722939|
**STEP**
I want to:
- Check all the orders made by `6329A8BF6ED53300400DBD5E`
- Check the entries of both `dw_user_id=8669726` and `dw_guest_id=7182737` to explore what they look like.
**RESULT**
Users that have changed their email and matches with a guest user.
I must match orders with users, first by user, after by guest.
To achieve it by user:
- Match `og.customerid` with `dim_user.id_app_user`
- For those that don't match this way: match `og.customeremail` with guest accounts
-
## The Bug shared by Janu
These are the faulty users:
|id_user_internal|id_data_source|id_app_user|date_joined|date_registered|date_uninstall|flag_is_active|user_role|gender|id_app_source|ip|app_version|user_agent|locale|priority|substitution_preference|default_payment_method|flag_gdpr_accepted|flag_agreed_campaigns|flag_agreed_history_usage|flag_agreed_phonecoms|flag_agreed_terms|flag_email_verified|flag_rtbf_deleted|flag_is_guest_account|flag_is_social_account|email|login|name|password|phone_number|id_facebook|id_stripe|id_third_party|poupamais_card|created_at|updated_at|
|----------------|--------------|-----------|-----------|---------------|--------------|--------------|---------|------|-------------|--|-----------|----------|------|--------|-----------------------|----------------------|------------------|---------------------|-------------------------|---------------------|-----------------|-------------------|-----------------|---------------------|----------------------|-----|-----|----|--------|------------|-----------|---------|--------------|--------------|----------|----------|
|7177183|1|guest2209121601074003||||1|ROLE_USER|||||||||||0|0|0|0|0|0|1|0|brucacajope1@gmail.com||guest account|||||||2022-09-12 16:01:17.000|2022-09-12 16:01:17.000|
|6722529|1|5EFCBE721D257E004C026D07||||1|||||||pt|0|REPLACE|CREDIT_CARD||1|1|1|1|1|0|0|1|brucacajope1@gmail.com|10219829617349375@facebook|Miriam Vaz|87435096261dd1a22e25e55c91f045789a67ac85|963844518|10219829617349375|||2446061022274|2022-09-09 09:21:42.000|2022-09-09 09:21:42.000|
|7178877|1|guest22091216010721636||||1|ROLE_USER|||||||||||0|0|0|0|0|0|1|0|raquelsantosmota@gmail.com||guest account|||||||2022-09-12 16:01:17.000|2022-09-12 16:01:17.000|
|6708840|1|60244B928C1878004124EDBE|2021-02-10 21:09:38.000|||1|||||||pt|0|REPLACE|CREDIT_CARD||1|1|0|1|1|0|0|0|raquelsantosmota@gmail.com|raquelmota@agpico.edu.pt|Raquel Mota|fc46053c49bc85c2d919dbdde184254b6dd27156|965182320||||2446010860094|2022-09-09 09:21:42.000|2022-09-09 09:21:42.000|
|7174727|1|guest22091216010720657||||1|ROLE_USER|||||||||||0|0|0|0|0|0|1|0|pedro.carreira74@gmail.com||guest account|||||||2022-09-12 16:01:17.000|2022-09-12 16:01:17.000|
|6800726|1|62D1A2CAA8426C003F1975F2|2022-07-15 17:24:26.000|||1|||||||pt|0|REPLACE|||1|1|1|1|1|0|0|1|pedro.carreira74@gmail.com|10224176747459562@facebook|Pedro Carreira||939867084|10224176747459562||||2022-09-09 09:21:42.000|2022-09-09 09:21:42.000|
Things I observe:
- For each of them, there is one guest and one user version.
- The guest versions have been created *after* the user version was created.
**Hypothesis**
- The new guest flow is not prepared for cases where an existing user makes a purchase as a guest.
- It should check if the guest email exists in `dim_user`, but it isn't hence it creates the user again.
- Given that the user never appears as new, since it already exists in DW, the guest version will never be upgraded to user.
**Result**
The hypothesis is rejected.
The new guest flow does watch out for existing users and guests in `dim_user` properly. It does so by matching by email.
- Alicia buys as guest
- Alicia gets inserted as guest in DW
- Alicia registers as user
- Alicia's existing guest gets upgraded the user
- Alicia buys as guest
- Alicia gets inserted as guest in DW
# 20220922
- [x] Review Dani's flows
- [x] 02 Update users
- `ETL_PATH` with env, otherwise the team can't run it
- `p006_lm21_gest_users` <- typo
- Pass slack channel from parameter, see new orders ETL flow for inspiration
- `user_agent` gets capped to `VARCHAR(250)`. Is that length reasonable, or could it fall short?
- lowercase all emails so that comparisons are always proper (email is case insensitive)
- The flow seems to do steps 1-4 for both users and guests, but then task 5 only updates users? Where are guests updated then?
- [x] 01
- [x] lower email
- [x] slack noti + scheduled param
- [x] `__file__`
- [x] `date_joined` and `date_registered`
- [x] `flag_is_active`
- [x] 02
- [x] lower email
- [x] slack noti
- [x] + scheduled param
- [x] `__file__`
- [x] `date_joined` and `date_registered`
- [x] `flag_is_active`
`
# 20220927
- [x] Review the failed flows
- [x] 01
- [x] 02
- [x] 03 - was it cahnged?
- [x] Pedir usuario para Francisco
- [ ] Seguir instrucciones para tener acceso a Github
- [ ] Keep on with fixing user flows
- [ ] 03
- [ ] lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- [ ] Could the lower function be different in Trino and MySQL?
- [ ] It seems that `LOWER` in Trino works just fine.
- [ ] In MySQL, `LOWER` and `LCASE` are supposed to do the same thing. I wrote a query to test it and
- [x] slack noti
- [ ] + scheduled param
- [x] `__file__`
- [x] `date_joined` and `date_registered`
- [x] `flag_is_active`
- [ ] EXTRACT DATE
- [ ] UPDATE_AT in update
- [ ] 04
- [ ] lower email
- [ ] slack noti + scheduled param
- [ ] `__file__`
- [ ] `date_joined` and `date_registered`
- [ ] `flag_is_active`
- [ ] 05
- [ ] lower email
- [ ] slack noti + scheduled param
- [ ] `__file__`
- [ ] `date_joined` and `date_registered`
- [ ] `flag_is_active
- [ ] Work on orders
## Meeting with Gonçalo
- Focus on operations because Poland removes our front-end
- What freedom/opportunities/limitations do we have to market biedronka.pl
- 100€ free delivery... inflation
- Game theory: kidnap and squeeze
- White label is just a commodity
- Technology is shared with Instaleap.
- What does Instaleap want from us?
- How happy is Berlana with Instaleap? Quality issues?
## Meeting with Ricardo
- We review the roadmap
- Ricardo explains that they have discussed with Jerome Martins around catalogs per store and stockouts.
- We only have one catalog per retailer, even if we know that each store does not have exactly the same category
- JM argues that stockouts happen because our catalog is simplified, not because their operations are flawed
- To settle the discussion, the catalog for a couple of stores have been fixed manually to be hyper-accurate. With this, we experiment with the following hypothesis: Does having an accurate catalog reduce stockouts for that store significantly?
- [ ] Biedronka launch
- [ ] Single shopper vs picker-driver
- [ ] Order of biedronka expansion by geography
- [ ] Recurrence
- [ ] Best drivers
For next one:
- How can we imprve AOV, CPO and Stockouts
- [ ] How much money are we losing to credit card fees in Spain?
# 20220929
- [ ] Document user flow 006
- [ ] Seguir instrucciones para tener acceso a Github
- [ ] Keep on with fixing user flows
- [ ] 03
- [ ] lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- [ ] Could the lower function be different in Trino and MySQL?
- [ ] It seems that `LOWER` in Trino works just fine.
- [ ] In MySQL, `LOWER` and `LCASE` are supposed to do the same thing. I wrote a query to test it and
- [x] slack noti
- [ ] + scheduled param
- [x] `__file__`
- [x] `date_joined` and `date_registered`
- [x] `flag_is_active`
- [ ] EXTRACT DATE
- [ ] UPDATE_AT in update
- [ ] 04
- [ ] lower email
- [ ] slack noti + scheduled param
- [ ] `__file__`
- [ ] `date_joined` and `date_registered`
- [ ] `flag_is_active`
- [ ] 05
- [ ] lower email
- [ ] slack noti + scheduled param
- [ ] `__file__`
- [ ] `date_joined` and `date_registered`
- [ ] `flag_is_active
---
## Work on the order etl
Okay. Since I haven't worked on fixing the user flows for days, I actually kind of forgot what was the issue and what was I doing about it. I'll start again from the initial point: trying to join orders to customers in the orders ETL.
Customers can be linked to orders in two ways:
- Through Mercadão's user ID.
- Through the order email.
Theoretically, with an up-to-date `dim_user` table:
- All orders should be linked to a customer by one of the two methods.
- Orders should only be matched by one method (if we find a user id, we should not match again by email)
- I'm going to take a look at the query I was building to check if this is the case.
After running the query with a few adjustments, I come to the following observations:
- The query matches the following way:
![[Pasted image 20220929134219.png]]
- Most match by ID.
- Double matches don't worry me, since we can just prioritise using the app id match instead of the user email one.
- The not matched are the ones that worry me. I need to look into those.
**Hypothesis**
I'll explore the following way:
- First, confirm that all non-matching orders do not have a user app id. This will confirm that we can narrow down the issue to email matching.
- If the previous is true:
- I'll get one of the non-matching orders and save the email.
- I'll go into dim user and see if I can find the email with a wrong case.
- My expectation is that I can find it. If this is not the case... I need to get creative to think what comes next.
**Result**
- All customer ids are null. So, I confirm that all non-matching orders are guest orders that should be matched by email.
- I pick order `106889`
- The customer email is `conceicaorento@sapo.pt`
- The email appers in `dim_user`, associated to a registered user.
**Aha-moment**: when matching by email, I should NOT only do that for guest users. There are guest orders that should be matched by email to upgraded-users. The order doesn't have a customer id because it was made from a guest identity, but later we upgraded the guest to user, so there will be no matching guest account but matching through the app user id is not possible because the order was placed in the guest-past.
So, matching should be like:
- If order has an user id, match that way
- If not:
- If order matches email with user account, use user account
- If order matches email with no user account, use guest account
- If no account, problem
**Hypothesis**
Applying the previous matching procedure should result in no orders unmatched.
**Result**
Unmatched orders are down from 79,174 to 1,989. We have reduced the problem, but issues still remain.
**Hypothesis**
I'll explore the following way:
- First, confirm that all non-matching orders do not have a user app id. This will confirm that we can narrow down the issue to email matching.
- If the previous is true:
- I'll get one of the non-matching orders and save the email.
- I'll go into dim user and see if I can find the email with a wrong case.
- My expectation is that I can find it. If this is not the case... I need to get creative to think what comes next.
**Result**
- Okay, first, for the orders that have a `customerid`:
- There are non-matched orders with customer ids, and their IDs don't appear in `dim_user`. My guess is that the following is happening:
- User registers
- User places orders
- User deletes itself from Mercadão
- ...time passes...
- We start feeding `dim_user`
- End result: we never had the chance to store the user in `dim_user` since it was deleted before we ran it.
- Options to move ahead:
- Drop the orders: ugly as hell, potentially acceptable since it's only 2,000 orders of deleted users.
- Make `id_user=-1` for these orders. Not that intuitive for analysts. And if we have another expection some day, we will have to do `id_user=-2` or something like that and things will start tangling up.
- Create a new user flow for Mercadão where we infer deleted users from orders and include them in `dim_user`. A bit more work, but I think it's clean and will pay off long term.
- For the customers that don't:
- I select order with id `66017`
- The customer email is `SONIASTC@SAPO.PT`
- And I look for it in `dim_user`... and it's there, with the same case. Why didn't it match??? Well, obviously, because `dim_user` email is in UPPER, and my query is lowercasing stuff.
- If I rerun again ensuring cases are fine:
- Does the same user appear? -> No
- How many non-matched by email users appear? -> only 22
- Out of these 22
- 6 are recent orders that have been done after the last Users ETL run, so that's fine
- The other 16 are super old orders and all of them come from users with email from Jerome Martins. When we insert guests into `dim_user`, we ignore order groups with status null. I'll do the same for these ETL.
Query for debugging in case I need it again in the future:
```sql
SELECT
o.id AS id_order,
og.id AS id_order_group,
UPPER(mdu.id_app_user) AS id_app_user,
mdug.email AS guest_user_email,
CASE
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NULL THEN mdu.id_user_internal
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NOT NULL THEN mdug.id_user_internal
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NOT NULL THEN mdu.id_user_internal
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NULL THEN -1
END AS dw_user_id,
CASE
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NULL THEN 'matched_by_id'
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NOT NULL THEN 'matched_by_email'
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NOT NULL THEN 'double_match'
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NULL THEN 'not_matched'
ELSE 'wut???'
END AS match_type,
o.*,
og.*
FROM
app_md_mysql.pdo."order" o
INNER JOIN app_md_mysql.pdo.ordergroup og
ON
o.ordergroupid = og.id
LEFT JOIN
(
SELECT
du.id_user_internal,
du.id_app_user
FROM
data_dw.dw_xl.dim_user AS du
WHERE
du.id_data_source = 1
AND du.flag_is_guest_account = 0
) AS mdu
ON
UPPER(og.customerid) = mdu.id_app_user
LEFT JOIN
(
SELECT
du.id_user_internal,
du.email
FROM
data_dw.dw_xl.dim_user AS du
WHERE
du.id_data_source = 1
) AS mdug
ON
LOWER(og.customeremail) = LOWER(mdug.email)
WHERE
o.status IS NOT NULL
AND og.status IS NOT NULL
AND CASE
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NULL THEN 'matched_by_id'
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NOT NULL THEN 'matched_by_email'
WHEN mdu.id_app_user IS NOT NULL
AND mdug.email IS NOT NULL THEN 'double_match'
WHEN mdu.id_app_user IS NULL
AND mdug.email IS NULL THEN 'not_matched'
ELSE 'wut???'
END = 'not_matched'
```
**USE PARTITIONING TO SOLVE DOUBLE MATCHES BY PRIORITISING THE USER**
Snippet to add deleted users to `dim_user`
https://drive.google.com/file/d/1Iz_NypqRx2WgtgxNxWit0E5XggS85K7P/view?usp=sharing
# 20220929
- [ ] Documenting and running deleted users thingy
- [ ] Seguir instrucciones para tener acceso a Github
- [ ] Keep on with fixing user flows
- [ ] 03
- [ ] lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- [ ] Could the lower function be different in Trino and MySQL?
- [ ] It seems that `LOWER` in Trino works just fine.
- [ ] In MySQL, `LOWER` and `LCASE` are supposed to do the same thing. I wrote a query to test it and
- [x] slack noti
- [ ] + scheduled param
- [x] `__file__`
- [x] `date_joined` and `date_registered`
- [x] `flag_is_active`
- [ ] EXTRACT DATE
- [ ] UPDATE_AT in update
- [ ] 04
- [ ] lower email
- [ ] slack noti + scheduled param
- [ ] `__file__`
- [ ] `date_joined` and `date_registered`
- [ ] `flag_is_active`
- [ ] 05
- [ ] lower email
- [ ] slack noti + scheduled param
- [ ] `__file__`
- [ ] `date_joined` and `date_registered`
- [ ] `flag_is_active
# 20221003
- [x] Document user flow 006
- [ ] Seguir instrucciones para tener acceso a Github
- [ ] Keep on with fixing user flows
- [x] 03 OJO QUE LA COMPARISON NO ES CASE SENSITIVEEEEEE
- [x] lower email <<<<<<<<<<<<<<<<<-------- same query returns different amount of rows in Trino and Python???? I left a SQL script (script-15) where both queries are there. I can use it for debugging.
- [x] Could the lower function be different in Trino and MySQL?
- [x] It seems that `LOWER` in Trino works just fine.
- [x] In MySQL, `LOWER` and `LCASE` are supposed to do the same thing. I wrote a query to test it and
- [x] slack noti
- [x] + scheduled param
- [x] `__file__`
- [x] `date_joined` and `date_registered`
- [x] `flag_is_active`
- [x] EXTRACT DATE
- [x] UPDATE_AT in update
- [x] 04
- [x] lower email
- [x] slack noti + scheduled param
- [x] Set the `date_joined` of existing guests as the the order of the first date. How?
- [x] `__file__`
- [x] `date_joined` and `date_registered`
- [x] `flag_is_active`
- [x] 05
- [x] lower email
- [x] slack noti + scheduled param
- [x] `__file__`
- [x] `date_joined` and `date_registered`
- [x] Current flow is leaving `date_registered` null (bad)
- [x] Current flow is "destroying" `date_joined` from the previous guest user entry. Should it be simply carried over?
- [x] `flag_is_active
- [x] Run an update on email column to lowercase the shit out of it
- [ ] Solve partitioning problem
|id_order|id_order_group|id_user|
|--------|--------------|-------|
|1084918|1113194|9166273|
|588939|654522|9246797|
|664680|728505|9021328|
Query to review if there are any uppercase emails in `dim_user
```sql
SELECT dim_user_email_has_upper, COUNT(1)
FROM
(SELECT
a.email AS pdo_email,
b.email AS dim_user_email,
BINARY a.email <> BINARY LOWER(a.email) AS pdo_email_has_upper,
BINARY b.email <> BINARY LOWER(b.email) AS dim_user_email_has_upper,
a.email = b.email AS naive_equality,
lower(a.email) = lower(b.email) AS lowered_equality
from
staging.p006_pt01_current_users a
left join dw_xl.dim_user b on (UPPER(a.id_app_user) = UPPER(b.id_app_user) and a.id_data_source = b.id_data_source)
where
b.id_user_internal is not null
and b.id_app_user is not NULL
) email_stuff
GROUP BY dim_user_email_has_upper
```
# 20221006
- [x] Remove unnecessary OVER PARTITION in status and date
- [x] Make note of the parameter change to use `updatedat` in MD Order flow
> now on the orderwithuser part, a couple of questions:
> 1. you dont actually need to do the over partition and then the distinct. you can just do the max() and a group by by o.id. It should be faster than performing the over partition.
> 2. the `mdug` subquery, by bringing the email on the select and joining by the lower, we can actually get more than 1 rows on the select if the email gets to be lower cased and not in the table. I know this is probably handled on dim_users, and will be solved with the max on the main query. So no worries, just to be aware of this.
> not sure about this, but we probably should limit the updatedat to not be today
otherwise you will get users that dont exist on dim_user yet
# 20221013
- [x] Index
- [x] Modify table definiton to include index
- [x] Change code back to deleting by id + fetch by creation date
- [x] Discuss with João the issue with using `update_at` to limit the scope of the order flow
- [x] Drop large tables from query experiment
# 20221014
## Order columns batch 2
- `id_picking_location`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `id_retailer`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `id_team`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `date_created`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `date_updated`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `status_order`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `total_price_ordered`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `date_delivery_slot_start`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `date_delivery_slot_end`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `date_last_delivery_slot_start`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `date_last_delivery_slot_end`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `delivery_fee`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `delivery_fee_discount`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `delivery_fee_original`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `delivery_type`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
# 20221017
- [x] Order LM V1
## The stupid non-null case
**What I was doing**
- I was running an `INSERT INTO` to a table.
- This included inserting in column `X`. Column `X` is nullable. I was aware that some of the values I was inserting where nulls... but that shouldn't be a problem since the column `X` is nullable.
- The query was being ran on Trino.
**What I expected to happen**
That the data just gets inserted properly.
**What actually happened**
- The query failed with an exception.
- The error message read: `trino.exceptions.TrinoUserError: TrinoUserError(type=USER_ERROR, name=CONSTRAINT_VIOLATION, message="NULL value not allowed for NOT NULL column: X", query_id=20221017_135640_00663_f6qyk)`
- I was puzzled since, again column `X` is nullable.
**Weird, let's research**
- After toying around a bit, I ran the same query without inserting into column `X`, but keeping many other columns in place.
- After running, I got the same error message I was getting for column `X`, but this time for column `Y`.
- This was also puzzling, because column `Y` is nullable as well.
- I kept taking out nullable columns out of the `INSERT`, and the error message kept jumping to some other column every time.
**Finding the root cause and solution**
- Eventually, I was left with 4 column, `A`, `B`, `C` and `D` that were NOT nullable.
- After running the `INSERT` once more, I got the same error for column `A`.
- I checked the data I was trying to insert and there were null values for column `A`. Hence, this time the error made sense: there where null values going into `A`, but `A` is not-nullable.
- I modified the schema definition to make `A` nullable and ran again the insert with just `A`, `B`, `C` and `D`. It inserted properly and threw no errors.
- I then re-executed my original query with all the columns I wanted to insert. Now it worked just fine!
**Take-away**
If Trino ever gives you an error like `trino.exceptions.TrinoUserError: TrinoUserError(type=USER_ERROR, name=CONSTRAINT_VIOLATION, message="NULL value not allowed for NOT NULL column: X", query_id=20221017_135640_00663_f6qyk)`, **completely ignore the column being mentioned**. Trino will just lie to your face repeteadly. Instead, restrict your operation to only the columns that can't be null and find what part of the data is breaking the constraint.
It looks like the LM order ETL is joining several users to one single order or tag. I need to look into it. Maybe Dani made some mistake in the user ETL.
# 20221026
- [x] Review João's review
- `VARCHAR(36)` agreement
- [x] Write convention on `VARCHAR(36)` for ids
- [x] Modify existing columns and table scripts for `fact_order`
- [x] Los gastos del viaje
- [x] Fill in details of tasks created by João
- `id_picking_location`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `id_retailer`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `id_team`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `date_created`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `date_updated`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `status_order` (naive version)
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `total_price_ordered`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [ ] Approved
- `date_delivery_slot_start`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `date_delivery_slot_end`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `date_last_delivery_slot_start`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `date_last_delivery_slot_end`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- `delivery_fee`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [ ] Approved
- `delivery_fee_discount`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [ ] Approved
- `delivery_fee_original`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [ ] Approved
- `delivery_type`
- [x] Created col in `staging`
- [x] Created col in `dw_xl`
- [x] Included in first query
- [x] Included in second query
- [x] Approved
- [x] Status
- [x] Check cash order
- [x] Check with Bruna
- [x] Make design proposal
- Deal with the multicurrency bit
- [x] Local currency should come from the currency field, not be hardcoded
- [x] Apply the exchange rate. See https://bi.mercadao.pt/question/1344-lm-cart.
- [x] MAKE USER A `NOT NULL` FIELD AGAIN! (Now it's not because the LM ETL for `dim_user` is buggy)
- [x] Ask specifically for Dani to review with a focus on the delivery slots fields
# 20221103
## Shutting down
Lolamarket operations will die today.
- Is the society still up? No.
- Who will stick around? Data and tech.
- Completely shut down? Today?
- Branding. Are we also going to kill the brand? Brand continues
- Timeline and paperwork
- Timeline?
- Diff between brand new society and subsidiary?
- As long as we do it calmly, all good
# 20221104
- [x] Ricardo's request: Add % lost sales to dashboard
- [x] Find out how could we do the "dirty read thingy"
- [x] Try to make an environment where I can work with the GE notebooks
Given a ready to insert/update subset of data *Mirror*,
and a table *Target*,
there is no expectation that applies to *Target* but not to *Mirror*.
Hence, the same expectation suite *Suite* that can validate fully the *Mirror* in an ETL flow,
can also validate fully the *Target* in a monitoring flow.
# 20221107
- [x] Clean up Sandbox
- [x] Review Pinto's comment
- ~~Re-run and check if duplicate persists~~
- The error persists.
- ~~If it persists, see what part of the monster JOIN is the one joining more than once and change what's necessary~~
- I just noticed it's from Mercadão, not Lolamarket. Change of plans
- ~~First, I'm gonna check how many orders in each system have duplications.~~
- It's only Mercadão.
- ~~I have noticed that the Mercadão flow is still doing the truncate with trino. This could be the reason behind the error. I'm going to check if the staging table has duplicates.~~
- I fixed it and it looks like that was the guilty bit. Re-executing returned no duplicates, so I'll consider it to be good for now unless Ana spots anything again.
# 20221108
## Q&A for Promotech SL death
![[Pasted image 20221108123815.png]]
- "All seniority will be taken into account."
- ~~What is a "Mercadão Entity"? Is it an SL company in Spain owned at 100% by Mercadão?~~ It's just Mercadão itself with legal presence in Spain.
- ~~Do we have a timeline? Not that I'm in a hurry, just out of curiosity ~~-> "Our goal is for your transition to be completed at most in mid december"
# 20221110
- [x] Review Janu's expectations
- Use unique
- Explain trick on query vs table to sum all flags together
- Explain REGEX to id_app_user
- Share this: https://legacy.docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html
# 20221115
## The Great Fixing of the Order ETLs
> Put the trim in the Lolamarket order status
- Solved
> Check the DELIVERING order having null status for payment
- Solved
> Another thing, if you look at the orders from October, some of the orders don't have the correct status... For example: `select * from data_dw.dw_xl.fact_order where id_order = 2041500;` In this case, the status on the operational table is `delivered` and in the `fact_order` is `LOYALTY_CARD_ERROR`
- TLDR: I can't reproduce the issue
- I checked and `fact_order.original_app_status = 'deleted'` and the `fact_order.id_operational_status = 4`, which correctly corresponds to `cancelled`
- Query: `select op_ status.*, o.* from dw_xl.fact_order o LEFT JOIN dw_xl.dim_operational_status op_status ON o.id_operational_status = op_status.id where id_order = 2041500`
>Why the date delivered of this order is `2022-09-30` instead of `2022-10-01`? `select * from data_dw.dw_xl.fact_order where id_order = 9715131;`
- Could you elaborate what's the rationale behind that `date_delivered_closed` should be `2022-10-01`?
> Why for order `2857238` we have differences in the fields `delivery_fee_original_eur` and `delivery_fee_discount_eur`? (I'm using the  `public.exchange_rate_eur` for the calculation)
- I can't find any problem. Could you explain how do you think the following numbers should be? (These are for order `2857238`)
|delivery_fee_eur|delivery_fee_original_eur|delivery_fee_discount_eur|delivery_fee_local|delivery_fee_original_local|delivery_fee_discount_local|
|----------------|-------------------------|-------------------------|------------------|---------------------------|---------------------------|
|0.3400|4.0000|3.6500|1.7200|19.9900|18.2700|
- [x] Check why it doesn't add up to 4
Solved, it was because the hardcoded values were not of type `DECIMAL` and Trino is an idiot.
- [x] Trim `ROLE_USER`
- [x] No orders from users with `ROLE_TEST`
- [x] Option A: we remove `ROLE_TEST` users from `dim_user`
- [x] Option B: we keep `ROLE_TEST` users in `dim_user`, add a filter in the `fact_order` ETL to remove those users
- [x] Add the timezone field to orders
- [x] LM
- [x] MD
- [x] Should we remove `cart = deleted`? YES
- [x] Find alternative `date_created` for Lomarket orders where the current logic doesn't have a value:
- FROM_UNIXTIME(cart.date_deliveries_requested) AS date_created,
- [x] Do the change about the 100€ cart and other details of the delivery fee
> Check if dates in UTC
Checked. All is UTC atm.
> I don't know if I already say this, but for this old order (`3127`) at Mercadão, we should consider fulfill the `deliveryslotdate` using the date of the real delivery. In these cases, we shouldn't consider the `date_last_delivery_slot_end`. Should be null as the `date_last_delivery_slot_start`  because these values don't make sense: `select * from data_dw.dw_xl.fact_order where date_last_delivery_slot_start is null and id_data_source = 1;`
- What is the rule to obtain the "*date of the real delivery*"?
- "*Should be null as the `date_last_delivery_slot_start`  because ...*" <- Sorry, I'm not following. Order `3127` does not have a null value in `date_last_delivery_slot_start`
- [x] It's only this order: manually copy over the value from the `last` field to the original one.
> I'm not sure if we talked about these specific cases:
> `select * from data_dw.dw_xl.fact_order where date_last_delivery_slot_start is null and id_data_source = 1;`
> But I think that the `date_last_delivery_slot_end` should be null because if you check, doesn't make sense, because we don't have slots finishing to 00:30
> Maybe you could add this to the manual fixes
- [x] Manual fix
---
- Check why these 5 orders have no `date_delivered_closed` even though they are in status `delivered`
```
SELECT *
FROM data_dw.dw_xl.fact_order fo
WHERE fo.date_delivered_closed IS NULL
AND original_app_status = 'delivered'
```
# 20221123
- [x] User ETL
- ~~Refactor to only pass the config from the expectations and the query~~
- ~~Make new bucket~~
- ~~Make new schema~~
- Copy over in other flows
- ~~Update MD~~
- ~~Guest MD~~
- ~~Upgrade MD~~
- New lola
- Guest lola
- Update Lola
id_user_internal
- Usual
id_data_source
- 2
id_app_user
- Like id_user_internal
date_joined
- Usual
date_registered
- Usual
user_role
- {"ROLE_USER" ,"ROLE_WARNING" ,"ROLE_TEST" ,"ROLE_SUPER_WARNING" ,"ROLE_BANNED" ,"ROLE_INCIDENCES" ,"ROLE_ADMIN" ,"ROLE_SUPER_ADMIN" ,"ROLE_AMBASSADOR" ,"ROLE_SUPER_AMBASSADOR" ,"ROLE_B2B"}
- Not null
gender
- {"masculine", "femenine", "none"}
id_app_source
- {"1", "2", "3"}
- Not null
ip
- ip regex
locale
- "es"
priority
- int between 0 and 100
substitution_preference
- {"call", "nothing", "shopper"}
- not null
email
- regex
login
- regex
id_stripe
- the regex
poupamais_card
- null
created_at
- usual date stuff
updated_at
- usual date stuff
flag_is_active
flag_gdpr_accepted
flag_agreed_campaigns
flag_agreed_history_usage
flag_agreed_phonecoms
flag_agreed_terms
flag_email_verified
flag_rtbf_deleted
flag_is_guest_account
flag_is_social_account
phone_number
- regex
## Performance Review Guidelines
- 3 moments
- Write self-review
- Manager (João) provides feedback
- 1:1 with manager
- Timeline?
- Every year the same?
# 20221213
## Docker image research
### Chapter 1 - Pulling the image
If we want to be able to freely modify and update our prefect-flow images, first we must understand what's inside the current one, since we need to keep the current production flows up and running. And if we want to look into it, first you need to have it in your local machine.
My trouble began here: I tried installing docker in my Ubuntu WSL, but whenever I tried to run any docker command, I got the error message: `Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?`. Googling this was a pain in the ass. There are a thousand reasons for this error to happen and there are a million combinations of Windows + WSL + Docker setups which apparently have all sorts of different behaviours.
I finally managed to get something working by doing this:
- Get docker working
- I installed Docker desktop in Windows.
- I installed the Debian WSL from the Microsoft Store.
- In the Docker Desktop GUI, I went to Settings (the little gear, top right) -> Resources -> WSL Integration and enabled the integration with the Debian engine.
- Debian can now use docker (as long as this Docker Desktop GUI is up and running).
- Note: our current Ubuntu WSL is not usable for this because it is a WSL Version 1 (don't ask me what this means). On the other hand, the Debian I installed is a WSL Version 2, which is correctly identified by the Docker Desktop GUI.
- Get permission to pull the image from AWS ECR
- Whoever wants to pull an image from our ECR must have the following policy in AWS: `AmazonEC2ContainerRegistryReadOnly` (note that there are higher level policies such as `AmazonElasticContainerRegistryPublicPowerUser ` or `AmazonElasticContainerRegistryPublicReadOnly` that would also work, but the read only one is enough for this purpose).
- Get `awsume` set up in the Debian env (you can follow instructions here: https://awsu.me/general/quickstart.html)
- Get your credentials set up in `~/.aws/credentials` and use `awsume`
- Login into our AWS Docker Registry with this command: `aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin 373245262072.dkr.ecr.eu-central-1.amazonaws.com`
- You should be able to pull images from the repo like this: `docker pull 373245262072.dkr.ecr.eu-central-1.amazonaws.com/pdo-data-prefect:latest`
A good resource to understand the AWS login stuff that's going under so that it doesn't feel like dark magic.
- Video explaining how to login to the AWS CLI with MFA: https://www.youtube.com/watch?v=EsSYFNcdDm8
### Chapter 2 - Understanding what does our image have
I researched a bit and apparently there are a couple of ways to (https://appfleet.com/blog/reverse-engineer-docker-images-into-dockerfiles-with-dedockify/, https://stackoverflow.com/questions/48716536/how-to-show-a-dockerfile-of-image-docker) Once the image was in my hands, the next part was understanding what was inside.
To do so, I ran the following command and got the following result:
```bash
$ docker history 223871741b8a
IMAGE CREATED CREATED BY SIZE COMMENT
223871741b8a 6 months ago /bin/sh -c #(nop) ENTRYPOINT ["tini" "-g" "… 0B
<missing> 6 months ago /bin/sh -c #(nop) COPY file:6068d9a0511b2a94… 795B
<missing> 6 months ago /bin/sh -c pip install trino 267kB
<missing> 7 months ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV LC_ALL=C.UTF-8 0B
<missing> 7 months ago /bin/sh -c #(nop) ENTRYPOINT ["tini" "-g" "… 0B
<missing> 7 months ago /bin/sh -c #(nop) COPY file:e1bbbe4447dfaf1e… 795B
<missing> 7 months ago |4 BUILD_DATE=2022-04-27T21:09:36Z EXTRAS=al… 505MB
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.bu… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.vc… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.ve… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.ur… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.na… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL org.label-schema.sc… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL io.prefect.python-v… 0B
<missing> 7 months ago /bin/sh -c #(nop) LABEL maintainer=help@pre… 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV LC_ALL=C.UTF-8 0B
<missing> 7 months ago /bin/sh -c #(nop) ARG BUILD_DATE 0B
<missing> 7 months ago /bin/sh -c #(nop) ARG GIT_SHA 0B
<missing> 7 months ago /bin/sh -c #(nop) ARG EXTRAS 0B
<missing> 7 months ago /bin/sh -c #(nop) ARG PREFECT_VERSION 0B
<missing> 7 months ago /bin/sh -c #(nop) CMD ["python3"] 0B
<missing> 7 months ago /bin/sh -c set -eux; savedAptMark="$(apt-m… 11.4MB
<missing> 7 months ago /bin/sh -c #(nop) ENV PYTHON_GET_PIP_SHA256… 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV PYTHON_GET_PIP_URL=ht… 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV PYTHON_SETUPTOOLS_VER… 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV PYTHON_PIP_VERSION=22… 0B
<missing> 7 months ago /bin/sh -c set -eux; for src in idle3 pydoc… 32B
<missing> 7 months ago /bin/sh -c set -eux; savedAptMark="$(apt-m… 28.1MB
<missing> 7 months ago /bin/sh -c #(nop) ENV PYTHON_VERSION=3.7.13 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV GPG_KEY=0D96DF4D4110E… 0B
<missing> 7 months ago /bin/sh -c set -eux; apt-get update; apt-g… 3.11MB
<missing> 7 months ago /bin/sh -c #(nop) ENV LANG=C.UTF-8 0B
<missing> 7 months ago /bin/sh -c #(nop) ENV PATH=/usr/local/bin:/… 0B
<missing> 7 months ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 7 months ago /bin/sh -c #(nop) ADD file:8b1e79f91081eb527… 80.4MB
```
I quickly realised most of the image didn't looked like something Ana might have modified manually but rather standard. I read the Prefect docs on the topic (https://docs-v1.prefect.io/orchestration/flow_config/docker.html) and then everything clicked in my mind:
- Prefect provides a standard image to run the flows in.
- The only thing Ana had done was to extend that image by installing the Trino Python client in it.
This is great news because it means we don't need to do any black magic to make new images that can be used by the old flows. We can simply grab the standard prefect image, add our required packages and add any other changes we do and that's it.
### Chapter 3 - From now on
- Try to make a new image and check if it works -> Yes!!!
- Design our image design and building pipeline
- Create a Git repository to document our images.
https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html
- [x] Do the performance review self-assessement
- Make new repository and organize image modification pipeline
- Manage to get a custom package inside our image
- Start packaging things
- Connections
- Slack alerts
- General bloat
- Great Expectations
- Common patterns
## Hand-off with Dani
- DMS (Data Migration Service)
- The thingy that moves data from LM Replica to Redshift
- DMS is configured to be real-time
# 20221214
## Payment process for Poland
- How will the shopper invoicing work in Poland?
- Which data should we look at in DB?
- Shopper Bill?
- New stuff in Flex schema?
- What is "the division between Delivery fees and CPO"
![[Pasted image 20221214123536.png]]
# 20230102
- [x] Fix the bug with the LM update users pipeline
- [x] Finish PR
- [x] Tag production image and clarify in docs that latest and production are synonymous
- [x] Finish the cute drawing
- [x] Review CVs
- [x] Make the hello-world package
- [x] Uploading to pip
- [x] Create new S3 bucket for Python packages
- [x] Write generic guide on how to make any package uploadable to the S3 bucket
- https://pdofonte.atlassian.net/wiki/spaces/DATA/pages/2440658945/Custom+Python+Packages+Distribution
- [x] Make specific guide on how to upload `bom-dia-mundo` to S3
- [x] Modify build process to include custom packages
- [x] Create config file with list of packages and versions
- [x] Create script to make temp copy of it
- [x] Manage to get it installed
- [x] Run bom-dia-mundo
- [x] Document and discuss with João
- [x] Prepare development plan with João
- [x] Review the data architecture summary with João
- [x] Document Metabase downtime, update, etc.
- [x] Store creds for AWS
- [x] Fix lm update users
- [x] Take a look at API thing
- [x] Document procedure to open connections in Looker for future reference
- [ ] Ideas for Dani's gift
- [ ] A haircomb and a hairdresser giftcard
- [ ]
https://community.looker.com/technical-tips-tricks-1021/mysql-howto-why-does-mysql-need-database-permissions-for-derived-tables-25386
![[Pasted image 20221216114751.png]]
## Interview methodology
- Technical interview
- Small E2E, case, common sense
- Checklist of tool questions
- Silly coding task, automated
# 20230103
## Looker connection to data_dw
- RDS: https://eu-central-1.console.aws.amazon.com/rds/home?region=eu-central-1#database:id=data-prod-mysql;is-cluster=false
- Security Group: https://eu-central-1.console.aws.amazon.com/ec2/v2/home?region=eu-central-1#SecurityGroup:groupId=sg-0275d4096c2890cea
- Subnets
- https://eu-central-1.console.aws.amazon.com/vpc/home?region=eu-central-1#SubnetDetails:subnetId=subnet-42945a29
- https://eu-central-1.console.aws.amazon.com/vpc/home?region=eu-central-1#SubnetDetails:subnetId=subnet-8a955be1
- https://eu-central-1.console.aws.amazon.com/vpc/home?region=eu-central-1#SubnetDetails:subnetId=subnet-0eaed173
- https://eu-central-1.console.aws.amazon.com/vpc/home?region=eu-central-1#SubnetDetails:subnetId=subnet-79b4b803
- VPC: https://eu-central-1.console.aws.amazon.com/vpc/home?region=eu-central-1#VpcDetails:VpcId=vpc-6a231802
# 20230109
- [x] Chase Carlos for the connection
- [x] Send code assignment to review
- [x] Fix the data_dw config myself
- [x] stop the instance
- ~~make a backup of the instance~~ -> https://eu-central-1.console.aws.amazon.com/rds/home?region=eu-central-1#db-snapshot:engine=mysql;id=snapshot-for-subnet-config-changes-20230103
- ~~Start the instance again (otherwise the networkign stuff can't be modified)~~
- [x] move it to default-vpc-4012372b(pdo-uat) subnet group
- [x] move it back to pdo-prod-public subnet group
- ~~Add again the security group with the IP exceptions~~ (it was removed with the subnet group migrations)
- [ ] Delete the snapshot?
- [x] Send message to Liliana
- Upgrade of the replica to MySQL 8.0
- [x] Create chained replica of existing replica on MySQL 8.0
- When creating this, I couldn't specify the MySQL version to be 8.0. It's either not allowed, or I missed it. Instead, I created the chained replica with 5.7 and afterwards upgrade it to 8.0
- [x] Upgrade the second replica to MySQL 8.0
- [x] Check that it replicates properly
- [ ] Modify Looker LM connection to point to second replica
- (Exploring, testing and debugging can start at this point, with some potential downtime while we do changes in the guts of MySQL)
- [ ] Perform a recoverable snapshot/backup/something of the old read replica
- [ ] Test recovering from the snapshot/backup
- CHECKPOINT only move forward if both of these are true
- [ ] We have confirmed that Looker + MySQL 8 connector satisfies our needs
- [ ] We feel comfortable rolling back the MySQL old replica from v8 to v5 if we screw up
- [ ] Execute upgrade of old replica to MySQL 8
- [ ] Switch Looker connection again to point to old replica
- [ ] Check (if something below breaks, grab logs and rollback)
- [ ] ETLs reading from old replica work fine
- [ ] Looker explore works fine
- [ ] Nothing else has broken
- Cleanup actions
- [ ] Remove second read replica
- [ ] After some prudent time (couple weeks, a month?), remove snapshots/backups from old replica in MySQL 5 version
- Related doc
- General info on read replicas https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_MySQL.Replication.ReadReplicas.html
- General info on upgrades https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.MySQL.html
**The busy day**
- Validate old dependencies on second replica
~~- Point Trino
- Backup old connection config of `app_lm_mysql.properties`
- ``` connector.name=mysql
connection-url=jdbc:mysql://lolamarket-rr2.cnlivtclari7.eu-west-1.rds.amazonaws.com
connection-user=trino-bi
connection-password=WdRpC6aHS4n7UnG2K9z
case-insensitive-name-matching=true```
- Modify connection URL
- After this, Trino is still reading from the old replica (validated by throwing queries through Trino and observing the monitoring pages of both replicas)
- Reboot Trino somehow
- We used the redeploy button in Rancher on the worker workload
- It worked. After a couple of minutes, the redeploy was successful. Queries now hit the second replica instead of the old one.~~
~~- Point ETL
- It is done automatically after Trino points there. The ETLs never connect directly to the LM replica
- Just run the ETLs and check if it runs~~
- Point Redshift Migration
- ~~Backup current endpoint configuration so we can go back to it -> it's simply the RDS endpoint~~
- ~~Stop the database migration tasks (otherwise we can't modify the endpoint)~~
- ~~Modify endpoint to point to second replica instead of old one~~
- ~~Start again the task~~
- It breaks with an error
- `Last Error Failed to connect to database. Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2800] [1020414] Error 2019 (Can't initialize character set unknown (path: compiled_in)) connecting to MySQL server 'lolamarket-rr-mysql8-test.cnlivtclari7.eu-west-1.rds.amazonaws.com'; Errors in MySQL server binary logging configuration. Follow all prerequisites for 'MySQL as a source in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html or'MySQL as a target in DMS' from https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.MySQL.html ; Failed while preparing stream component 'st_0_KURC747DBGGXBD44ND6JF6DFZ4'.; Cannot initialize subtask; Stream component 'st_0_KURC747DBGGXBD44ND6JF6DFZ4' terminated [reptask/replicationtask.c:2808] [1020414] Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE`
- We decide to:
- Make a new parameter group and set `binlog_format` to `ROW` (https://eu-west-1.console.aws.amazon.com/rds/home?region=eu-west-1#parameter-groups-detail:ids=lolamarket-replica-8-custom-params;type=instance)
- Assign this parameter group to the second replica
- Reboot the replica
- Restart the DMS task
- Pivot. We decide to leave DMS dead.
- Before closing on the DMS topic, repoint the DMS endpoint to the old replica, knowing that it will fail once it gets upgrade to v8
- Validate... how?
- By checking if new carts appear in Redshift
- ~~Check that metabase can connect to mysql 8 -> It can~~
- Point old dependencies back to old replica
- Point Trino to old replica
- ~~Change config back to how it was~~
- ~~Redeploy workers~~
- ~~Validate with the query game again~~
- ~~Upgrade old replica to MySQL 8~~
- ~~Take snapshot while in 5.7~~
- ~~Go to `Modify` menu and bump to latest version~~
- ~~Give it some time~~
- ~~Validate that everything is fine~~
- ~~Connect Looker to old replica on MySQL 8~~
- ~~Remove second replica~~
- ~~Make stories for things that we could deprecate
- Redshift and DMS
- Metabase contents reading from redshift~~
- [x] Read requirement documents and add my comments
- Capacity: Setting app
- I agree with Carlos Matias on the non-overlapping of the delivery areas. Implementing this in a web UI might be a nightmare. We might better off using some GIS software such as QGIS and writing a little guide on how to create the polygons with it + implementing some kind of validator to prevent user mistakes.
- If there are areas within areas, a strong hierarchy control must be in place. Each child area should only have one parent.
- The data team should have a way to obtain the polygon files from the areas. Something programmatic would be ideal, since as we grow the number of areas might make manual downloads infeasible.
- It's key that the history of areas is kept. Removing or modifying an area should not delete the old data, but rather archive the old version. If we fail to do this, looking at historical data will be a mess and we won't be able to answer questions regarding operational areas properly.
- Orchestration: Dispatching orders
- What happens to orders that have been rejected by all shoppers?
- How are orders assigned to a fleet?
- Scenario: customer Alice places an order. Shopper Zac sees the order and accepts it. Some time afterwards, customer Alice cancels the order. How does shooper Zac learn about this? Is there any notification? Does shopper Zac have access to a list of cancelled-by-the-user orders? The rational for my question is: as a shopper in the field, you are not proactively checking your entire order list constantly (you get a bit of tunnel vision and focus only on the order at hand and perhaps the next). I think if there is no notification to push the cancel "in the face" of the shopper, it could go unnoticed and the shopper might be planning his day assuming that order still needs to happen.
-
- [x] Draft technical interview case
- [x] Code
## Ygor Gomes
- From North Brasil
- 7 years as DevOps
- Knows Hive, Databricks, Hadoop
- Married, 2 kids
- He asks for docs. Boss
# 20230116
- Review https://pdofonte.atlassian.net/browse/DATA-932
## Interview with Afonso
- Information Systems
- Learn about ETLs and Dashboarding
- Data Structures and Algorithmes
- Databases and relational models
- IESE Business school (Summer)
- AESE Business school (Summer)
- Shopper in Mercadão (#7)
- Suggested a change in ticketing systems
- Logistics startup
- Fill empty trucks of different providers
JAVA
Python
Prolog
Assembly
HTML/CSS
Database
TSQL
### My Impression
- Overall
- Afonso is university student with a good profile. He has the gaps that one would expect from not having working experience, but he has the beginner skills and knowledge that are needed to start in our world and seems to be smart. Hiring him will mean investing time and effort on building his skillset. But I think he is smart enough to learn fast and eventually become a good engineer.
- A disclaimer: because of the previous points, I wouldn't advice hiring Afonso on a part-time basis. Afonso will face non-trivial ramping up period until he is productive within the team, so working part-time could translate into waiting months before being net positive for the team.
- Even though I genuinely think Afonso could work with us as a Junior member, I think we should try to assess slightly more experienced candidates who would be comfortable with our compensation. As much as Afonso shows great potential for a fresh graduate, the right profile with 1-2 years of experience could be orders of magnitude more useful to the team.
- On the case
- Was the candidate able to draw a complete solution from the data sources (Pingo Doce, Orders database) to the business users (Finance team)?
- Afonso was able to draw a rough but complete processing and monitoring scheme to go from raw data to the final data that would be needed.
- On the other hand, he struggled to find the right component to host data to be accessed from Looker, and didn't suggest how to turn email data into S3 files.
- Was the candidate curious about the business requirements? Did he make an effort to deeply understand what his business colleagues want to achieve so that he can propose the most suitable solution? Did he think out of the box to realize what was "the request behind the request"?
- Afonso asked good questions initial questions to understand better the situation, data and architecture wise. He asked early if data was available to check and get an idea on what he needs to work on.
- On the other hand, Afonso didn't dive a lot into how the solution would change the finance team's way of working. I would have expected a bit more of curiosity around finance's problem and brainstorming over what the solution would be from their point of view, regardless of how we built it in the backend.
- Were the tools and techniques chosen by the candidate the best ones for each component in the system?
- Afonso made good proposals on using lambda functions to host Python code and S3 buckets to handle the data as flow of files.
- On the other hand, he didn't cover how to fetch an email into S3, and more importantly, couldn't correctly identify the need for SQL database in order to store the processed information for Looker to access it.
- Did the candidate challenge the fact that Pingo Doce sends the orders through email, and explored the idea of having an alternative ingestion method that is more powerful and consistent?
- No.
- Did the candidate realize that a very simple, version 1 solution could be delivered to the finance team to provide early value, while more advanced solutions could be worked on?
- No.
- Did the candidate ask to see examples of the raw data?
- Yes, early. And made good remarks, paying attention to what info was contained, what were the primary keys and how it should be turned into a more structured format than excel.
- Did the candidate show an interest for the size of the data being handled, in order to take scalability matters into account when designing the system?
- No.
- Was the candidate able to decompose the request into different, independent user stories? Such as: ingest and clean the data, automate the detection of unmatched or inconsistent orders, automate the suggestion of data correction.
- No.
- Additional points
- Surprisingly, Afonso didn't have a single question to ask at the end of the interview.
# 20230208
## Interview with Timóteo
He's from Oporto
Started as micro IT helpdesk, scaled to IT infrastructure
Went into big data support:
- Bash
- Ansible
- "I did mostly scripts and firmware upgrades"
- Azure, Cloudera
Now in Jumia:
- Building ETL pipelines in Airflow
- Python and SQL coding, first experiences "I'm doing basic pipelines"
- Doing Udemy course in Python
- "I consider myself a junior"
- Acts as a bit of an admin on rancher
- Apache Nifi
- AWS
- He's not sure about many things. Red flags
He will be fired in Jumia in February.
Tools
Checklist
- Python
- Git
- SQL
- Trino
- Github
- Jira
- Confluence
- Looker
- AWS
### My Impression
- Overall
- Timóteo has very little experience in Data Engineering. His knowledge is mostly on administering and maintaining infrastructure rather than designing and developing systems. He has troubles come up with detailed designs for how to store, move and transform data, . His approach to the challenge was unstructured and most of the time he had time communicating his proposal in a clear and understandable manner.
- I would propose not moving forward with Timóteo. I think his experience adds little value towards his position and I don't think his soft skills are enough to compensate for it.
- On the case
- Was the candidate able to draw a complete solution from the data sources (Pingo Doce, Orders database) to the business users (Finance team)?
- Was the candidate curious about the business requirements? Did he make an effort to deeply understand what his business colleagues want to achieve so that he can propose the most suitable solution? Did he think out of the box to realize what was "the request behind the request"?
- Were the tools and techniques chosen by the candidate the best ones for each component in the system?
- Did the candidate challenge the fact that Pingo Doce sends the orders through email, and explored the idea of having an alternative ingestion method that is more powerful and consistent?
- Did the candidate realize that a very simple, version 1 solution could be delivered to the finance team to provide early value, while more advanced solutions could be worked on?
- Did the candidate ask to see examples of the raw data?
- Did the candidate show an interest for the size of the data being handled, in order to take scalability matters into account when designing the system?
- Was the candidate able to decompose the request into different, independent user stories? Such as: ingest and clean the data, automate the detection of unmatched or inconsistent orders, automate the suggestion of data correction.
- Additional points
---
# 20230214
## Meeting with Ygor
- Databases
- Lola backend
- Mercadão backend
- DW
- Trino
- Prefect
- Other stuff in AWS
- ECR
- S3 buckets
- for flows
- for great expectations
- for python packages
- Wishlist
- Improve Trino uptime
- UAT Prefect server
- Improve automations on CI (docker, python packages)
- Airbyte
# 20230217
## Entrevista con Ricardo
- Perfil Ricardo
- Python
- Machine Learning
- Optimizacion exacta, metaheuristicas y simulacion
- Experiencia profesional y rol en la UPF
- Macroeconomia
- IB en UPF y Ingenieria Computacional y Matematica UOC
- Banc Sabadell
- BI
-
- Sobre mi
- Sobre la asignatura
Solo tardes a partir de las 15
Seminarios
## Interview with Rui
- From Porto, lives in Lisbon
- Started worked in Caixa General, Business Intelligence Data Engineer
- First project, started from scratch infrastructure Apache Nifi
- Groovy Java,
- Sonae, Azure cloud
# 20230221
## Silent failing flows for MD users
- Why are they failing?
- Because the connection object passed to the great expectations task doesn't have the `raw_user` attribute anymore (it used to).
- This was introduced in the last release of the dim user project (1.0.3), when the connections stopped being managed by the hardcoded tasked and were refactored to use lolafect, which does not implement the `raw_user` and `raw_password` trick.
- Because of this, connecting to run the GE test fails.
- Why are they not sending alerts?
- Because of how we named the transactions and basically built the flow, none of the `final_tasks` fails when `t_00X` fails.
- The slack messaging is configured to send an alert if any of the `final_tasks` is getting sent.
- How to fix
- The simplest option is to migrate already the data test to `lolafect`.
- It's important the the `final_tasks` design mistake gets fixed so that, in the future, the slack messages get sent.
- I was about to do this, but I see you have already started work beyond the open release and are implementing this, so I'll just leave it be and let you carry on once you are back (or we can discuss if I should take over)
-
### Fix
Okay, we are going to work together. Here is some context for you to be aware of.
- We will be writing code for a Prefect 1 flow.
- We are using Great Expectations to evaluate some data on a MySQL database.
Review user flows failing because lolafect does not support the raw user and password trick and why do the slack alerts not happen
- Play around with transactions in the Python Trino client to understand them and how can we use them better in ETL
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
# 20230223
## Discussion on Afonso and Rui
- Liliana
- She thinks both are good culture-fit wise
- Afonso pro: operations knowledge
- Afonso pro: ambitious, eager
- Afonso dis: he might leave if he is not challenged enough
- Rui pro: knowledge and experience
- Rui cons: more expensive
- Me
- Let's go for Rui if we want someone who knows his stuff, let's go for Afonso if we want to truly have a junior
Rui questions (deadline for yesterday)
Afonso ambitious multinational
# 20230308
## Airbyte vs Fivetran
- Criteria
- Cost
- **Fivetran**: 5 digits per year. Susceptible to large bills due to developer mistakes. + DevOps cost for integration.
- **Airbyte**: infra cost (maybe 2K per year or something like that) + DevOps cost for maintenance.
- **Winner**: airbyte will easily be 10 times cheaper with current needs. As needs increase, the gap will widen.
- Security
- **Fivetran**: data travels through their infra. It does so encrypted, which theoretically means they see nothing. But we still need to keep an eye on the topic to ensure we are happy with however they are doing things. Credentials storage???
- **Airbyte**: everything stays at home. We do have the responsibility of security ourselves, but it shouldn't be much of a headache. Airbyte should only be accessible within our VPN and with password based access.
- **Winner**: airbyte makes this much simpler.
- **Connectors**:
- **Fivetran**:
- Has connectors to everything we need.
- The evolution will probably be slow (they only have so much software firepower)
- Developing custom connectors is out of the table.
- **Airbyte**:
- Has connectors to everything we need. Some of them are in alpha and beta though. Regardless, it should probably mature fast.
- The evolution will probably be fast (open source, everyone and their mother will join)
- Developing custom connectors is on the table. It's designed for it.
- **Winner**: short term, it's a tie. Long term, Airbyte has the upper hand.
- **Transformations**: both work with dbt. I don't appreciate any significant differences on this point.
- **Other**
- By simple game theory, Airbyte will outpace Fivetran. It's a solid, production ready, open source copycat of Fivetran. Adoption will eventually outpace Fivetran and the network effect of opensource will generate a positive spiral.
- No lockage: we do not really depend on Airbyte (as a company) for anything. If some day they go stupid, we can still use the older versions. Also, a community fork would probably appear with the general interest in mind (see PrestoSQL and Trino as an example)
## Guillermina
Curra en Kadre
Tether, previously Luma.
About me
- Education and experience
- Writing professional grade Python for many years
Offer
- I am familiar with the problem of grid balancing, and I feel attracted by it
- Position
- Team,
- who do I report to,
- who else is there
- Why is there an army of interns?
- (I see you are hiring for many positions)
- Selection process
- Next conocer al cliente con Martim or Luis
- Entrevista tecnica
- Where is the company at, maturity wise
- Business model
Founders: Luis and Martim
Conectar a miles de cargadores en tiempo real
Solo dos personas + 5 interns
Buscan dos perfiles: backend/devops
Contratación directa con la empresa
Porque Luma
- Problema interesante
Porque no lolamarket
-
https://legend.lnbits.com/wallet?usr=f0f7d962a8f44b179529dfbbf8d51d13&wal=a286ac99b8504fea858c6537d921983d
4f0215f77a0b4e33b1ca802fc21fa6cf
# 20230320
## Townhall
- Ricardo and Gonçalo are leaving the company.
-
# 20230324
## Q&A
- Will XL be an independent division? -> Yes.
- Do we need to go to the office like Glovo guys -> No.
## Slack messages not sent
### Initial symptoms
The following flows failed due to a Trino outage, but the slack warnings were not sent:
- `006_md_03`
- `006_md_04`
- `006_md_05`
- `003_md_01`
- `003_lmpl_01`
- `004_md_01`
- `006_lmpl_01`
- `006_lmpl_02`
- `006_lmpl_03`
- `013_010`
- `007`
- `017_pt_01`
The following flows failed due to the same outage, but did send the slack message correctly:
- `011_md_01`
- `011_lm_01`
- `015_md_02`
- `001_20`
### First exploration
To explore this issue, I focused on the flow `013_010`, version `1.0.3` (the version that was scheduled in the prefect server).
- The logs of the failed run showed that the slack message task had been skipped:
```
Task 'SendSlackMessageTask': Finished task run for task with final state: 'Skipped'
```
- I decided to reproduce the error:
- First I ran the flow locally as-is: the execution was successful.
- I then executed again the flow locally while manually introducing an exception in the same task that failed in the scheduled run due to Trino. The Slack message was sent successfully.
- This was confusing. I was expecting the same behaviour as with the failed prefect server run
- I reviewed the logs of the prefect server run that failed and didn't send the slack message again.
- I noticed that, besides the slack message task, the ssh tunnel closure task also had been skipped:
- `Task 'close_ssh_tunnel': Finished task run for task with final state: 'Skipped'`
- This was reasonable, since the ssh tunnel is not used in the prefect server. Nevertheless, it made me wonder: is the usage of the ssh tunnel the responsible for differences between local runs and server runs? There was no a priori reason to think so, but it was the only difference I could find between my local environment and the prefect server one.
- I then observed the code once more and realized that the slack message task had as its upstream tasks all the final tasks.
```
final_tasks = [transaction_end, trino_closed, dw_closed, tunnel_closed, t_090]
[...]
send_message = send_warning_message_on_any_failure(
webhook_url=channel_webhook,
text_to_send=warning_message,
upstream_tasks=final_tasks
)
```
- That the closure of the SSH tunnel was considered one of the final tasks.
- Theoretically, this shouldn't matter. The slack message task is configured with a trigger of `any_failed`, so whether the ssh tunnel closure task was being successfully done or skipped was irrelevant as long as some other final task was failing (which was the case).
- Nevertheless, I decided to try the following:
- Modify the code to remove `tunnel_closed` from the final tasks.
- Upload this version to the prefect server, along with a manually induced exception on the same task that failed previously due to Trino.
- Observe the logs.
- Result: it worked! This time, the slack message task triggered properly and sent the message,
In conclusion, it seems that having the tunnel closed task in the final tasks somehow forces the slack message task to skip.
### Open questions
- Why the hell does that happen?
- Is this the reason for all the silent failing flows we've seen? Or is this only specific to flow `013-010`?
- Are there other flows that do not have the `tunnel_closed` as part of the final tasks, and hence are sending the slack messages correctly when failing?
### Next actions
- Answer the open questions
- Apply the fix to all flows that the `tunnel_closed` in their final tasks.
### Answering open questions
- **Is this the reason for all the silent failing flows we've seen? + Are there other flows that do not have the `tunnel_closed` as part of the final tasks, and hence are sending the slack messages correctly when failing?**
- I pick a sample of the other flows to check this.
- Faulty flows:
- `006_md_03` -> Has `tunnel_closed` in `final_tasks`
- `003_md_01` -> Has `tunnel_closed` in `final_tasks`
- `006_lmpl_01` -> Has `tunnel_closed` in `final_tasks`
- `007` -> NA, extremely outdated, doesn't use `lolafect`
- Properly working flows
- `011_md_01` -> Has `tunnel_closed` in `final_tasks`
- `011_lm_01` -> Has `tunnel_closed` in `final_tasks`
- `015_md_02` -> Has `tunnel_closed` in `final_tasks`
- `001_20` -> Has `tunnel_closed` in `final_tasks`
- The hypothesis is not holding. `tunnel_closed` can be in `final_tasks` and everything will work fine.
- But, I have observed the following difference between the flows in both blocks: the faulty flows are using two `case` blocks to deal with connecting with or without an SSH tunnel. The working flows are using the pre-`lolafect` approach.
- **Why the hell does that happen?**
- Given the answer to the previous questions, it seems that the following pattern in flows is the one that causes the issue:
- The ssh tunnel opening happens within a `case` block that doesn't activate when running on the prefect server.
- The output of the ssh tunnel opening task gets referenced in the ssh tunnel closure task.
- The output of the ssh tunnel closure task is listed as a final task.
- The list of final tasks is passed to the slack warning task as the list of upstream tasks.
- The solution seems easy:
- Either put the ssh tunnel opening output outside of the `case` block
- or do not include the ssh tunnel closure output in the final tasks
### Dummy SSH tunnel task test
I decided to test the following hypothesis:
- If, instead of only creating the task output `ssh_tunnel` within the `case` block, we also create a dummy output with the same name earlier, outside of the `case`, does it fix the issue?
- No, it doesn't. This is puzzling me even harder.
https://pdofonte.atlassian.net/wiki/spaces/DATA/pages/2379284485/Notifying+flow+failures+through+Slack#The-evil-slack-error
- Fixed the [evil slack error](https://pdofonte.atlassian.net/wiki/spaces/DATA/pages/2379284485/Notifying+flow+failures+through+Slack#The-evil-slack-error) that prevented slack messages from being sent on failure.
### Fixing flows
- `006_md_03`: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
- `006_md_04`: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
- `006_md_05`: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
- `003_md_01`: https://github.com/lolamarket/data-003-etl-xl-order/pull/9
- `003_lmpl_01`: https://github.com/lolamarket/data-003-etl-xl-order/pull/9
- `004_md_01`: done
- `006_lmpl_01`: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
- `006_lmpl_02`: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
- `006_lmpl_03`: https://github.com/lolamarket/data-006-etl-xl-user-dim/pull/18
- `013_010`: https://github.com/lolamarket/data-013-etl-xl-exchange-rate/pull/8
- `017_pt_01`
## Update users failed double flow
- The flow failed
- It's because the GE data test failed
- The GE data test failed on an expectation on uniqueness for the `id_app_user`
- The quarantine was not made because the code has a bug. The trigger is not the right one. This happens in (at least) all the MD user flows.
- In which PR was this introduced?
- It seems that it was in 1.1.2.
- It also affects LMPL 2 and 3.
- It affects all MD flows
- ~~Send EVO payment~~
- ~~User for Looker~~
- ~~Check data issue on MD03~~
- ~~Review PR for João https://github.com/lolamarket/data-003-etl-xl-order/pull/10~~
- Keep working on the evil slack bug
- Clean up quarantine flow
- Add onboarding steps for Afonso
## Martim
- My thousand questions
- Background
- What is the story of Tether? When did you start, what has happened since then, where are you today?
- What is your roadmap?
- How is the equity of the company structured?
- Product/Service
- Who pays who for what?
- Frequency Containment Reserves:
- TSO: how much power are you going to produce
- They act as a bidder on the power market
- CPO Charge Point Operator
- Owners get smart charging (charge when price is cheap)
- What is the potential for this market?
- How the hell are you going to control thousands of charging stations? Isn't it tremendously expensive to run such field operation? What do the numbers look like?
- How are you going to make the experience for car owners not a complete pain in the ass?
- What is your geographical scope?
- How does the evolution of the electric vehicle fleet impact your own roadmap?
- Team and position
- What is the current team composition? What other positions are you hiring for?
- Why do I only see a bunch of interns?
- How would I fit in? What would my work look like during my first year?
- What is your personal work culture? What do you like and don't? Could you describe me people and projects you enjoyed and people and projects you despised?
-
- Martim
- Portuguese
- Bored about uni, wanted to go abroad
- Luis, meet in the masters, worked in energy in the US
The Pitch podcast - Bodyguard of the grid
Beta version running on computers
Scheduling from partners
Test version by end of the year (30-40 vehicles at first, up to 8000 at some point)
Revenue at the beginning of the year
Hiring for a devops engineer
Cloud advisor, business advisor, ML advisor (might join the team)
- Optimization Engineer <- me
- Devops Engineer
- Managing next employees
## Onelabel databases
- Fer are going for a relational model, probably Aurora MySQL
- How does this play with events?
- Conventions for data types (money, dates)
What I would like:
- Clear documentation
- Representation layer decoupled from
- Something airbyte-able
# 20230517
## Glovo Data Platform meeting
### Briefing
We meet with:
- Kiran
- Simone
- Charly
They are somehow related to Glovo's event data platform. We meet to learn about how they work so we can get some inspiration. Opening questions:
- How does the data platform works and is structured
- What time of access do we need to drive events there and handle them
- What is it possible to do and not do
- What kind of dependences there will be (PR approvals, team roadmap dependencies, etc.)
### Notes
Simone and Charly com around.
Charly -> Manages the central data engineering team.
Simone -> Manager in one the central data teams (data platform creation team (ingesting and transforming data))
Streaming vs Batch
If we don't need to mix streaming with batch data, life will be better.
Simone:
- Keep it with streaming native technologies as far as you can
- When it comes to dumping events:
- We are on Kafka (Confluence managed). We used to be on Kinesis.
- We used Kafka connect to dump data on S3. We can share the tool we use with you to dump if you live under the same infrastructure as us.
Who should we talk about the Kafka?
What would be the ideal stack for streaming?
- Enrichment and transformation engine (Flink, KSQL, Materialized, near-real time materialization engines)
- After that, two options:
- Specialized database to dump (for example Druid)
- Trino/Starbust/Presto to throw queries
-
## Glovo Data Platform client team
### Briefing
We talk with this team who apparently is using the data platform to learn how their experience is.
### Attendants
- Pablo Butron: data product manager
### Notes
His team is focused on consuming customer activity events on the front-end with the goal of reproducing the behaviour of users in the app.
He thinks we should talk with Engineering, not with the Data Platform. Engineering takes care of designing the backend. Data platform only puts the infra, but doesn't "fill it" with events in any way.
They are creating a Data Mesh. They had a monolith database that is being produced. Tier structure. Tier 0 is core stuff with strong SLAs (like orders).
They are not constrained or permissioned by Data Platform. They can implement their own databases and infrastructure. They share databases with other data teams (so intermediate states of )
Declarative data products. Build products in less than a couple of hours.
Data Platform provides importers that you need to access core data.
They use Amplitude and Looker for reporting to end-users.
Most of the products server internal reporting or external partners (like McDonalds).
They don't write back into the event streams themselves. They only act as consumers of the event bus.
# 20230518
- ~~Pinto says:
- ~~I'm looking at the order items table, and for this order 2383028, the replacement products are missing. I think is because you add the info when the order was created, but the delivery was a few days after. So I think in the process, maybe you should only add data on orders that it was already delivery~~~~
# 20230601
## Onelabel infra with Marcos
- Event driven architecture
- Three gateways: one for retailers, another one for the backoffice apps, another one for the shopper apps
- Auth0 for authentication
- Assortment: reusing PHP service existing in Lolamarket. A re-implementation will be done.
- One database per service. AuroraMySQL 8.0
- Source of truth:
- For orders, order service
- For order planning, orchestrator service
- For fulfillment, order fulfillment
- For shopper data, shopper service
- For configuration of operations, operations config service. This includes capacity.
- Capacity is defined in operations config. Availability is computed in the order orchestrator.
- Redis for reservation expiration. Reservations only live in redis while they are not confirmed. Once they are confirmed, they get persisted.
- Glovo's Kafka will replicate all the events received by the Onelabel kafka
**Questions**
- Go through gateway, or have access to private network?
- We should be able to go inside the private network
- One instance per client, right?
- Not clear.
- This needs to become clearer overtime.
- Most probably thing is no, multiple retailers
- Aurora CDC replicas are possible?
- It can make replicas.
-
- Timezones, multicurrency management, geographical data
- There are conventions and they are documented.
- For currencies: only local currency.
- For time: everything in UTC.
- For geo: the map projection is
- Is there data that only live in the databases that don't get emitted as events because they are not relevant for other services?
- Yes.
- We might need to modify what gets published as events to prevent having to query to the services.
- Are you going to NEVER have orders querying each other's databases?
- Rule of thumb is never.
- But exceptions might be done.
- Ledge mentality, mutability of records, CRUD?
- Mutable records
- Event schema registry, database models. Process views.
- They have this XL event catalog that contains everything.
- Unsure about API access.
- What strategy will you have for versioning events and databases.
- Versioning of event schemas
- Versioning of APIs
- Using events vs db vs apis?
- Skewed towards events.
Requests and next steps
- Access to the miro board
- Conventions
- GlovoXL EventCatalog
- Data contracts driven by data
# 20230608
## The duplication of product-catalogue
- Catalogue ID: `5a71ba59bbe9c6000f7fe360`
- SKU: `2147483647`
But that SKU does not appear in mongodb. Wtf?
Ok, the previous thing was because of the int field size.
Now I'm left with 3 bad combinations.
- #1
- `id_catalogue`: 6009b142f27478003e846b6a
- `sku`: 6
- #2
- `id_catalogue`: 6009b142f27478003e846b6a
- `sku`: 5
- #3
- `id_catalogue`: 5a1ed1b0f777bd000f7a2ef6
- `sku`: 4
I surrender. The ID is going to be a varchar and that's it.
If I run again, I shouldn't get duplicates.
Ok. Now my issue is that checking for uniqueness on the string combo of SKU + catalogueID takes AGES.
I'm hacking to see if we can somehow use hashes to speed up this uniqueness check, because dropping it could be very dangerous. This data comes from partner catalogues, so the chances of receiving shit data are sky high.
Okay. I did the MD5 trick. It works.
Now I still have have to deal with some duplicate ids. Review time it is again.
Ok. The agreed procedure is to pick some values at random.
Finally, none of this was necessary once I fixed type issues in the different steps of the ETL.
## The order with thousands quantity
`id_order`: 2409563
`sku`: 699569
## Performance review meeting with Liliana
- H1'23
- Will be done through factorial
- June 12th to 23rd
- I rate João
- I rate other colleagues
- 26-30 June
- Each manager judges it's own team members
- Optionally: self-assessment
- 3-7 July
- Calibration: I don't understand this even after the explanation
- Finished in July 24th
- Results will be visible in Factorial
A lot of pep talk on the idea that we might develop ourselves outside of the company if necessary. Is the ship sinking?
Is this going to happen every six months?
# 20230613
## Meeting with Charly
Entro en Glovo en Febrero del 2022. Vive en Sant Cugat.
Data Engineering Manager en HQ. Antes en Growth/Marketing.
Veterinario de origen, tiro a bioinformatica, luego a puro data.
Estuvo en Mercado Libre como senior manager de data.
# 20230623
## Catalogue expansion of DW
Should we first:
- Add MD catalogue to DW (regardless of whether that's done by expanding `dim_product` or creating `dim_catalogue`)
- This is useful for name and retailer
- Or should we include LMPL in `fact_orderproduct` and `dim_product`?
- Important to keep the model generic enough
- Catalogue
- Check fields in Mercadão and LM
- Map them
- Share with them
- Settle for final list and model
- In MD, the only things that seem to be interesting enough about the catalogue are the name and the retailer id
- In LM, it seems the uniqueness of products comes from associating them with a specific store. I'm not finding the catalogue abstraction anywhere
We conclude, quick and dirty
# 20230627
## Metabase UAT upgrade
- Autoscaling group name: awseb-e-qguaiqvtw7-stack-AWSEBAutoScalingGroup-1E973Q3A4RYIR
https://www.youtube.com/watch?v=yUXV34RrVmQ
- Existing EC2 instance in UAT: i-07129f5b2695f6a4b
- Existing version in UAT: v0.43.3
- EC2 instance after reboot: i-0682d52f80e19e3bd
- Version in UAT after reboot: v0.46.5
# 20230801
## Trino in Metabase
https://github.com/starburstdata/metabase-driver
https://www.metabase.com/docs/latest/developers-guide/partner-and-community-drivers
https://docs.starburst.io/data-consumer/clients/metabase.html
### How to access the machine to drop files
- Obtain the UAT SSH key
- Set up putty to open an SSH tunnel to the Metabase machine through the jumphost
- Connect through the tunnel with Putty and/or Filezilla
### How to place the Starbust drivers inside the container
- First, download to your laptop the `JAR` file which is the driver itself. These are in the github repo from starbust.
- Now, use Filezilla to drop the driver somewhere in the EC2 host. For example, if the driver file is called `starburst-3.0.1.metabase-driver.jar`, you can drop it in the EC2 so that it lives in `/home/ec2-user/starburst-3.0.1.metabase-driver.jar`
- Metabase is running as a docker container within the EC2 host.
- The path within the container where the driver file should live is `/plugins`. So, again, if the driver file is `starburst-3.0.1.metabase-driver.jar`, it should be placed like `/plugins/starburst-3.0.1.metabase-driver.jar`.
ISSUE: we need to restart
## Redshift and DMS deprecation
### Current state as of 20230609
In the Lolamarket AWS account (251404039695):
- Redshift
- There is an existing redshift cluster
- Name: lola-market-bi
- id: 36d07598-37ab-46a6-9a60-cbb0f231fa7d
- lola-market-bi.c3ircn2vj5i5.eu-west-1.redshift.amazonaws.com:5439/lolamarket
- It is up and running. I thought we had stopped it.
- It is receiving queries on a daily basis. Some of them seem to originate from metabase.
- DMS
- The RS cluster exists as a destination point in the DMS configuration.
- There are two active replication tasks
### Plan
- [x] Create Redshift cluster snapshot: https://eu-west-1.console.aws.amazon.com/redshiftv2/home?region=eu-west-1#snapshot-details?snapshot=backup-20230614
- [x] Test recovery
- [x] Raise new cluster from snapshot
- [x] Run a query in both the original cluster and the replica and check that values are identical
- [x] Delete recovery test
- [x] Stop Redshift cluster
- [x] Delete DMS tasks
- [x] Delete DMS sources and destinations
- [x] Delete DMS Replication Instance (https://eu-west-1.console.aws.amazon.com/dms/v2/home?region=eu-west-1#replicationInstanceDetails/rds-mysql-to-redshift-instance)
- [x] Wait some time
- [ ] Delete Redshift cluster
## User issues with Pinto
- What happens when a guest users has two possible registed users to match with? https://glovoapp.eu.looker.com/explore/XL-Biz/realtime_pt_orders?qid=bNR5i4o4UgvuTTQEes5PFX&toggle=fil
- I'm going to research in the code to understand how this gets handled.
- Flow `006-md-05-upgrade-guest-users.py` is the responsible for spotting:
- Registered users that exist already in DW
- That have the same email as some guest user that already exists in DW
- Then, DW updates the existing guest user record in DW and assigns the
- Users that disappear with first created order (red ones in Ana's excel)
- Create story
- Store data in sandbox
- Set up time next week with Ana to review
### Sunday runs issues
**Current state**
Currently, the order flow is scheduled with a lookback strategy. That means that, whenever you start a flow run, you must specify how many days into the past should the ETL go for. For instance, running with `lookback_days=7` means you will fetch orders from today up to 7 days in the past.
More specifically, this gets used in the following way in a Trino query to get orders from the source:
```SQL
[... some more SQL code]
from
app_md_mysql.pdo."order" o
INNER JOIN app_md_mysql.pdo.ordergroup og ON o.ordergroupid = og.id
WHERE
og.status IS NOT NULL
AND o.status IS NOT NULL
AND o.updatedat > DATE '2023-07-01' -- <---- This would be the result of a lookback of 7 days.
AND o.updatedat < DATE '2023-07-07' -- <---- This would be today
),
[... some more SQL code]
```
As you can see in the previous code, what counts is the `updatedat` field in the Mercadão.
**Issues with this implementation**
There is one caveat to how this implementation works and one corner case which could potentially explain the issues we have seen with orders created on monday not appearing in DW on monday.
The caveat is that, with the current implementation, orders updated on the day on which the ETL runs are NOT included in the ETL. This can be seen here:
```SQL
AND o.updatedat < DATE '2023-07-07' -- <---- This would be today
```
Simple. If we run on the 7th of July of 2023, orders updated on the 7th of July of 2023 are not included. The most recent ones will be the ones updated on the 6th of July of 2023.
This could be an issue for the following reasons:
- If there is any update to the cart in the night between Sunday and Monday that happens already on Monday hours, then the cart wouldn't be included.
- I am not fully aware of what timezone is the Prefect container that runs the flow living in, and this is very relevant. It could be that we are scheduling the flow at a time where in Portugal it's already Monday, but inside the Prefect machine it's still sunday. That would leave out all the orders created on Sunday.
**Solutions**
- The first and obvious solution is to push the schedule deeper into Monday so that there is no doubt that it's running on Monday, even if there might be some hour differences due to timezones. Before, the flow was scheduled at 2AM UTC. Now it's scheduled at 6AM UTC.
- The second solution is to push the filter from today to tomorrow. That ensures that, irrespective of weird hour and timezone games, order's updated today are included in the ETL. This simply requires a small code change that is WIP.
I am not 100% that these issues and caveats are what cause the past problems since this topic is very hard to debug due to the stateful nature of both the origin sources and DW itself. Nevertheless, I'll apply both solutions and we will observe next week if the issue is still taking place.
### Guest orders get assigned to different registered users
## Status log for João
### 20230627
- Asked Marcos for:
- Access to schema catalog -> Not deployed. Only got pointed to https://github.com/lolamarket/xl-bus-events/tree/main/src/events
- Data model/docs for the order's service database -> Not existing
## Llamada con Víctor
- CBRE
- Negocio
- Que ha cambiado desde que me fui. Transacciones siguen siendo el core?
- Mas diversificado
- Nueva linea consultoria estrategica
- Ya no hay consejo. Solo todos los directores juntos.
- Que tal habeis llevado los ultimos años?
- Como sigue research
- Carlos Casado sigue siendo el capo por encima de D&T?
- D&T
- Vamos a dibujar el mapa
- El puesto
- Como le llamas?
- Que skills son mas importantes
- Perfiles de las personas al cargo y donde estan
- Clientes del puesto
- Que data products estan construidos encima de este DWH -> Seguimos como estabamos. No hay nada crítico, dashboards y tal y cual pero todo muy fluffy.
- Se tiro del cable un lunes por la mañana, quien me va a llamar esa semana y porque
- Cositas en el roadmap para este puesto
- Cosas de mi perfil que te encajen/no te encajen
- Condiciones
- Pelas
- Horarios
- Cosas que me escuecen
- Hacer chuladas y que no se usan
- Flexibilidad oficina y meterme un traje
- Mierdas de ultima hora porque a un nivel C se le ha ido la olla
- Perder tiempo con reuniones y falta de claridad en roadmap. Tener un plan, stick to it, ir haciendo.
- Stack
- Airflow -> Prefect
- Data docs -> Amundsen
- AWS -> Oki
- Query engine?
- Visualizacion -> Tableau? Metabase?
- Data Quality -> Great expectations
- PostgreSQL -> Que cabrones, estoy cansado de MySQL
- ELT -> Airbyte
- Otros
- Que tal BDC 3D?
- Quien sigue del gang original?
- Como le va el negocio a tu señora?
- Me tuve que ir yo para que montaseis un stack como dios manda eh
# 20230805
## DP V1 Training
- V1 vs V2:
- V1 depends on the old monolithical redshift
-
![[Pasted image 20230809113546.png]]
- Kinesis is gone, now it's Confluence Kafka.
- IDP (staging) contains info as raw as possible
- ODP (data marts) contains transformed stuff
![[Pasted image 20230809115157.png]]
The actual map of how we will do things.
Two environments, dev and prod. Dev gets deployed automatically,
We need a monster laptop to be able to deploy DP1 locally.
- How do we coordinate with Data Analysts?
- Looker explores are not part of the DP
- They shouldn't care about DPs for now
- If I make a monster pipeline and smash run it every minute, where are the monster $ bills gonna come from?
- Hello-world weather forecast
## DDP Training
![[Pasted image 20230810114231.png]]
- Hardcode table names and columns
- One DAG node or multiple -> One DAG
- Arbitrary python code, like read from a public API and write into a SQL
- Git version
## Data Platform Onboarding steps
### Github
**Onelabel DP checklist**
- Be included in glovo's github:
- Follow this guide: https://glovoapp.atlassian.net/wiki/spaces/ITS/pages/3097985253/How+to+Get+Access+to+Github
- In case the glovo people get picky, reference on of these tickets:
- https://glovoapp.atlassian.net/servicedesk/customer/portal/23/ITS-100484
- ITS-55167
- Get added to the "All" team (IT support from Glovo)
- Get added to onelabel data group: [https://github.com/orgs/Glovo/teams/onelabel-data](https://github.com/orgs/Glovo/teams/onelabel-data)
- Get added to this repository (either by Joan or João): https://github.com/Glovo/onelabel-data-mesh
### AWS VPN
- General instructions: https://glovoapp.atlassian.net/wiki/spaces/TECH/pages/3468329771/AWS+Client+VPN+Access+Set-up
- If you don't have permission to get the VPN config file listed there, you can get it in our team Google Drive: Data Drive > 90 Useful > 20 vpn config files
- Once you set it up, you can try one of these links to check if things are working. Note that the important thing is that you reach the page, even if you get an access denied message (each of these services has their own permission system, independent of the VPN).
- https://notebooks.g8s-data-platform-prod.glovoint.com/
- https://starburst.g8s-data-platform-prod.glovoint.com/ui/insights/login
- https://datamesh-workflow.g8s-data-platform-dev.glovoint.com/
- https://datamesh-workflow.g8s-data-platform-prod.glovoint.com/
- https://datahub.g8s-data-platform-prod.glovoint.com/
- Known issues from my side:
- If, when you try to start the VPN, you get redirected to a Glovo OneLogin page that says access denied, you are probably missing a role in the Onelogin system. Open an IT ticket to request: "AWS VPN - All"
## Call with Joan Heredia
- What does DP offer?
- Expose events and make them consumable?
- Environments
- Permissions
- Contacts
-
He's on Alena's team.
He can point to trainings and accesses.
He
# 20230908
Details about our new office.
![[New Office - Guidelines.pdf]]
## Onelabel
Waiting for Onelabel team to modify events thingies.
We will only have events in prod. Everything will be deployed there.
Team was able to do a hello world in production.
# 20230912
## Bitwarden migration
I should check with you if it's fine for me to simply follow the steps described here: https://pdofonte.atlassian.net/wiki/spaces/DM/pages/2569469953/Account+Migration+to+Glovo+-+2023+08
Tip from David Clemente to avoid having to create the account:
> I've logged in with the old account, an error showed saying that I needed to leave all organisations in order to accept the glovo invite. I left `mercadao` organisation (the only one I had in my account). Then logged in again through the invitation email, and a message saying "Invite successfully accepted, an admin will need to accept your account" (something like that) Some minutes later I had the glovo organisation in my account. My personal vault was always there throughout the process
According to what I've read in Slack, there should be an email somewhere in my inbox inviting me to the right Glovo-Bitwarden orgs.
And I should add this info provided to João somewhere in Confluence for the next time an employee joins the team.
> For future new user access, and requests for new collections, an automated servicedesk was created, please [check how to do it here](https://pdofonte.atlassian.net/wiki/spaces/DM/pages/2569502731/Requesting+an+Account).
> In the future, if you have any issue with the Bitwarden access or any other Bitwarden issue, please create a [ticket for Security](https://pdofonte.atlassian.net/wiki/spaces/DM/pages/2576121857/Issues+with+Bitwarden).
# 20230922
## Airbyte
### Header
- [ ] Deploy Airbyte locally with Terraform
- [ ] Deploy Airbyte on AWS manually
- [ ] Source an EC2 instance on AWS with Terraform
- [ ] Deploy Airbyte on the UAT EC2 instance with Terraform
### Deploying Airbyte in UAT manually
I'm going to use a private subnet in UAT.
- VPC: # pdo-uat - vpc-4012372b
- Subnet: pdo-uat-private-1a - subnet-37858a5c
- Instance type: `t2.medium`
- Keypair: `pdo-uat`
The instance I created: https://eu-central-1.console.aws.amazon.com/ec2/home?region=eu-central-1#InstanceDetails:instanceId=i-01be407713b011383
Prepare SSH tunnels and jump into the new instance.
Install docker and docker compose. The docker compose install must be with the modern plugin (as in, `docker compose`, not `docker-compose`). See useful link.
Follow the airbyte instructions.
Prepare SSH tunnel for web access.
Modify `.env` to set credentials.
Define webhooks for alerts.
### Useful links
- Airbyte deployment docs: https://docs.airbyte.com/category/deploy-airbyte
- How to update Airbyte: https://docs.airbyte.com/operator-guides/upgrading-airbyte#upgrading-on-docker
- Terraform tutorial: https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli
- Fixing the issue with the local docker example in the Terraform tutorial: https://github.com/kreuzwerker/terraform-provider-docker/issues/44
- Running a shell script on an EC2 machine: https://brad-simonin.medium.com/learning-how-to-execute-a-bash-script-from-terraform-for-aws-b7fe513b6406
- Install Docker and Docker Compose on AMI 2023: https://medium.com/@fredmanre/how-to-configure-docker-docker-compose-in-aws-ec2-amazon-linux-2023-ami-ab4d10b2bcdc
- How to install Docker Compose Plugin (as in, `docker compose`, not `docker-compose`): https://stackoverflow.com/a/73680537/8776339
- On how to resize a Linux partition after an EBS volume has been enlarged: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html
### Missing stuff
- [x] Finish confluence docs
- [x] Add docs on ssh tunnels
- [x] Schedule session with the team
- [x] Send pre-session material to the team
- [x] Store videos
- [x] Ensure all credentials are in Bitwarden
# 20230927
## Call with Vini
- What the hell do you do?
Integrations team.
They are centralizing a lot of solutions and making them Infra as code.
Move data from streaming (Kafka connector) to an S3 bucket in an hourly fashion (SLA is 90 minutes).
Files are compacted in parquet. Partitioned by day.
Are there duplications?
New events won't be added automatically. We will have to ask them to ingest it and make it available for us.
Datadog to monitor ingestion. Datahub for schemas and metadata.
- How will we manage breaking changes?
- What flows of info should there be between us?
- Any documentation you can share?
# 20231018
## Hello world in DDP
- [ ] I can't access the FAQ on DDP.
- [ ] Is it possible to install custom packages in our Notebook environment? If so, how?
- Data discoverability
- Intermediate tables (our)
- Look at Airflow logs
- Datahub
- Sources
- Not really very helpful
- Git, versioning, rolling back
- Things have changed
- We can use github
- Multiple tables at the end?
- Yes
- What is one DP, what are multiple DP, philosophy
- Very flexible
- Can a DP only generate intermediate tables?
- Yes
- Can the output of a DP be the input
- Yes
- We basically can combine in anyway we want
-
- Esquiar fora pista a DDP
## Making the hello world
- [x] Check the demo table on starbust
- [ ] Make a new product
- [ ] Simply make a select star and copy it with full refresh mode
- [ ] Run
- [ ] Check Airflow
- [ ] Check resulting table in Starbust
- [ ] Experiment with all the refresh modes
- [ ] Understand full refresh
- [ ] Understand merge
- [ ] Understand XXX
- [ ] Intermediate data products
- [ ] Build a data product that only generates intermediate tables
- [ ] Read it from a different data product
- [ ] Make a session with Pinto to try to read things from one of the data products
- [ ] Delete everything to keep things clean
## Counting delivered orders
### Goal
Make a table in the public schema in `delta` with the following schema:
| Column | Description | Type | PK |
| ---------------------- | ------------------------------------------- | ------- | --- |
| date_utc | The date | date | X |
| orders_delivered_count | The number of orders delivered in that date | integer | |
The table should contain all the history of Onelabel.
The table should be refreshed on an hourly basis.
### Design
Sources
- Table `"hive"."desert"."orders_v0__com_glovoxl_events_orders_v0_orderdelivered"`
Final tables
- Table `"XXX"."YYY"."whatever_something_that_thing__delivered_orders_per_day"`
Schedule
`30 * * * *`
We had an issue with permissions. The DDP user doesn't have permission to read from the `desert` schema. Instructions from Joan to fix this:
el user del DDP el tens a rendered template de Airflow, al operador de query_to_table , si busques per b_dp_
es algo aixi: b_dp_10a8aaa3-252e-4295-a9eb-26c3f3475e06
Insert the user in `b_desert_migration_tmp`
This is the user I found: b_dp_9e1363ee-4444-4ad2-8c87-15ed05a76e71
## Event questions with Marcos
- What is the `id_company` field?
- Do the events have IDs?
- Not really
- Picking Location
- IDs that exist in `changed` events but not in `added`
- Orders
- Two timestamps: `createdAt` and kafka timestamp. What to use?
- Doubt.
- Timezones
- Kafka and createdAt are always UTC
- Currency
- What currency codes are you following?
- https://en.wikipedia.org/wiki/ISO_4217
We get the VPN and more permissions in confluent cloud to read events
- Documentation of lifecycle of orderings and picking
- Not there
- Change management flow for versioning, specially breaking changes
- How do we do?
- How will we know when the different versions stop/start?
- Dirty data in prod. Will you remove it? If not, how do we tell it apart?
- Fresh start at some point
- Will events be forever?
- Initially yes, let's wait for us to run out of space to consider a change
# 20231019
## 003-etl-xl-order - 003-md-01-refresh-orders - Uncontrolled enum change
**TLDR**: the Mercadão backend is breaching our unwritten data contract by having value `DELIVERED` in the `delivery_type` field of orrders.
### Context
The flow `003-md-01-refresh-orders` has an expectation for the field `delivery_type` that restricts the valid values to the set `{"DELIVERY", "CLICK_AND_COLLECT"}`
### Issue
In this flow run (https://prefecthq-ui.mercadao.pt/mercadao/flow-run/786b017b-1ccf-49dc-886c-4c25d4b0a137), two records contained the value `DELIVERED` for the field `delivery_type`.
The batch of data containing the invalid values is stored in the quarantine table QUARANTINETABLE HERE. The specific records can be spotted with the following query:
QUERY
This is a risk for upstream code in looker that might have hardcodes that rely on `delivery_type` only having `{"DELIVERY", "CLICK_AND_COLLECT"}` as valid values.
### Possible courses of action
1. Talk with engineering team to understand what happened and make them change the values.
- Ideally, they go back to only using `{"DELIVERY", "CLICK_AND_COLLECT"}`.
- If not, we must adapt on our side***.
2. Ignore the tech team altogether and simply adapt on our side***.
***Options to adapt on our side:
- Play with the ETL to turn values `DELIVERED` into `DELIVERY`.
- Add `DELIVERED` as a valid value for the field `delivery_type` and adapt in Looker and othe reports.
### Aftermath
Someone had messed around with the faulty orders manually in the backoffice. The values where fixed manually again the backoffice and the ETL ran just fine afterwards.
# 20231025
## Data Quality incident on 020_ETL_XL_Products_020_md_01_refresh_product on 20231024
### Context
The flow has failed due to not passing the data test.
Only one expectation was violated: the uniqueness of the combination of id_product and id_catalogue.
The quarantine data reveals that all the values have been duplicated several times. Each combination appears four times. This is very, very weird.
### Diagnostic
As a first step, I try to simply rerun the flow without any change to see if the duplication persists.
The rerun generates the same problem without the slightest change.
I have checked the repository and it seems João made some changes to the pipeline last week with release 0.4.2.
I'm going to run the suspicious query for versions 0.4.2 and 0.4.1 and compare the output.
After running the transformation step query for both 0.4.2 and 0.4.1, I can conclude that the issue appeared in 0.4.2. The query from 0.4.1 on the same data does not generate duplicate values.
Furthermore, I can see that query 0.4.2 introduced two joins in the query. These new joins are most probably the source of the duplications.
**Conclusion**: changes from 0.4.2 are responsible for duplicating values in the pipeline.
### Courses of action
We can:
- Roll back to 0.4.1 ASAP to keep the product table refreshed, even if the new`category_XXX` columns that we introduced in release 0.4.2 will remain empty.
- Simultaneously, we should apply a fix and make a new release 0.4.3 that includes the `category_XXX` columns but does not introduce duplicate values mistakenly.
### Links and references
- First flow failure: https://prefecthq-ui.mercadao.pt/mercadao/flow-run/feb5ab1e-cba6-4064-8e4c-bc16e3659e16
- Great expectations validation output with failed expectations: https://s3.console.aws.amazon.com/s3/object/pdo-prod-great-expectations?region=eu-central-1&prefix=validations/020_ETL_XL_Products_020_md_01_refresh_product_suite/020_ETL_XL_Products_020_md_01_refresh_product_suite_checkpoint/20231024T061732.662931Z/3c859ea3c1fa0158eeceadd14d3d5950.json
- Quarantine table: quarantine.`020_md_01_refresh_product_20231024_061838`
### Work log
- [x] Make story about coding fix
- [x] Rollback deployed flow to 0.4.1
- [x] Re-run with flow 0.4.1
# Ongoing
## DATA-1182
https://pdofonte.atlassian.net/jira/software/c/projects/DATA/boards/6?selectedIssue=DATA-1182
### First look
Can I derive all the fields required in the task from the query that João already composed?
| | | |
|---|---|---|
|Task Field|Query peer|Comments|
|Order ID|id_order||
|Status|status||
|Shopper ID|id_shopper||
|Team / Fleet|NA|We must obtain it through a very messy process from the shopper. I would drop it for now.|
|Date Created|date_order_received||
|Date Delivered|date_order_delivered||
|Local Currency|local_currency||
|Total Amount Ordered|charged_amount||
|Total Amount Delivered|NA|It's unclear how to obtain this. Must it be derived from adding up all the "Article picked" events, and substracting all the removed articles? The events around articles are also confusing (difference between stockout and not available? difference between unpicked and removed?)|
|Delivery Time Slot Start|delivery_time_slot_start||
|Delivery Time Slot End|delivery_time_slot_end||
|Picking Location ID|id_picking_location||
A couple of them are ambiguous. I'll drop them for now and get ahold of Marcos to clarify how to obtain them.
## Design
We are going to make a pipeline that generates the final table in a single step.
This is the query:
```sql
select
rec.orderref as id_order,
coalesce(can.status, del.status, oit.status, cod.status, pic.status, allo.status, con.status, rej.status, rec.status) as status,
rec.kafka_record_timestamp as date_order_received,
del.createdat as date_order_delivered,
con.pickinglocation.id as id_picking_location,
allo.shopperid as id_shopper,
coalesce(con.totals.amountcharged.amount, rec.totals.amountcharged.amount) as charged_amount,
coalesce(con.totals.amountcharged.currency, rec.totals.amountcharged.currency) as local_currency,
rec.servicedate.fromdate as delivery_time_slot_start,
rec.servicedate.todate as delivery_time_slot_end,
from
delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderreceived rec
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderconfirmed con on (rec.id = con.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderrejected rej on (rec.id = rej.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderallocated allo on (rec.id = allo.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderinpicking pic on (rec.id = pic.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_ordercheckoutdone cod on (rec.id = cod.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderintransit oit on (rec.id = oit.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_orderdelivered del on (rec.id = del.id)
left join delta.desert.orders_v0__com_glovoxl_events_orders_v0_ordercanceled can on (rec.id = can.id)
```
Which means that the
## Onelabel ways of working
- Alerts
- FX data
- Team being able to modify pipeline
- Data Quality Expectations
## Departure Management
### Open fronts list
- lolafect
- On the usage of `lolafect`, I think the team is up to date.
- On the internals of `lolafect`, that's a different story. I think I'm the only one fully familiar with the internals of the package. Perhaps it would be a good idea to hold a session on it, and perhaps to try to go for a couple of silly-but-real stories to implement something in the package just so that you and/or Afonso go through the entire lifecycle of adding something new to `lolafect` and making a new release with it.
- DDP
- I generally think we are quite aligned.
- I will document all the knowledge I have and the ongoing work with the orders pipeline.
- I would suggest planning some transfer sessions on my last week for whatever WIP I still have.
- AWS
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.
- Prefect
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.
- Airbyte
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.
- Trino
- I think we are in sync and I don't have any knowledge that the team doesn't. If you think there's any foggy area that I should transfer or document, let me know.
## Call with Joan to improve ways of working
- Overcoming notebooks
- Option 1: share notebooks and just survive
- Option 2: repository
- Centralizes development in a repository with a Github Action that publishes to production
- is it a notebook or how do you structure code?
- one repo for one ddp? one repo for all ddps?
- Option 3:
- Same as option 2, but with an individual repository for onelabel
- data quality checks
- Use the data quality checks from Glovo, not GE, not dbt-expectations
- Docs for that?
- We will read the tutorial notebooks and come back with questions
- Control flow?
- No
- Alerts on Airflow failures
https://glovoapp.atlassian.net/servicedesk/customer/portal/26/PHC-11171?created=true
Artifactory instructions setup: https://glovoapp.atlassian.net/wiki/spaces/TECH/pages/1124565144/Configure+access+to+Artifactory+repository
cmVmdGtuOjAxOjE3MzA4OTcyNjA6NGtQaTJRbDk3QTFObmxHa25QNWY2NHc2R2x5
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
- Hack del browser (Charly)
- DQ per cada output ODP
- Add SLO to add_sql_transformation
# OIDC Token Refresh bug
- Package version: `glovo-data-platform-meshub-client==0.1.68`
- Summary: meshub client fails to correctly store a valid access token when using a refresh token to obtain it. The new access token is obtained correctly and stored in memory, but it never reaches the `credential` file. The issues is created by an exception: `jwt.exceptions.ImmatureSignatureError: The token is not yet valid (iat)` that gets triggered in the following line of code: https://github.com/Glovo/data-platform-meshub/blob/34eb43822fde83312b4259478dae93bbeab426e1/meshub-backend/service/glovo_data_platform/meshub_client/authentication/oidc_tokens.py#L37.
- Current workaround: copy the new access token from an in-memory variable by debugging with Pycharm and pasting it in the `credentials` file, along with some faked out `access_issued_at` and `access_expires_at`. Obviously, we won't get very far this way.
Tag: Joan, @dp-transformations-greenflag.
This is what my credentials file right now looks like:
```bash
[oidc meshub]
refresh_token = eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6Ii1FY0U2TlE0MFJtaUJRQUt6Zm4wYnFheXhZY2JXOXdaaURmZ28wQm52a3cifQ.eyJqdGkiOiJWWTZlZFFUZV9pTU9mRWhzM1d0cDciLCJzdWIiOiI5MjQ0MjU4NSIsImlzcyI6Imh0dHBzOi8vZ2xvdm9hcHAub25lbG9naW4uY29tL29pZGMvMiIsImlhdCI6MTY5OTQ1NDQ0MiwiZXhwIjoxNzAyMDQ2NDQyLCJzY29wZSI6Im9wZW5pZCIsImF1ZCI6ImNjOTc5NmIwLTc3ZDgtMDEzYi1jOThmLTA2NThkMDRkMjM2NjM3ODE1In0.hpp8bKfSSBpivMVl3zwwPXeDtGzOrPETAI-HRsy-hsgVqG13eahdw8MAHgDKNUdXQ-l01uqGG90RiYXn3CCU8b5Bx3QEh90FMQvrzAOJXWZufSVhR9WNKwvmh7lr568Xxg__3Ux6JVau8Qo7PH7KCcPQTNbrf9aV2v3rSSczkNMgKKUO5GN8w9UYFs1vN6DX8olIE8voVbDhWEuidMRhl8EZWDJG2rRiY3EvLlAl3QFbQZZdGTbxd6o7tyH_DEPDyIQ0Mhk5CK3qGDEx7w5ySSwoVC_uxI_BcC1cAtha2klL0Dz4OT06d_5DIRLCHLqrGjGuM75yXBc6rOaiLUus_g
refresh_issued_at = 2023-11-08T15:40:42
refresh_expires_at = 2023-12-08T15:40:42
```
This is what the glovo code has read when running:
# Docs
- Clone repos
- Clone personal notes
- Send messages to
- ~~Slack general ~~
- ~~Maria~~
- ~~Dani~~
- Team
- ~~Charly~~
# Future
There is no future anymore.