
Yuka Saudi Nov 2025
Attempted to build Yuka for Saudi, a food product scanning app.
Realized the regional coverage on OpenFoodFacts for Saudi grocery products was quite poor in terms of both quantity and quality with only 12,000 items listed and more than half missing nutritional or ingredient info.
Development
Scraped 30,000+ product pages across the most popular grocery retailers in the country, i.e. Ninja, Lulu, and Carrefour.
Reverse-engineered private APIs for Lulu and Carrefour, and implemented headless browser scraping for Ninja with Playwright. Circumvented geo and location restrictions by manipulating request headers. Ran all scraping locally and in parallel to maximize throughput.
Downloaded all product images and ran a barcode scanner CNN locally for products with missing barcode data. For products with no barcode in images, embedded product titles and performed semantic text similarity (STS) against all other titles to infer matches, accepting only matches with cosine similarity > 0.97.
Ran an open-source nutrition table YOLO model alongside SAM 3 for nutrition label object detection, and used agreement between the two models to significantly reduce false positives.
Finally, normalized and matched all products using barcodes. The tables below summarize the scraping results across all sources.
================================================================
NINJA SCRAPING SUMMARY
================================================================
Date: 2025-11-14 11:26:40
Duration: 5:37:01.816874
Workers: 6
Category Total Success Failed Rate
----------------------------------------------------------------
dairy-eggs 388 388 0 100.0%
milk 323 323 0 100.0%
..
healthy-snacks 576 576 0 100.0%
healthy-drinks 171 171 0 100.0%
----------------------------------------------------------------
TOTAL 11769 11769 0 100.0%
================================================================
================================================================
LULU SCRAPING SUMMARY
================================================================
Category Total Both EN AR Failed
Category Total Both EN AR Failed
----------------------------------------------------------------
fresh-food-dairy-eggs-cheese 617 617 0 0 0
fresh-food-bakery 306 306 0 0 0
...
grocery-food-cupboard 5908 5908 0 0 0
grocery-speciality-food 354 354 0 0 0
----------------------------------------------------------------
TOTAL 8082 8082 0 0 0
================================================================
================================================================
CARREFOUR PROCESSING SUMMARY
================================================================
Category Total Both EN AR Failed
----------------------------------------------------------------
fresh-food 1558 1558 0 0 0
beverages 1726 1722 4 0 0
...
frozen-food 1006 1006 0 0 0
food-cupboard 6455 6429 24 2 0
----------------------------------------------------------------
TOTAL 12562 12526 34 2 0
================================================================
Outcome
Produced a clean dataset of ~16,000 unique grocery items, with 8,000+ products containing nutrition labels.
If this dataset may be of interest to you, feel free to reach out.