Yuka Saudi thumbnail

Yuka Saudi Nov 2025

Attempted to build Yuka for Saudi, a food product scanning app.

Realized the regional coverage on OpenFoodFacts for Saudi grocery products was quite poor in terms of both quantity and quality with only 12,000 items listed and more than half missing nutritional or ingredient info.

Development

Scraped 30,000+ product pages across the most popular grocery retailers in the country, i.e. Ninja, Lulu, and Carrefour.

Reverse-engineered private APIs for Lulu and Carrefour, and implemented headless browser scraping for Ninja with Playwright. Circumvented geo and location restrictions by manipulating request headers. Ran all scraping locally and in parallel to maximize throughput.

Downloaded all product images and ran a barcode scanner CNN locally for products with missing barcode data. For products with no barcode in images, embedded product titles and performed semantic text similarity (STS) against all other titles to infer matches, accepting only matches with cosine similarity > 0.97.

Ran an open-source nutrition table YOLO model alongside SAM 3 for nutrition label object detection, and used agreement between the two models to significantly reduce false positives.

Finally, normalized and matched all products using barcodes. The tables below summarize the scraping results across all sources.

================================================================
NINJA SCRAPING SUMMARY
================================================================
Date: 2025-11-14 11:26:40
Duration: 5:37:01.816874
Workers: 6

Category                    Total     Success   Failed    Rate      
----------------------------------------------------------------
dairy-eggs                  388       388       0         100.0%
milk                        323       323       0         100.0%
..
healthy-snacks              576       576       0         100.0%
healthy-drinks              171       171       0         100.0%
----------------------------------------------------------------
TOTAL                     11769     11769       0         100.0%
================================================================
================================================================
LULU SCRAPING SUMMARY
================================================================
Category                       Total   Both    EN    AR   Failed
Category                       Total   Both    EN    AR   Failed
----------------------------------------------------------------
fresh-food-dairy-eggs-cheese     617    617     0     0        0
fresh-food-bakery                306    306     0     0        0
...
grocery-food-cupboard           5908   5908     0     0        0
grocery-speciality-food          354    354     0     0        0
----------------------------------------------------------------
TOTAL                           8082   8082     0     0        0
================================================================
================================================================
CARREFOUR PROCESSING SUMMARY
================================================================
Category                       Total   Both    EN    AR   Failed
----------------------------------------------------------------
fresh-food                     1558    1558     0     0        0
beverages                      1726    1722     4     0        0
...
frozen-food                    1006    1006     0     0        0
food-cupboard                  6455    6429    24     2        0
----------------------------------------------------------------
TOTAL                         12562   12526    34     2        0
================================================================

Outcome

Produced a clean dataset of ~16,000 unique grocery items, with 8,000+ products containing nutrition labels.

If this dataset may be of interest to you, feel free to reach out.