Swap the Search-Based product download to the Dump based product download #5787

teolemon · 2024-11-05T11:47:14Z

What

@stephanegigandet is fixing the miniaturized dump of the DB: refactor: gen_feeds_daily.pl (exports and database dumps) for all flavors openfoodfacts-server#10973
Swap the Search-Based product download to the Dump based product download to make it more sustainable
We should also ask for equivalent dump (even more miniaturized, without NS, ES, NOVA) for OBF, OPF, OPFF

teolemon · 2024-11-08T17:23:52Z

@monsieurtanuki We have a new dump for OFF, courtesy of @raphael0202
https://openfoodfacts-ds.s3.eu-west-3.amazonaws.com/openfoodfacts-mobile-dump-products.tsv.gz

raphael0202 · 2024-11-08T17:30:54Z

Please note it's not the final URL, the canonical URL will be under static.openfoodfacts.org domain (even if we will do a redirect, it's not done yet).

monsieurtanuki · 2024-11-08T17:49:16Z

I'm not sure what we can do with that kind of file:

75Mb zipped - not sure if we can download that much in flutter, better be fast
haven't seen how much it takes unzipped
currently we get tons of product fields (except knowledge panels) for a given country and language - can you provide that?

teolemon · 2024-11-08T18:10:35Z

It's 200mb unzipped, and provides only the existence of the product in Open Food Facts + the 3 scores.
We could eventually have an addon option for people who want complete products, but it's not technically doable outside of the legacy search API, and it might not be for a while (attributes and knowledge panels are generated on the fly)
This extract allows basic offline scan functionality on the scan card, and unlocks offline product addition without major overwrite risk. This is available today.

monsieurtanuki · 2024-11-09T11:29:25Z

For the record we're talking about 3.5 million products, with that kind of informations:

code    product_name    quantity        brands  nutrition_grade_fr      nova_group      ecoscore_grade
0010374162716   Kicked Up Key Lime Pie          The Father's Table Llc  d       4       unknown
0010374162747   Chocolate lover's cheesecake            The Father's Table Llc  e       4       unknown
0010374162754   The Father's Table, Winter Variety Cheesecake           The Father's Table Llc  d       4       unknown
0010374162761   The Father's Table, Cheesecake, Raspberry Lemonade              The Father's Table Llc  e       4       unknown

To be compared with the top 10K products for a given country/language with relevant data.
Anyway, downloading/unzipping/loading that .tsv.gz into the local database would be a challenge, and I have doubts about the added value.

teolemon · 2024-11-09T12:11:16Z

With the new backend, we'll be able to split by country, but France will still be a whooping 1M. As a reference, Others apps are storing the top 150k products.

raphael0202 · 2024-11-09T12:26:30Z

Using parquet files instead of TSV would bring some benefits in my opinion. First, generating country specific dumps is almost instant on the backend.

I generated Parquet files with the same data as the TSV, here is the file size:

-rw-rw-r--  1 raphael raphael 122M nov.   9 13:10 products-min.parquet
-rw-rw-r--  1 raphael raphael 9,7M nov.   9 13:17 products-min-de.parquet
-rw-rw-r--  1 raphael raphael 9,4M nov.   9 13:16 products-min-es.parquet
-rw-rw-r--  1 raphael raphael  34M nov.   9 13:14 products-min-fr.parquet
-rw-rw-r--  1 raphael raphael 4,0M nov.   9 13:17 products-min-uk.parquet

products-min.parquet being the Parquet file for all countries.

Second, these parquet files can be queried directly using duckdb, the query is really fast thanks to Parquet columnar storage and statistics:

D SELECT * FROM '/home/raphael/datasets/products-min-de.parquet' WHERE code = '4388040200632';
┌───────────────┬──────────────────┬──────────┬─────────────┬────────────────────┬────────────┬────────────────┐
│     code      │   product_name   │ quantity │ brands_tags │ nutrition_grade_fr │ nova_group │ ecoscore_grade │
│    varchar    │     varchar      │ varchar  │  varchar[]  │      varchar       │   int64    │    varchar     │
├───────────────┼──────────────────┼──────────┼─────────────┼────────────────────┼────────────┼────────────────┤
│ 4388040200632 │ Thunfisch chunks │ 185g     │ [economy]   │ unknown            │            │ unknown        │
└───────────────┴──────────────────┴──────────┴─────────────┴────────────────────┴────────────┴────────────────┘
Run Time (s): real 0.014 user 0.031547 sys 0.001925

It looks like there is a duckdb integration for flutter, it could be an option: https://pub.dev/packages/dart_duckdb

monsieurtanuki · 2024-11-09T12:39:55Z

With the new backend, we'll be able to split by country

What about the language, especially for multilanguage countries?

but France will still be a whooping 1M. As a reference, Others apps are storing the top 150k products.

What about tranches of 100k products?

I'm concerned about:

heterogenous data (e.g. monolingual data from the dump and multilingual data from the search - potential side effects when a user changes the labels)
one shot action - I don't think our background tasks, or even flutter on smartphone, is compatible with long tasks that use a lot of memory
size of the database - I can try to load the database with 1M crap data and see how the size evolves
I can downgrade the quantity of fields retrieved for offline data to barcode, names, ns, es, brands, nova, for performance tests

@raphael0202 I'll have a look at your duckdb. That would mean another database (after shared preferences, hive and sqflite), and a sort of meta product coming either from Parquet or SQFLite.

raphael0202 · 2024-11-12T08:39:55Z

What about the language, especially for multilanguage countries?

We're currently working on adding language fields (ex: ingredients_text_{LANG}) on the Parquet file in a scalable manner. Currently, only the ingredients_text_{LANG} fields are added, but in bulk at the root of the schema (see https://huggingface.co/datasets/openfoodfacts/product-database).

What about tranches of 100k products?

That's possible as well!

What's the fields you currently fetch from Open Food Facts search API to get the offline data? We can match this fields in the dump so that you don't need to query the search API.

@raphael0202 I'll have a look at your duckdb. That would mean another database (after shared preferences, hive and sqflite), and a sort of meta product coming either from Parquet or SQFLite.

duckdb can be used as a real SQL DB, but it can directly query Parquet files directly as well (as shown above).

monsieurtanuki · 2024-11-12T09:41:36Z

We're currently working on adding language fields (ex: ingredients_text_{LANG}) on the Parquet file in a scalable manner. Currently, only the ingredients_text_{LANG} fields are added, but in bulk at the root of the schema (see https://huggingface.co/datasets/openfoodfacts/product-database).

How would you do that for multilingual products with a .tsv file?

What about tranches of 100k products?

That's possible as well!

Cool!

What's the fields you currently fetch from Open Food Facts search API to get the offline data? We can match this fields in the dump so that you don't need to query the search API.

That would be madness (sorry about the poor presentation, my keyboard is getting crazy and prevents me from using Alt Gr key) (which is getting very problematic)
[Edited]

  static List<ProductField> get fields => const <ProductField>[
        ProductField.NAME,
        ProductField.NAME_ALL_LANGUAGES,
        ProductField.BRANDS,
        ProductField.BARCODE,
        ProductField.NUTRISCORE,
        ProductField.FRONT_IMAGE,
        ProductField.IMAGE_FRONT_URL,
        ProductField.IMAGE_INGREDIENTS_URL,
        ProductField.IMAGE_NUTRITION_URL,
        ProductField.IMAGE_PACKAGING_URL,
        ProductField.IMAGES,
        ProductField.SELECTED_IMAGE,
        ProductField.QUANTITY,
        ProductField.SERVING_SIZE,
        ProductField.STORES,
        ProductField.PACKAGING_QUANTITY,
        ProductField.PACKAGING,
        ProductField.PACKAGINGS,
        ProductField.PACKAGINGS_COMPLETE,
        ProductField.PACKAGING_TAGS,
        ProductField.PACKAGING_TEXT_ALL_LANGUAGES,
        ProductField.NO_NUTRITION_DATA,
        ProductField.NUTRIMENT_DATA_PER,
        ProductField.NUTRITION_DATA,
        ProductField.NUTRIMENTS,
        ProductField.NUTRIENT_LEVELS,
        ProductField.NUTRIMENT_ENERGY_UNIT,
        ProductField.ADDITIVES,
        ProductField.INGREDIENTS_ANALYSIS_TAGS,
        ProductField.INGREDIENTS_TEXT,
        ProductField.INGREDIENTS_TEXT_ALL_LANGUAGES,
        ProductField.LABELS_TAGS,
        ProductField.LABELS_TAGS_IN_LANGUAGES,
        ProductField.COMPARED_TO_CATEGORY,
        ProductField.CATEGORIES_TAGS,
        ProductField.CATEGORIES_TAGS_IN_LANGUAGES,
        ProductField.LANGUAGE,
        ProductField.ATTRIBUTE_GROUPS,
        ProductField.STATES_TAGS,
        ProductField.ECOSCORE_DATA,
        ProductField.ECOSCORE_GRADE,
        ProductField.ECOSCORE_SCORE,
        // ProductField.KNOWLEDGE_PANELS, // that's the only one we don't use for offline download
        ProductField.COUNTRIES,
        ProductField.COUNTRIES_TAGS,
        ProductField.COUNTRIES_TAGS_IN_LANGUAGES,
        ProductField.EMB_CODES,
        ProductField.ORIGINS,
        ProductField.WEBSITE,
        ProductField.OBSOLETE,
        ProductField.OWNER_FIELDS,
      ];

duckdb can be used as a real SQL DB, but it can directly query Parquet files directly as well (as shown above).

I'm still not 100% confident. For instance, SQFlite starts immediately and has almost immediate results, while Shared preferences and hive (most of the time) do need to load the data into memory at startup. How would duckdb behave at startup? To be tested.

raphael0202 · 2024-11-14T10:44:51Z

How would you do that for multilingual products with a .tsv file?

It would be a Parquet file and not a TSV. I'm currently working on the implementation.

Thank you for the list! I will investigate to see if we can add the missing ones.

I'm still not 100% confident. For instance, SQFlite starts immediately and has almost immediate results, while Shared preferences and hive (most of the time) do need to load the data into memory at startup. How would duckdb behave at startup? To be tested.

There is no startup penalty. duckdb does not need to load all data to memory before the query, as it takes advantages of row group statistics of the parquet file: it only needs to read from disk data that correspond to the WHERE clause of the query, and only fetch requested columns, thanks to Parquet column-oriented storage.

monsieurtanuki · 2024-11-14T11:00:04Z

@raphael0202 Before you start implementing anything, please roughly compute the amount of data involved.
In addition to that, I don't know how you would split a Parquet in 10m or 100m product tranches.
Good luck!

raphael0202 · 2024-11-14T13:27:48Z

Do you have the original field names (=names of fields requested to the Open Food Facts API)?

It would help me fetch the right data.

In addition to that, I don't know how you would split a Parquet in 10m or 100m product tranches.

Tranches of 10 000 or 100 000 products?

monsieurtanuki · 2024-11-14T13:49:39Z

Quickly said, you can find the field names as "offTag" in https://github.com/openfoodfacts/openfoodfacts-dart/blob/master/lib/src/utils/product_fields.dart

TBH I don't believe in 100k products: probably too big.
I guess once we have we have stats for 10k we can compute stats for 100k. And vice-versa.

raphael0202 · 2024-11-18T16:17:49Z

I added most missing fields in the Parquet data dump. A few remarks below:

FRONT_IMAGE IMAGE_FRONT_URL IMAGE_INGREDIENTS_URL IMAGE_NUTRITION_URL IMAGE_PACKAGING_URL SELECTED_IMAGE

-> these fields can be computed from the images field easily. To keep the dump lightweight, I would advocate to let the front-end compute the images field.

NUTRIMENT_ENERGY_UNIT is supposed to be mapped to nutriment_energy_unit in the API, I couldn't find this value in the data dump (?).

LABELS_TAGS_IN_LANGUAGES
CATEGORIES_TAGS_IN_LANGUAGES
COUNTRIES_TAGS_IN_LANGUAGES

That's the tricky part, storing this will increase significantly the dump size. I think we should explore downloading on the mobile app a language-specific translation files for all taxonomies, so that we don't have to submit those fields.

COUNTRIES: we have countries_tags which is more robust, as the countries field can be in any language.

ATTRIBUTE_GROUPS: it's not possible for now to store it on the dump, as it's computed on the fly by the server.

monsieurtanuki · 2024-11-18T17:40:14Z

@raphael0202 I don't know how problematic it'd be not to include ATTRIBUTE_GROUPS, but if I remember well this is where we say "vegetarian" or "may contain nuts", right?
How big would be a dump of 100k products with untranslated tags?

teolemon added the Offline - Browsing label Nov 5, 2024

github-project-automation bot added this to 📶 🤳 Offline scan/experience Nov 5, 2024

github-project-automation bot moved this to Todo in 📶 🤳 Offline scan/experience Nov 5, 2024

github-project-automation bot added this to 🤳🥫 The Open Food Facts mobile app (Android & iOS) Nov 5, 2024

github-project-automation bot moved this to 💬 To discuss and validate in 🤳🥫 The Open Food Facts mobile app (Android & iOS) Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap the Search-Based product download to the Dump based product download #5787

Swap the Search-Based product download to the Dump based product download #5787

teolemon commented Nov 5, 2024

teolemon commented Nov 8, 2024

raphael0202 commented Nov 8, 2024

monsieurtanuki commented Nov 8, 2024

teolemon commented Nov 8, 2024

monsieurtanuki commented Nov 9, 2024

teolemon commented Nov 9, 2024

raphael0202 commented Nov 9, 2024

monsieurtanuki commented Nov 9, 2024

raphael0202 commented Nov 12, 2024

monsieurtanuki commented Nov 12, 2024 •

edited

Loading

raphael0202 commented Nov 14, 2024 •

edited

Loading

monsieurtanuki commented Nov 14, 2024

raphael0202 commented Nov 14, 2024 •

edited

Loading

monsieurtanuki commented Nov 14, 2024

raphael0202 commented Nov 18, 2024

monsieurtanuki commented Nov 18, 2024

Swap the Search-Based product download to the Dump based product download #5787

Swap the Search-Based product download to the Dump based product download #5787

Comments

teolemon commented Nov 5, 2024

What

teolemon commented Nov 8, 2024

raphael0202 commented Nov 8, 2024

monsieurtanuki commented Nov 8, 2024

teolemon commented Nov 8, 2024

monsieurtanuki commented Nov 9, 2024

teolemon commented Nov 9, 2024

raphael0202 commented Nov 9, 2024

monsieurtanuki commented Nov 9, 2024

raphael0202 commented Nov 12, 2024

monsieurtanuki commented Nov 12, 2024 • edited Loading

raphael0202 commented Nov 14, 2024 • edited Loading

monsieurtanuki commented Nov 14, 2024

raphael0202 commented Nov 14, 2024 • edited Loading

monsieurtanuki commented Nov 14, 2024

raphael0202 commented Nov 18, 2024

monsieurtanuki commented Nov 18, 2024

monsieurtanuki commented Nov 12, 2024 •

edited

Loading

raphael0202 commented Nov 14, 2024 •

edited

Loading

raphael0202 commented Nov 14, 2024 •

edited

Loading