Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap the Search-Based product download to the Dump based product download #5787

Open
teolemon opened this issue Nov 5, 2024 · 16 comments
Open

Comments

@teolemon
Copy link
Member

teolemon commented Nov 5, 2024

What

@teolemon
Copy link
Member Author

teolemon commented Nov 8, 2024

@raphael0202
Copy link

Please note it's not the final URL, the canonical URL will be under static.openfoodfacts.org domain (even if we will do a redirect, it's not done yet).

@monsieurtanuki
Copy link
Contributor

I'm not sure what we can do with that kind of file:

  • 75Mb zipped - not sure if we can download that much in flutter, better be fast
  • haven't seen how much it takes unzipped
  • currently we get tons of product fields (except knowledge panels) for a given country and language - can you provide that?

@teolemon
Copy link
Member Author

teolemon commented Nov 8, 2024

  • It's 200mb unzipped, and provides only the existence of the product in Open Food Facts + the 3 scores.
  • We could eventually have an addon option for people who want complete products, but it's not technically doable outside of the legacy search API, and it might not be for a while (attributes and knowledge panels are generated on the fly)
  • This extract allows basic offline scan functionality on the scan card, and unlocks offline product addition without major overwrite risk. This is available today.

@monsieurtanuki
Copy link
Contributor

For the record we're talking about 3.5 million products, with that kind of informations:

code    product_name    quantity        brands  nutrition_grade_fr      nova_group      ecoscore_grade
0010374162716   Kicked Up Key Lime Pie          The Father's Table Llc  d       4       unknown
0010374162747   Chocolate lover's cheesecake            The Father's Table Llc  e       4       unknown
0010374162754   The Father's Table, Winter Variety Cheesecake           The Father's Table Llc  d       4       unknown
0010374162761   The Father's Table, Cheesecake, Raspberry Lemonade              The Father's Table Llc  e       4       unknown

To be compared with the top 10K products for a given country/language with relevant data.
Anyway, downloading/unzipping/loading that .tsv.gz into the local database would be a challenge, and I have doubts about the added value.

@teolemon
Copy link
Member Author

teolemon commented Nov 9, 2024

With the new backend, we'll be able to split by country, but France will still be a whooping 1M. As a reference, Others apps are storing the top 150k products.

@raphael0202
Copy link

Using parquet files instead of TSV would bring some benefits in my opinion. First, generating country specific dumps is almost instant on the backend.

I generated Parquet files with the same data as the TSV, here is the file size:

-rw-rw-r--  1 raphael raphael 122M nov.   9 13:10 products-min.parquet
-rw-rw-r--  1 raphael raphael 9,7M nov.   9 13:17 products-min-de.parquet
-rw-rw-r--  1 raphael raphael 9,4M nov.   9 13:16 products-min-es.parquet
-rw-rw-r--  1 raphael raphael  34M nov.   9 13:14 products-min-fr.parquet
-rw-rw-r--  1 raphael raphael 4,0M nov.   9 13:17 products-min-uk.parquet

products-min.parquet being the Parquet file for all countries.

Second, these parquet files can be queried directly using duckdb, the query is really fast thanks to Parquet columnar storage and statistics:

D SELECT * FROM '/home/raphael/datasets/products-min-de.parquet' WHERE code = '4388040200632';
┌───────────────┬──────────────────┬──────────┬─────────────┬────────────────────┬────────────┬────────────────┐
│     code      │   product_name   │ quantity │ brands_tags │ nutrition_grade_fr │ nova_group │ ecoscore_grade │
│    varcharvarcharvarcharvarchar[]  │      varchar       │   int64    │    varchar     │
├───────────────┼──────────────────┼──────────┼─────────────┼────────────────────┼────────────┼────────────────┤
│ 4388040200632 │ Thunfisch chunks │ 185g     │ [economy]   │ unknown            │            │ unknown        │
└───────────────┴──────────────────┴──────────┴─────────────┴────────────────────┴────────────┴────────────────┘
Run Time (s): real 0.014 user 0.031547 sys 0.001925

It looks like there is a duckdb integration for flutter, it could be an option: https://pub.dev/packages/dart_duckdb

@monsieurtanuki
Copy link
Contributor

With the new backend, we'll be able to split by country

What about the language, especially for multilanguage countries?

but France will still be a whooping 1M. As a reference, Others apps are storing the top 150k products.

What about tranches of 100k products?

I'm concerned about:

  • heterogenous data (e.g. monolingual data from the dump and multilingual data from the search - potential side effects when a user changes the labels)
  • one shot action - I don't think our background tasks, or even flutter on smartphone, is compatible with long tasks that use a lot of memory
  • size of the database - I can try to load the database with 1M crap data and see how the size evolves
  • I can downgrade the quantity of fields retrieved for offline data to barcode, names, ns, es, brands, nova, for performance tests

@raphael0202 I'll have a look at your duckdb. That would mean another database (after shared preferences, hive and sqflite), and a sort of meta product coming either from Parquet or SQFLite.

@raphael0202
Copy link

What about the language, especially for multilanguage countries?

We're currently working on adding language fields (ex: ingredients_text_{LANG}) on the Parquet file in a scalable manner. Currently, only the ingredients_text_{LANG} fields are added, but in bulk at the root of the schema (see https://huggingface.co/datasets/openfoodfacts/product-database).

What about tranches of 100k products?

That's possible as well!

What's the fields you currently fetch from Open Food Facts search API to get the offline data? We can match this fields in the dump so that you don't need to query the search API.

@raphael0202 I'll have a look at your duckdb. That would mean another database (after shared preferences, hive and sqflite), and a sort of meta product coming either from Parquet or SQFLite.

duckdb can be used as a real SQL DB, but it can directly query Parquet files directly as well (as shown above).

@monsieurtanuki
Copy link
Contributor

monsieurtanuki commented Nov 12, 2024

We're currently working on adding language fields (ex: ingredients_text_{LANG}) on the Parquet file in a scalable manner. Currently, only the ingredients_text_{LANG} fields are added, but in bulk at the root of the schema (see https://huggingface.co/datasets/openfoodfacts/product-database).

How would you do that for multilingual products with a .tsv file?

What about tranches of 100k products?

That's possible as well!

Cool!

What's the fields you currently fetch from Open Food Facts search API to get the offline data? We can match this fields in the dump so that you don't need to query the search API.

That would be madness (sorry about the poor presentation, my keyboard is getting crazy and prevents me from using Alt Gr key) (which is getting very problematic)
[Edited]

  static List<ProductField> get fields => const <ProductField>[
        ProductField.NAME,
        ProductField.NAME_ALL_LANGUAGES,
        ProductField.BRANDS,
        ProductField.BARCODE,
        ProductField.NUTRISCORE,
        ProductField.FRONT_IMAGE,
        ProductField.IMAGE_FRONT_URL,
        ProductField.IMAGE_INGREDIENTS_URL,
        ProductField.IMAGE_NUTRITION_URL,
        ProductField.IMAGE_PACKAGING_URL,
        ProductField.IMAGES,
        ProductField.SELECTED_IMAGE,
        ProductField.QUANTITY,
        ProductField.SERVING_SIZE,
        ProductField.STORES,
        ProductField.PACKAGING_QUANTITY,
        ProductField.PACKAGING,
        ProductField.PACKAGINGS,
        ProductField.PACKAGINGS_COMPLETE,
        ProductField.PACKAGING_TAGS,
        ProductField.PACKAGING_TEXT_ALL_LANGUAGES,
        ProductField.NO_NUTRITION_DATA,
        ProductField.NUTRIMENT_DATA_PER,
        ProductField.NUTRITION_DATA,
        ProductField.NUTRIMENTS,
        ProductField.NUTRIENT_LEVELS,
        ProductField.NUTRIMENT_ENERGY_UNIT,
        ProductField.ADDITIVES,
        ProductField.INGREDIENTS_ANALYSIS_TAGS,
        ProductField.INGREDIENTS_TEXT,
        ProductField.INGREDIENTS_TEXT_ALL_LANGUAGES,
        ProductField.LABELS_TAGS,
        ProductField.LABELS_TAGS_IN_LANGUAGES,
        ProductField.COMPARED_TO_CATEGORY,
        ProductField.CATEGORIES_TAGS,
        ProductField.CATEGORIES_TAGS_IN_LANGUAGES,
        ProductField.LANGUAGE,
        ProductField.ATTRIBUTE_GROUPS,
        ProductField.STATES_TAGS,
        ProductField.ECOSCORE_DATA,
        ProductField.ECOSCORE_GRADE,
        ProductField.ECOSCORE_SCORE,
        // ProductField.KNOWLEDGE_PANELS, // that's the only one we don't use for offline download
        ProductField.COUNTRIES,
        ProductField.COUNTRIES_TAGS,
        ProductField.COUNTRIES_TAGS_IN_LANGUAGES,
        ProductField.EMB_CODES,
        ProductField.ORIGINS,
        ProductField.WEBSITE,
        ProductField.OBSOLETE,
        ProductField.OWNER_FIELDS,
      ];

duckdb can be used as a real SQL DB, but it can directly query Parquet files directly as well (as shown above).

I'm still not 100% confident. For instance, SQFlite starts immediately and has almost immediate results, while Shared preferences and hive (most of the time) do need to load the data into memory at startup. How would duckdb behave at startup? To be tested.

@raphael0202
Copy link

raphael0202 commented Nov 14, 2024

How would you do that for multilingual products with a .tsv file?

It would be a Parquet file and not a TSV. I'm currently working on the implementation.

Thank you for the list! I will investigate to see if we can add the missing ones.

I'm still not 100% confident. For instance, SQFlite starts immediately and has almost immediate results, while Shared preferences and hive (most of the time) do need to load the data into memory at startup. How would duckdb behave at startup? To be tested.

There is no startup penalty. duckdb does not need to load all data to memory before the query, as it takes advantages of row group statistics of the parquet file: it only needs to read from disk data that correspond to the WHERE clause of the query, and only fetch requested columns, thanks to Parquet column-oriented storage.

@monsieurtanuki
Copy link
Contributor

@raphael0202 Before you start implementing anything, please roughly compute the amount of data involved.
In addition to that, I don't know how you would split a Parquet in 10m or 100m product tranches.
Good luck!

@raphael0202
Copy link

raphael0202 commented Nov 14, 2024

Do you have the original field names (=names of fields requested to the Open Food Facts API)?

It would help me fetch the right data.

In addition to that, I don't know how you would split a Parquet in 10m or 100m product tranches.

Tranches of 10 000 or 100 000 products?

@monsieurtanuki
Copy link
Contributor

Quickly said, you can find the field names as "offTag" in https://github.com/openfoodfacts/openfoodfacts-dart/blob/master/lib/src/utils/product_fields.dart

TBH I don't believe in 100k products: probably too big.
I guess once we have we have stats for 10k we can compute stats for 100k. And vice-versa.

@raphael0202
Copy link

I added most missing fields in the Parquet data dump. A few remarks below:

FRONT_IMAGE IMAGE_FRONT_URL IMAGE_INGREDIENTS_URL IMAGE_NUTRITION_URL IMAGE_PACKAGING_URL SELECTED_IMAGE

-> these fields can be computed from the images field easily. To keep the dump lightweight, I would advocate to let the front-end compute the images field.

NUTRIMENT_ENERGY_UNIT is supposed to be mapped to nutriment_energy_unit in the API, I couldn't find this value in the data dump (?).

LABELS_TAGS_IN_LANGUAGES
CATEGORIES_TAGS_IN_LANGUAGES
COUNTRIES_TAGS_IN_LANGUAGES

That's the tricky part, storing this will increase significantly the dump size. I think we should explore downloading on the mobile app a language-specific translation files for all taxonomies, so that we don't have to submit those fields.

COUNTRIES: we have countries_tags which is more robust, as the countries field can be in any language.

ATTRIBUTE_GROUPS: it's not possible for now to store it on the dump, as it's computed on the fly by the server.

@monsieurtanuki
Copy link
Contributor

@raphael0202 I don't know how problematic it'd be not to include ATTRIBUTE_GROUPS, but if I remember well this is where we say "vegetarian" or "may contain nuts", right?
How big would be a dump of 100k products with untranslated tags?
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 💬 To discuss and validate
Development

No branches or pull requests

3 participants