-
-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swap the Search-Based product download to the Dump based product download #5787
Comments
@monsieurtanuki We have a new dump for OFF, courtesy of @raphael0202 |
Please note it's not the final URL, the canonical URL will be under static.openfoodfacts.org domain (even if we will do a redirect, it's not done yet). |
I'm not sure what we can do with that kind of file:
|
|
For the record we're talking about 3.5 million products, with that kind of informations:
To be compared with the top 10K products for a given country/language with relevant data. |
With the new backend, we'll be able to split by country, but France will still be a whooping 1M. As a reference, Others apps are storing the top 150k products. |
Using parquet files instead of TSV would bring some benefits in my opinion. First, generating country specific dumps is almost instant on the backend. I generated Parquet files with the same data as the TSV, here is the file size:
Second, these parquet files can be queried directly using duckdb, the query is really fast thanks to Parquet columnar storage and statistics: D SELECT * FROM '/home/raphael/datasets/products-min-de.parquet' WHERE code = '4388040200632';
┌───────────────┬──────────────────┬──────────┬─────────────┬────────────────────┬────────────┬────────────────┐
│ code │ product_name │ quantity │ brands_tags │ nutrition_grade_fr │ nova_group │ ecoscore_grade │
│ varchar │ varchar │ varchar │ varchar[] │ varchar │ int64 │ varchar │
├───────────────┼──────────────────┼──────────┼─────────────┼────────────────────┼────────────┼────────────────┤
│ 4388040200632 │ Thunfisch chunks │ 185g │ [economy] │ unknown │ │ unknown │
└───────────────┴──────────────────┴──────────┴─────────────┴────────────────────┴────────────┴────────────────┘
Run Time (s): real 0.014 user 0.031547 sys 0.001925 It looks like there is a duckdb integration for flutter, it could be an option: https://pub.dev/packages/dart_duckdb |
What about the language, especially for multilanguage countries?
What about tranches of 100k products? I'm concerned about:
@raphael0202 I'll have a look at your duckdb. That would mean another database (after shared preferences, hive and sqflite), and a sort of meta product coming either from Parquet or SQFLite. |
We're currently working on adding language fields (ex:
That's possible as well! What's the fields you currently fetch from Open Food Facts search API to get the offline data? We can match this fields in the dump so that you don't need to query the search API.
duckdb can be used as a real SQL DB, but it can directly query Parquet files directly as well (as shown above). |
How would you do that for multilingual products with a .tsv file?
Cool!
That would be madness (sorry about the poor presentation, my keyboard is getting crazy and prevents me from using Alt Gr key) (which is getting very problematic) static List<ProductField> get fields => const <ProductField>[
ProductField.NAME,
ProductField.NAME_ALL_LANGUAGES,
ProductField.BRANDS,
ProductField.BARCODE,
ProductField.NUTRISCORE,
ProductField.FRONT_IMAGE,
ProductField.IMAGE_FRONT_URL,
ProductField.IMAGE_INGREDIENTS_URL,
ProductField.IMAGE_NUTRITION_URL,
ProductField.IMAGE_PACKAGING_URL,
ProductField.IMAGES,
ProductField.SELECTED_IMAGE,
ProductField.QUANTITY,
ProductField.SERVING_SIZE,
ProductField.STORES,
ProductField.PACKAGING_QUANTITY,
ProductField.PACKAGING,
ProductField.PACKAGINGS,
ProductField.PACKAGINGS_COMPLETE,
ProductField.PACKAGING_TAGS,
ProductField.PACKAGING_TEXT_ALL_LANGUAGES,
ProductField.NO_NUTRITION_DATA,
ProductField.NUTRIMENT_DATA_PER,
ProductField.NUTRITION_DATA,
ProductField.NUTRIMENTS,
ProductField.NUTRIENT_LEVELS,
ProductField.NUTRIMENT_ENERGY_UNIT,
ProductField.ADDITIVES,
ProductField.INGREDIENTS_ANALYSIS_TAGS,
ProductField.INGREDIENTS_TEXT,
ProductField.INGREDIENTS_TEXT_ALL_LANGUAGES,
ProductField.LABELS_TAGS,
ProductField.LABELS_TAGS_IN_LANGUAGES,
ProductField.COMPARED_TO_CATEGORY,
ProductField.CATEGORIES_TAGS,
ProductField.CATEGORIES_TAGS_IN_LANGUAGES,
ProductField.LANGUAGE,
ProductField.ATTRIBUTE_GROUPS,
ProductField.STATES_TAGS,
ProductField.ECOSCORE_DATA,
ProductField.ECOSCORE_GRADE,
ProductField.ECOSCORE_SCORE,
// ProductField.KNOWLEDGE_PANELS, // that's the only one we don't use for offline download
ProductField.COUNTRIES,
ProductField.COUNTRIES_TAGS,
ProductField.COUNTRIES_TAGS_IN_LANGUAGES,
ProductField.EMB_CODES,
ProductField.ORIGINS,
ProductField.WEBSITE,
ProductField.OBSOLETE,
ProductField.OWNER_FIELDS,
];
I'm still not 100% confident. For instance, SQFlite starts immediately and has almost immediate results, while Shared preferences and hive (most of the time) do need to load the data into memory at startup. How would duckdb behave at startup? To be tested. |
It would be a Parquet file and not a TSV. I'm currently working on the implementation. Thank you for the list! I will investigate to see if we can add the missing ones.
There is no startup penalty. duckdb does not need to load all data to memory before the query, as it takes advantages of row group statistics of the parquet file: it only needs to read from disk data that correspond to the WHERE clause of the query, and only fetch requested columns, thanks to Parquet column-oriented storage. |
@raphael0202 Before you start implementing anything, please roughly compute the amount of data involved. |
Do you have the original field names (=names of fields requested to the Open Food Facts API)? It would help me fetch the right data.
Tranches of 10 000 or 100 000 products? |
Quickly said, you can find the field names as "offTag" in https://github.com/openfoodfacts/openfoodfacts-dart/blob/master/lib/src/utils/product_fields.dart TBH I don't believe in 100k products: probably too big. |
I added most missing fields in the Parquet data dump. A few remarks below:
-> these fields can be computed from the
That's the tricky part, storing this will increase significantly the dump size. I think we should explore downloading on the mobile app a language-specific translation files for all taxonomies, so that we don't have to submit those fields.
|
@raphael0202 I don't know how problematic it'd be not to include ATTRIBUTE_GROUPS, but if I remember well this is where we say "vegetarian" or "may contain nuts", right? |
What
The text was updated successfully, but these errors were encountered: