-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983
Comments
@Swanand-Kadhe @Hajar-Emami #965 has now been merged into the dev branch. |
@BishwaBhatta The GneissWeb fasttext classifier transform has been tested only with facebook/fasttext-language-identification model.bin |
@Swanand-Kadhe @Hajar-Emami An update: #974 has also been merged now, but we found a last-minute bug with #953 that Shalisha should be able to fix tomorrow morning. After that is merged too, Maroun will make the pypi release that has all 4 transforms and you can pip install in your notebook. A note about downloading parquet files to your notebook from HF: #1000 |
Hi, everyone. Just a quick update: All 4 new transforms have been tested and merged now, and Maroun is preparing the PyPi release that you will pip install at the top of your notebook. It should be ready anytime. In the meantime, the
to compile |
@Hajar-Emami I suggest that after adding a step 0 to your current notebook here: https://github.com/Hajar-Emami/data-prep-kit/blob/dev/examples/notebooks/GneissWeb/GneissWeb.ipynb, which will download a parquet file from HF (use the HF download API discussed in #1000), submit a PR in "draft" mode to DPK repo, so Maroun and I can track your progress and help you as you add API's for all other transforms you need for your recipe. A summary of all the steps you need from the directory where your notebook is:
|
@shahrokhDaijavad ModuleNotFoundError Traceback (most recent call last) ModuleNotFoundError: No module named 'huggingface_hub' |
@Hajar-Emami As we discussed, if you want to keep testing with the "toy" data, in your first step of the notebook, you can download the parquet and arrow files from the repo itself:
|
Based on the above, here are the steps for running rep_removal next in the notebook:
For the next transform:
|
It seems we also need to install pandas
|
Many thanks, @shahrokhDaijavad, for the commands. After creating the venv, and running the below commands:
I am getting the following error:
|
Please see my comment that I added in PR #1010 |
Many Thanks @shahrokhDaijavad and @touma-I. I'm getting the below error when trying to run
|
@Hajar-Emami It looks like you don't have clang++ (c and c++ compiler) installed on your machine and the fasttext package needs that. |
@shahrokhDaijavad It seems like it’s already installed on my Mac:
But I'm getting the below error when trying to run pip install "data-prep-toolkit-transforms[all]==1.0.1.dev1"
|
|
Was able to install |
@Hajar-Emami For now, please do these two steps from the command line before starting Jupyter and not inside your notebook:
|
Thanks @shahrokhDaijavad . I did what you mentioned but still cannot find the output parquet. I can only see metadata.json. |
@Hajar-Emami After this change, you should be able to see the output parquet file in tmp/files-repremoval directory. |
Thanks @shahrokhDaijavad . It works https://github.com/Hajar-Emami/data-prep-kit/blob/rep_removal/examples/notebooks/GneissWeb/GneissWeb.ipynb We do not have its corresponding arrow file to annotate with the extreme_tokenized transform. |
I am getting the following error when trying to use DCLM fasttext to annotate the data: https://github.com/IBM/data-prep-kit/blob/f5bb38e38f3b714e8b417b1228e9d1868f1f9910/examples/notebooks/GneissWeb/GneissWeb.ipynb
|
@shahrokhDaijavad @touma-I I notice the |
I've created this issue for it: #1024 Closed |
I am getting the following error for the DCLM fasttext: https://github.com/Hajar-Emami/data-prep-kit/blob/rep_removal/examples/notebooks/GneissWeb/GneissWeb.ipynb
|
getting the following error for the
|
Search before asking
Component
Other
Feature
We now have the 4 new transforms that were used in creating GneissWeb as 3 PRs in the repo (one PR implements 2 of these transforms) These PRs are being tested currently, but they are all in reasonable shape and all have notebook examples that will be helpful in creating ONE notebook that sequentially runs the transforms (plus existing transforms such as Filter)The sequence is:
Read Data->Repetition Removal->Readability,Fasttext,DCLM etc Annotation->Extreme token Annotation-> Filter
The 3 PRs are:
#953 (rep removal) #965 (extreme tokenization and readability) and #974 (fasttext classification).
The relevant notebooks are all in the main transform directory, for example: https://github.com/swith005/data-prep-kit-outer/blob/rep_removal/transforms/universal/rep_removal/rep_removal.ipynb for rep removal.
@touma-I @yousafshah @Swanand-Kadhe @Hajar-Emami
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: