Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

Open
1 of 2 tasks
shahrokhDaijavad opened this issue Jan 27, 2025 · 25 comments
Labels

Comments

@shahrokhDaijavad
Copy link
Member

shahrokhDaijavad commented Jan 27, 2025

Search before asking

  • I searched the issues and found no similar issues.

Component

Other

Feature

We now have the 4 new transforms that were used in creating GneissWeb as 3 PRs in the repo (one PR implements 2 of these transforms) These PRs are being tested currently, but they are all in reasonable shape and all have notebook examples that will be helpful in creating ONE notebook that sequentially runs the transforms (plus existing transforms such as Filter)The sequence is:
Read Data->Repetition Removal->Readability,Fasttext,DCLM etc Annotation->Extreme token Annotation-> Filter
The 3 PRs are:
#953 (rep removal) #965 (extreme tokenization and readability) and #974 (fasttext classification).
The relevant notebooks are all in the main transform directory, for example: https://github.com/swith005/data-prep-kit-outer/blob/rep_removal/transforms/universal/rep_removal/rep_removal.ipynb for rep removal.
@touma-I @yousafshah @Swanand-Kadhe @Hajar-Emami

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@shahrokhDaijavad
Copy link
Member Author

@Swanand-Kadhe @Hajar-Emami #965 has now been merged into the dev branch.

@shahrokhDaijavad
Copy link
Member Author

@BishwaBhatta The GneissWeb fasttext classifier transform has been tested only with facebook/fasttext-language-identification model.bin
Whenever we have the models that were used in creating GneissWeb fasttext classification uploaded to HF or a similar public place, please let us know.
Also, I assume we need to download the small number of input parquet/arrow files that @Swanand-Kadhe and @Hajar-Emami need for their notebook from a public place, correct?

@shahrokhDaijavad
Copy link
Member Author

@Swanand-Kadhe @Hajar-Emami An update: #974 has also been merged now, but we found a last-minute bug with #953 that Shalisha should be able to fix tomorrow morning. After that is merged too, Maroun will make the pypi release that has all 4 transforms and you can pip install in your notebook.

A note about downloading parquet files to your notebook from HF: #1000
These few lines of python could be useful for downloading real files and not just toy files.

@shahrokhDaijavad
Copy link
Member Author

Hi, everyone. Just a quick update: All 4 new transforms have been tested and merged now, and Maroun is preparing the PyPi release that you will pip install at the top of your notebook. It should be ready anytime. In the meantime, the rep_removal transform is a little more complex than others (because of the Google Rust code). Please read this https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/rep_removal/README.md (especially Running on the M1 Mac section) and let me know if it is clear (basically, you have to install rust on the machine and then in your notebook, after pip installing the latest data-prep-toolkit-transforms[ray,all] add the lines:

PACKAGE_LOCATION=$(pip show data_prep_toolkit_transforms | grep Location | awk '{print $2}')
cargo install --path $PACKAGE_LOCATION/dpk_rep_removal/rust

to compile dedup_dataset that the transform needs.

@shahrokhDaijavad
Copy link
Member Author

@Hajar-Emami I suggest that after adding a step 0 to your current notebook here: https://github.com/Hajar-Emami/data-prep-kit/blob/dev/examples/notebooks/GneissWeb/GneissWeb.ipynb, which will download a parquet file from HF (use the HF download API discussed in #1000), submit a PR in "draft" mode to DPK repo, so Maroun and I can track your progress and help you as you add API's for all other transforms you need for your recipe.

A summary of all the steps you need from the directory where your notebook is:

python -m venv venv
source venv/bin/activate
pip install "data-prep-toolkit-transforms[all]==1.0.1.dev1"
PACKAGE_LOCATION=$(pip show data_prep_toolkit_transforms | grep Location | awk '{print $2}')
cargo install --path $PACKAGE_LOCATION/dpk_rep_removal/rust
pip install jupyterlab
jupyter lab 

@Hajar-Emami
Copy link

@shahrokhDaijavad
I am getting the following error : #1008


ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from huggingface_hub import hf_hub_download
2 import pandas as pd
4 REPO_ID = "HuggingFaceFW/fineweb"

ModuleNotFoundError: No module named 'huggingface_hub'

@shahrokhDaijavad
Copy link
Member Author

shahrokhDaijavad commented Feb 3, 2025

@Hajar-Emami !pip install --upgrade huggingface_hub will fix the error above.

As we discussed, if you want to keep testing with the "toy" data, in your first step of the notebook, you can download the parquet and arrow files from the repo itself:

import urllib.request
import shutil
shutil.os.makedirs("tmp/input", exist_ok=True)
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/extreme_tokenized/test-data/input/test1.parquet", "tmp/input/test1.parquet")
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/extreme_tokenized/test-data/input/arrow/test1.arrow", "tmp/input/test1.arrow")

@shahrokhDaijavad
Copy link
Member Author

Based on the above, here are the steps for running rep_removal next in the notebook:

from dpk_rep_removal.runtime import RepRemoval

RepRemoval(input_folder= "tmp/input",
            output_folder= "tmp/files-repremoval",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads='1',
            ).transform()

For the next transform:

tmp/files-repremoval becomes the input_folder, and we define a new folder for the output of that transform, and so on

@Hajar-Emami
Copy link

It seems we also need to install pandas

[/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/venv/lib/python3.10/site-packages/tqdm/auto.py:21](http://localhost:8890/lab/workspaces/venv/lib/python3.10/site-packages/tqdm/auto.py#line=20): TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[4], line 2
      1 from huggingface_hub import hf_hub_download
----> 2 import pandas as pd
      4 REPO_ID = "HuggingFaceFW[/fineweb](http://localhost:8890/fineweb)"
      5 FILENAME = "data[/CC-MAIN-2013-20/000_00000.parquet](http://localhost:8890/CC-MAIN-2013-20/000_00000.parquet)"

ModuleNotFoundError: No module named 'pandas'

@Hajar-Emami
Copy link

Hajar-Emami commented Feb 4, 2025

Many thanks, @shahrokhDaijavad, for the commands. After creating the venv, and running the below commands:

pip install "data-prep-toolkit-transforms==1.0.1.dev1"
PACKAGE_LOCATION=$(pip show data_prep_toolkit_transforms | grep Location | awk '{print $2}')
cargo install --path $PACKAGE_LOCATION/dpk_rep_removal/rust
pip install jupyterlab
jupyter lab 

I am getting the following error:
#1010

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 1
----> 1 from dpk_rep_removal.runtime import RepRemoval

ModuleNotFoundError: No module named 'dpk_rep_removal'

@shahrokhDaijavad
Copy link
Member Author

Please see my comment that I added in PR #1010

@Hajar-Emami
Copy link

Hajar-Emami commented Feb 4, 2025

Many Thanks @shahrokhDaijavad and @touma-I. I'm getting the below error when trying to run pip install "data-prep-toolkit-transforms[all]==1.0.1.dev1"

      In file included from python/fasttext_module/fasttext/pybind/fasttext_pybind.cc:9:
      src/args.h:11:10: fatal error: 'istream' file not found
         11 | #include <istream>
            |          ^~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang++' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for fasttext
Failed to build fasttext
ERROR: Failed to build installable wheels for some pyproject.toml based projects (fasttext)

@shahrokhDaijavad
Copy link
Member Author

@Hajar-Emami It looks like you don't have clang++ (c and c++ compiler) installed on your machine and the fasttext package needs that.
I think the easiest way to get clang++ is installing Xcode CLI from here: https://mac.install.guide/commandlinetools/
@touma-I Any other suggestion?

@Hajar-Emami
Copy link

Hajar-Emami commented Feb 5, 2025

@shahrokhDaijavad It seems like it’s already installed on my Mac:

wecm-9-67-124-36:~ hajaremami$ xcode-select -p
/Library/Developer/CommandLineTools

But I'm getting the below error when trying to run pip install "data-prep-toolkit-transforms[all]==1.0.1.dev1"

      building 'fasttext_pybind' extension
      creating build/temp.macosx-14-arm64-cpython-310/python/fasttext_module/fasttext/pybind
      creating build/temp.macosx-14-arm64-cpython-310/src
      clang++ -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk -I/private/var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/pip-build-env-raf8rn8a/overlay/lib/python3.10/site-packages/pybind11/include -I/private/var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/pip-build-env-raf8rn8a/overlay/lib/python3.10/site-packages/pybind11/include -Isrc -I/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/venv/include -I/opt/homebrew/opt/[email protected]/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c python/fasttext_module/fasttext/pybind/fasttext_pybind.cc -o build/temp.macosx-14-arm64-cpython-310/python/fasttext_module/fasttext/pybind/fasttext_pybind.o -stdlib=libc++ -DVERSION_INFO=\"0.9.3\" -std=c++17 -fvisibility=hidden
      In file included from python/fasttext_module/fasttext/pybind/fasttext_pybind.cc:9:
      src/args.h:11:10: fatal error: 'istream' file not found
         11 | #include <istream>
            |          ^~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang++' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for fasttext
Failed to build fasttext
ERROR: Could not build wheels for fasttext, which is required to install pyproject.toml-based projects

@Hajar-Emami
Copy link

Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... rror
  error: subprocess-exited-with-error
  
  × Building wheel for fasttext (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [43 lines of output]
      /private/var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/pip-build-env-afivwa8r/overlay/lib/python3.10/site-packages/setuptools/dist.py:493: SetuptoolsDeprecationWarning: Invalid dash-separated options
      !!
      
              ********************************************************************************
              Usage of dash-separated 'description-file' will not be supported in future
              versions. Please use the underscore name 'description_file' instead.
      
              By 2025-Mar-03, you need to update your project and remove deprecated calls
              or your builds will no longer be supported.
      
              See https://setuptools.pypa.io/en/latest/userguide/declarative_config.html for details.
              ********************************************************************************
      
      !!
        opt = self.warn_dash_deprecation(opt, section)
      running bdist_wheel
      running build
      running build_py

@Hajar-Emami
Copy link

Was able to install rep_removal and run it, but cannot see the output parquet file under tmp/files-repremoval:
https://github.com/IBM/data-prep-kit/blob/712dfa7ce29754d9ad727e8ae2a94090bdf7525c/examples/notebooks/GneissWeb/GneissWeb.ipynb

@shahrokhDaijavad
Copy link
Member Author

@Hajar-Emami For now, please do these two steps from the command line before starting Jupyter and not inside your notebook:

pip install "data-prep-toolkit-transforms[rep_removal]==1.0.1.dev1"
PACKAGE_LOCATION=$(pip show data_prep_toolkit_transforms | grep Location | awk '{print $2}')
cargo install --path $PACKAGE_LOCATION/dpk_rep_removal/rust

@Hajar-Emami
Copy link

Thanks @shahrokhDaijavad . I did what you mentioned but still cannot find the output parquet. I can only see metadata.json.

https://github.com/Hajar-Emami/data-prep-kit/blob/rep_removal/examples/notebooks/GneissWeb/GneissWeb.ipynb

@shahrokhDaijavad
Copy link
Member Author

@Hajar-Emami
I looked at your notebook, and I see that you are downloading the wrong test1.parquet file as input. What I showed above as how to download from the repo was just an example and was using the extreme_tokenized test file.
For rep_removal, please use:
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/universal/rep_removal/test-data/input/test1.parquet", "tmp/input/test1.parquet")
and comment out the line that downloads the arrow file (not needed here)

After this change, you should be able to see the output parquet file in tmp/files-repremoval directory.

@Hajar-Emami
Copy link

Hajar-Emami commented Feb 6, 2025

Thanks @shahrokhDaijavad . It works https://github.com/Hajar-Emami/data-prep-kit/blob/rep_removal/examples/notebooks/GneissWeb/GneissWeb.ipynb

We do not have its corresponding arrow file to annotate with the extreme_tokenized transform.

@Hajar-Emami
Copy link

Hajar-Emami commented Feb 6, 2025

I am getting the following error when trying to use DCLM fasttext to annotate the data: https://github.com/IBM/data-prep-kit/blob/f5bb38e38f3b714e8b417b1228e9d1868f1f9910/examples/notebooks/GneissWeb/GneissWeb.ipynb

    raise Exception(f"column to store label ({self.output_label_column_name}) already exist")
Exception: column to store label (lang) already exist

@Hajar-Emami
Copy link

@shahrokhDaijavad @touma-I I notice the Classification transform creates columns with the same name for different models, and which is the reason for the above error.
This transform should have two parameters so the user can specify the name of the column for each model similar to the attached file.

Image

@Hajar-Emami
Copy link

Hajar-Emami commented Feb 7, 2025

I've created this issue for it: #1024

Closed

@Hajar-Emami
Copy link

I am getting the following error for the DCLM fasttext: https://github.com/Hajar-Emami/data-prep-kit/blob/rep_removal/examples/notebooks/GneissWeb/GneissWeb.ipynb

raise Exception(f"column to store label ({self.output_label_column_name}) already exist")
Exception: column to store label (dclm_fasttext_label) already exist

WARNING:data_processing.runtime.transform_file_processor:Exception processing file [/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/tmp/fasttext/quality/test1.parquet](http://localhost:8891/lab/tree/tmp/tmp/fasttext/quality/test1.parquet): Traceback (most recent call last):
  File "/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/venv/lib/python3.10/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file
    out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
  File "[/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/venv/lib/python3.10/site-packages/data_processing/transform/table_transform.py", line 59](http://localhost:8891/lab/tree/tmp/venv/lib/python3.10/site-packages/data_processing/transform/table_transform.py#line=58), in transform_binary
    out_tables, stats = self.transform(table=table, file_name=file_name)
  File "/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/venv/lib/python3.10/site-packages/dpk_gneissweb_classification/transform.py", line 95, in transform
    raise Exception(f"column to store label ({self.output_label_column_name}) already exist")
Exception: column to store label (dclm_fasttext_label) already exist

@Hajar-Emami
Copy link

getting the following error for the readability transform. Should the 'textstat' package be installed as part of Readability transform?

ModuleNotFoundError: No module named 'textstat'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants