Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

shahrokhDaijavad · 2025-01-27T20:03:38Z

Search before asking

I searched the issues and found no similar issues.

Component

Other

Feature

We now have the 4 new transforms that were used in creating GneissWeb as 3 PRs in the repo (one PR implements 2 of these transforms) These PRs are being tested currently, but they are all in reasonable shape and all have notebook examples that will be helpful in creating ONE notebook that sequentially runs the transforms (plus existing transforms such as Filter)The sequence is:
Read Data->Repetition Removal->Readability,Fasttext,DCLM etc Annotation->Extreme token Annotation-> Filter
The 3 PRs are:
#953 (rep removal) #965 (extreme tokenization and readability) and #974 (fasttext classification).
The relevant notebooks are all in the main transform directory, for example: https://github.com/swith005/data-prep-kit-outer/blob/rep_removal/transforms/universal/rep_removal/rep_removal.ipynb for rep removal.
@touma-I @yousafshah @Swanand-Kadhe @Hajar-Emami

Are you willing to submit a PR?

Yes I am willing to submit a PR!

shahrokhDaijavad · 2025-01-29T14:54:30Z

@Swanand-Kadhe @Hajar-Emami #965 has now been merged into the dev branch.

shahrokhDaijavad · 2025-01-29T21:49:46Z

@BishwaBhatta The GneissWeb fasttext classifier transform has been tested only with facebook/fasttext-language-identification model.bin
Whenever we have the models that were used in creating GneissWeb fasttext classification uploaded to HF or a similar public place, please let us know.
Also, I assume we need to download the small number of input parquet/arrow files that @Swanand-Kadhe and @Hajar-Emami need for their notebook from a public place, correct?

shahrokhDaijavad · 2025-01-31T01:16:19Z

@Swanand-Kadhe @Hajar-Emami An update: #974 has also been merged now, but we found a last-minute bug with #953 that Shalisha should be able to fix tomorrow morning. After that is merged too, Maroun will make the pypi release that has all 4 transforms and you can pip install in your notebook.

A note about downloading parquet files to your notebook from HF: #1000
These few lines of python could be useful for downloading real files and not just toy files.

shahrokhDaijavad · 2025-01-31T21:49:51Z

Hi, everyone. Just a quick update: All 4 new transforms have been tested and merged now, and Maroun is preparing the PyPi release that you will pip install at the top of your notebook. It should be ready anytime. In the meantime, the rep_removal transform is a little more complex than others (because of the Google Rust code). Please read this https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/rep_removal/README.md (especially Running on the M1 Mac section) and let me know if it is clear (basically, you have to install rust on the machine and then in your notebook, after pip installing the latest data-prep-toolkit-transforms[ray,all] add the lines:

PACKAGE_LOCATION=$(pip show data_prep_toolkit_transforms | grep Location | awk '{print $2}')
cargo install --path $PACKAGE_LOCATION/dpk_rep_removal/rust

to compile dedup_dataset that the transform needs.

shahrokhDaijavad · 2025-02-03T17:32:31Z

@Hajar-Emami I suggest that after adding a step 0 to your current notebook here: https://github.com/Hajar-Emami/data-prep-kit/blob/dev/examples/notebooks/GneissWeb/GneissWeb.ipynb, which will download a parquet file from HF (use the HF download API discussed in #1000), submit a PR in "draft" mode to DPK repo, so Maroun and I can track your progress and help you as you add API's for all other transforms you need for your recipe.

A summary of all the steps you need from the directory where your notebook is:

python -m venv venv
source venv/bin/activate
pip install "data-prep-toolkit-transforms[all]==1.0.1.dev1"
PACKAGE_LOCATION=$(pip show data_prep_toolkit_transforms | grep Location | awk '{print $2}')
cargo install --path $PACKAGE_LOCATION/dpk_rep_removal/rust
pip install jupyterlab
jupyter lab

Hajar-Emami · 2025-02-03T21:02:03Z

@shahrokhDaijavad
I am getting the following error : #1008

ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from huggingface_hub import hf_hub_download
2 import pandas as pd
4 REPO_ID = "HuggingFaceFW/fineweb"

ModuleNotFoundError: No module named 'huggingface_hub'

shahrokhDaijavad · 2025-02-03T22:56:26Z

@Hajar-Emami !pip install --upgrade huggingface_hub will fix the error above.

As we discussed, if you want to keep testing with the "toy" data, in your first step of the notebook, you can download the parquet and arrow files from the repo itself:

import urllib.request
import shutil

shutil.os.makedirs("tmp/input", exist_ok=True)
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/extreme_tokenized/test-data/input/test1.parquet", "tmp/input/test1.parquet")
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/extreme_tokenized/test-data/input/arrow/test1.arrow", "tmp/input/test1.arrow")

shahrokhDaijavad · 2025-02-03T23:12:54Z

Based on the above, here are the steps for running rep_removal next in the notebook:

from dpk_rep_removal.runtime import RepRemoval

RepRemoval(input_folder= "tmp/input",
            output_folder= "tmp/files-repremoval",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads='1',
            ).transform()

For the next transform:

tmp/files-repremoval becomes the input_folder, and we define a new folder for the output of that transform, and so on

Hajar-Emami · 2025-02-04T05:25:48Z

It seems we also need to install pandas

[/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/venv/lib/python3.10/site-packages/tqdm/auto.py:21](http://localhost:8890/lab/workspaces/venv/lib/python3.10/site-packages/tqdm/auto.py#line=20): TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[4], line 2
      1 from huggingface_hub import hf_hub_download
----> 2 import pandas as pd
      4 REPO_ID = "HuggingFaceFW[/fineweb](http://localhost:8890/fineweb)"
      5 FILENAME = "data[/CC-MAIN-2013-20/000_00000.parquet](http://localhost:8890/CC-MAIN-2013-20/000_00000.parquet)"

ModuleNotFoundError: No module named 'pandas'

Hajar-Emami · 2025-02-04T05:37:52Z

Many thanks, @shahrokhDaijavad, for the commands. After creating the venv, and running the below commands:

pip install "data-prep-toolkit-transforms==1.0.1.dev1"
PACKAGE_LOCATION=$(pip show data_prep_toolkit_transforms | grep Location | awk '{print $2}')
cargo install --path $PACKAGE_LOCATION/dpk_rep_removal/rust
pip install jupyterlab
jupyter lab

I am getting the following error:
#1010

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 1
----> 1 from dpk_rep_removal.runtime import RepRemoval

ModuleNotFoundError: No module named 'dpk_rep_removal'

shahrokhDaijavad · 2025-02-04T06:15:30Z

Please see my comment that I added in PR #1010

Hajar-Emami · 2025-02-04T17:48:44Z

Many Thanks @shahrokhDaijavad and @touma-I. I'm getting the below error when trying to run pip install "data-prep-toolkit-transforms[all]==1.0.1.dev1"

      In file included from python/fasttext_module/fasttext/pybind/fasttext_pybind.cc:9:
      src/args.h:11:10: fatal error: 'istream' file not found
         11 | #include <istream>
            |          ^~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang++' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for fasttext
Failed to build fasttext
ERROR: Failed to build installable wheels for some pyproject.toml based projects (fasttext)

shahrokhDaijavad · 2025-02-04T21:12:29Z

@Hajar-Emami It looks like you don't have clang++ (c and c++ compiler) installed on your machine and the fasttext package needs that.
I think the easiest way to get clang++ is installing Xcode CLI from here: https://mac.install.guide/commandlinetools/
@touma-I Any other suggestion?

Hajar-Emami · 2025-02-05T17:17:43Z

@shahrokhDaijavad It seems like it’s already installed on my Mac:

wecm-9-67-124-36:~ hajaremami$ xcode-select -p
/Library/Developer/CommandLineTools

But I'm getting the below error when trying to run pip install "data-prep-toolkit-transforms[all]==1.0.1.dev1"

      building 'fasttext_pybind' extension
      creating build/temp.macosx-14-arm64-cpython-310/python/fasttext_module/fasttext/pybind
      creating build/temp.macosx-14-arm64-cpython-310/src
      clang++ -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk -I/private/var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/pip-build-env-raf8rn8a/overlay/lib/python3.10/site-packages/pybind11/include -I/private/var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/pip-build-env-raf8rn8a/overlay/lib/python3.10/site-packages/pybind11/include -Isrc -I/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/venv/include -I/opt/homebrew/opt/[email protected]/Frameworks/Python.framework/Versions/3.10/include/python3.10 -c python/fasttext_module/fasttext/pybind/fasttext_pybind.cc -o build/temp.macosx-14-arm64-cpython-310/python/fasttext_module/fasttext/pybind/fasttext_pybind.o -stdlib=libc++ -DVERSION_INFO=\"0.9.3\" -std=c++17 -fvisibility=hidden
      In file included from python/fasttext_module/fasttext/pybind/fasttext_pybind.cc:9:
      src/args.h:11:10: fatal error: 'istream' file not found
         11 | #include <istream>
            |          ^~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang++' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for fasttext
Failed to build fasttext
ERROR: Could not build wheels for fasttext, which is required to install pyproject.toml-based projects

Hajar-Emami · 2025-02-05T19:33:00Z

Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... rror
  error: subprocess-exited-with-error
  
  × Building wheel for fasttext (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [43 lines of output]
      /private/var/folders/f3/5zmfvg4j539bhmnsxzqmbc2h0000gn/T/pip-build-env-afivwa8r/overlay/lib/python3.10/site-packages/setuptools/dist.py:493: SetuptoolsDeprecationWarning: Invalid dash-separated options
      !!
      
              ********************************************************************************
              Usage of dash-separated 'description-file' will not be supported in future
              versions. Please use the underscore name 'description_file' instead.
      
              By 2025-Mar-03, you need to update your project and remove deprecated calls
              or your builds will no longer be supported.
      
              See https://setuptools.pypa.io/en/latest/userguide/declarative_config.html for details.
              ********************************************************************************
      
      !!
        opt = self.warn_dash_deprecation(opt, section)
      running bdist_wheel
      running build
      running build_py

Hajar-Emami · 2025-02-05T20:01:12Z

Was able to install rep_removal and run it, but cannot see the output parquet file under tmp/files-repremoval:
https://github.com/IBM/data-prep-kit/blob/712dfa7ce29754d9ad727e8ae2a94090bdf7525c/examples/notebooks/GneissWeb/GneissWeb.ipynb

shahrokhDaijavad · 2025-02-05T20:16:39Z

@Hajar-Emami For now, please do these two steps from the command line before starting Jupyter and not inside your notebook:

pip install "data-prep-toolkit-transforms[rep_removal]==1.0.1.dev1"
PACKAGE_LOCATION=$(pip show data_prep_toolkit_transforms | grep Location | awk '{print $2}')
cargo install --path $PACKAGE_LOCATION/dpk_rep_removal/rust

Hajar-Emami · 2025-02-05T20:39:34Z

Thanks @shahrokhDaijavad . I did what you mentioned but still cannot find the output parquet. I can only see metadata.json.

https://github.com/Hajar-Emami/data-prep-kit/blob/rep_removal/examples/notebooks/GneissWeb/GneissWeb.ipynb

shahrokhDaijavad · 2025-02-05T21:13:42Z

@Hajar-Emami
I looked at your notebook, and I see that you are downloading the wrong test1.parquet file as input. What I showed above as how to download from the repo was just an example and was using the extreme_tokenized test file.
For rep_removal, please use:
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/universal/rep_removal/test-data/input/test1.parquet", "tmp/input/test1.parquet")
and comment out the line that downloads the arrow file (not needed here)

After this change, you should be able to see the output parquet file in tmp/files-repremoval directory.

Hajar-Emami · 2025-02-06T05:40:44Z

Thanks @shahrokhDaijavad . It works https://github.com/Hajar-Emami/data-prep-kit/blob/rep_removal/examples/notebooks/GneissWeb/GneissWeb.ipynb

We do not have its corresponding arrow file to annotate with the extreme_tokenized transform.

Hajar-Emami · 2025-02-06T23:20:28Z

I am getting the following error when trying to use DCLM fasttext to annotate the data: https://github.com/IBM/data-prep-kit/blob/f5bb38e38f3b714e8b417b1228e9d1868f1f9910/examples/notebooks/GneissWeb/GneissWeb.ipynb

    raise Exception(f"column to store label ({self.output_label_column_name}) already exist")
Exception: column to store label (lang) already exist

Hajar-Emami · 2025-02-07T16:25:12Z

@shahrokhDaijavad @touma-I I notice the Classification transform creates columns with the same name for different models, and which is the reason for the above error.
This transform should have two parameters so the user can specify the name of the column for each model similar to the attached file.

Hajar-Emami · 2025-02-07T16:27:01Z

I've created this issue for it: #1024

Closed

Hajar-Emami · 2025-02-08T05:03:10Z

I am getting the following error for the DCLM fasttext: https://github.com/Hajar-Emami/data-prep-kit/blob/rep_removal/examples/notebooks/GneissWeb/GneissWeb.ipynb

raise Exception(f"column to store label ({self.output_label_column_name}) already exist")
Exception: column to store label (dclm_fasttext_label) already exist

WARNING:data_processing.runtime.transform_file_processor:Exception processing file [/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/tmp/fasttext/quality/test1.parquet](http://localhost:8891/lab/tree/tmp/tmp/fasttext/quality/test1.parquet): Traceback (most recent call last):
  File "/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/venv/lib/python3.10/site-packages/data_processing/runtime/transform_file_processor.py", line 79, in process_file
    out_files, stats = self.transform.transform_binary(file_name=f_name, byte_array=filedata)
  File "[/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/venv/lib/python3.10/site-packages/data_processing/transform/table_transform.py", line 59](http://localhost:8891/lab/tree/tmp/venv/lib/python3.10/site-packages/data_processing/transform/table_transform.py#line=58), in transform_binary
    out_tables, stats = self.transform(table=table, file_name=file_name)
  File "/Users/hajaremami/Desktop/DPK_notebook/data-prep-kit/examples/notebooks/GneissWeb/venv/lib/python3.10/site-packages/dpk_gneissweb_classification/transform.py", line 95, in transform
    raise Exception(f"column to store label ({self.output_label_column_name}) already exist")
Exception: column to store label (dclm_fasttext_label) already exist

Hajar-Emami · 2025-02-08T05:09:03Z

getting the following error for the readability transform. Should the 'textstat' package be installed as part of Readability transform?

ModuleNotFoundError: No module named 'textstat'

shahrokhDaijavad added enhancement New feature or request gneiss web sprint-Jan31 labels Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

shahrokhDaijavad commented Jan 27, 2025 •

edited

Loading

shahrokhDaijavad commented Jan 29, 2025

shahrokhDaijavad commented Jan 29, 2025

shahrokhDaijavad commented Jan 31, 2025

shahrokhDaijavad commented Jan 31, 2025

shahrokhDaijavad commented Feb 3, 2025

Hajar-Emami commented Feb 3, 2025

shahrokhDaijavad commented Feb 3, 2025 •

edited

Loading

shahrokhDaijavad commented Feb 3, 2025

Hajar-Emami commented Feb 4, 2025

Hajar-Emami commented Feb 4, 2025 •

edited by shahrokhDaijavad

Loading

shahrokhDaijavad commented Feb 4, 2025

Hajar-Emami commented Feb 4, 2025 •

edited

Loading

shahrokhDaijavad commented Feb 4, 2025

Hajar-Emami commented Feb 5, 2025 •

edited

Loading

Hajar-Emami commented Feb 5, 2025

Hajar-Emami commented Feb 5, 2025

shahrokhDaijavad commented Feb 5, 2025

Hajar-Emami commented Feb 5, 2025

shahrokhDaijavad commented Feb 5, 2025

Hajar-Emami commented Feb 6, 2025 •

edited

Loading

Hajar-Emami commented Feb 6, 2025 •

edited

Loading

Hajar-Emami commented Feb 7, 2025

Hajar-Emami commented Feb 7, 2025 •

edited

Loading

Hajar-Emami commented Feb 8, 2025

Hajar-Emami commented Feb 8, 2025

Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983

Comments

shahrokhDaijavad commented Jan 27, 2025 • edited Loading

Search before asking

Component

Feature

Are you willing to submit a PR?

shahrokhDaijavad commented Jan 29, 2025

shahrokhDaijavad commented Jan 29, 2025

shahrokhDaijavad commented Jan 31, 2025

shahrokhDaijavad commented Jan 31, 2025

shahrokhDaijavad commented Feb 3, 2025

Hajar-Emami commented Feb 3, 2025

shahrokhDaijavad commented Feb 3, 2025 • edited Loading

shahrokhDaijavad commented Feb 3, 2025

Hajar-Emami commented Feb 4, 2025

Hajar-Emami commented Feb 4, 2025 • edited by shahrokhDaijavad Loading

shahrokhDaijavad commented Feb 4, 2025

Hajar-Emami commented Feb 4, 2025 • edited Loading

shahrokhDaijavad commented Feb 4, 2025

Hajar-Emami commented Feb 5, 2025 • edited Loading

Hajar-Emami commented Feb 5, 2025

Hajar-Emami commented Feb 5, 2025

shahrokhDaijavad commented Feb 5, 2025

Hajar-Emami commented Feb 5, 2025

shahrokhDaijavad commented Feb 5, 2025

Hajar-Emami commented Feb 6, 2025 • edited Loading

Hajar-Emami commented Feb 6, 2025 • edited Loading

Hajar-Emami commented Feb 7, 2025

Hajar-Emami commented Feb 7, 2025 • edited Loading

Hajar-Emami commented Feb 8, 2025

Hajar-Emami commented Feb 8, 2025

shahrokhDaijavad commented Jan 27, 2025 •

edited

Loading

shahrokhDaijavad commented Feb 3, 2025 •

edited

Loading

Hajar-Emami commented Feb 4, 2025 •

edited by shahrokhDaijavad

Loading

Hajar-Emami commented Feb 4, 2025 •

edited

Loading

Hajar-Emami commented Feb 5, 2025 •

edited

Loading

Hajar-Emami commented Feb 6, 2025 •

edited

Loading

Hajar-Emami commented Feb 6, 2025 •

edited

Loading

Hajar-Emami commented Feb 7, 2025 •

edited

Loading