Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rep removal #953

Merged
merged 27 commits into from
Jan 31, 2025
Merged

Rep removal #953

merged 27 commits into from
Jan 31, 2025

Conversation

swith005
Copy link
Contributor

Why are these changes needed?

This PR adds a transform for repetition removal in documents.

@touma-I touma-I requested a review from cmadam January 21, 2025 16:34
@swith005 swith005 marked this pull request as ready for review January 21, 2025 22:45
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review. Will spend more time tomorrow...

transforms/universal/rep_removal/Makefile Outdated Show resolved Hide resolved
@@ -0,0 +1,109 @@
# Repetition Removal Transform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its it possible to add a diagram that shows how you are implementing a wrapper around the existing code ? Also, to confirm the cargo.lock, cargo.toml,rust/main.rs and rust/table.rs are "copy as is" from the original git repo. right ? What if that code changes and someone wants to upgrade the transform to work with newer code? Is that something we want to allow ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also add a test file for the pure python invocation of the program? (test_rep_removal_python.py). Its content would be almost identical to the test_rep_removal_ray.py, except that it would use the python transform launcher and config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@roytman roytman requested a review from revit13 January 23, 2025 06:45
transforms/pyproject.toml Outdated Show resolved Hide resolved
Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @swith005. Great job!

@touma-I touma-I self-requested a review January 30, 2025 00:21
@shahrokhDaijavad
Copy link
Member

Hi, @swith005 Looks like you have made a change in the directory for rust that results in an error when I run the notebook.
In the README, it says:

Compile the dedup_dataset binary from the dpk_rep_removal package dir:

cd dpk_rep_removal
cargo install 

it should be cd dpk_rep_removal/rust

Then, when I run the notebook, I get the error:
ERROR:root:Error occurred: /bin/sh: /Users/shahrokhdaijavad/Documents/data-prep-kit-outer/transforms/universal/rep_removal/dpk_rep_removal/target/release/dedup_dataset: cannot execute binary file

Because it is trying to find datasets in dpk_rep_removal/target/release/dedup_dataset, but the datasets are in:
dpk_rep_removal/rust/target/release/dedup_dataset

Please look into this. Thanks.

@shahrokhDaijavad shahrokhDaijavad self-requested a review January 30, 2025 15:02
#"*" = ["*.txt"]
dpk_rep_removal = ["universal/rep_removal/dpk_rep_removal/rust", "universal/rep_removal/dpk_rep_removal/gpt2"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to produce a proper wheel using line 160. I had to use the lines below. Can you please confirm that this also works for you:

[tool.setuptools.package-data]
"dpk_rep_removal.rust" = ["**"] 
"dpk_rep_removal.gpt2" = ["**"] 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@touma-I I don't understand what you mean "build a proper wheeling using line 160". i thought the command was make build-pkg-dist which should include the package data mentioned above for rep_removal.

can you elaborate how you're building the wheel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK tested this works :)


B) Compile the dedup_dataset binary from the **dpk_rep_removal** package dir:
```shell
cd dpk_rep_removal/rust
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the user is installing the package via pip install ? would they need to navigate to the folder in venv/lib/python3.11/site-packages/dpk_rep_removal/rust ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@touma-I if installing the package via pip install, they would still need to build the pkg first with make build-pkg-dist correct? in that case, they should compile the dedup_dataset binary as instructed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swith005 No. We only run build-pkg-dist once before we push the image to pypi.Once the image is in pypi, users will do pip install data-prep-kit-transforms[all]. At this point, their packages are installed in their environment. So the cd dpk_rep_removal/rust will not work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated README to reflect install from git repo or pip install

@shahrokhDaijavad
Copy link
Member

@swith005 Please see @touma-I 's comments above.

@touma-I touma-I requested review from touma-I, revit13 and cmadam January 31, 2025 18:50
Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swith005 Per our discussion earlier, you need to create a new PR to further simplify how users interact with this transform. Ideally, you will run the cargo command within your own code during initiailization. The only requirements on the users would be to install rust. cc:@shahrokh

@touma-I touma-I merged commit f9a3cc4 into IBM:dev Jan 31, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants