-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rep removal #953
Rep removal #953
Conversation
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review. Will spend more time tomorrow...
@@ -0,0 +1,109 @@ | |||
# Repetition Removal Transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its it possible to add a diagram that shows how you are implementing a wrapper around the existing code ? Also, to confirm the cargo.lock, cargo.toml,rust/main.rs and rust/table.rs are "copy as is" from the original git repo. right ? What if that code changes and someone wants to upgrade the transform to work with newer code? Is that something we want to allow ?
transforms/universal/rep_removal/dpk_rep_removal/dedup_Rust_scripts.py
Outdated
Show resolved
Hide resolved
transforms/universal/rep_removal/dpk_rep_removal/dedup_Rust_scripts.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also add a test file for the pure python invocation of the program? (test_rep_removal_python.py
). Its content would be almost identical to the test_rep_removal_ray.py
, except that it would use the python transform launcher and config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @swith005. Great job!
Hi, @swith005 Looks like you have made a change in the directory for rust that results in an error when I run the notebook. Compile the dedup_dataset binary from the dpk_rep_removal package dir:
it should be Then, when I run the notebook, I get the error: Because it is trying to find datasets in dpk_rep_removal/target/release/dedup_dataset, but the datasets are in: Please look into this. Thanks. |
#"*" = ["*.txt"] | ||
dpk_rep_removal = ["universal/rep_removal/dpk_rep_removal/rust", "universal/rep_removal/dpk_rep_removal/gpt2"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unable to produce a proper wheel using line 160. I had to use the lines below. Can you please confirm that this also works for you:
[tool.setuptools.package-data]
"dpk_rep_removal.rust" = ["**"]
"dpk_rep_removal.gpt2" = ["**"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@touma-I I don't understand what you mean "build a proper wheeling using line 160". i thought the command was make build-pkg-dist
which should include the package data mentioned above for rep_removal.
can you elaborate how you're building the wheel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK tested this works :)
|
||
B) Compile the dedup_dataset binary from the **dpk_rep_removal** package dir: | ||
```shell | ||
cd dpk_rep_removal/rust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the user is installing the package via pip install ? would they need to navigate to the folder in venv/lib/python3.11/site-packages/dpk_rep_removal/rust ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@touma-I if installing the package via pip install, they would still need to build the pkg first with make build-pkg-dist
correct? in that case, they should compile the dedup_dataset binary as instructed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swith005 No. We only run build-pkg-dist once before we push the image to pypi.Once the image is in pypi, users will do pip install data-prep-kit-transforms[all]. At this point, their packages are installed in their environment. So the cd dpk_rep_removal/rust will not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated README to reflect install from git repo or pip install
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these changes needed?
This PR adds a transform for repetition removal in documents.