-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rep removal #953
Merged
Merged
Rep removal #953
Changes from all commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
91f7b7d
added initial base for rep_removal
30bb7f4
tested python and ray runs and changed imports to avoid pythonpath
be776de
added target folder for dedupt dataset
a5f58c4
added folder for dedup_dataset
ef775d7
removed cli and config param keys
a1954bb
added/tested dockerfiles and moded rust script to locate cargo.toml i…
bf4c2ff
added initial unittest for ray and changed contents col default
466e5e6
updated README and base notebook
03ad897
updated readme and removed minimum version in requirements
1b23a7b
enable workflow for new transform
touma-I 88c62de
updated Makefile template and typo in readme
925fe5f
Merge branch 'rep_removal' of https://github.com/swith005/data-prep-k…
a2044fc
tested with m1 mac and updated readme and notebook for requirements
0c7096b
Fixed a few typos in the README
shahrokhDaijavad 77aad8d
moved all rust content into rust folder
91b4e1d
updated path in Cargo.tomo
db1df53
moved exception cases in calling scripts
ade6286
changed boolean values to string to work with class
73856ba
added pytest and handling no duplicates
c9fb841
updated .toml file to include dpk_rep_removal package data
57eae1b
removed unnessary line in readme
9e60ae3
Merge branch 'dev' into rep_removal
swith005 9a9bf26
updated path to install cargo in repo
f7bda88
updated paths for new rust scripts
64ff4e8
updated .toml to remove old target/release path
63f83ef
added absolute path to python pytest
8c9fcdd
added cargo install locations for git repo and pip install
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
# | ||
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files | ||
# | ||
name: Test - transforms/universal/rep_removal | ||
|
||
on: | ||
workflow_dispatch: | ||
push: | ||
branches: | ||
- "dev" | ||
- "releases/**" | ||
tags: | ||
- "*" | ||
paths: | ||
- ".make.*" | ||
- "transforms/.make.transforms" | ||
- "transforms/universal/rep_removal/**" | ||
- "data-processing-lib/**" | ||
- "!transforms/universal/rep_removal/**/kfp_ray/**" # This is/will be tested in separate workflow | ||
- "!data-processing-lib/**/test/**" | ||
- "!data-processing-lib/**/test-data/**" | ||
- "!**.md" | ||
- "!**/doc/**" | ||
- "!**/images/**" | ||
- "!**.gitignore" | ||
pull_request: | ||
branches: | ||
- "dev" | ||
- "releases/**" | ||
paths: | ||
- ".make.*" | ||
- "transforms/.make.transforms" | ||
- "transforms/universal/rep_removal/**" | ||
- "data-processing-lib/**" | ||
- "!transforms/universal/rep_removal/**/kfp_ray/**" # This is/will be tested in separate workflow | ||
- "!data-processing-lib/**/test/**" | ||
- "!data-processing-lib/**/test-data/**" | ||
- "!**.md" | ||
- "!**/doc/**" | ||
- "!**/images/**" | ||
- "!**.gitignore" | ||
|
||
# Taken from https://stackoverflow.com/questions/66335225/how-to-cancel-previous-runs-in-the-pr-when-you-push-new-commitsupdate-the-curre | ||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ||
cancel-in-progress: true | ||
|
||
jobs: | ||
check_if_push_image: | ||
# check whether the Docker images should be pushed to the remote repository | ||
# The images are pushed if it is a merge to dev branch or a new tag is created. | ||
# The latter being part of the release process. | ||
# The images tag is derived from the value of the DOCKER_IMAGE_VERSION variable set in the .make.versions file. | ||
runs-on: ubuntu-22.04 | ||
outputs: | ||
publish_images: ${{ steps.version.outputs.publish_images }} | ||
steps: | ||
- id: version | ||
run: | | ||
publish_images='false' | ||
if [[ ${GITHUB_REF} == refs/heads/dev && ${GITHUB_EVENT_NAME} != 'pull_request' && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ; | ||
then | ||
publish_images='true' | ||
fi | ||
if [[ ${GITHUB_REF} == refs/tags/* && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ; | ||
then | ||
publish_images='true' | ||
fi | ||
echo "publish_images=$publish_images" >> "$GITHUB_OUTPUT" | ||
test-src: | ||
runs-on: ubuntu-22.04 | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v4 | ||
- name: Free up space in github runner | ||
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173 | ||
run: | | ||
df -h | ||
sudo rm -rf "/usr/local/share/boost" | ||
sudo rm -rf "$AGENT_TOOLSDIRECTORY" | ||
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/local/.ghcup | ||
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true | ||
df -h | ||
- name: Test transform source in transforms/universal/rep_removal | ||
run: | | ||
if [ -e "transforms/universal/rep_removal/Makefile" ]; then | ||
make -C transforms/universal/rep_removal DOCKER=docker test-src | ||
else | ||
echo "transforms/universal/rep_removal/Makefile not found - source testing disabled for this transform." | ||
fi | ||
test-image: | ||
needs: [check_if_push_image] | ||
runs-on: ubuntu-22.04 | ||
timeout-minutes: 120 | ||
env: | ||
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }} | ||
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }} | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v4 | ||
- name: Free up space in github runner | ||
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173 | ||
run: | | ||
df -h | ||
sudo rm -rf /opt/ghc | ||
sudo rm -rf "/usr/local/share/boost" | ||
sudo rm -rf "$AGENT_TOOLSDIRECTORY" | ||
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/lib/jvm /usr/local/.ghcup | ||
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true | ||
df -h | ||
- name: Test transform image in transforms/universal/rep_removal | ||
run: | | ||
if [ -e "transforms/universal/rep_removal/Makefile" ]; then | ||
if [ -d "transforms/universal/rep_removal/spark" ]; then | ||
make -C data-processing-lib/spark DOCKER=docker image | ||
fi | ||
make -C transforms/universal/rep_removal DOCKER=docker test-image | ||
else | ||
echo "transforms/universal/rep_removal/Makefile not found - testing disabled for this transform." | ||
fi | ||
- name: Print space | ||
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173 | ||
run: | | ||
df -h | ||
docker images | ||
- name: Publish images | ||
if: needs.check_if_push_image.outputs.publish_images == 'true' | ||
run: | | ||
if [ -e "transforms/universal/rep_removal/Makefile" ]; then | ||
make -C transforms/universal/rep_removal publish | ||
else | ||
echo "transforms/universal/rep_removal/Makefile not found - publishing disabled for this transform." | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
FROM docker.io/python:3.10.14-slim-bullseye | ||
|
||
RUN pip install --upgrade --no-cache-dir pip | ||
RUN apt update && apt install curl -y && apt install gcc -y | ||
|
||
# install pytest | ||
RUN pip install --no-cache-dir pytest | ||
|
||
# Create a user and use it to run the transform | ||
RUN useradd -ms /bin/bash dpk | ||
USER dpk | ||
WORKDIR /home/dpk | ||
ENV HOME="/home/dpk" | ||
ARG DPK_WHEEL_FILE_NAME | ||
ARG TRANSFORM_NAME | ||
|
||
# install rust and set path | ||
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y | ||
ENV PATH="$PATH:$HOME/.cargo/bin" | ||
|
||
# Copy and install data processing libraries | ||
# These are expected to be placed in the docker context before this is run (see the make image). | ||
COPY --chown=dpk:users data-processing-dist data-processing-dist | ||
RUN pip install data-processing-dist/${DPK_WHEEL_FILE_NAME} | ||
|
||
# END OF STEPS destined for a data-prep-kit base image | ||
|
||
COPY --chown=dpk:users dpk_${TRANSFORM_NAME}/ dpk_${TRANSFORM_NAME}/ | ||
COPY --chown=dpk:users requirements.txt requirements.txt | ||
RUN pip install --no-cache-dir -r requirements.txt | ||
|
||
# Set environment | ||
ENV PYTHONPATH="/home/dpk" | ||
|
||
# Put these at the end since they seem to upset the docker cache. | ||
ARG BUILD_DATE | ||
ARG GIT_COMMIT | ||
LABEL build-date=$BUILD_DATE | ||
LABEL git-commit=$GIT_COMMIT |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
ARG BASE_IMAGE=docker.io/rayproject/ray:2.24.0-py310 | ||
FROM ${BASE_IMAGE} | ||
|
||
# see https://docs.openshift.com/container-platform/4.17/openshift_images/create-images.html#use-uid_create-images | ||
USER root | ||
RUN chown ray:root /home/ray && chmod 775 /home/ray | ||
|
||
RUN pip install --upgrade --no-cache-dir pip | ||
RUN apt update && apt install curl -y && apt install gcc -y | ||
|
||
USER ray | ||
|
||
# install pytest | ||
RUN pip install --no-cache-dir pytest | ||
ARG DPK_WHEEL_FILE_NAME | ||
ARG TRANSFORM_NAME | ||
|
||
ENV HOME="/home/ray" | ||
# install rust and set path | ||
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y | ||
ENV PATH="$PATH:$HOME/.cargo/bin" | ||
|
||
# Copy and install data processing libraries | ||
# These are expected to be placed in the docker context before this is run (see the make image). | ||
COPY --chmod=775 --chown=ray:root data-processing-dist data-processing-dist | ||
RUN pip install data-processing-dist/${DPK_WHEEL_FILE_NAME}[ray] | ||
|
||
|
||
COPY --chmod=775 --chown=ray:root dpk_${TRANSFORM_NAME}/ dpk_${TRANSFORM_NAME}/ | ||
COPY --chmod=775 --chown=ray:root requirements.txt requirements.txt | ||
RUN pip install --no-cache-dir -r requirements.txt | ||
|
||
# Set environment | ||
ENV PYTHONPATH="/home/ray" | ||
|
||
# Put these at the end since they seem to upset the docker cache. | ||
ARG BUILD_DATE | ||
ARG GIT_COMMIT | ||
LABEL build-date=$BUILD_DATE | ||
LABEL git-commit=$GIT_COMMIT |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
REPOROOT=../../.. | ||
# Use make help, to see the available rules | ||
include $(REPOROOT)/transforms/.make.cicd.targets | ||
|
||
# | ||
# This is intended to be included across the Makefiles provided within | ||
# a given transform's directory tree, so must use compatible syntax. | ||
# | ||
################################################################################ | ||
# This defines the name of the transform and is used to match against | ||
# expected files and is used to define the transform's image name. | ||
TRANSFORM_NAME=$(shell basename `pwd`) | ||
|
||
################################################################################ | ||
TRANSFORM_PYTHON_SRC="-m dpk_$(TRANSFORM_NAME).runtime" | ||
TRANSFORM_RAY_SRC="-m dpk_$(TRANSFORM_NAME).ray.runtime" | ||
|
||
run-cli-sample: | ||
make venv | ||
source venv/bin/activate && \ | ||
$(PYTHON) -m dpk_$(TRANSFORM_NAME).runtime \ | ||
--data_local_config "{ 'input_folder' : 'test-data/input', 'output_folder' : 'output'}" \ | ||
--rep_removal_contents_column_name 'text' \ | ||
--rep_removal_num_threads '1' |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unable to produce a proper wheel using line 160. I had to use the lines below. Can you please confirm that this also works for you:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@touma-I I don't understand what you mean "build a proper wheeling using line 160". i thought the command was
make build-pkg-dist
which should include the package data mentioned above for rep_removal.can you elaborate how you're building the wheel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK tested this works :)