-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add shell script to deduplicate in one go
This commit will make it easier to deduplicate datasets: * Added JAR file to Java tokenizer for easier usage. * Added deduplicate.py for duplicate removal after they have been found * Added script to run all the tasks for the user in one place
- Loading branch information
Showing
8 changed files
with
162 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
import json | ||
import os | ||
import random | ||
from argparse import ArgumentParser | ||
|
||
''' | ||
This script removes duplicate files from project found by duplicate code detector. | ||
''' | ||
if __name__ == '__main__': | ||
parser = ArgumentParser() | ||
parser.add_argument("--project", dest="project_path", | ||
help="path to the project from which duplicates should be removed", required=True) | ||
parser.add_argument("--duplicates_data", dest="duplicates_data_path", | ||
help="data from DuplicateCodeDetector", required=True) | ||
args = parser.parse_args() | ||
|
||
project_path = args.project_path | ||
duplicates_data_path = args.duplicates_data_path | ||
|
||
with open('DuplicateCodeDetector/DuplicateCodeDetector.csproj.json') as f: | ||
duplicates = json.load(f) | ||
|
||
for duplicate_group in duplicates: # type: list | ||
# Leave one from the duplicate group to the dataset | ||
duplicate_group.remove(random.choice(duplicate_group)) | ||
for duplicate_path in duplicate_group: # type: str | ||
os.remove(os.path.join(project_path, duplicate_path)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
#!/usr/bin/env bash | ||
############################################################################### | ||
# Bash script to take dataset, find duplicates and create | ||
# a copy of the dataset without near code duplicates | ||
# | ||
# Usage: sh deduplicate.sh target/project/path output/folder/path | ||
# if target/project/path is not specified, script falls | ||
# back to DEFAULT_TARGET_PROJECT_PATH. Same for output. | ||
############################################################################### | ||
# Change the following values to preprocess a new dataset. | ||
# PATH_TO_TOKENIZER - Path to the tokenizer JAR | ||
# DEFAULT_TARGET_PROJECT_PATH - Path to target project if | ||
# not specified by parameters | ||
# TOKENIZER_OUTPUT_PATH - Output for the tokenizer | ||
# IDENTIFIER_ONLY - Boolean to specify if tokenizing only | ||
# identifiers or all possible text in code | ||
# | ||
# DUPLICATE_DETECTOR_PROJECT_PATH - Path to DuplicateCodeDetector project | ||
# DUPLICATE_DETECTOR_PATH - Path to the DuplicateCodeDetector c# entry file | ||
# | ||
# DEDUPLICATE_PROJECT_PATH - Path for the resulting deduplicated project | ||
# DEDUPLICATION_DATA - Path to temporarily save deduplication data as JSON form | ||
# | ||
# JAVA - java 1.8 alias | ||
# DOTNET - dotnet alias | ||
# PYTHON - python3 interpreter alias. | ||
############################################################################### | ||
# Changing DEFAULT_TARGET_PROJECT_PATH or specifing it | ||
# in program argument is enough for most users. | ||
DEFAULT_TARGET_PROJECT_PATH="path/to/project/if/not/specified/in/parameters" | ||
#${nr:-value} used to parse arguments in order of entry or fall back to "value" | ||
TARGET_PROJECT_PATH=${1:-${DEFAULT_TARGET_PROJECT_PATH}} | ||
############################################################################### | ||
PATH_TO_TOKENIZER="tokenizers/java/target/javatokenizer-1.0-SNAPSHOT.jar" | ||
TOKENIZER_OUTPUT_PATH="output/" | ||
IDENTIFIER_ONLY="true" | ||
|
||
DUPLICATE_DETECTOR_PROJECT_PATH="DuplicateCodeDetector" | ||
DUPLICATE_DETECTOR_PATH="${DUPLICATE_DETECTOR_PROJECT_PATH}/DuplicateCodeDetector.csproj" | ||
|
||
DEFAULT_DEDUPLICATE_PROJECT_PATH="deduplicated_results" | ||
DEDUPLICATE_PROJECT_PATH=${2:-${DEFAULT_DEDUPLICATE_PROJECT_PATH}} | ||
DEDUPLICATION_DATA="${DUPLICATE_DETECTOR_PROJECT_PATH}/DuplicateCodeDetector.csproj.json" | ||
|
||
JAVA=java | ||
DOTNET=dotnet | ||
PYTHON=python | ||
|
||
rm -rf ${DEDUPLICATE_PROJECT_PATH} | ||
|
||
echo "Running tokenizer..." | ||
${JAVA} -jar ${PATH_TO_TOKENIZER} ${TARGET_PROJECT_PATH} ${TOKENIZER_OUTPUT_PATH} ${IDENTIFIER_ONLY} | ||
echo "Tokenizer finished." | ||
|
||
echo "Running near duplicate code detection..." | ||
${DOTNET} run ${DUPLICATE_DETECTOR_PATH} --project=${DUPLICATE_DETECTOR_PROJECT_PATH} --dir=${TOKENIZER_OUTPUT_PATH} | ||
echo "Near duplicate code detection finished." | ||
|
||
echo "Copying project to ${DEDUPLICATE_PROJECT_PATH}" | ||
cp -r ${TARGET_PROJECT_PATH}/. ${DEDUPLICATE_PROJECT_PATH} | ||
echo "Copying finished" | ||
|
||
echo "Removing duplicates from the copy" | ||
${PYTHON} deduplicate.py --project ${DEDUPLICATE_PROJECT_PATH} --duplicates_data ${DEDUPLICATION_DATA} | ||
echo "Finished removing near duplicates" | ||
echo "Untouched project location: ${TARGET_PROJECT_PATH}" | ||
echo "Resulting project with duplicates removed: ${DEDUPLICATE_PROJECT_PATH}" | ||
|
||
# If all went well, tokenizer output is not needed anymore | ||
rm -r ${TOKENIZER_OUTPUT_PATH} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
target |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.