Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add models and interfaces #33

Merged
merged 55 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
42e0fb9
Add models and interfaces
OlivieFranklova Sep 18, 2024
b88e01d
Reformat with black
OlivieFranklova Sep 18, 2024
b271786
#23 add file system connector
OlivieFranklova Sep 23, 2024
832080a
#23 Add tests and improve filesystem connector
OlivieFranklova Sep 23, 2024
5ea367d
#26 add basic runner
OlivieFranklova Sep 30, 2024
fe6c08c
Fix clean up tokenization error
OlivieFranklova Oct 1, 2024
0b680f7
Switch warnings to logging in Comparators
OlivieFranklova Oct 2, 2024
6241574
#25 Add json formater
OlivieFranklova Oct 2, 2024
8526550
Add runner
OlivieFranklova Oct 2, 2024
6ea0607
Add functionsRunner test to workflow
OlivieFranklova Oct 2, 2024
d087b50
Format files with black
OlivieFranklova Oct 2, 2024
ed9f215
Add tests
OlivieFranklova Oct 2, 2024
b4469b7
omit some similarityRunner files
OlivieFranklova Oct 2, 2024
a933b80
Rename functionRunner to function_runner
OlivieFranklova Oct 3, 2024
8e79fb4
Test ipynb
OlivieFranklova Oct 3, 2024
3788269
Update workflow
OlivieFranklova Oct 3, 2024
bc1c95a
Add comments
OlivieFranklova Oct 3, 2024
2d49e80
Format with black
OlivieFranklova Oct 3, 2024
dfd22c7
Add and updates READMEs
OlivieFranklova Oct 4, 2024
628fa79
Start refactoring structure
OlivieFranklova Nov 16, 2024
8a1a054
Add function runner
OlivieFranklova Nov 16, 2024
453436d
Refactoring: before similarity folder delete
OlivieFranklova Nov 16, 2024
795b4de
Refactoring: after similarity folder delete
OlivieFranklova Nov 16, 2024
7a23ed1
More refactoring
OlivieFranklova Nov 16, 2024
c15b7e8
renamed runner
OlivieFranklova Nov 16, 2024
4c09f7c
Refactoring: working framework tests
OlivieFranklova Nov 17, 2024
9eaa88e
Refactoring: working column2vec tests
OlivieFranklova Nov 17, 2024
602d29d
Refactoring: WIP - similarity_runner rework
OlivieFranklova Nov 18, 2024
4b08f9d
Refactoring: similarity_runner rework
OlivieFranklova Nov 18, 2024
f49e07e
Update ui
OlivieFranklova Nov 20, 2024
7f0e2cb
Update ui
OlivieFranklova Nov 20, 2024
97dce23
Add CLI for connectors, without validation
OlivieFranklova Nov 21, 2024
f137ad4
Add CLI implementation using argsparse and pydantic_settings
OlivieFranklova Nov 22, 2024
aa4daa6
Fix pipeline
OlivieFranklova Nov 22, 2024
a7c4d85
Add analyses settings into handlers, fix some tests
OlivieFranklova Nov 23, 2024
813c89a
Create files for measurments
OlivieFranklova Nov 23, 2024
ec95f44
Fix of the tests
OlivieFranklova Nov 23, 2024
1fd8f79
Change order od calling metadata creators
OlivieFranklova Nov 23, 2024
aa71eb7
Add centralized logging solution - this needs to be fixed later
OlivieFranklova Nov 23, 2024
5988282
Few small updates
OlivieFranklova Nov 24, 2024
ed29f56
Add mesuremnets output
OlivieFranklova Nov 25, 2024
9987a60
Format with black
OlivieFranklova Nov 25, 2024
15db60e
Remove stremlit
OlivieFranklova Nov 25, 2024
933196b
Update requirements
OlivieFranklova Nov 25, 2024
d28c655
Update CI
OlivieFranklova Nov 25, 2024
aa5dc8b
Update requirements
OlivieFranklova Nov 25, 2024
6fce081
Update requirements
OlivieFranklova Nov 25, 2024
0ad9fea
Update tests
OlivieFranklova Nov 25, 2024
4a61724
comment test
OlivieFranklova Nov 25, 2024
c8b99de
update tests
OlivieFranklova Nov 26, 2024
ba41d70
update tests
OlivieFranklova Nov 26, 2024
2b9aa88
add mesurements res graphs
OlivieFranklova Nov 26, 2024
a6ef2cf
Remove reaserch from git
OlivieFranklova Nov 26, 2024
bdcf6e2
Remove measurements from git
OlivieFranklova Nov 26, 2024
e09c45d
add suggestions from PR
OlivieFranklova Nov 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions .github/workflows/py_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ jobs:

- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pylint

- name: Analysing the code with pylint
Expand Down Expand Up @@ -50,7 +51,7 @@ jobs:

python-tests:
env:
TEST_FILES: test/test_types.py test/test_metadata.py test/test_comparator.py test/test_column2VecCache.py
TEST_FILES: tests/similarity_framework/test_similarity* tests/column2vec/test_column2vec_cache.py
name: Run Python Tests
runs-on: ubuntu-latest
steps:
Expand All @@ -66,26 +67,26 @@ jobs:
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install coverage pytest

- name: Run tests
run: coverage run --source='similarity,column2Vec' -m pytest $TEST_FILES
run: |
coverage run -m pytest $TEST_FILES

- name: Show coverage
run: coverage report -m --omit=".*.ipynb"
run: coverage report -m --omit=".*.ipynb,similarity_runner/*"

- name: Create coverage file
if: github.event_name == 'pull_request'
run: coverage xml
run: coverage xml --omit=".*.ipynb,similarity_runner/*"

- name: Get Cover
if: github.event_name == 'pull_request'
uses: orgoro/coverage@v3.1
uses: orgoro/coverage@v3.2
with:
coverageFile: coverage.xml
token: ${{ secrets.GITHUB_TOKEN }}
thresholdAll: 0.7
thresholdNew: 0.9
thresholdNew: 0.7

- uses: actions/upload-artifact@v4
if: github.event_name == 'pull_request'
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ celerybeat.pid
*.sage.py

# Environments
.env
.config
.venv
env/
venv/
Expand Down Expand Up @@ -165,3 +165,6 @@ cython_debug/
# Custom for this project
fingerprints/
**/.DS_Store

column2vec/research
measurement
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ ignored-modules=

# Python code to execute, usually for sys.path manipulation such as
# pygtk.require().
#init-hook=
init-hook='import sys; sys.path.append("./similarity"); sys.path.append("./similarityRunner")'

# Use multiple processes to speed up Pylint. Specifying 0 will auto-detect the
# number of processors available to use, and will cap the count on Windows to
Expand Down
26 changes: 18 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ the main set (training) on which the program is
tuned, and a validation set for validating the results.

#### Definition of table similarity:
![img_1.png](images/similarity_def.png)
![img_1.png](docs/similarity_def.png)
>Parameter **important columns** is user input.
>
>Parameter **k** is also user input.
Expand All @@ -44,8 +44,8 @@ input for Comparator.
Comparator compares metadata and it computes distance.
We should test which one is better.

1. ![img_2.png](images/pipeline1.png)
2. ![img_3.png](images/pipeline2.png)
1. ![img_2.png](similarity_framework/docs/pipeline1.png)
2. ![img_3.png](similarity_framework/docs/pipeline2.png)
#### Metadata creator
MetadataCreator has:
- **constructor** that fills fields:
Expand Down Expand Up @@ -102,7 +102,7 @@ of these two tables.
### Column2Vec
Column2Vec is a module in which we implement word2Vec based functionality for columns.
It will compute embeddings for columns, so we can compare them.
More about this module can be found [here](column2Vec/README.md).
More about this module can be found [here](column2vec/README.md).
### Types and Kinds
We have decided to split columns by type. We can compute types or kinds for each column.
Types define the real type of column. Some you may know from programming languages (int, float, string)
Expand All @@ -118,22 +118,22 @@ Explaining some types:
- phrase: string with more than one word
- multiple: string that represents not atomic data or structured data
- article: string with more than one sentence
3. ![img.png](images/types.png)
3. ![img.png](docs/types.png)
Kind has only for "types" plus undefined. You can see all types on the picture 4.
Explaining kinds:
- As **Id** will be marked column that contains only uniq values
- As **Bool** will be marked column that contains only two unique values
- As **Constant** will be marked column that contains only one unique value
- As **Categorical** will be marked column that contains categories. Number of uniq values is less than threshold % of the total number of rows. Threshold is different for small and big dataset.
4. ![img.png](images/kind.png)
4. ![img.png](docs/kind.png)
### Applicability
- merging teams
- fuze of companies
- found out which data are duplicated
- finding similar or different data
## Structure
- **Source code** is in folder [similarity](similarity). More about similarity folder structure in [README.md](similarity/README.md)
- **Source code for column2Vec** is in folder [column2Vec](column2Vec).
- **Source code for column2Vec** is in folder [column2Vec](column2vec).
- **Tests** are in folder [test](test)
- **Data** are stored in folders [**data**](data) and [**data_validation**](data_validation).
- **Main folder** contains: folder .github, files .gitignore, CONTRIBUTING.MD, LICENSE, README.md, requirements.txt, constants.py and main.py
Expand All @@ -142,7 +142,7 @@ Explaining kinds:
---

**column2Vec** folder contains all files for [column2Vec](#column2Vec) feature.
More about the structure of this folder can be found [here](column2Vec/README.md/#structure).
More about the structure of this folder can be found [here](column2vec/README.md/#structure).

**Datasets** for testing are stored in [**data**](data) and [**data_validation**](data_validation)
Corresponding link, name and eventual description for each dataset is
Expand Down Expand Up @@ -213,5 +213,15 @@ black similarity/metadata_creator.py
```
You can change black settings in [pyproject.toml](pyproject.toml) file.


#### Coverage
You can run it by using this command:
```bash
PYTHONPATH="./similarity:./similarityRunner:$PYTHONPATH" \
coverage run --source='similarity,column2Vec,similarityRunner' -m \
pytest test/test_similarity* test/test_runner* test/test_column2VecCache.py

```

## How to contribute
Please see our [**Contribution Guidelines**](CONTRIBUTING.md).
14 changes: 0 additions & 14 deletions column2Vec/generated/Average column2vec vectors.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Average_column2vec_vectors.html

This file was deleted.

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Cleaned_sentence_column2vec.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Cleaned_sentence_column2vec_vectors.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Clusters_Average_column2vec_clusters.html

This file was deleted.

This file was deleted.

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Clusters_Sentence_column2vec_clusters.html

This file was deleted.

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/My_Clusters.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Sentence column2vec vectors.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Sentence_column2vec_vectors.html

This file was deleted.

14 changes: 0 additions & 14 deletions column2Vec/generated/Weighted_average_column2vec_vectors.html

This file was deleted.

2 changes: 0 additions & 2 deletions column2Vec/generated/cache.txt

This file was deleted.

Loading