Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add models and interfaces #33

Merged
merged 55 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
42e0fb9
Add models and interfaces
OlivieFranklova Sep 18, 2024
b88e01d
Reformat with black
OlivieFranklova Sep 18, 2024
b271786
#23 add file system connector
OlivieFranklova Sep 23, 2024
832080a
#23 Add tests and improve filesystem connector
OlivieFranklova Sep 23, 2024
5ea367d
#26 add basic runner
OlivieFranklova Sep 30, 2024
fe6c08c
Fix clean up tokenization error
OlivieFranklova Oct 1, 2024
0b680f7
Switch warnings to logging in Comparators
OlivieFranklova Oct 2, 2024
6241574
#25 Add json formater
OlivieFranklova Oct 2, 2024
8526550
Add runner
OlivieFranklova Oct 2, 2024
6ea0607
Add functionsRunner test to workflow
OlivieFranklova Oct 2, 2024
d087b50
Format files with black
OlivieFranklova Oct 2, 2024
ed9f215
Add tests
OlivieFranklova Oct 2, 2024
b4469b7
omit some similarityRunner files
OlivieFranklova Oct 2, 2024
a933b80
Rename functionRunner to function_runner
OlivieFranklova Oct 3, 2024
8e79fb4
Test ipynb
OlivieFranklova Oct 3, 2024
3788269
Update workflow
OlivieFranklova Oct 3, 2024
bc1c95a
Add comments
OlivieFranklova Oct 3, 2024
2d49e80
Format with black
OlivieFranklova Oct 3, 2024
dfd22c7
Add and updates READMEs
OlivieFranklova Oct 4, 2024
628fa79
Start refactoring structure
OlivieFranklova Nov 16, 2024
8a1a054
Add function runner
OlivieFranklova Nov 16, 2024
453436d
Refactoring: before similarity folder delete
OlivieFranklova Nov 16, 2024
795b4de
Refactoring: after similarity folder delete
OlivieFranklova Nov 16, 2024
7a23ed1
More refactoring
OlivieFranklova Nov 16, 2024
c15b7e8
renamed runner
OlivieFranklova Nov 16, 2024
4c09f7c
Refactoring: working framework tests
OlivieFranklova Nov 17, 2024
9eaa88e
Refactoring: working column2vec tests
OlivieFranklova Nov 17, 2024
602d29d
Refactoring: WIP - similarity_runner rework
OlivieFranklova Nov 18, 2024
4b08f9d
Refactoring: similarity_runner rework
OlivieFranklova Nov 18, 2024
f49e07e
Update ui
OlivieFranklova Nov 20, 2024
7f0e2cb
Update ui
OlivieFranklova Nov 20, 2024
97dce23
Add CLI for connectors, without validation
OlivieFranklova Nov 21, 2024
f137ad4
Add CLI implementation using argsparse and pydantic_settings
OlivieFranklova Nov 22, 2024
aa4daa6
Fix pipeline
OlivieFranklova Nov 22, 2024
a7c4d85
Add analyses settings into handlers, fix some tests
OlivieFranklova Nov 23, 2024
813c89a
Create files for measurments
OlivieFranklova Nov 23, 2024
ec95f44
Fix of the tests
OlivieFranklova Nov 23, 2024
1fd8f79
Change order od calling metadata creators
OlivieFranklova Nov 23, 2024
aa71eb7
Add centralized logging solution - this needs to be fixed later
OlivieFranklova Nov 23, 2024
5988282
Few small updates
OlivieFranklova Nov 24, 2024
ed29f56
Add mesuremnets output
OlivieFranklova Nov 25, 2024
9987a60
Format with black
OlivieFranklova Nov 25, 2024
15db60e
Remove stremlit
OlivieFranklova Nov 25, 2024
933196b
Update requirements
OlivieFranklova Nov 25, 2024
d28c655
Update CI
OlivieFranklova Nov 25, 2024
aa5dc8b
Update requirements
OlivieFranklova Nov 25, 2024
6fce081
Update requirements
OlivieFranklova Nov 25, 2024
0ad9fea
Update tests
OlivieFranklova Nov 25, 2024
4a61724
comment test
OlivieFranklova Nov 25, 2024
c8b99de
update tests
OlivieFranklova Nov 26, 2024
ba41d70
update tests
OlivieFranklova Nov 26, 2024
2b9aa88
add mesurements res graphs
OlivieFranklova Nov 26, 2024
a6ef2cf
Remove reaserch from git
OlivieFranklova Nov 26, 2024
bdcf6e2
Remove measurements from git
OlivieFranklova Nov 26, 2024
e09c45d
add suggestions from PR
OlivieFranklova Nov 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions .github/workflows/py_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ jobs:

- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pylint

- name: Analysing the code with pylint
Expand Down Expand Up @@ -50,7 +51,7 @@ jobs:

python-tests:
env:
TEST_FILES: test/test_types.py test/test_metadata.py test/test_comparator.py test/test_column2VecCache.py
TEST_FILES: tests/similarity_framework/test_similarity* tests/column2vec/test_column2vec_cache.py
name: Run Python Tests
runs-on: ubuntu-latest
steps:
Expand All @@ -66,26 +67,26 @@ jobs:
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install coverage pytest

- name: Run tests
run: coverage run --source='similarity,column2Vec' -m pytest $TEST_FILES
run: |
coverage run -m pytest $TEST_FILES

- name: Show coverage
run: coverage report -m --omit=".*.ipynb"
run: coverage report -m --omit=".*.ipynb,similarity_runner/*"

- name: Create coverage file
if: github.event_name == 'pull_request'
run: coverage xml
run: coverage xml --omit=".*.ipynb,similarity_runner/*"

- name: Get Cover
if: github.event_name == 'pull_request'
uses: orgoro/coverage@v3.1
uses: orgoro/coverage@v3.2
with:
coverageFile: coverage.xml
token: ${{ secrets.GITHUB_TOKEN }}
thresholdAll: 0.7
thresholdNew: 0.9
thresholdNew: 0.7

- uses: actions/upload-artifact@v4
if: github.event_name == 'pull_request'
Expand Down
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ celerybeat.pid
*.sage.py

# Environments
.env
.config
.venv
env/
venv/
Expand Down
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ ignored-modules=

# Python code to execute, usually for sys.path manipulation such as
# pygtk.require().
#init-hook=
init-hook='import sys; sys.path.append("./similarity"); sys.path.append("./similarityRunner")'

# Use multiple processes to speed up Pylint. Specifying 0 will auto-detect the
# number of processors available to use, and will cap the count on Windows to
Expand Down
26 changes: 18 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ the main set (training) on which the program is
tuned, and a validation set for validating the results.

#### Definition of table similarity:
![img_1.png](images/similarity_def.png)
![img_1.png](docs/similarity_def.png)
>Parameter **important columns** is user input.
>
>Parameter **k** is also user input.
Expand All @@ -44,8 +44,8 @@ input for Comparator.
Comparator compares metadata and it computes distance.
We should test which one is better.

1. ![img_2.png](images/pipeline1.png)
2. ![img_3.png](images/pipeline2.png)
1. ![img_2.png](similarity_framework/docs/pipeline1.png)
2. ![img_3.png](similarity_framework/docs/pipeline2.png)
#### Metadata creator
MetadataCreator has:
- **constructor** that fills fields:
Expand Down Expand Up @@ -102,7 +102,7 @@ of these two tables.
### Column2Vec
Column2Vec is a module in which we implement word2Vec based functionality for columns.
It will compute embeddings for columns, so we can compare them.
More about this module can be found [here](column2Vec/README.md).
More about this module can be found [here](column2vec/README.md).
### Types and Kinds
We have decided to split columns by type. We can compute types or kinds for each column.
Types define the real type of column. Some you may know from programming languages (int, float, string)
Expand All @@ -118,22 +118,22 @@ Explaining some types:
- phrase: string with more than one word
- multiple: string that represents not atomic data or structured data
- article: string with more than one sentence
3. ![img.png](images/types.png)
3. ![img.png](docs/types.png)
Kind has only for "types" plus undefined. You can see all types on the picture 4.
Explaining kinds:
- As **Id** will be marked column that contains only uniq values
- As **Bool** will be marked column that contains only two unique values
- As **Constant** will be marked column that contains only one unique value
- As **Categorical** will be marked column that contains categories. Number of uniq values is less than threshold % of the total number of rows. Threshold is different for small and big dataset.
4. ![img.png](images/kind.png)
4. ![img.png](docs/kind.png)
### Applicability
- merging teams
- fuze of companies
- found out which data are duplicated
- finding similar or different data
## Structure
- **Source code** is in folder [similarity](similarity). More about similarity folder structure in [README.md](similarity/README.md)
- **Source code for column2Vec** is in folder [column2Vec](column2Vec).
- **Source code for column2Vec** is in folder [column2Vec](column2vec).
- **Tests** are in folder [test](test)
- **Data** are stored in folders [**data**](data) and [**data_validation**](data_validation).
- **Main folder** contains: folder .github, files .gitignore, CONTRIBUTING.MD, LICENSE, README.md, requirements.txt, constants.py and main.py
Expand All @@ -142,7 +142,7 @@ Explaining kinds:
---

**column2Vec** folder contains all files for [column2Vec](#column2Vec) feature.
More about the structure of this folder can be found [here](column2Vec/README.md/#structure).
More about the structure of this folder can be found [here](column2vec/README.md/#structure).

**Datasets** for testing are stored in [**data**](data) and [**data_validation**](data_validation)
Corresponding link, name and eventual description for each dataset is
Expand Down Expand Up @@ -213,5 +213,15 @@ black similarity/metadata_creator.py
```
You can change black settings in [pyproject.toml](pyproject.toml) file.


#### Coverage
You can run it by using this command:
```bash
PYTHONPATH="./similarity:./similarityRunner:$PYTHONPATH" \
coverage run --source='similarity,column2Vec,similarityRunner' -m \
pytest test/test_similarity* test/test_runner* test/test_column2VecCache.py

```

## How to contribute
Please see our [**Contribution Guidelines**](CONTRIBUTING.md).
13 changes: 0 additions & 13 deletions column2Vec/research/streamlit/pages/1_map.py

This file was deleted.

35 changes: 0 additions & 35 deletions column2Vec/research/streamlit/streamlit_app.py

This file was deleted.

14 changes: 7 additions & 7 deletions column2Vec/README.md → column2vec/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,18 @@ We have implemented seven different approaches.

## Structure

Folder [**generated**](generated) contains all generated files.
Folder [**generated**](similarity_framework/src/column2vec2vec/research/generated) contains all generated files.
Mostly html files representing
2D clusters, created by clustering vectors.
It also contains cache file where could be stored created embeddings.
Cashing is possible to switch of or switch on.

Folder [**research**](research) is folder created for testing column2Vec functions.
It contains folder [files](research/files) with all generated and md files with results from test.
[column2Vec_re.py](research/column2Vec_re.py) file with statistics computation for functions.
[generate_report.py](research/generate_report.py) file that contains functions to generate files stored in [files](research/files).
Folder [**research**](similarity_framework/src/column2vec2vec/research) is folder created for testing column2Vec functions.
It contains folder [files](similarity_framework/src/column2vec2vec/research/files) with all generated and md files with results from test.
[column2Vec_re.py](similarity_framework/src/column2vec2vec/research/column2vec_re.py) file with statistics computation for functions.
[generate_report.py](similarity_framework/src/column2vec2vec/research/generate_report.py) file that contains functions to generate files stored in [files](similarity_framework/src/column2vec2vec/research/files).

File [**Column2Vec.py**](impl/Column2Vec.py) in folder [impl](impl) contains 7 different implementations of column2Vec.
File [**Column2Vec.py**](similarity_framework/src/column2vec2vec/src/column2vec.py) in folder [impl](src) contains 7 different implementations of column2Vec.
All implementations could use cache.
There is implemented run time cache and persistent cache stored in cache.txt in folder generated.
To store cache persistently it is necessary to call:
Expand All @@ -43,7 +43,7 @@ cache.clear_persistent_cache() # clears the cache file

> Inspired by "Column2Vec: Structural Understanding via Distributed Representations of Database Schemas" by Michael J. Mior, Alexander G. Ororbia [1903.08621](https://arxiv.org/pdf/1903.08621)

File [**functions.py**](impl/functions.py) in folder [impl](impl) contains functions for using column2Vec.
File [**functions.py**](similarity_framework/src/column2vec2vec/src/functions.py) in folder [impl](src) contains functions for using column2Vec.
It contains functions:
- get_nonnumerical_data (returns string columns)
- get_vectors (creates embeddings)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@
SentenceTransformer,
)

from column2Vec.impl.functions import (
from column2vec.src.functions import (
get_nonnumerical_data,
)
from column2Vec.impl.Column2Vec import (
from column2vec.src.column2vec import (
column2vec_as_sentence,
column2vec_as_sentence_clean,
column2vec_as_sentence_clean_uniq,
Expand All @@ -27,15 +27,12 @@
cache,
)

from column2Vec.research.generate_report import (
from column2vec.research.generate_report import (
generate_time_report,
generate_sim_report,
generate_stability_report,
generate_partial_column_report,
)
from config import configure
from constants import warning_enable
from similarity.Comparator import cosine_sim

from similarity_framework.config import configure
from similarity_framework.src.impl.comparator.comparator_by_type import cosine_sim

FUNCTIONS = [
column2vec_as_sentence,
Expand All @@ -48,7 +45,7 @@
]
MODEL = "paraphrase-multilingual-mpnet-base-v2" # 'bert-base-nli-mean-tokens'
THIS_DIR = os.path.dirname(os.path.abspath(__file__))
model = SentenceTransformer(MODEL)
model = SentenceTransformer(MODEL, tokenizer_kwargs={"clean_up_tokenization_spaces": True})


def count_embedding(column1: pd.Series, function, key: str) -> pd.Series:
Expand Down Expand Up @@ -255,6 +252,4 @@ def run_fun():


configure()
warning_enable.change_status(True)

run_fun()
File renamed without changes.
10 changes: 5 additions & 5 deletions column2Vec/impl/Column2Vec.py → column2vec/src/column2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

from torch import Tensor

logger = logging.getLogger(__name__)
from logging_ import logger


class Cache:
Expand All @@ -37,11 +37,11 @@ def __read(self):
try:
self.__cache = pd.io.parsers.read_csv(self.__file, index_col=0)
except FileNotFoundError:
logger.warning("CACHE: File not found.")
logger.debug("CACHE: File not found.")
except pd.errors.EmptyDataError:
logger.warning("CACHE: No data")
logger.debug("CACHE: No data")
except pd.errors.ParserError:
logger.warning("CACHE: Parser error")
logger.debug("CACHE: Parser error")

def get_cache(self, key: str, function: str) -> list | None:
"""
Expand All @@ -60,7 +60,7 @@ def get_cache(self, key: str, function: str) -> list | None:
tmp = self.__cache.loc[function, key]
if (tmp != "nan" and tmp is not int) or (tmp is int and not math.isnan(tmp)):
return json.loads(tmp) # json is faster than ast
print(f"NO CACHE key: {key}, function: {function}")
# print(f"NO CACHE key: {key}, function: {function}")
return None
OlivieFranklova marked this conversation as resolved.
Show resolved Hide resolved

def save(
Expand Down
13 changes: 6 additions & 7 deletions column2Vec/impl/functions.py → column2vec/src/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,10 @@
from sklearn.manifold import TSNE

from constants import trained_model
from similarity.Comparator import cosine_sim
from similarity.DataFrameMetadataCreator import (
DataFrameMetadataCreator,
)
from similarity.Types import NONNUMERICAL
from similarity_framework.src.impl.comparator.utils import cosine_sim
from similarity_framework.src.impl.metadata.type_metadata_creator import TypeMetadataCreator
from similarity_framework.src.models.metadata import MetadataCreatorInput
from similarity_framework.src.models.types_ import NONNUMERICAL


def get_nonnumerical_data(
Expand All @@ -38,8 +37,8 @@ def get_nonnumerical_data(
for i in files:
index += 1
data = pd.read_csv(i)
metadata_creator = DataFrameMetadataCreator(data).compute_advanced_structural_types().compute_column_kind()
metadata1 = metadata_creator.get_metadata()
metadata_creator = TypeMetadataCreator().compute_advanced_structural_types().compute_column_kind()
metadata1 = metadata_creator.get_metadata(MetadataCreatorInput(dataframe=data))
column_names = metadata1.get_column_names_by_type(NONNUMERICAL)
for name in column_names:
print(f" {i} : {name}")
Expand Down
File renamed without changes.
13 changes: 7 additions & 6 deletions test/test_column2Vec.py → column2vec/tests/test_column2Vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
import pandas as pd
from sentence_transformers import SentenceTransformer

from column2Vec.impl.Column2Vec import (column2vec_as_sentence, column2vec_as_sentence_clean,
column2vec_as_sentence_clean_uniq, column2vec_avg,
column2vec_weighted_avg, column2vec_sum,
column2vec_weighted_sum, cache)
from column2Vec.impl.functions import get_nonnumerical_data, get_clusters, compute_distances
from src.column2vec import (column2vec_as_sentence, column2vec_as_sentence_clean,
column2vec_as_sentence_clean_uniq, column2vec_avg,
column2vec_weighted_avg, column2vec_sum,
column2vec_weighted_sum, cache)
from src.functions import get_nonnumerical_data, get_clusters, compute_distances
from similarity.DataFrameMetadataCreator import DataFrameMetadataCreator
from similarity.Types import NONNUMERICAL

Expand All @@ -33,7 +33,8 @@ def get_vectors(function, data):
count = 1
for key in data:
# print("Processing column: " + key + " " + str(round((count / len(data)) * 100, 2)) + "%")
OlivieFranklova marked this conversation as resolved.
Show resolved Hide resolved
result[key] = function(data[key], SentenceTransformer(MODEL), key)
result[key] = function(data[key], SentenceTransformer(MODEL, tokenizer_kwargs={
'clean_up_tokenization_spaces': True}), key)
count += 1
end = time.time()
print(f"ELAPSED TIME :{end - start}")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@

from sentence_transformers import SentenceTransformer

from column2Vec.impl.Column2Vec import (cache,
column2vec_as_sentence,
column2vec_as_sentence_clean,
column2vec_as_sentence_clean_uniq,
column2vec_avg,
column2vec_weighted_sum,
column2vec_sum,
column2vec_weighted_avg)
from column2Vec.impl.functions import get_nonnumerical_data
from src.column2vec import (cache,
column2vec_as_sentence,
column2vec_as_sentence_clean,
column2vec_as_sentence_clean_uniq,
column2vec_avg,
column2vec_weighted_sum,
column2vec_sum,
column2vec_weighted_avg)
from src.functions import get_nonnumerical_data

MODEL = 'bert-base-nli-mean-tokens'
THIS_DIR = os.path.dirname(os.path.abspath(__file__))
Expand Down
Loading