AbsaOSS · OlivieFranklova · Nov 26, 2024 · Sep 18, 2024 · Sep 18, 2024 · Sep 23, 2024
@@ -21,6 +21,7 @@ jobs:
 
       - name: Install dependencies
         run: |
+          pip install -r requirements.txt
           pip install pylint
 
       - name: Analysing the code with pylint
@@ -50,7 +51,7 @@ jobs:
 
   python-tests:
     env:
-      TEST_FILES: test/test_types.py test/test_metadata.py test/test_comparator.py test/test_column2VecCache.py
+      TEST_FILES: tests/similarity_framework/test_similarity*  tests/column2vec/test_column2vec_cache.py
     name: Run Python Tests
     runs-on: ubuntu-latest
     steps:
@@ -66,26 +67,26 @@ jobs:
       - name: Install dependencies
         run: |
           pip install -r requirements.txt
-          pip install coverage pytest
 
       - name: Run tests
-        run: coverage run --source='similarity,column2Vec' -m pytest $TEST_FILES
+        run: |
+           coverage run -m pytest $TEST_FILES
 
       - name: Show coverage
-        run: coverage report -m --omit=".*.ipynb"
+        run: coverage report -m --omit=".*.ipynb,similarity_runner/*"
 
       - name: Create coverage file
         if: github.event_name == 'pull_request'
-        run: coverage xml
+        run: coverage xml --omit=".*.ipynb,similarity_runner/*"
 
       - name: Get Cover
         if: github.event_name == 'pull_request'
-        uses: orgoro/coverage@v3.1
+        uses: orgoro/coverage@v3.2
         with:
           coverageFile: coverage.xml
           token: ${{ secrets.GITHUB_TOKEN }}
           thresholdAll: 0.7
-          thresholdNew: 0.9
+          thresholdNew: 0.7
 
       - uses: actions/upload-artifact@v4
         if: github.event_name == 'pull_request'

@@ -122,7 +122,7 @@ celerybeat.pid
 *.sage.py
 
 # Environments
-.env
+.config
 .venv
 env/
 venv/

@@ -68,7 +68,7 @@ ignored-modules=
 
 # Python code to execute, usually for sys.path manipulation such as
 # pygtk.require().
-#init-hook=
+init-hook='import sys; sys.path.append("./similarity"); sys.path.append("./similarityRunner")'
 
 # Use multiple processes to speed up Pylint. Specifying 0 will auto-detect the
 # number of processors available to use, and will cap the count on Windows to

@@ -27,7 +27,7 @@ the main set (training) on which the program is
 tuned, and a validation set for validating the results.
 
 #### Definition of table similarity:
-![img_1.png](images/similarity_def.png)
+![img_1.png](docs/similarity_def.png)
 >Parameter **important columns** is user input.
 > 
 >Parameter **k** is also user input.
@@ -44,8 +44,8 @@ input for Comparator.
 Comparator compares metadata and it computes distance.
 We should test which one is better.
 
-1. ![img_2.png](images/pipeline1.png)
-2. ![img_3.png](images/pipeline2.png)
+1. ![img_2.png](similarity_framework/docs/pipeline1.png)
+2. ![img_3.png](similarity_framework/docs/pipeline2.png)
 #### Metadata creator
 MetadataCreator has:
   - **constructor** that fills fields:
@@ -102,7 +102,7 @@ of these two tables.
 ### Column2Vec
 Column2Vec is a module in which we implement word2Vec based functionality for columns. 
 It will compute embeddings for columns, so we can compare them. 
-More about this module can be found [here](column2Vec/README.md).
+More about this module can be found [here](column2vec/README.md).
 ### Types and Kinds
 We have decided to split columns by type. We can compute types or kinds for each column.
 Types define the real type of column. Some you may know from programming languages (int, float, string)
@@ -118,22 +118,22 @@ Explaining some types:
 - phrase: string with more than one word
 - multiple: string that represents not atomic data or structured data
 - article: string with more than one sentence
-3. ![img.png](images/types.png)
+3. ![img.png](docs/types.png)
 Kind has only for "types" plus undefined. You can see all types on the picture 4.
 Explaining kinds:
    - As **Id** will be marked column that contains only uniq values
    - As **Bool** will be marked column that contains only two unique values
    - As **Constant** will be marked column that contains only one unique value
    - As **Categorical** will be marked column that contains categories. Number of uniq values is less than threshold % of the total number of rows. Threshold is different for small and big dataset.
-4. ![img.png](images/kind.png)
+4. ![img.png](docs/kind.png)
 ### Applicability
 - merging teams 
 - fuze of companies
 - found out which data are duplicated 
 - finding similar or different data
 ## Structure
 - **Source code** is in folder [similarity](similarity). More about similarity folder structure in [README.md](similarity/README.md)
-- **Source code for column2Vec** is in folder [column2Vec](column2Vec).
+- **Source code for column2Vec** is in folder [column2Vec](column2vec).
 - **Tests** are in folder [test](test)
 - **Data** are stored in folders [**data**](data) and [**data_validation**](data_validation).
 - **Main folder** contains: folder .github, files .gitignore, CONTRIBUTING.MD, LICENSE, README.md, requirements.txt, constants.py and main.py
@@ -142,7 +142,7 @@ Explaining kinds:
 ---
 
 **column2Vec** folder contains all files for [column2Vec](#column2Vec) feature.
-More about the structure of this folder can be found [here](column2Vec/README.md/#structure).
+More about the structure of this folder can be found [here](column2vec/README.md/#structure).
 
 **Datasets** for testing are stored in [**data**](data) and [**data_validation**](data_validation)
 Corresponding link, name and eventual description for each dataset is
@@ -213,5 +213,15 @@ black similarity/metadata_creator.py
 ```
 You can change black settings in [pyproject.toml](pyproject.toml) file.
 
+
+#### Coverage
+You can run it by using this command:
+```bash
+PYTHONPATH="./similarity:./similarityRunner:$PYTHONPATH" \                      
+           coverage run --source='similarity,column2Vec,similarityRunner' -m \
+           pytest test/test_similarity* test/test_runner* test/test_column2VecCache.py
+
+```
+
 ## How to contribute
 Please see our [**Contribution Guidelines**](CONTRIBUTING.md).
@@ -5,18 +5,18 @@ We have implemented seven different approaches.
 
 ## Structure
 
-Folder [**generated**](generated) contains all generated files.
+Folder [**generated**](similarity_framework/src/column2vec2vec/research/generated) contains all generated files.
 Mostly html files representing
 2D clusters, created by clustering vectors. 
 It also contains cache file where could be stored created embeddings.
 Cashing is possible to switch of or switch on.
 
-Folder [**research**](research) is folder created for testing column2Vec functions.
-It contains folder [files](research/files) with all generated and md files with results from test. 
-[column2Vec_re.py](research/column2Vec_re.py) file with  statistics computation for functions. 
-[generate_report.py](research/generate_report.py) file that contains functions to generate files stored in [files](research/files).
+Folder [**research**](similarity_framework/src/column2vec2vec/research) is folder created for testing column2Vec functions.
+It contains folder [files](similarity_framework/src/column2vec2vec/research/files) with all generated and md files with results from test. 
+[column2Vec_re.py](similarity_framework/src/column2vec2vec/research/column2vec_re.py) file with  statistics computation for functions. 
+[generate_report.py](similarity_framework/src/column2vec2vec/research/generate_report.py) file that contains functions to generate files stored in [files](similarity_framework/src/column2vec2vec/research/files).
 
-File [**Column2Vec.py**](impl/Column2Vec.py) in folder [impl](impl) contains 7 different implementations of column2Vec.
+File [**Column2Vec.py**](similarity_framework/src/column2vec2vec/src/column2vec.py) in folder [impl](src) contains 7 different implementations of column2Vec.
 All implementations could use cache.
 There is implemented run time cache and persistent cache stored in cache.txt in folder generated.
 To store cache persistently it is necessary to call:
@@ -43,7 +43,7 @@ cache.clear_persistent_cache() # clears the cache file
 
 > Inspired by  "Column2Vec: Structural Understanding via Distributed Representations of Database Schemas" by Michael J. Mior, Alexander G. Ororbia [1903.08621](https://arxiv.org/pdf/1903.08621)
 
-File [**functions.py**](impl/functions.py) in folder [impl](impl) contains functions for using column2Vec.
+File [**functions.py**](similarity_framework/src/column2vec2vec/src/functions.py) in folder [impl](src) contains functions for using column2Vec.
 It contains functions:
 - get_nonnumerical_data (returns string columns)
 - get_vectors (creates embeddings)

diff --git a/column2Vec/research/column2Vec_re.py → column2vec/research/column2vec_re.py b/column2Vec/research/column2Vec_re.py → column2vec/research/column2vec_re.py
@@ -13,10 +13,10 @@
     SentenceTransformer,
 )
 
-from column2Vec.impl.functions import (
+from column2vec.src.functions import (
     get_nonnumerical_data,
 )
-from column2Vec.impl.Column2Vec import (
+from column2vec.src.column2vec import (
     column2vec_as_sentence,
     column2vec_as_sentence_clean,
     column2vec_as_sentence_clean_uniq,
@@ -27,15 +27,12 @@
     cache,
 )
 
-from column2Vec.research.generate_report import (
+from column2vec.research.generate_report import (
     generate_time_report,
-    generate_sim_report,
-    generate_stability_report,
-    generate_partial_column_report,
 )
-from config import configure
-from constants import warning_enable
-from similarity.Comparator import cosine_sim
+
+from similarity_framework.config import configure
+from similarity_framework.src.impl.comparator.comparator_by_type import cosine_sim
 
 FUNCTIONS = [
     column2vec_as_sentence,
@@ -48,7 +45,7 @@
 ]
 MODEL = "paraphrase-multilingual-mpnet-base-v2"  # 'bert-base-nli-mean-tokens'
 THIS_DIR = os.path.dirname(os.path.abspath(__file__))
-model = SentenceTransformer(MODEL)
+model = SentenceTransformer(MODEL, tokenizer_kwargs={"clean_up_tokenization_spaces": True})
 
 
 def count_embedding(column1: pd.Series, function, key: str) -> pd.Series:
@@ -255,6 +252,4 @@ def run_fun():
 
 
 configure()
-warning_enable.change_status(True)
-
 run_fun()
diff --git a/column2Vec/research/files/REP_not_similar.md → column2vec/research/files/REP_not_similar.md b/column2Vec/research/files/REP_not_similar.md → column2vec/research/files/REP_not_similar.md
diff --git a/...research/files/REP_partial_column_test.md → ...research/files/REP_partial_column_test.md b/...research/files/REP_partial_column_test.md → ...research/files/REP_partial_column_test.md
diff --git a/...ch/files/REP_partial_column_test_plot.png → ...ch/files/REP_partial_column_test_plot.png b/...ch/files/REP_partial_column_test_plot.png → ...ch/files/REP_partial_column_test_plot.png
diff --git a/column2Vec/research/files/REP_similar.md → column2vec/research/files/REP_similar.md b/column2Vec/research/files/REP_similar.md → column2vec/research/files/REP_similar.md
diff --git a/...Vec/research/files/REP_similar_and_not.md → ...vec/research/files/REP_similar_and_not.md b/...Vec/research/files/REP_similar_and_not.md → ...vec/research/files/REP_similar_and_not.md
diff --git a/...2Vec/research/files/REP_stability_test.md → ...2vec/research/files/REP_stability_test.md b/...2Vec/research/files/REP_stability_test.md → ...2vec/research/files/REP_stability_test.md
diff --git a/...Vec/research/files/REP_time_test_plot.png → ...vec/research/files/REP_time_test_plot.png b/...Vec/research/files/REP_time_test_plot.png → ...vec/research/files/REP_time_test_plot.png
diff --git a/column2Vec/research/files/Result_Reaserch.md → column2vec/research/files/Result_Reaserch.md b/column2Vec/research/files/Result_Reaserch.md → column2vec/research/files/Result_Reaserch.md
diff --git a/column2Vec/research/files/cache.txt → column2vec/research/files/cache.txt b/column2Vec/research/files/cache.txt → column2vec/research/files/cache.txt
diff --git a/...2Vec/research/files/nonnumerical_data.pkl → ...2vec/research/files/nonnumerical_data.pkl b/...2Vec/research/files/nonnumerical_data.pkl → ...2vec/research/files/nonnumerical_data.pkl
diff --git a/column2Vec/research/generate_report.py → column2vec/research/generate_report.py b/column2Vec/research/generate_report.py → column2vec/research/generate_report.py
diff --git a/...generated/Average column2vec vectors.html → ...generated/Average column2vec vectors.html b/...generated/Average column2vec vectors.html → ...generated/Average column2vec vectors.html
diff --git a/...generated/Average_column2vec_vectors.html → ...generated/Average_column2vec_vectors.html b/...generated/Average_column2vec_vectors.html → ...generated/Average_column2vec_vectors.html
diff --git a/...ean_uniq_sentence_column2vec_vectors.html → ...ean_uniq_sentence_column2vec_vectors.html b/...ean_uniq_sentence_column2vec_vectors.html → ...ean_uniq_sentence_column2vec_vectors.html
diff --git a/...enerated/Cleaned_sentence_column2vec.html → ...enerated/Cleaned_sentence_column2vec.html b/...enerated/Cleaned_sentence_column2vec.html → ...enerated/Cleaned_sentence_column2vec.html
diff --git a/.../Cleaned_sentence_column2vec_vectors.html → .../Cleaned_sentence_column2vec_vectors.html b/.../Cleaned_sentence_column2vec_vectors.html → .../Cleaned_sentence_column2vec_vectors.html
diff --git a/...Clusters_Average_column2vec_clusters.html → ...Clusters_Average_column2vec_clusters.html b/...Clusters_Average_column2vec_clusters.html → ...Clusters_Average_column2vec_clusters.html
diff --git a/...ean_uniq_sentence_column2vec_vectors.html → ...ean_uniq_sentence_column2vec_vectors.html b/...ean_uniq_sentence_column2vec_vectors.html → ...ean_uniq_sentence_column2vec_vectors.html
diff --git a/..._Cleaned_sentence_column2vec_vectors.html → ..._Cleaned_sentence_column2vec_vectors.html b/..._Cleaned_sentence_column2vec_vectors.html → ..._Cleaned_sentence_column2vec_vectors.html
diff --git a/...lusters_Sentence_column2vec_clusters.html → ...lusters_Sentence_column2vec_clusters.html b/...lusters_Sentence_column2vec_clusters.html → ...lusters_Sentence_column2vec_clusters.html
diff --git a/..._Weighted_average_column2vec_vectors.html → ..._Weighted_average_column2vec_vectors.html b/..._Weighted_average_column2vec_vectors.html → ..._Weighted_average_column2vec_vectors.html
diff --git a/column2Vec/generated/My_Clusters.html → ...n2vec/research/generated/My_Clusters.html b/column2Vec/generated/My_Clusters.html → ...n2vec/research/generated/My_Clusters.html
diff --git a/...enerated/Sentence column2vec vectors.html → ...enerated/Sentence column2vec vectors.html b/...enerated/Sentence column2vec vectors.html → ...enerated/Sentence column2vec vectors.html
diff --git a/...enerated/Sentence_column2vec_vectors.html → ...enerated/Sentence_column2vec_vectors.html b/...enerated/Sentence_column2vec_vectors.html → ...enerated/Sentence_column2vec_vectors.html
diff --git a/.../Weighted_average_column2vec_vectors.html → .../Weighted_average_column2vec_vectors.html b/.../Weighted_average_column2vec_vectors.html → .../Weighted_average_column2vec_vectors.html
diff --git a/column2Vec/generated/cache.txt → column2vec/research/generated/cache.txt b/column2Vec/generated/cache.txt → column2vec/research/generated/cache.txt
@@ -16,7 +16,7 @@
 
 from torch import Tensor
 
-logger = logging.getLogger(__name__)
+from logging_ import logger
 
 
 class Cache:
@@ -37,11 +37,11 @@ def __read(self):
         try:
             self.__cache = pd.io.parsers.read_csv(self.__file, index_col=0)
         except FileNotFoundError:
-            logger.warning("CACHE: File not found.")
+            logger.debug("CACHE: File not found.")
         except pd.errors.EmptyDataError:
-            logger.warning("CACHE: No data")
+            logger.debug("CACHE: No data")
         except pd.errors.ParserError:
-            logger.warning("CACHE: Parser error")
+            logger.debug("CACHE: Parser error")
 
     def get_cache(self, key: str, function: str) -> list | None:
         """
@@ -60,7 +60,7 @@ def get_cache(self, key: str, function: str) -> list | None:
             tmp = self.__cache.loc[function, key]
             if (tmp != "nan" and tmp is not int) or (tmp is int and not math.isnan(tmp)):
                 return json.loads(tmp)  # json is faster than ast
-        print(f"NO CACHE key: {key}, function: {function}")
+        # print(f"NO CACHE key: {key}, function: {function}")
         return None
 
     def save(

@@ -16,11 +16,10 @@
 from sklearn.manifold import TSNE
 
 from constants import trained_model
-from similarity.Comparator import cosine_sim
-from similarity.DataFrameMetadataCreator import (
-    DataFrameMetadataCreator,
-)
-from similarity.Types import NONNUMERICAL
+from similarity_framework.src.impl.comparator.utils import cosine_sim
+from similarity_framework.src.impl.metadata.type_metadata_creator import TypeMetadataCreator
+from similarity_framework.src.models.metadata import MetadataCreatorInput
+from similarity_framework.src.models.types_ import NONNUMERICAL
 
 
 def get_nonnumerical_data(
@@ -38,8 +37,8 @@ def get_nonnumerical_data(
     for i in files:
         index += 1
         data = pd.read_csv(i)
-        metadata_creator = DataFrameMetadataCreator(data).compute_advanced_structural_types().compute_column_kind()
-        metadata1 = metadata_creator.get_metadata()
+        metadata_creator = TypeMetadataCreator().compute_advanced_structural_types().compute_column_kind()
+        metadata1 = metadata_creator.get_metadata(MetadataCreatorInput(dataframe=data))
         column_names = metadata1.get_column_names_by_type(NONNUMERICAL)
         for name in column_names:
             print(f" {i} : {name}")

diff --git a/test/test_column2Vec.py → column2vec/tests/test_column2Vec.py b/test/test_column2Vec.py → column2vec/tests/test_column2Vec.py
@@ -5,11 +5,11 @@
 import pandas as pd
 from sentence_transformers import SentenceTransformer
 
-from column2Vec.impl.Column2Vec import (column2vec_as_sentence, column2vec_as_sentence_clean,
-                                        column2vec_as_sentence_clean_uniq, column2vec_avg,
-                                        column2vec_weighted_avg, column2vec_sum,
-                                        column2vec_weighted_sum, cache)
-from column2Vec.impl.functions import get_nonnumerical_data, get_clusters, compute_distances
+from src.column2vec import (column2vec_as_sentence, column2vec_as_sentence_clean,
+                            column2vec_as_sentence_clean_uniq, column2vec_avg,
+                            column2vec_weighted_avg, column2vec_sum,
+                            column2vec_weighted_sum, cache)
+from src.functions import get_nonnumerical_data, get_clusters, compute_distances
 from similarity.DataFrameMetadataCreator import DataFrameMetadataCreator
 from similarity.Types import NONNUMERICAL
 
@@ -33,7 +33,8 @@ def get_vectors(function, data):
     count = 1
     for key in data:
         # print("Processing column: " + key + " " + str(round((count / len(data)) * 100, 2)) + "%")
-        result[key] = function(data[key], SentenceTransformer(MODEL), key)
+        result[key] = function(data[key], SentenceTransformer(MODEL, tokenizer_kwargs={
+        'clean_up_tokenization_spaces': True}), key)
         count += 1
     end = time.time()
     print(f"ELAPSED TIME :{end - start}")

diff --git a/test/test_column2VecCache.py → column2vec/tests/test_column2VecCache.py b/test/test_column2VecCache.py → column2vec/tests/test_column2VecCache.py
@@ -4,15 +4,15 @@
 
 from sentence_transformers import SentenceTransformer
 
-from column2Vec.impl.Column2Vec import (cache,
-                                        column2vec_as_sentence,
-                                        column2vec_as_sentence_clean,
-                                        column2vec_as_sentence_clean_uniq,
-                                        column2vec_avg,
-                                        column2vec_weighted_sum,
-                                        column2vec_sum,
-                                        column2vec_weighted_avg)
-from column2Vec.impl.functions import get_nonnumerical_data
+from src.column2vec import (cache,
+                            column2vec_as_sentence,
+                            column2vec_as_sentence_clean,
+                            column2vec_as_sentence_clean_uniq,
+                            column2vec_avg,
+                            column2vec_weighted_sum,
+                            column2vec_sum,
+                            column2vec_weighted_avg)
+from src.functions import get_nonnumerical_data
 
 MODEL = 'bert-base-nli-mean-tokens'
 THIS_DIR = os.path.dirname(os.path.abspath(__file__))
-Original file line number
+Diff line change
@@ Expand Up / @@ -122,7 +122,7 @@ celerybeat.pid @@
     *.sage.py
     # Environments
-    .env
+    .config
     .venv
     env/
     venv/
@@ Expand Down @@