Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add models and interfaces #33

Merged
merged 55 commits into from
Nov 26, 2024
Merged
Changes from 1 commit
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
42e0fb9
Add models and interfaces
OlivieFranklova Sep 18, 2024
b88e01d
Reformat with black
OlivieFranklova Sep 18, 2024
b271786
#23 add file system connector
OlivieFranklova Sep 23, 2024
832080a
#23 Add tests and improve filesystem connector
OlivieFranklova Sep 23, 2024
5ea367d
#26 add basic runner
OlivieFranklova Sep 30, 2024
fe6c08c
Fix clean up tokenization error
OlivieFranklova Oct 1, 2024
0b680f7
Switch warnings to logging in Comparators
OlivieFranklova Oct 2, 2024
6241574
#25 Add json formater
OlivieFranklova Oct 2, 2024
8526550
Add runner
OlivieFranklova Oct 2, 2024
6ea0607
Add functionsRunner test to workflow
OlivieFranklova Oct 2, 2024
d087b50
Format files with black
OlivieFranklova Oct 2, 2024
ed9f215
Add tests
OlivieFranklova Oct 2, 2024
b4469b7
omit some similarityRunner files
OlivieFranklova Oct 2, 2024
a933b80
Rename functionRunner to function_runner
OlivieFranklova Oct 3, 2024
8e79fb4
Test ipynb
OlivieFranklova Oct 3, 2024
3788269
Update workflow
OlivieFranklova Oct 3, 2024
bc1c95a
Add comments
OlivieFranklova Oct 3, 2024
2d49e80
Format with black
OlivieFranklova Oct 3, 2024
dfd22c7
Add and updates READMEs
OlivieFranklova Oct 4, 2024
628fa79
Start refactoring structure
OlivieFranklova Nov 16, 2024
8a1a054
Add function runner
OlivieFranklova Nov 16, 2024
453436d
Refactoring: before similarity folder delete
OlivieFranklova Nov 16, 2024
795b4de
Refactoring: after similarity folder delete
OlivieFranklova Nov 16, 2024
7a23ed1
More refactoring
OlivieFranklova Nov 16, 2024
c15b7e8
renamed runner
OlivieFranklova Nov 16, 2024
4c09f7c
Refactoring: working framework tests
OlivieFranklova Nov 17, 2024
9eaa88e
Refactoring: working column2vec tests
OlivieFranklova Nov 17, 2024
602d29d
Refactoring: WIP - similarity_runner rework
OlivieFranklova Nov 18, 2024
4b08f9d
Refactoring: similarity_runner rework
OlivieFranklova Nov 18, 2024
f49e07e
Update ui
OlivieFranklova Nov 20, 2024
7f0e2cb
Update ui
OlivieFranklova Nov 20, 2024
97dce23
Add CLI for connectors, without validation
OlivieFranklova Nov 21, 2024
f137ad4
Add CLI implementation using argsparse and pydantic_settings
OlivieFranklova Nov 22, 2024
aa4daa6
Fix pipeline
OlivieFranklova Nov 22, 2024
a7c4d85
Add analyses settings into handlers, fix some tests
OlivieFranklova Nov 23, 2024
813c89a
Create files for measurments
OlivieFranklova Nov 23, 2024
ec95f44
Fix of the tests
OlivieFranklova Nov 23, 2024
1fd8f79
Change order od calling metadata creators
OlivieFranklova Nov 23, 2024
aa71eb7
Add centralized logging solution - this needs to be fixed later
OlivieFranklova Nov 23, 2024
5988282
Few small updates
OlivieFranklova Nov 24, 2024
ed29f56
Add mesuremnets output
OlivieFranklova Nov 25, 2024
9987a60
Format with black
OlivieFranklova Nov 25, 2024
15db60e
Remove stremlit
OlivieFranklova Nov 25, 2024
933196b
Update requirements
OlivieFranklova Nov 25, 2024
d28c655
Update CI
OlivieFranklova Nov 25, 2024
aa5dc8b
Update requirements
OlivieFranklova Nov 25, 2024
6fce081
Update requirements
OlivieFranklova Nov 25, 2024
0ad9fea
Update tests
OlivieFranklova Nov 25, 2024
4a61724
comment test
OlivieFranklova Nov 25, 2024
c8b99de
update tests
OlivieFranklova Nov 26, 2024
ba41d70
update tests
OlivieFranklova Nov 26, 2024
2b9aa88
add mesurements res graphs
OlivieFranklova Nov 26, 2024
a6ef2cf
Remove reaserch from git
OlivieFranklova Nov 26, 2024
bdcf6e2
Remove measurements from git
OlivieFranklova Nov 26, 2024
e09c45d
add suggestions from PR
OlivieFranklova Nov 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add analyses settings into handlers, fix some tests
OlivieFranklova committed Nov 23, 2024

Verified

This commit was signed with the committer’s verified signature.
fengmk2 fengmk2
commit a7c4d8566affe4e9055b5bdfd4a70f2ab7124510
3 changes: 0 additions & 3 deletions similarity_framework/config.py
Original file line number Diff line number Diff line change
@@ -12,6 +12,3 @@ def configure():
os.environ["REQUESTS_CA_BUNDLE"] = certifi.where()
os.environ["CURL_CA_BUNDLE"] = certifi.where()
print(f"Environment configured. {certifi.where()}")


class Configuration: ...
Original file line number Diff line number Diff line change
@@ -10,7 +10,7 @@
from similarity_framework.src.impl.logging import warning_enable
from similarity_framework.src.models.metadata import Metadata, KindMetadata, CategoricalMetadata
from similarity_framework.src.models.similarity import SimilarityOutput
from similarity_framework.src.models.types_ import DataKind
from similarity_framework.src.models.types_ import DataKind, Type
from similarity_framework.src.models.settings import AnalysisSettings


@@ -356,7 +356,63 @@ def _inner_compare(self, metadata1: Metadata, metadata2: Metadata, index1: str,
return np.nan


## TODO Type
class ColumnTypeHandler(SpecificColumnHandler):

def __numerical_compare1(self, metadata1: Metadata, metadata2: Metadata, index1: str, index2: str,
column1_type: type[Type], column2_type: type[Type]) -> float:
num_met1 = metadata1.numerical_metadata[index1]
num_met2 = metadata2.numerical_metadata[index2]
score = 3 if column1_type == column2_type else 0
if num_met1.same_value_length == num_met2.same_value_length:
score += 2
if num_met1.min_value == num_met2.min_value:
score += 1
elif num_met1.min_value == num_met2.min_value + num_met1.range_size/100 \
or num_met1.max_value == num_met2.max_value - num_met1.range_size/100:
score += 0.5
if num_met1.max_value == num_met2.max_value:
score += 1
elif num_met1.max_value == num_met2.max_value - num_met1.range_size/100 \
or num_met1.max_value == num_met2.max_value + num_met1.range_size/100:
score += 0.5
if num_met1.range_size == num_met2.range_size:
score += 2
return 1 - score / 9

def __nonnumerical_compare1(self, metadata1: Metadata, metadata2: Metadata, index1: str, index2: str,
column1_type: type[Type], column2_type: type[Type]) -> float:
num_met1 = metadata1.nonnumerical_metadata[index1]
num_met2 = metadata2.nonnumerical_metadata[index2]
score = 3 if column1_type == column2_type else 0
if num_met1.longest == num_met2.longest:
score += 2
if num_met1.shortest == num_met2.shortest:
score += 2
if num_met1.avg_length == num_met2.avg_length:
score += 2
elif num_met1.avg_length == num_met2.avg_length + num_met1.avg_length/100 \
or num_met1.avg_length == num_met2.avg_length - num_met1.avg_length/100:
score += 1
return 1 - score / 9


def _inner_compare(self, metadata1: Metadata, metadata2: Metadata, index1: str, index2: str) -> float:
"""
Compare if two columns have the same type.
:param index2: name of column in metadata2
:param index1: name of column in metadata1
:param metadata1: first dataframe metadata
:param metadata2: second dataframe metadata
:return: float number between 0 and 1 (distance)
"""
column1_type = metadata1.get_column_type(index1)
column2_type = metadata2.get_column_type(index2)
if index1 in metadata1.numerical_metadata and index2 in metadata2.numerical_metadata:
return self.__numerical_compare1(metadata1, metadata2, index1, index2, column1_type, column2_type)

if index1 in metadata1.nonnumerical_metadata and index2 in metadata2.nonnumerical_metadata:
return self.__nonnumerical_compare1(metadata1, metadata2, index1, index2, column1_type, column2_type)
return 1


class ComparatorByColumn(Comparator):
@@ -412,6 +468,7 @@ def weightwed_avg(self, distances: list[tuple[float, int]]) -> float:
def __pre_compare_individual(self):
for i in self.table_comparators:
i.settings = self.settings
i.analysis_settings = self.analysis_settings

def _compare(self, metadata1: Metadata, metadata2: Metadata) -> SimilarityOutput:
"""
Original file line number Diff line number Diff line change
@@ -12,12 +12,13 @@
class HandlerType(ABC):
"""Abstract class for comparators"""

def __init__(self, weight: int = 1):
def __init__(self, weight: int = 1, analysis_settings: AnalysisSettings = None):
"""
Constructor for ComparatorType
:param weight: weight of the comparator
"""
self.weight: int = weight
self.analysis_settings: AnalysisSettings = analysis_settings

@abstractmethod
def compare(self, metadata1: Metadata, metadata2: Metadata, **kwargs) -> pd.DataFrame | float:
@@ -33,6 +34,7 @@ def __init__(self):
self.settings: set[Settings] = set()
self.distance_function = HausdorffDistanceMin()
self.comparator_type: list[HandlerType] = []
self.analysis_settings: AnalysisSettings = None

def set_distance_function(self, distance_function: DistanceFunction) -> "Comparator":
"""
@@ -55,13 +57,15 @@ def add_settings(self, setting: Settings) -> "Comparator":
self.settings.add(setting)
return self

def compare(self, metadata1: Metadata, metadata2: Metadata) -> SimilarityOutput:
def compare(self, metadata1: Metadata, metadata2: Metadata, analysis_settings: AnalysisSettings = AnalysisSettings()) -> SimilarityOutput:
self.analysis_settings = analysis_settings
self.__pre_compare()
return self._compare(metadata1, metadata2)

def __pre_compare(self):
for i in self.comparator_type:
i.settings = self.settings
i.analysis_settings = self.analysis_settings
self.__pre_compare_individual()

def __pre_compare_individual(self, **kwargs):
18 changes: 18 additions & 0 deletions similarity_framework/src/models/metadata.py
Original file line number Diff line number Diff line change
@@ -229,6 +229,24 @@ def get_numerical_columns_names(self):
"""
return self.get_column_names_by_type(NUMERICAL, FLOAT, INT, HUMAN_GENERATED, COMPUTER_GENERATED)

def get_column_type(self, name: str) -> type[Type] | None:
"""
Get column type by column name
"""
for column_type, columns in self.column_type.items():
if name in columns:
return column_type
return None

def get_column_kind(self, name: str) -> DataKind | None:
"""
Get column kind by column name
"""
for column_kind, columns in self.column_kind.items():
if name in columns:
return column_kind
return None


@dataclass
class MetadataCreatorInput:
2 changes: 1 addition & 1 deletion similarity_runner/src/interfaces/ui.py
Original file line number Diff line number Diff line change
@@ -36,6 +36,6 @@ def run(self):
result = []
for first in metadata:
for second in metadata:
result.append(comparator.compare(first, second))
result.append(comparator.compare(first, second, analysis_settings))
# TODO: based on analysis settings get specified metadata objects
self.show(result, analysis_settings)
41 changes: 14 additions & 27 deletions tests/similarity_framework/test_similarity_comparator.py
Original file line number Diff line number Diff line change
@@ -60,21 +60,16 @@ def setUp(self):
self.data_second_half.index = self.data_second_half.index - int(len(self.data) / 2)
self.data_diff_type = self.data.copy() # todo fill

self.metadata_creator = TypeMetadataCreator().compute_advanced_structural_types().compute_column_kind()
self.metadata_creator = (TypeMetadataCreator()
.compute_advanced_structural_types()
.compute_column_kind()
.compute_advanced_structural_types()
.compute_column_kind()
.compute_incomplete_column())
self.metadata1 = self.metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data))

metadata_creator = (TypeMetadataCreator().
compute_advanced_structural_types().
compute_column_kind())
self.metadata_diff_column_names = metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_diff_column_names))
metadata_creator = (TypeMetadataCreator().
compute_advanced_structural_types().
compute_column_kind())
self.metadata_first_half = metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_first_half))
metadata_creator = (TypeMetadataCreator().
compute_advanced_structural_types().
compute_column_kind())
self.metadata_second_half = metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_second_half))
self.metadata_diff_column_names = self.metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_diff_column_names))
self.metadata_first_half = self.metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_first_half))
self.metadata_second_half = self.metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_second_half))

def test_size_compare(self):
self.compartor.add_comparator_type(SizeHandler())
@@ -170,23 +165,15 @@ def setUp(self):

self.metadata_creator = (TypeMetadataCreator().
compute_advanced_structural_types().
compute_column_kind())
compute_column_kind()
.compute_incomplete_column())
self.metadata1 = self.metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data))

metadata_creator = (TypeMetadataCreator().
compute_advanced_structural_types().
compute_column_kind())
self.metadata_diff_column_names = metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_diff_column_names))
self.metadata_diff_column_names = self.metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_diff_column_names))

metadata_creator = (TypeMetadataCreator().
compute_advanced_structural_types().
compute_column_kind())
self.metadata_first_half = metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_first_half))
self.metadata_first_half = self.metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_first_half))

metadata_creator = (TypeMetadataCreator().
compute_advanced_structural_types().
compute_column_kind())
self.metadata_second_half = metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_second_half))
self.metadata_second_half = self.metadata_creator.get_metadata(MetadataCreatorInput(dataframe=self.data_second_half))

def test_size_compare(self):
self.compartor.add_comparator_type(SizeHandlerByColumn())