Bengali Sentence Similarity Measurement

A Python package for measuring the semantic similarity among sentences in the Bengali language. Users will provide a reference sentence and a list of target sentences as input; and optionally the similarity assessment approach (Default: Cosine Similarity) and the maximum sequence length (Default: 512). The length will be calculated in terms of the number of tokens using the WordPiece tokenizer. Currently, the maximum sequence length limit is 512. BenSim will perform normalization on the input texts and extract the contextual embeddings of the reference sentence and target sentences through a pre-trained BERT model. The similarities will be measured between each of the sentence pairs by applying either Euclidean distance or Cosine similarity (based on the input parameter). Finally, this will return a list of similarity scores between the reference sentence and the target sentences. If the assessment method is cosine, the higher scores will denote higher similarity, and the opposite will be for euclidean.

Setup from clone

Install Pytorch on your virtual environment.
Clone this repo.
Run the following commands to install:

$ python setup.py bdist_wheel # to build
$ pip install -e .

Default Usage

For the first usage, it may take a while to download the pretrained BERT model on your system memory.

Sample input (single target sentence):

from bensim import similarity
score = similarity.score('তোমার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।')
score

Sample output:

array([0.83910286], dtype=float32)

Sample input (list of target sentences):

from bensim import similarity
score = similarity.scores('তোমার সাথে দেখা হয়ে ভালো লাগলো।', ['আপনার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।'])
scores

Sample output:

array([0.83910286, 0.83910286], dtype=float32)

Limiting max sequence length

To limit the maximum sequence length, max_seq can be set upto 512. If not mentioned, the default value will be 512.

Sample input:

from bensim import similarity
score = similarity.score('তোমার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।' ,  max_seq = 10)
score

Sample output:

array([0.763596], dtype=float32)

Computing the Euclidean distance

To get the similarity based on Euclidean (Lower means Higher similarity) distance, simply set euclidean for the similarity_method. The default will be cosine (Higher means Higher similarity).

Sample input:

from bensim import similarity
score = similarity.score('তোমার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।' ,  similarity_method = 'euclidean')
score

Sample output:

array([8.81357], dtype=float32)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src/bensim		src/bensim
tests		tests
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bengali Sentence Similarity Measurement

Setup from clone

Default Usage

Limiting max sequence length

Computing the Euclidean distance

References

About

Releases

Packages

Languages

dialect-ai/BenSim

Folders and files

Latest commit

History

Repository files navigation

Bengali Sentence Similarity Measurement

Setup from clone

Default Usage

Limiting max sequence length

Computing the Euclidean distance

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages