A Python package for measuring the semantic similarity among sentences in the Bengali language. Users will provide a reference sentence and a list of target sentences as input; and optionally the similarity assessment approach (Default: Cosine Similarity) and the maximum sequence length (Default: 512). The length will be calculated in terms of the number of tokens using the WordPiece tokenizer. Currently, the maximum sequence length limit is 512. BenSim will perform normalization on the input texts and extract the contextual embeddings of the reference sentence and target sentences through a pre-trained BERT model. The similarities will be measured between each of the sentence pairs by applying either Euclidean distance or Cosine similarity (based on the input parameter). Finally, this will return a list of similarity scores between the reference sentence and the target sentences. If the assessment method is cosine
, the higher scores will denote higher similarity, and the opposite will be for euclidean
.
- Install Pytorch on your virtual environment.
- Clone this repo.
- Run the following commands to install:
$ python setup.py bdist_wheel # to build
$ pip install -e .
For the first usage, it may take a while to download the pretrained BERT model on your system memory.
- Sample input (single target sentence):
from bensim import similarity
score = similarity.score('তোমার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।')
score
- Sample output:
array([0.83910286], dtype=float32)
- Sample input (list of target sentences):
from bensim import similarity
score = similarity.scores('তোমার সাথে দেখা হয়ে ভালো লাগলো।', ['আপনার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।'])
scores
- Sample output:
array([0.83910286, 0.83910286], dtype=float32)
To limit the maximum sequence length, max_seq
can be set upto 512. If not mentioned, the default value will be 512
.
- Sample input:
from bensim import similarity
score = similarity.score('তোমার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।' , max_seq = 10)
score
- Sample output:
array([0.763596], dtype=float32)
To get the similarity based on Euclidean (Lower means Higher similarity) distance, simply set euclidean
for the similarity_method
. The default will be cosine
(Higher means Higher similarity).
- Sample input:
from bensim import similarity
score = similarity.score('তোমার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।' , similarity_method = 'euclidean')
score
- Sample output:
array([8.81357], dtype=float32)