Skip to content

dialect-ai/BenSim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bengali Sentence Similarity Measurement

A Python package for measuring the semantic similarity among sentences in the Bengali language. Users will provide a reference sentence and a list of target sentences as input; and optionally the similarity assessment approach (Default: Cosine Similarity) and the maximum sequence length (Default: 512). The length will be calculated in terms of the number of tokens using the WordPiece tokenizer. Currently, the maximum sequence length limit is 512. BenSim will perform normalization on the input texts and extract the contextual embeddings of the reference sentence and target sentences through a pre-trained BERT model. The similarities will be measured between each of the sentence pairs by applying either Euclidean distance or Cosine similarity (based on the input parameter). Finally, this will return a list of similarity scores between the reference sentence and the target sentences. If the assessment method is cosine, the higher scores will denote higher similarity, and the opposite will be for euclidean.

Setup from clone

  1. Install Pytorch on your virtual environment.
  2. Clone this repo.
  3. Run the following commands to install:
$ python setup.py bdist_wheel # to build
$ pip install -e .

Default Usage

For the first usage, it may take a while to download the pretrained BERT model on your system memory.

  • Sample input (single target sentence):
from bensim import similarity
score = similarity.score('তোমার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।')
score
  • Sample output:
array([0.83910286], dtype=float32)
  • Sample input (list of target sentences):
from bensim import similarity
score = similarity.scores('তোমার সাথে দেখা হয়ে ভালো লাগলো।', ['আপনার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।'])
scores
  • Sample output:
array([0.83910286, 0.83910286], dtype=float32)

Limiting max sequence length

To limit the maximum sequence length, max_seq can be set upto 512. If not mentioned, the default value will be 512.

  • Sample input:
from bensim import similarity
score = similarity.score('তোমার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।' ,  max_seq = 10)
score
  • Sample output:
array([0.763596], dtype=float32)

Computing the Euclidean distance

To get the similarity based on Euclidean (Lower means Higher similarity) distance, simply set euclidean for the similarity_method. The default will be cosine (Higher means Higher similarity).

  • Sample input:
from bensim import similarity
score = similarity.score('তোমার সাথে দেখা হয়ে ভালো লাগলো।', 'আপনার সাথে দেখা হয়ে ভালো লাগলো।' ,  similarity_method = 'euclidean')
score
  • Sample output:
array([8.81357], dtype=float32)

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages