nlp-proj-LPD

Hello, The goal of this project was to identify LPD possible in sentences for hebrew

for example: "היה לי לוחם ממש טוב" <=> "היה לי חלום ממש טוב", "רוצה לבוא לאכול גדילה" <=> "רוצה לבוא לאכול גלידה"

In this project has several parts:

Preparation of the information:

I downloaded form Mila site : http://www.mila.cs.technion.ac.il/resources_corpora_wikipedia_2013.html Corpus for wikpdia 2013, And extracted as many sentences as I could (more 192000000 sentences) It was not simple (prepareTheCorpus.py)

create Vocabulary,Bigram Model, And LPD sentence

The next step was to create a vocabulary, a Bigram model and I test sentences with LPD the test sentence is not import to train courps.

tests:

baseLine:

the baseLine is simple: For each word I choose a random combinator, that exists in a dictionary. of all possible combinations. The results were really bad: 0% Almost all the words in the sentence were changed and no sentence came out as the desired sentence.

unigram or other words Which appeared most often:

For each word I choose a the word that appeared most often. The results were a bit nicer: 64%

Bigram:

For each word I choose the word that its probability when known as the previous word and the next word is the highest. The results in this case were really impressive:97%

Play yourself:

for play yourself run test model bigram.

You can enter any sentence you want and test it.
You can choose a sentence between the test corpus.
You can select a sentence that has an error from the test corpus. (It helped me check the errors, and improve the results)

Thanks

Name	Name	Last commit message	Last commit date
Latest commit aviadpinis Delete test+model.py Sep 15, 2018 c860efd · Sep 15, 2018 History 19 Commits
.idea	.idea	clean code for other	Sep 15, 2018
__pycache__	__pycache__	test push	Sep 14, 2018
src	src	Delete test+model.py	Sep 15, 2018
test	test	test push	Sep 14, 2018
.gitattributes	.gitattributes	update report	Sep 14, 2018
.gitignore	.gitignore	Clean code	Sep 15, 2018
README.md	README.md	Update README.md	Sep 15, 2018
common.pyc	common.pyc	test push	Sep 14, 2018
wordcount.lex	wordcount.lex	more files	Sep 14, 2018
זיהוי ותיקון של שיכול אותיות.docx	זיהוי ותיקון של שיכול אותיות.docx	update report	Sep 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlp-proj-LPD

Preparation of the information:

create Vocabulary,Bigram Model, And LPD sentence

tests:

baseLine:

unigram or other words Which appeared most often:

Bigram:

Play yourself:

About

Releases

Packages

Languages

aviadpinis/nlp-proj-LPD

Folders and files

Latest commit

History

Repository files navigation

nlp-proj-LPD

Preparation of the information:

create Vocabulary,Bigram Model, And LPD sentence

tests:

baseLine:

unigram or other words Which appeared most often:

Bigram:

Play yourself:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages