Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

span_info on combined unicode characters #82

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

kkaiser
Copy link

@kkaiser kkaiser commented May 28, 2019

This fixes issue: #81

Lowering a sentence with combined unicode chars changes the length of a sentence.

s = 'İ love Big Apple and Bay Area.'
len(s)  # 30
len(s.lower())  # 31

Lowering keywords and search sentence now works on a per char basis to return the correct span_info

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Big Apple', 'New York')
keyword_processor.add_keyword('Bay Area')
keyword_processor.add_keyword('İ love')
s = 'İ love Big Apple and Bay Area.'
keywords_found = keyword_processor.extract_keywords(s, span_info=True)
keywords_found
# old: [('İ love', 0, 7), ('New York', 8, 17), ('Bay Area', 22, 30)]
# new: [('İ love', 0, 6), ('New York', 7, 16), ('Bay Area', 21, 29)]
for k in keywords_found:
    print(s[k[1]:k[2]])
# new: İ love
# old: İ love
# new: Big Apple
# old: ig Apple
# new: Bay Area
# old: ay Area.

…a per char basis to return correct span_info
@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 99.327% when pulling 40633c9 on kkaiser:master into 50c45f1 on vi3k6i5:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 99.327% when pulling 40633c9 on kkaiser:master into 50c45f1 on vi3k6i5:master.

@kkaiser kkaiser mentioned this pull request Jan 31, 2020
@vi3k6i5
Copy link
Owner

vi3k6i5 commented May 3, 2020

Moving the check inside the loop ads a little of execution overhead, please share a test case for the change. Thanks

@vi3k6i5
Copy link
Owner

vi3k6i5 commented May 3, 2020

Can you please resolve the conflict.

@kkaiser
Copy link
Author

kkaiser commented May 15, 2020

Ready for review

@laurenegerton
Copy link

laurenegerton commented Nov 16, 2020

This is still happening with flashtext-2.7
Looks like the fix was never merged with master...

@spencertollefson
Copy link

@vi3k6i5 - Hi there! Is this ready for and can you complete merging this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants