-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using a dictionary to generate blacklists #43
Comments
Is this something we should integrate in the script or maybe we can use externally to generate lists as we do with the word usage external script? |
It will be very languange-dependant (dependant on resources available for each language and word tokenization). So let's keep it external. |
Great. Is the use of these dictionaries more helpful to generate blacklists or whitelists? How that would work? If it's for blacklisting, we should then only improve our doc to explain how to use them when available in combination with @dabinat cvtools (or integrate into these tools?) |
I guess a blacklist is the lesser evil. The blacklist creation should be an independent script. |
@dabinat is this something that can potentially be integrated into your tool? |
@nukeador There’s no harm in adding the feature, but with > 1 million sentences, it seems like maintaining a blacklist would be a lot of work. |
What do you propose? I don't think we need to maintain an update one, but enable the tools for communities to generate one. It has been proven key to improve the quality of the final extraction. |
I remember you looked at filtering little-used words before but it filtered too many. Perhaps combine that with a letter sequence check and/or word length check so it’s only filtering out the least-pronounceable ones. |
Less used words proved to be OKish for German and Spanish, it's a matter of deciding where to put the cut. But yes, definitely a combination of other methods will allow more quality sentences. |
There are multiple ways to come up with the blacklist. I'd like to keep this here simple, and keep the concern of actually generating the blacklist out of this repo. However I'm happy to take a PR which adds another method than the existing to the README. |
To generate a good blacklist, a dictionary or spell-checker can be used.
We can add to the blacklist all the misspelled words (or unusual proper nouns) we find on the Wikipedia.
The main problem with this approach is word tokenization. Word tokenization must be coherently used in every step of the process (see #41), including the dictionary lookup. The tokenization used in the dictionary must be previously known, and if it is not the same, some adjustment can be necessary.
I'm very skeptical that you will be able achieve this kind of language expertise for more than a few languages. To use existing grammar & spelling checkers is a better solution.
The text was updated successfully, but these errors were encountered: