-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rules & disallowed words for Catalan #42
Conversation
Results will be better if issue #41 is fixed. |
Great, thanks for this.
Thanks! |
I get:
Using LanguageTool, I get 566,792 sentences properly checked, with a very low error rate. |
Thanks for all the feedback! |
Do not make wrong assumptions. Of the 5 proper nouns you mention (William, Gregor, Google, Burger King, Jefferson), 4 are already in the dictionary I'm using. So we are not talking about this kind of (usual) foreign names. We are talking about unusual and frequently unpronounceable proper nouns. |
Don't merge this PR. I will wait. |
Please, merge the pull-request now. The rules and the blacklist are OK. It is still necessary to remove duplicates, or even better to avoid them during the selection of sentences. We would like you to run the script and to upload the resulting sentences as soon as possible. New sentences are very much needed for the people making recordings. |
OK. For the sake of the process, since we want to expand this to a lot of languages and we need to ensure peer-review, can I ask you to point me to a place where a few samples of the output have been discussed with other native speakers and the estimated error ratio? Once we automate this process we want to have a step where we make sure enough native speakers agree with the samples. Thanks! |
The sentences are here:
|
I know but as we talked, in order to be able to support everyone and complain with legal constrains we need to run the tools ourselves, that's why we are asking for support to improve the existing one. Let me know about the feedback, we are still trying to figure out the best way to verify feedback from any other communities in the future. |
I read around a couple of hundred sentences browsing through the file (in no particular order).
|
I also looked into the sentences, trying to do a more qualitative than a quantitative analysis:
But in general, it looks good enough to import them. |
I would say both colon and scientific terms are OK. Most complex terms are usually caught by the blacklist. Where did you put the limit for the word frequency to generate the list? For example in Spanish we used 80 repetitions or less. |
I read around sentences and their quality is really good.
|
@MichaelKohler @Gregoor Once we merge this PR, is this something you can own the extraction and PR to voice-web? I'm going on PTO for a couple of weeks and I don't want to be a blocker. |
I can't promise anything. Pretty busy next week due to me moving, but maybe the weekend of August 17th might be possible. |
I didn't use this procedure. I built the blacklist using a dictionary. I'm trying now dictionary + word frequencies. |
I should be able to find time for this 👍 |
OK @jaumeortola try with the word frequency we describe in the readme to see if results improve (to generate a few random samples of 500 sentences you don't need the script to finish all the process) Once you and a few other native speakers are satisfied, please comment here with the estimated error rate, how many sentences you are getting and how did you generate the blacklist (for reference). @Gregoor please merge when the previous is done and run the scrapping based on these rules and move to voice-web. Thanks so much! :-) |
We are satisfied with the results now. So you can go ahead, @Gregoor. We get 449,163 sentences without duplicates. The blacklist was generated using a language dictionary. A minimum frequency of 5 was set for disallowing some proper nouns present in the dictionary but very rare. |
Just an update: It's still running on my old laptop, we really should find out why this is so slow 🙈 |
Thanks everyone involved here, I'm happy to see that Catalan has now the wikisentences merged. This is also really helpful for driving this effort in other languages :D |
The list of disallowed words has been created with the help of a dictionary.