Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: bilingual dictionaries #973

Closed
chopinesque opened this issue Jun 27, 2021 · 21 comments
Closed

Feature request: bilingual dictionaries #973

chopinesque opened this issue Jun 27, 2021 · 21 comments

Comments

@chopinesque
Copy link
Contributor

chopinesque commented Jun 27, 2021

Would it be possible to create subdictionaries based on EN wiktionary for other languages?
For example, German-English (here is a German word: https://en.wiktionary.org/wiki/Nacht)

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar
@lasconic
Copy link
Collaborator

Changing this line could work : https://github.com/BoboTiG/ebook-reader-dict/blob/master/wikidict/lang/en/__init__.py#L15
But I'm not sure why you would like to do so. For EN/DE kobo dictionary, you might want to check http://download.wikdict.com/dictionaries/kobo/
If it's not what you are looking for, please explain more in details.

@chopinesque
Copy link
Contributor Author

The German was just an example. The idea is to produce bilingual dictionaries based on the EN one for example (and not from the Translations section of the English words). For example, when it comes to Ancient Greek, there is much larger coverage in main entries rather than entries in the Translations section.

Would changing that line you mention suffice? I read the add new local section but I am a little confused on how exactly to run it on a local Wiktionary dump.

@lasconic
Copy link
Collaborator

I replaced the line in question by

head_sections = ("==German==", "german")

And ran (sorry, my german is very very limited)

python -m wikidict en --gen-dict=Nacht,Kartoffel,schwarz --output=Nacht

And I got the attached file in the Nacht directory. You can try it on your Kobo and see if the 3 words can be find and look good.
dicthtml-en.zip

@chopinesque
Copy link
Contributor Author

Thank you! Sadly, I use tsv or Stardict (no Kobo).

@lasconic
Copy link
Collaborator

Which language would you be the most interested in ?

@chopinesque
Copy link
Contributor Author

Greek and Ancient Greek.

I can see the part of speech templates have an "el" (el-adj, el-verb...) or "grc" prefix for Greek and Ancient Greek respectively. Does the script figure out the templates by itself or one needs to add/finetune them?

@lasconic
Copy link
Collaborator

I believe part of speech are not extracted at all right now. @BoboTiG can confirm. We just use them to choose which definition we keep or not.

@chopinesque
Copy link
Contributor Author

Yes, that is what I meant, these templates are needed to decide which part should be extracted and which not :)

@lasconic
Copy link
Collaborator

It seems to work without finetuning then.
I changed the line to:

head_sections = ("==Ancient Greek==", "ancientgreek")

and ran

python -m wikidict en --get-word="Γραῖα"

I got the following, compare with https://en.wiktionary.org/wiki/%CE%93%CF%81%CE%B1%E1%BF%96%CE%B1

Γραῖα   

A name meaning "grey", from Proto-Indo-European *ǵerh₂- (“to grow old”).


  1. Graea, Boeotia; Greece

@BoboTiG
Copy link
Owner

BoboTiG commented Jun 29, 2021

Indeed, we are only using parts that mater to the language: the project was not designed for cross-language stuff.

You could play with it and see how it works. Make a copy of the langs/en folder to langs/en_grc or something like that and tune templates handling and sections names.

@chopinesque
Copy link
Contributor Author

chopinesque commented Jun 29, 2021

Well, cross-language could be another possibility then, but thank you so much for all the work so far -:)
Having checked the relevant page, I am a bit at a loss at how to run the script on a wiktionary dump.

The Γραῖα example appears to maintain the Etymology, I guess this is not included normally.

@lasconic
Copy link
Collaborator

lasconic commented Jun 29, 2021

Etymology is always included in the other languages.

To run it on a dump, checkout the code, install the requirements, change the line for the language and run

python -m wikidict en 

After some time, you will get a directory with .df file. You can convert it to Stardict with pyglossary:

pyglossary --no-progress-bar --no-color data/en/dict-en.df dict-data.ifo

@chopinesque
Copy link
Contributor Author

chopinesque commented Jun 29, 2021

But how is the path of the dump defined?
(Yes, I found this project via pyglossary -:) )

@lasconic
Copy link
Collaborator

The dump will be downloaded in data/en

@chopinesque
Copy link
Contributor Author

So the script downloads the dump automatically?

@lasconic
Copy link
Collaborator

Yes

@lasconic
Copy link
Collaborator

I just ran the first steps, and there are only 16,431 in ancient greek.

@lasconic
Copy link
Collaborator

grc.zip

@chopinesque
Copy link
Contributor Author

Wow, looks quite good after a quick look. Many thanks.
Some issues:

3.1 is a quotations drop down which is converted to <i>Q</i> <b>Od.</b>
https://en.wiktionary.org/wiki/%CE%BB%CE%B1%CE%BC%CE%B2%CE%AC%CE%BD%CF%89

@lasconic
Copy link
Collaborator

the quotation block is supposed to be entirely removed.

@chopinesque
Copy link
Contributor Author

I guess then there is some difference in syntax so that the current regex for that block does not match it.

@BoboTiG BoboTiG changed the title [EN] Include other languages? Feature request: bilingual dictionaries Mar 26, 2023
@polar-sh polar-sh bot added the Fund label Jul 23, 2024
@BoboTiG BoboTiG pinned this issue Oct 25, 2024
@BoboTiG BoboTiG closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2024
Repository owner locked as resolved and limited conversation to collaborators Nov 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants