Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best input format to create new dictionary #356

Closed
thehijacker opened this issue Jan 24, 2022 · 10 comments
Closed

Best input format to create new dictionary #356

thehijacker opened this issue Jan 24, 2022 · 10 comments
Labels

Comments

@thehijacker
Copy link

Hello,

I wish to create a new dictionary for my Kobo reader. I can get around 70.000 dictionary entries in plain text or in RTF format. Each entry as seperated file or I can join them together. I would prefer RTF format as it includes formatting (bold, italic, bullets, etc...).

If needed I could convert RTF to another format and perserve the formatting if format supports it.

I kindy ask for advise what input format would be best, so I can join all the dictionary entries with formatting into a input file, that pyglossary could process and create custom dictionary for Kobo ebook readers.

Thank you very much for the help and a great library.

THJ

@ilius ilius added the Q&A label Jan 24, 2022
@ilius
Copy link
Owner

ilius commented Jan 24, 2022

Best is kind of subjective and depends on your need or taste,

But you can try Dictfile:
https://pgaskin.net/dictutil/dictgen/#dictfile-format
https://github.com/ilius/pyglossary/blob/master/doc/p/kobo_dictfile.md

Though if you want to add images, you will have to embed them in your text file (as base64) which may not be convenient.

@thehijacker
Copy link
Author

Dictfile format from the documentation URL you posted looks easy enough and can support bold, italic, bullets and even matching with another words. I have all this, but currently in RTF. Need to figure out how to parse all this from RTF. Have some ideas where to start.

As I can see, pyglossary can convert from "Kobo E-Reader Dictfile (.df)" to Kobo "E-Reader Dictionary (.zip)" that I can put on my Kobo reader.

@ilius
Copy link
Owner

ilius commented Jan 24, 2022

There are several tools and websites that can convert RTF to HTML
https://github.com/search?q=rtf+to+html
https://convertio.co/rtf-html/

@thehijacker
Copy link
Author

Hello @ilius. Thank you very much for the tips.

Now I have all the RTF files. Each dictionary entry as own file. I think I should better write a RTF parser and parse all the bold, italic, bullets. Should be enough. Parsing to html adds way too much extra tags.

Need to find example on how for example default english dictonary on Kobo is built. I could see it can also show multiple entries as single word on same screen when you look at word. For example.

word1
word2

Is there a Kobo df sample somewhere of an actual dictionary I could look? Examples from the page are no longer working.

https://pgaskin.net/dictutil/examples/webster1913-convert.html

Maybe someone has them and could share?

This tool converts Project Gutenberg’s Webster’s Unabridged Dictionary into a dictfile for conversion into a Kobo dictzip.

Thank you,

@ilius
Copy link
Owner

ilius commented Jan 24, 2022

Now I have all the RTF files. Each dictionary entry as own file. I think I should better write a RTF parser and parse all the bold, italic, bullets. Should be enough. Parsing to html adds way too much extra tags.

You can try removing extra tags with PyGlossary in command line, by passing --remove-html=tag1,tag2,tag3 for example.

Is there a Kobo df sample somewhere of an actual dictionary I could look? Examples from the page are no longer working.

Here is 2 examples from the website:
https://mega.nz/folder/ks4nBQ7K#h1idvSzCv_mLriOW9EDDPg

You can also check this repo (they convert .df to StarDict with PyGlossary):
https://github.com/BoboTiG/ebook-reader-dict

@thehijacker
Copy link
Author

After days of coding I have finaly manage to create a Kobo df file from all the input dictionary entries. An example of one DF file dictionary entry:

@ entry_name
: \pronanuciation\
& alias1
& alias2
<html><i>noun</i> here comes the descrition

I used also aliases (&) and html code so I can make italic for specific words.

Now for final step. I need to test this on my Kobo. I converted from "Kobo E-Reader Dictfile (.df)" to "Kobo E-Reader Dictionary (.zip)" all output file looks fine. Will let you know after I do some more tests.

Few more questions on the df format. Do you think this is valid?

Two entries with same name, different meaning. One with number 1 and one with number 2 in name. How does Kobo process this?

@ name1
: \aaaa\
<html><i>noun</i> description

@ name2
: \aaaa\
<html><i>adjective</i> different description

Or this. Optional character inside parentheses. For example.

@ backward(s)
: \baekwəd(z)\
<html><i>adverb</i> description...

Thank you!

@ilius
Copy link
Owner

ilius commented Jan 27, 2022

From PyGlossary's point of view, you can even use the same headword (without adding 1 or 2) multiple times.
Though I have seen some glossaries use headwords like "test (1)" or "test (2)".

But I don't know how Kobo will process it.
You may either test it, or ask the author of dictutil.

If Kobo can render html lists, it would be my preferred form (have all definitions in the same entry).

@thehijacker
Copy link
Author

I will ask also the author of dictutil. I could merge them somehow. Just need suggestion what is proper way :).

Thank you.

@ilius
Copy link
Owner

ilius commented Jan 27, 2022

If you want to target other e-book reader users as well, you may also try to test KOReader with StarDict format.

Some dictionary apps have "prefix search" feature (show "test (1)" when you lookup test) and some don't.

I was just having an interesting discussion here:
BoboTiG/ebook-reader-dict#1161

@thehijacker
Copy link
Author

Hello @ilius. Since I only have Kobo reader (my first ebook reader ever) and I love it so much I wish to make this dictionary only for it. Df format is perfect. I now used dictgen to convert it to Kobo zip, as it give me more warning in my df struction that I needed to fix, and it is working fine on Kobo reader. I am in touch with the author to figure out proper df file structure that would work best on Kobo. Fixed the words with (x) inside name but I am still looking for best way to manage word1, word2, word3, ... headwords.

I think we can close this ticket and conclude that in my opinion best format to make custom dictionaries for Kobo reader is Kobo df. You can easily make it as it is text based and has well defined structure. You can convert it to Kobo zip (to put in custom-dic path) with PyGlossary or dictgen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants