Best input format to create new dictionary #356

thehijacker · 2022-01-24T07:03:44Z

Hello,

I wish to create a new dictionary for my Kobo reader. I can get around 70.000 dictionary entries in plain text or in RTF format. Each entry as seperated file or I can join them together. I would prefer RTF format as it includes formatting (bold, italic, bullets, etc...).

If needed I could convert RTF to another format and perserve the formatting if format supports it.

I kindy ask for advise what input format would be best, so I can join all the dictionary entries with formatting into a input file, that pyglossary could process and create custom dictionary for Kobo ebook readers.

Thank you very much for the help and a great library.

THJ

ilius · 2022-01-24T10:20:16Z

Best is kind of subjective and depends on your need or taste,

But you can try Dictfile:
https://pgaskin.net/dictutil/dictgen/#dictfile-format
https://github.com/ilius/pyglossary/blob/master/doc/p/kobo_dictfile.md

Though if you want to add images, you will have to embed them in your text file (as base64) which may not be convenient.

thehijacker · 2022-01-24T12:25:10Z

Dictfile format from the documentation URL you posted looks easy enough and can support bold, italic, bullets and even matching with another words. I have all this, but currently in RTF. Need to figure out how to parse all this from RTF. Have some ideas where to start.

As I can see, pyglossary can convert from "Kobo E-Reader Dictfile (.df)" to Kobo "E-Reader Dictionary (.zip)" that I can put on my Kobo reader.

ilius · 2022-01-24T15:01:53Z

There are several tools and websites that can convert RTF to HTML
https://github.com/search?q=rtf+to+html
https://convertio.co/rtf-html/

thehijacker · 2022-01-24T21:15:55Z

Hello @ilius. Thank you very much for the tips.

Now I have all the RTF files. Each dictionary entry as own file. I think I should better write a RTF parser and parse all the bold, italic, bullets. Should be enough. Parsing to html adds way too much extra tags.

Need to find example on how for example default english dictonary on Kobo is built. I could see it can also show multiple entries as single word on same screen when you look at word. For example.

word1
word2

Is there a Kobo df sample somewhere of an actual dictionary I could look? Examples from the page are no longer working.

https://pgaskin.net/dictutil/examples/webster1913-convert.html

Maybe someone has them and could share?

This tool converts Project Gutenberg’s Webster’s Unabridged Dictionary into a dictfile for conversion into a Kobo dictzip.

Thank you,

ilius · 2022-01-24T23:38:08Z

Now I have all the RTF files. Each dictionary entry as own file. I think I should better write a RTF parser and parse all the bold, italic, bullets. Should be enough. Parsing to html adds way too much extra tags.

You can try removing extra tags with PyGlossary in command line, by passing --remove-html=tag1,tag2,tag3 for example.

Is there a Kobo df sample somewhere of an actual dictionary I could look? Examples from the page are no longer working.

Here is 2 examples from the website:
https://mega.nz/folder/ks4nBQ7K#h1idvSzCv_mLriOW9EDDPg

You can also check this repo (they convert .df to StarDict with PyGlossary):
https://github.com/BoboTiG/ebook-reader-dict

thehijacker · 2022-01-27T12:42:40Z

After days of coding I have finaly manage to create a Kobo df file from all the input dictionary entries. An example of one DF file dictionary entry:

@ entry_name
: \pronanuciation\
& alias1
& alias2
<html><i>noun</i> here comes the descrition

I used also aliases (&) and html code so I can make italic for specific words.

Now for final step. I need to test this on my Kobo. I converted from "Kobo E-Reader Dictfile (.df)" to "Kobo E-Reader Dictionary (.zip)" all output file looks fine. Will let you know after I do some more tests.

Few more questions on the df format. Do you think this is valid?

Two entries with same name, different meaning. One with number 1 and one with number 2 in name. How does Kobo process this?

@ name1
: \aaaa\
<html><i>noun</i> description

@ name2
: \aaaa\
<html><i>adjective</i> different description

Or this. Optional character inside parentheses. For example.

@ backward(s)
: \baekwəd(z)\
<html><i>adverb</i> description...

Thank you!

ilius · 2022-01-27T13:14:26Z

From PyGlossary's point of view, you can even use the same headword (without adding 1 or 2) multiple times.
Though I have seen some glossaries use headwords like "test (1)" or "test (2)".

But I don't know how Kobo will process it.
You may either test it, or ask the author of dictutil.

If Kobo can render html lists, it would be my preferred form (have all definitions in the same entry).

thehijacker · 2022-01-27T13:23:12Z

I will ask also the author of dictutil. I could merge them somehow. Just need suggestion what is proper way :).

Thank you.

ilius · 2022-01-27T21:53:19Z

If you want to target other e-book reader users as well, you may also try to test KOReader with StarDict format.

Some dictionary apps have "prefix search" feature (show "test (1)" when you lookup test) and some don't.

I was just having an interesting discussion here:
BoboTiG/ebook-reader-dict#1161

thehijacker · 2022-01-30T08:05:17Z

Hello @ilius. Since I only have Kobo reader (my first ebook reader ever) and I love it so much I wish to make this dictionary only for it. Df format is perfect. I now used dictgen to convert it to Kobo zip, as it give me more warning in my df struction that I needed to fix, and it is working fine on Kobo reader. I am in touch with the author to figure out proper df file structure that would work best on Kobo. Fixed the words with (x) inside name but I am still looking for best way to manage word1, word2, word3, ... headwords.

I think we can close this ticket and conclude that in my opinion best format to make custom dictionaries for Kobo reader is Kobo df. You can easily make it as it is text based and has well defined structure. You can convert it to Kobo zip (to put in custom-dic path) with PyGlossary or dictgen.

ilius added the Q&A label Jan 24, 2022

thehijacker mentioned this issue Jan 27, 2022

Mutiple headwords as seperated entries? pgaskin/dictutil#19

Closed

ilius closed this as completed Feb 5, 2022

sricochet mentioned this issue Jul 9, 2024

possibility for special formatting in dictionaries? #573

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best input format to create new dictionary #356

Best input format to create new dictionary #356

thehijacker commented Jan 24, 2022

ilius commented Jan 24, 2022

thehijacker commented Jan 24, 2022

ilius commented Jan 24, 2022

thehijacker commented Jan 24, 2022

ilius commented Jan 24, 2022

thehijacker commented Jan 27, 2022

ilius commented Jan 27, 2022

thehijacker commented Jan 27, 2022

ilius commented Jan 27, 2022

thehijacker commented Jan 30, 2022

Best input format to create new dictionary #356

Best input format to create new dictionary #356

Comments

thehijacker commented Jan 24, 2022

ilius commented Jan 24, 2022

thehijacker commented Jan 24, 2022

ilius commented Jan 24, 2022

thehijacker commented Jan 24, 2022

ilius commented Jan 24, 2022

thehijacker commented Jan 27, 2022

ilius commented Jan 27, 2022

thehijacker commented Jan 27, 2022

ilius commented Jan 27, 2022

thehijacker commented Jan 30, 2022