Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added a first run of RU locale, added some POS in the definitions output #1271

Merged
merged 24 commits into from
May 11, 2022

Conversation

victornove
Copy link
Contributor

added RU locale, only few templates though as had memory issues with find-templates
modified some DE (flattened definitions with ordered lists) - can remove if needed?
modified definitions to include POS (checked for EN, FR, DE, RU) - had to modify DE to pull POS from different section into definition. - can also remove if needed?

@BoboTiG
Copy link
Owner

BoboTiG commented Mar 31, 2022

That's awesome!

I would be more comfortable with several PRs to target specific changes. When you modify existent code, it will need tests :).
Let's tackle only the new locale here.

Also, I am not sure about other changes, mostly because I do not see a before/after visual, it may help a lot here. Let's open new PRs and discuss about it there.

👍

@victornove
Copy link
Contributor Author

Hi again,

yes, I kind of expected to have to to split the problems :D

I removed everything linked to POS, DE and the subsection "flattening"'.

I'm entirely clueless about the testing part (I can read through some docs if needed?). Otherwise , if I manage to get the list of templates (find-templates gets killed by my machine or throws memory errors) and start working through them, I can add more to the template handler.

Also, I was wondering if sometimes using the expanded templates via '&templates=expand' isn't easier? for example I can't find a way to get the pronunciation in russian without it.

PS: for the visuals, you mean visualising the difference output? Something like this?

Before
'''
salut \sa.ly\ m.

Du latin salus, salūtem (« santé, salut »).

  1. Fait d’être sauvé de la mort, d’un danger, d’échapper à une situation désagréable.
    ...
  2. (Informel) (Familier) Bonjour, bonsoir, à quelqu’un que l’on tutoie.
    '''
    After
    """
    salut \sa.ly\ m.

Du latin salus, salūtem (« santé, salut »).

  1. nom: Fait d’être sauvé de la mort, d’un danger, d’échapper à une situation désagréable.
    ...
  2. interjection: (Informel) (Familier) Bonjour, bonsoir, à quelqu’un que l’on tutoie.
    """

@BoboTiG
Copy link
Owner

BoboTiG commented Mar 31, 2022

I'm entirely clueless about the testing part (I can read through some docs if needed?). Otherwise , if I manage to get the list of templates (find-templates gets killed by my machine or throws memory errors) and start working through them, I can add more to the template handler.

Testing is easy when you've done it once ;)

Let's say you want to test a word, pick that word on the Wiktionary, and from the "edit-mode" tab, copy the Wikicode. Then, in the folder tests/data/ru/ create a file named <word>.wiki, and paste the Wikicode inside it.
Now, you have the Wiktionary word file, let's add the real test. You could copy any other test file, you seems to work in France, so maybe are you fluent in French, and it will be easier to copy-paste tests/test_fr.py as tests/test_ru.py. Anyway, here is the minimal skeleton for such test file:

import pytest

from wikidict.render import parse_word
from wikidict.utils import process_templates


@pytest.mark.parametrize(
    "word, pronunciations, gender, etymology, definitions",
    [
        # Each and every word must contain all those details, it can be empty list or string when the information is not available
        (
            "<word>",  # this must be the same name (case-sensitivity is important) as "tests/data/ru/<word>.wiki"
            ["pronunciation1", ...],
            "gender",
            ["etymology"],
            ["definition1", ...],
        ),
    ],
)
def test_parse_word(
    word, pronunciations, gender, etymology, definitions, page
):
    """Test the sections finder and definitions getter."""
    code = page(word, "ru")
    details = parse_word(word, code, "ru", force=True)
    assert pronunciations == details.pronunciations
    assert gender == details.gender
    assert definitions == details.definitions
    assert etymology == details.etymology

This is a complete test for a given word. You can add as many words as files you will create in tests/data/ru/*.wiki. Those tests are important to validate the parsing of Russian words. By default, you should add words that cover parsed sections. If you find one word covering all sections, that's great too!

Then, maybe not now, but when you will work on adding templates, you will have to create lighter tests to only cover the template transformation. Here is the skeleton:

@pytest.mark.parametrize(
    "wikicode, expected",
    [
        # ("minimal wikicode template", "expected parsed result")
        ("{{1|Descendant}}", "Descendant"),
        ("{{1er}}", "1<sup>er</sup>"),
        ("{{1er|mai}}", "1<sup>er</sup>&nbsp;mai"),
    ],
)
def test_process_templates(wikicode, expected):
    """Test templates handling."""
    assert process_templates("foo", wikicode, "ru") == expected

@BoboTiG
Copy link
Owner

BoboTiG commented Mar 31, 2022

Could you open an issue about the find-templates error (with the exact command, and the error) so that we could investigate.

PS: for the visuals, you mean visualising the difference output? Something like this?

Yes, exactly :)

Also, I was wondering if sometimes using the expanded templates via '&templates=expand' isn't easier? for example I can't find a way to get the pronunciation in russian without it.

@lasconic do you have a opinion?

@lasconic
Copy link
Collaborator

Also, I was wondering if sometimes using the expanded templates via '&templates=expand' isn't easier? for example I can't find a way to get the pronunciation in russian without it.

@lasconic do you have a opinion?

If I understand correctly, this is not an option, since it would work only with get-word and not with the whole dump where templates are not "expanded". Right ?

@lasconic
Copy link
Collaborator

Regarding POS, let's discuss it somewhere else. Maybe here : #1149

@victornove
Copy link
Contributor Author

The problem with find-templates was simply my linux machine running out of memory. I split the json file and worked through it by pieces... not sure its worth opening a ticket for that? Turns out because of this issue there are 18000 templates to work through though!

Also, I'm kind of struggling with the details of getting my tests working, but I think(pray) this last correction will work! Once I get the first test working I can add some more.

Copy link
Owner

@BoboTiG BoboTiG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix quality code issues, just run ./check.sh locally, and push changes.

@victornove
Copy link
Contributor Author

Ok, I finally installed pytest & dependencies... makes iterations much quicker!
(is there a way to run unit tests selectively? like only for russian language/changed files?)

I guess before releasing the locale we would still need to solve the etymology/pronunciation templates issue, and ideally add a few more tests and template handlers? Anything else?

@BoboTiG
Copy link
Owner

BoboTiG commented Apr 12, 2022

(is there a way to run unit tests selectively? like only for russian language/changed files?)

python -m pytest tests/test_ru.py

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented May 11, 2022

Sourcery Code Quality Report

❌  Merging this PR will decrease code quality in the affected files by 0.73%.

Quality metrics Before After Change
Complexity 30.13 😞 30.89 😞 0.76 👎
Method Length 94.40 🙂 96.15 🙂 1.75 👎
Working memory 11.26 😞 11.36 😞 0.10 👎
Quality 44.85% 😞 44.12% 😞 -0.73% 👎
Other metrics Before After Change
Lines 612 622 10
Changed files Quality Before Quality After Quality Change
wikidict/render.py 40.40% 😞 39.71% 😞 -0.69% 👎
wikidict/lang/init.py 81.64% ⭐ 81.25% ⭐ -0.39% 👎

Here are some functions in these files that still need a tune-up:

File Function Complexity Length Working Memory Quality Recommendation
wikidict/render.py find_etymology 49 ⛔ 339 ⛔ 16 ⛔ 13.15% ⛔ Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
wikidict/render.py parse_word 57 ⛔ 324 ⛔ 15 😞 13.40% ⛔ Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
wikidict/render.py find_section_definitions 70 ⛔ 248 ⛔ 14 😞 16.42% ⛔ Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
wikidict/render.py find_all_sections 16 🙂 169 😞 12 😞 42.52% 😞 Try splitting into smaller methods. Extract out complex expressions

Legend and Explanation

The emojis denote the absolute quality of the code:

  • ⭐ excellent
  • 🙂 good
  • 😞 poor
  • ⛔ very poor

The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request.


Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Help us improve this quality report!

@BoboTiG BoboTiG merged commit da2b5d1 into BoboTiG:master May 11, 2022
@BoboTiG
Copy link
Owner

BoboTiG commented May 11, 2022

@all-contributors please add @victornove for code

@BoboTiG
Copy link
Owner

BoboTiG commented May 11, 2022

Thanks a lot @victornove 🍾

@allcontributors
Copy link
Contributor

@BoboTiG

I've put up a pull request to add @victornove! 🎉

@BoboTiG BoboTiG mentioned this pull request May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants