Merge pull request #128 from goodmami/v0.8.0

V0.8.0
goodmami · Jul 6, 2021 · 3411b03 · 3411b03
2 parents f587a33 + ec97661
commit 3411b03
Show file tree

Hide file tree

Showing 22 changed files with 1,753 additions and 287 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,27 @@
 
 ## [Unreleased]
 
+## [v0.8.0]
+
+**Release date: 2021-07-07**
+
+### Added
+
+* `wn.ic` module ([#40]
+* `wn.taxonomy` module ([#125])
+* `wn.similarity.res` Resnik similarity ([#122])
+* `wn.similarity.jcn` Jiang-Conrath similarity ([#123])
+* `wn.similarity.lin` Lin similarity ([#124])
+* `wn.util.synset_id_formatter` ([#119])
+
+### Changed
+
+* Taxonomy methods on `wn.Synset` are moved to `wn.taxonomy`, but
+  shortcut methods remain for compatibility ([#125]).
+* Similarity metrics in `wn.similarity` now raise an error when
+  synsets come from different parts of speech.
+
+
 ## [v0.7.0]
 
 **Release date: 2021-06-09**
@@ -62,6 +83,10 @@
 
 **Release date: 2021-03-04**
 
+**Notice:** This release introduces backwards-incompatible changes to
+the schema that require users upgrading from previous versions to
+rebuild their database.
+
 ### Added
 
 * For WN-LMF 1.0 support ([#65])
@@ -347,6 +372,7 @@ the https://github.com/nltk/wordnet/ code which had been effectively
 abandoned, but this is an entirely new codebase.
 
 
+[v0.8.0]: ../../releases/tag/v0.8.0
 [v0.7.0]: ../../releases/tag/v0.7.0
 [v0.6.2]: ../../releases/tag/v0.6.2
 [v0.6.1]: ../../releases/tag/v0.6.1
@@ -367,6 +393,7 @@ abandoned, but this is an entirely new codebase.
 [#17]: https://github.com/goodmami/wn/issues/17
 [#19]: https://github.com/goodmami/wn/issues/19
 [#23]: https://github.com/goodmami/wn/issues/23
+[#40]: https://github.com/goodmami/wn/issues/40
 [#46]: https://github.com/goodmami/wn/issues/46
 [#47]: https://github.com/goodmami/wn/issues/47
 [#58]: https://github.com/goodmami/wn/issues/58
@@ -406,3 +433,8 @@ abandoned, but this is an entirely new codebase.
 [#115]: https://github.com/goodmami/wn/issues/115
 [#116]: https://github.com/goodmami/wn/issues/116
 [#117]: https://github.com/goodmami/wn/issues/117
+[#119]: https://github.com/goodmami/wn/issues/119
+[#122]: https://github.com/goodmami/wn/issues/122
+[#123]: https://github.com/goodmami/wn/issues/123
+[#124]: https://github.com/goodmami/wn/issues/124
+[#125]: https://github.com/goodmami/wn/issues/125
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -5,6 +5,7 @@ Thanks for helping to make Wn better!
 **Quick Links:**
 
 - [Report a bug or request a features](https://github.com/goodmami/wn/issues/new)
+- [Ask a question](https://github.com/goodmami/wn/discussions)
 - [View documentation](https://wn.readthedocs.io/)
 
 **Developer Information:**
@@ -14,28 +15,43 @@ Thanks for helping to make Wn better!
 - Changelog: [keep a changelog](https://keepachangelog.com/en/1.0.0/)
 - Documentation framework: [Sphinx](https://www.sphinx-doc.org/)
 - Docstring style: [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) (via [sphinx.ext.napoleon](https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html))
-- Testing framework: [pytest](https://pytest.org/)
-- Packaging framework: [flit](https://flit.readthedocs.io/en/latest/)
-- Coding style: [PEP-8](https://www.python.org/dev/peps/pep-0008/)
+- Testing automation: [nox](https://nox.thea.codes)
+- Unit/regression testing: [pytest](https://pytest.org/)
+- Packaging framework: [Flit](https://flit.readthedocs.io/en/latest/)
+- Coding style: [PEP-8](https://www.python.org/dev/peps/pep-0008/) (via [Flake8](https://flake8.pycqa.org/))
 - Type checking: [Mypy](http://mypy-lang.org/)
 
 
 ## Get Help
 
-Confused about wordnets? See the [Global Wordnet Association
-Documentation](https://globalwordnet.github.io/gwadoc/)
+Confused about wordnets in general? See the [Global Wordnet
+Association Documentation](https://globalwordnet.github.io/gwadoc/)
 
-Having trouble with using Wn? [Raise an
+Confused about using Wn or wish to share some tips? [Start a
+discussion](https://github.com/goodmami/wn/discussions)
+
+Encountering a problem with Wn or wish to propose a new features? [Raise an
 issue](https://github.com/goodmami/wn/issues/new)
 
+
 ## Report a Bug
 
 When reporting a bug, please provide enough information for someone to
 reproduce the problem. This might include the version of Python you're
 running, the version of Wn you have installed, the wordnet lexicons
 you have installed, and possibly the platform (Linux, Windows, macOS)
 you're on. Please give a minimal working example that illustrates the
-problem.
+problem. For example:
+
+> I'm using Wn 0.7.0 with Python 3.8 on Linux and [description of
+> problem...]. Here's what I have tried:
+>
+> ```pycon
+> >>> import wn
+> >>> # some code
+> ... # some result or error
+> ```
+
 
 ## Request a Feature
 
@@ -47,4 +63,18 @@ would address.
 
 See the "developer information" above for a brief description of
 guidelines and conventions used in Wn. If you have a fix, please
-submit a pull request to the `main` branch.
+submit a pull request to the `main` branch. In general, every pull
+request should have an associated issue.
+
+Developers should install Wn locally from source using
+[Flit](https://flit.readthedocs.io/en/latest/). Flit may be installed
+system-wide or within a virtual environment:
+
+```bash
+$ pip install flit
+$ flit install -s
+```
+
+The `-s` option tells Flit to use symbolic links to install Wn,
+similar to pip's -e editable installs. This allows one to edit source
+files and use the changes without having to reinstall Wn each time.
diff --git a/README.md b/README.md
@@ -19,13 +19,6 @@
 
 ---
 
-**Notice for users upgrading to v0.6:** Version v0.6.0 introduced
-changes to the database schema that require the user to rebuild their
-database. Please [raise an
-issue](https://github.com/goodmami/wn/issues/new) if you need help.
-
----
-
 Wn is a Python library for exploring information in wordnets. Install
 it from PyPI:
 

diff --git a/docs/api/wn.ic.rst b/docs/api/wn.ic.rst
@@ -0,0 +1,163 @@
+
+wn.ic
+=====
+
+.. automodule:: wn.ic
+
+The mathematical formulae for information content are defined in
+`Formal Description`_, and the corresponding Python API function are
+described in `Calculating Information Content`_. These functions
+require information content weights obtained either by `computing them
+from a corpus <Computing Corpus Weights_>`_, or by `loading
+pre-computed weights from a file <Reading Pre-computed Information
+Content Files_>`_.
+
+.. note::
+
+   The term *information content* can be ambiguous. It often, and most
+   accurately, refers to the result of the :func:`information_content`
+   function (:math:`\text{IC}(c)` in the mathematical notation), but
+   is also sometimes used to refer to the corpus frequencies/weights
+   (:math:`\text{freq}(c)` in the mathematical notation) returned by
+   :func:`load` or :func:`compute`, as these weights are the basis of
+   the value computed by :func:`information_content`. The Wn
+   documentation tries to consistently refer to former as the
+   *information content value*, or just *information content*, and the
+   latter as *information content weights*, or *weights*.
+
+
+Formal Description
+------------------
+
+The Information Content (IC) of a concept (synset) is a measure of its
+specificity computed from the wordnet's taxonomy structure and corpus
+frequencies. It is defined by Resnik 1995 ([RES95]_), following
+information theory, as the negative log-probability of a concept:
+
+.. math::
+
+   \text{IC}(c) = -\log{p(c)}
+
+A concept's probability is the empirical probability over a corpus:
+
+.. math::
+
+   p(c) = \frac{\text{freq}(c)}{N}
+
+Here, :math:`N` is the total count of words of the same category as
+concept :math:`c` ([RES95]_ only considered nouns) where each word has
+some representation in the wordnet, and :math:`\text{freq}` is defined
+as the sum of corpus counts of words in :math:`\text{words}(c)`, which
+is the set of words subsumed by concept :math:`c`:
+
+.. math::
+
+   \text{freq}(c) = \sum_{w \in \text{words}(c)}{\text{count}(w)}
+
+It is common for :math:`\text{freq}` to not contain actual frequencies
+but instead weights distributed evenly among the synsets for a
+word. These weights are calculated as the word frequency divided by
+the number of synsets for the word:
+
+.. math::
+
+   \text{freq}_{\text{distributed}}(c)
+   = \sum_{w \in \text{words}(c)}{\frac{\text{count}(w)}{|\text{synsets}(w)|}}
+
+.. [RES95] Resnik, Philip. "Using information content to evaluate
+   semantic similarity." In Proceedings of the 14th International
+   Joint Conference on Artificial Intelligence (IJCAI-95), Montreal,
+   Canada, pp. 448-453. 1995.
+
+
+Example
+-------
+
+In the Princeton WordNet, the frequency of a concept like **stone
+fruit** is not the number of occurrences of *stone fruit*, but also
+includes the counts of the words for its hyponyms (*almond*, *olive*,
+etc.) and other taxonomic descendants (*Jordan almond*, *green olive*,
+etc.). The word *almond* has two synsets: one for the fruit or nut,
+another for the plant. Thus, if the word *almond* is encountered
+:math:`n` times in a corpus, then the weight (either the frequency
+:math:`n` or distributed weight :math:`\frac{n}{2}`) is added to the
+total weights for both synsets and to those of their ancestors, but
+not for descendant synsets, such as for **Jordan almond**. The fruit/nut
+synset of almond has two hypernym paths which converge on **fruit**:
+
+1. **almond** ⊃ **stone fruit** ⊃ **fruit**
+2. **almond** ⊃ **nut** ⊃ **seed** ⊃ **fruit**
+
+The weight is added to each ancestor (**stone fruit**, **nut**,
+**seed**, **fruit**, ...) once. That is, the weight is not added to
+the convergent ancestor for **fruit** twice, but only once.
+
+
+Calculating Information Content
+-------------------------------
+
+.. autofunction:: information_content
+.. autofunction:: synset_probability
+
+
+Computing Corpus Weights
+------------------------
+
+If pre-computed weights are not available for a wordnet or for some
+domain, they can be computed given a corpus and a wordnet.
+
+The corpus is an iterable of words. For large corpora it may help to
+use a generator for this iterable, but the entire vocabulary (i.e.,
+unique words and counts) will be held at once in memory. Multi-word
+expressions are also possible if they exist in the wordnet. For
+instance, the Princeton WordNet has *stone fruit*, with a single space
+delimiting the words, as an entry.
+
+The :class:`wn.Wordnet` object must be instantiated with a single
+lexicon, although it may have expand-lexicons for relation
+traversal. For best results, the wordnet should use a lemmatizer to
+help it deal with inflected wordforms from running text.
+
+.. autofunction:: compute
+
+
+Reading Pre-computed Information Content Files
+----------------------------------------------
+
+The :func:`load` function reads pre-computed information content
+weights files as used by the `WordNet::Similarity
+<http://wn-similarity.sourceforge.net/>`_ Perl module or the `NLTK
+<http://www.nltk.org/>`_ Python package. These files are computed for
+a specific version of a wordnet using the synset offsets from the
+`WNDB <https://wordnet.princeton.edu/documentation/wndb5wn>`_ format,
+which Wn does not use. These offsets therefore must be converted into
+an identifier that matches those used by the wordnet. By default,
+:func:`load` uses the lexicon identifier from its *wordnet* argument
+with synset offsets (padded with 0s to make 8 digits) and
+parts-of-speech from the weights file to format an identifier, such as
+``pwn-00001174-n``. For wordnets that use a different identifier
+scheme, the *get_synset_id* parameter of :func:`load` can be given a
+callable created with :func:`wn.util.synset_id_formatter`. It can also
+be given another callable with the same signature as shown below:
+
+.. code-block:: python
+
+   get_synset_id(*, offset: int, pos: str) -> str
+
+.. warning::
+
+   The weights files are only valid for the version of wordnet for
+   which they were created. Files created for the Princeton WordNet
+   3.0 do not work for the Princeton WordNet 3.1 because the offsets
+   used in its identifiers are different, although the *get_synset_id*
+   parameter of :func:`load` could be given a function that performs a
+   suitable mapping. Some `Open Multilingual Wordnet
+   <https://github.com/globalwordnet/OMW>`_ wordnets use the Princeton
+   WordNet 3.0 offsets in their identifiers and can therefore
+   technically use the weights, but this usage is discouraged because
+   the distributional properties of text in another language and the
+   structure of the other wordnet will not be compatible with that of
+   the Princeton WordNet. For these cases, it is recommended to
+   compute new weights using :func:`compute`.
+
+.. autofunction:: load
diff --git a/docs/api/wn.rst b/docs/api/wn.rst
@@ -168,6 +168,7 @@ The Sense Class
    .. automethod:: frames
    .. automethod:: counts
    .. automethod:: metadata
+   .. automethod:: relations
    .. automethod:: get_related
    .. automethod:: get_related_synsets
    .. automethod:: closure
@@ -218,17 +219,38 @@ The Synset Class
    .. automethod:: hyponyms
    .. automethod:: holonyms
    .. automethod:: meronyms
-   .. automethod:: hypernym_paths
-   .. automethod:: min_depth
-   .. automethod:: max_depth
-   .. automethod:: shortest_path
-   .. automethod:: common_hypernyms
-   .. automethod:: lowest_common_hypernyms
+   .. automethod:: relations
    .. automethod:: get_related
    .. automethod:: closure
    .. automethod:: relation_paths
    .. automethod:: translate
 
+   .. The taxonomy methods below have been moved to wn.taxonomy
+
+   .. method:: hypernym_paths(simulate_root=False)
+
+      Shortcut for :func:`wn.taxonomy.hypernym_paths`.
+
+   .. method:: min_depth(simulate_root=False)
+
+      Shortcut for :func:`wn.taxonomy.min_depth`.
+
+   .. method:: max_depth(simulate_root=False)
+
+      Shortcut for :func:`wn.taxonomy.max_depth`.
+
+   .. method:: shortest_path(other, simulate_root=False)
+
+      Shortcut for :func:`wn.taxonomy.shortest_path`.
+
+   .. method:: common_hypernyms(other, simulate_root=False)
+
+      Shortcut for :func:`wn.taxonomy.common_hypernyms`.
+
+   .. method:: lowest_common_hypernyms(other, simulate_root=False)
+
+      Shortcut for :func:`wn.taxonomy.lowest_common_hypernyms`.
+
 
 The ILI Class
 -------------