Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispersion and reverse dispersion probability estimators #96

Merged
merged 15 commits into from
Sep 26, 2022

Conversation

kahaaga
Copy link
Member

@kahaaga kahaaga commented Sep 19, 2022

What is this PR?

This PR does three things:

  • Introduces Dispersion as a probability estimator (fixes DispersionEntropy should be a ProbabilitiesEstimator #82), removing it as a direct entropy estimator and replacing it with renyi_entropy(x, Dispersion(); kwargs...).
  • Introduces the complexity measure reverse_dispersion (entropy), which is not included as a separate probability estimator, because the probability estimation step is identical to Dispersion, and the final quantities on which the reverse dispersion entropy is computed is a combination of distance information and probabilities, so we can't use renyi_entropy directly.
  • Moves the tests for these two estimators in a separate file, following the convention agreed upon in Refactor api #89 .

References

Li, Y., Gao, X., & Wang, L. (2019). Reverse dispersion entropy: a new complexity measure for sensor signal. Sensors, 19(23), 5203.

test/dispersion.jl Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Sep 19, 2022

Codecov Report

Merging #96 (aea3a03) into main (2aa89a9) will increase coverage by 0.48%.
The diff coverage is 81.39%.

❗ Current head aea3a03 differs from pull request most recent head 9391e9f. Consider uploading reports for the commit 9391e9f to get more accurate results

@@            Coverage Diff             @@
##             main      #96      +/-   ##
==========================================
+ Coverage   71.46%   71.95%   +0.48%     
==========================================
  Files          30       31       +1     
  Lines         729      756      +27     
==========================================
+ Hits          521      544      +23     
- Misses        208      212       +4     
Impacted Files Coverage Δ
src/entropies/convenience_definitions.jl 0.00% <0.00%> (ø)
.../probabilities_estimators/dispersion/dispersion.jl 80.76% <80.76%> (ø)
.../complexity_measures/reverse_dispersion_entropy.jl 100.00% <100.00%> (ø)
src/symbolization/GaussianSymbolization.jl 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@kahaaga kahaaga marked this pull request as ready for review September 19, 2022 16:01
@kahaaga
Copy link
Member Author

kahaaga commented Sep 19, 2022

@Datseris I think this is good to go from my side. We can deal with the test cases in a separate PR (we also have an open issue to improve tests (#85).

Note that I made some changes regarding the reverse dispersion entropy (see initial comment for the PR). It is a complexity estimator, not an entropy estimator.

@Datseris
Copy link
Member

If we want to follow good development practices, we should add tests in this PR, not in another one. A new feature is ideally accompanied by both documentation and tests. Is there any time pressure to merge this immediately for some reason? If not, we should keep it open until we can get around and add tests.

Especially if we want to write a paper about this, a big selling point we can have is rigorous tests, which, at least from what I've seen so far, all competing libraries are completely void of. So the best practice moving forwards is to demand rigorous tests for all new features. From a scientific perspective this is even more important actually.

@kahaaga
Copy link
Member Author

kahaaga commented Sep 19, 2022

Sure, we can wait!

This PR is not extremely urgent, but I have an estimator for TransferEntropy.jl waiting that relies on this estimator (which I want in the CausalityTools paper). However, why not take the opportunity to write robust, analytical tests for both? It’ll probably take a few week or so before I get to addressing this, but it’ll be worth the effort.

@kahaaga
Copy link
Member Author

kahaaga commented Sep 23, 2022

Hey, @Datseris

I actually had analytical tests based on the original papers in an uncommited PR that I had completely forgotten about. I've updated this PR with these test cases (note: the example in the Rostaghi & Azami paper is not correct, but the tests here are adjusted to reflect this).

In addition to adding analytical tests, I've also:

  • Added an example to Entropy.jl examples comparing the dispersion entropy and reverse dispersion entropy for a noisy signal with a single strong impulse.
  • Updated and improved the docstring for GaussianSymbolization, so that all docs regarding this symbolization procedure doesn't need to be repeated elsewhere.
  • Changed the named argument for GaussianSymbolization from n_categories to c, to conform with the notation in the Rostaghi & Azami and Li et al papers (from which this symbolization approach came).
  • Outlined briefly the algorithms in the Dispersion and reverse_dispersion docstrings.
  • Clarified confusing notation in the reverse_dispersion docstring.
  • Added reproducible examples to docstrings, using jldoctest. This way, we can test the examples while developing by running doctest(Entropies).

The doc preview is here

@Datseris Datseris self-requested a review September 23, 2022 11:53
Copy link
Member

@Datseris Datseris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left several comments, so I'm stopping here!

The most important comment is about the generalized interface for normalizing entropies!

Comment on lines 18 to 19
reverse_dispersion(x::AbstractVector{T}, s = GaussianSymbolization(c = 5), m = 2, τ = 1,
normalize = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having so many optional arguments is inconvenient. If I want to change just normalize, I need to input all prior 4. I'd suggest to make everything besides x a keyword argument. I'd also recommend to not use s, but rather something descriptive like symbolization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should indeed be keyword arguments. I missed the semi-colon.

src/complexity_measures/reverse_dispersion_entropy.jl Outdated Show resolved Hide resolved
Comment on lines 46 to 55
## Example

```jldoctest reverse_dispersion_example; setup = :(using Entropies)
julia> x = repeat([0.5, 0.7, 0.1, -1.0, 1.11, 2.22, 4.4, 0.2, 0.2, 0.1], 10);

julia> c, m = 3, 5;

julia> reverse_dispersion(x, s = GaussianSymbolization(c = c), m = m, normalize = false)
0.11372331532921814
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. I mean how much does it help really? It just shows the function call, and that it outputs a number, which is anyways understood by the first sentence of the docstring. Unlike Julia base functions, whose ouput can be deduced, and hence these tiny snippets actually make sure the user gets what's going on, here it's not like a user could deduce the 0.11.... Furthermore, there is already an example in the docs.

More importantly, what are the principles that decide which methods have examples in their docstrings and which do not? Are we perhaps confusing the user? One of the principles of good documentation is that information should be where it is expected to be. So, are we creating the expectation that examples will be found in the docpage "Examples", or at the end of the docstrings?

My vote is to stick with one mean of showing examples, and the vote for that is the dedicated examples page.


P.s.: The argument of using this as a means of tests is totally unconvincing for me, that's why we write tests in the test folder anyways, so if this tests what is already in the test folder, then why test it twice, and if it tests something that isn't in the test folder, well, why wasn't this thing in the test folder in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Documenter.jl docs:

It is recommended that as many of a package's examples as possible be runnable by Documenter's doctest. This can help to avoid documentation examples from becoming outdated, incorrect, or misleading.

One can discuss whether this advice is good or not, given precisely what you said:

that's why we write tests in the test folder anyways, so if this tests what is already in the test folder, then why test it twice, and if it tests something that isn't in the test folder, well, why wasn't this thing in the test folder in the first place.

There a balance to be struck between 1) the expectation a user has of where examples are found and 2) the ease of use and/or learning curve.

I'm fine just having the extended examples in the online documentation pages, but it is also slightly annoying (for me) to have to open the browser and read through potentially many documentation pages to find a simple runnable example. I very much like the sci-py, e.g. approach, where most if not all methods have a simple runnable example AND some of these methods are exposed in more detail elsewhere in the docs. On the opposite end of the spectrum, you have packages like StatsBase.jl which have very few examples, except in some methods (where, funnily, for the example I looked at, they don't even use jldoctest, just a regular julia bode block.

I do straight-forwardly agree that for functions as simple as reverse_dispersion, the runnable example is a bit redundant. The example in the doc page should be informative enough.

Comment on lines +57 to +58
entropy_dispersion(x; m = 2, τ = 1, s = GaussianSymbolization(3),
base = MathConstants.e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reverse_dispersion arguments are positional, here they are keyword. I'd say we stick with keyword.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reverse_dispersion arguments are positional, here they are keyword. I'd say we stick with keyword.

Yes, definitely. As I commented above, I unintentionally missed the semi-colon.

categories for the Gaussian symbol mapping. See below for an in-depth description of
dispersion pattern construction and usage.

# Algorithm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Algorithm
## Description

Comment on lines 76 to 79
Hrde_maximal = Entropies.distance_to_whitenoise(single_element_dist, n_classes, m,
normalize = false)
Hrde_maximal_norm = Entropies.distance_to_whitenoise(single_element_dist, n_classes, m,
normalize = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here internal functions are also used. Perhaps this function should be documented and exported, and put into the Utils page?

Comment on lines 17 to 18
categories for the Gaussian symbol mapping. See below for an in-depth description of
dispersion pattern construction and usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
categories for the Gaussian symbol mapping. See below for an in-depth description of
dispersion pattern construction and usage.
categories for the Gaussian symbol mapping.

The description follows immediatelly, so this sentence isn't necessary :)

Comment on lines 73 to 75
If `normalize == true`, then when used in combination with [`entropy_renyi`](@ref),
the dispersion entropy is normalized to `[0, 1]`. Normalization is only
defined when `q == 1`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, here this is something we can discuss extensively. The normalization process. In general, most entropies that have a known total number of symbols can be normalized versus the entropy of uniform probability for each symbol. E.g., for ordinal patterns this is also known a-priory.

Isn't it better for us to devise a general API for normalizing entropies in this regard, instead of doing it individual for each single estimator?

Here is what I propose: define a function alphabet_length(::ProbabilitiesEstimator) or word_length or whatever information theory people call it. It returns the total number of possible words. Not just the ones that actually exist in the data. If it is not known a-priori, it returns nothing instead. Then, we define a function entropy_normalized that returns the renyi entropy but normalized (i.e., divided by log(alphabet_length)), i.e., normalized versus uniform distribution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better for us to devise a general API for normalizing entropies in this regard, instead of doing it individual for each single estimator?
Here is what I propose: define a function alphabet_length(::ProbabilitiesEstimator) or word_length or whatever information theory people call it. It returns the total number of possible words. Not just the ones that actually exist in the data. If it is not known a-priori, it returns nothing instead. Then, we define a function entropy_normalized that returns the renyi entropy but normalized (i.e., divided by log(alphabet_length)), i.e., normalized versus uniform distribution.

I like that idea a lot. I'll implement that in a PR with all my responses to your comments here.

``\\sum_i^{c^m} p_i = 1``, can then be estimated by counting and sum-normalising
the distribution of dispersion patterns among the embedding vectors.

```jldoctest dispersion_example; setup = :(using Entropies)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern with the doctests as in the other docstring.

Comment on lines 57 to 77
!!! note
#### A clarification on notation

With ambiguous notation, Li et al. claim that

``H_{rde} = \\sum_{i = 1}^{c^m} \\left(p_i - \\dfrac{1}{{c^m}} \\right)^2 = \\sum_{i = 1}^{c^m} p_i^2 - \\frac{1}{c^m}.``

But on the right-hand side of the equality, does the constant term appear within or
outside the sum? Using that ``P`` is a probability distribution by
construction (in step 4) , we see that the constant must appear *outside* the sum:

```math
\\begin{aligned}
H_{rde} &= \\sum_{i = 1}^{c^m} \\left(p_i - \\dfrac{1}{{c^m}} \\right)^2
= \\sum_{i=1}^{c^m} p_i^2 - \\frac{2p_i}{c^m} + \\frac{1}{c^{2m}} \\\\
&= \\left( \\sum_{i=1}^{c^m} p_i^2 \\right) - \\left(\\sum_i^{c^m} \\frac{2p_i}{c^m}\\right) + \\left( \\sum_{i=1}^{c^m} \\dfrac{1}{{c^{2m}}} \\right) \\\\
&= \\left( \\sum_{i=1}^{c^m} p_i^2 \\right) - \\left(\\frac{2}{c^m} \\sum_{i=1}^{c^m} p_i \\right) + \\dfrac{c^m}{c^{2m}} \\\\
&= \\left( \\sum_{i=1}^{c^m} p_i^2 \\right) - \\frac{2}{c^m} (1) + \\dfrac{1}{c^{m}} \\\\
&= \\left( \\sum_{i=1}^{c^m} p_i^2 \\right) - \\dfrac{1}{c^{m}}.
\\end{aligned}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is any point in writing this huge block of text. Why don't we just type the correct formula in the Description section and be done with it? Furthermore, do we even need to discuss this at all? I don't think it matters for the software that this equation holds, and here we are documenting the software. But in any case, I'd say just add the correct formula after a second equality in the equation in the Description if you want to keep it.

Copy link
Member Author

@kahaaga kahaaga Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters for the software that this equation holds, and here we are documenting the software. But in any case, I'd say just add the correct formula after a second equality in the equation in the Description if you want to keep it.

I think it does matter for the software, because the ambiguous notation caused me to spend a lot of extra time during development and testing figuring out (also not realizing that was a potential issue) whether there was a typo in their definition, or if they made a computational mistake. That could matter to anyone that want to use the formula to test something by hand themselves.

But in any case, I'd say just add the correct formula after a second equality in the equation in the Description if you want to keep it.

That's a good compromise.

* Address review comments

* Throw error

* Remove non-used field

* Fix tests

* Merge docs and move individual methods to their respective files

* Tsallis reduces to Shannon entropy for q -> 1.

* Normalized entropy API, including utility functions

* Analytical test cases for Tsallis

* Include `k` in `maxentropy_tsallis`
@Datseris Datseris merged commit 78f555c into main Sep 26, 2022
@Datseris Datseris deleted the newapi/dispersion_estimators branch September 26, 2022 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DispersionEntropy should be a ProbabilitiesEstimator
2 participants