Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current status of API is unclear #184

Closed
Datseris opened this issue Dec 16, 2022 · 13 comments · Fixed by #187
Closed

Current status of API is unclear #184

Datseris opened this issue Dec 16, 2022 · 13 comments · Fixed by #187
Assignees
Labels
discussion-design Discussion or design matters high priority
Milestone

Comments

@Datseris
Copy link
Member

Datseris commented Dec 16, 2022

@kahaaga I've merged main (which is #168 ) into my branch that defines the encoding API and also implements it for histograms. Things remain unclear to me, mainly because discussions were spread across many different issues and PRs and I no longer remember where the information is supposed to be.

Is it correct that all main API functions start with the estimator types? This means that we have:

  • entropy(e::Entropy, p::ProbEst, x, ...)
  • probabilities(p::ProbEst, x, ...)
  • probabilities_and_outcomes(p::ProbEst, x, ...)

But as far as I can tell, these are the only two functions that follow this appraoch of having first the type. The following functions have the different approach of having first input x, and then the probabilities type:

  • outcome_space
  • total_outcomes
  • missing_outcomes
  • outcomes

Can you please explain what is the overall plan? Should we make all functions follow the first or the second version in the order of arguments?

@Datseris Datseris added high priority discussion-design Discussion or design matters labels Dec 16, 2022
@kahaaga
Copy link
Member

kahaaga commented Dec 16, 2022

My opinion: we make all functions follow the first order. That is: type first, then the argument.

@Datseris
Copy link
Member Author

ok I do this in a new PR

@Datseris
Copy link
Member Author

@kahaaga and what do we do about the encode and decode functions? They were encode(χ, e::Encoding) before. Should the order be changed?

I have to admit I find it more natural to have encoding(element, encoding). Because it reads "encode this value using that encoding." The other way around it doesn't read naturally: encode, using this encoding, this value.

Which also makes me think, why did we change the probabilities interface to have the type first? For the entropies I got your argument: it made extending entropies to joint entropies natural. But does this argument carry over to probabilities? Notice that the Julia style guide says "put type first", but this is a bit missleading: this is for the cases that the type decides the return type, such as in rand(Int, x). In our case the estimator doesn't decide the return type, only the return value.

On the other hand, it would be good to have an idea of how in general the broader Julia community does this. Is the "algorithm deciding argument" before or after the input data? In JuliaDynamics overall it is first input data and then algorithm deciding argument. Eg. that's how it is done in RecurrenceAnalysis.jl or TimeseriesSurrogates.jl: surrogate(x, method). I could be in the wrong though, and everywhere else it is done the other way around. Let's try to collect at least a couple of examples from other repos?

@Datseris
Copy link
Member Author

@kahaaga
Copy link
Member

kahaaga commented Dec 17, 2022

Good idea on consulting with the community!

I have to admit I find it more natural to have encoding(element, encoding). Because it reads "encode this value using that encoding." The other way around it doesn't read naturally: encode, using this encoding, this value.

Well, it depends on how you phrase it. Take encode(x, method::A). There's "encode x using method of type A", and there's "A-encode x". Both feels pretty natural to me.

I don't feel strongly about either option beyond the vararg-argument, but the vargarg-argument weights pretty strongly in my opinion. I've gone through several rounds of design proposal for the CausalityTools methods without coming up with anything reasonably developer-friendly, because without it, we're back to writing a bunch of specialized methods for all input argument combinations. This is a pain, is error-prone, hard to document and hard to debug.

A third option is to have the types as the first argument to methods that will never accept vargargs as the last argument. But how do we know if a method won't be extended with varargs later? In some cases it is obvious that it will never happen, but in other cases it is not so clear.

Which also makes me think, why did we change the probabilities interface to have the type first?

I think the reasoning was just "follow the same style everywhere". But since there is no established guidelines on the topic, we could also just decide on a style guide ourselves.

Anyways, we should see what the community thinks too.

I'll try to find some examples from well-established packages in the meanwhile.

@kahaaga
Copy link
Member

kahaaga commented Dec 17, 2022

Some examples from other packages with functionality for estimating some statistic from data (I'll fill this out as I investigate more)

StatsBase.jl

Implements different algorithms as a different function, and parameters as keyword arguments. They do knuths_sample and fisher_yates_sample, where we'd do sample!(Knuths(), args...) and sample(FisherYates(), args...).

Distributions.jl

Uses a similar type-based approach, where each distributions is its own type. They dispatch on type-first, then input. For example, fit_mle(dist::D, x), where D <: Distribution. However, here the return type is also of type D, so this again is not really comparable to our situation.

Distances.jl

Perhaps the best example of using the same approach as us: distance type first, then input data, and return type is not the same as the distance type. For example, pairwise(d::Distance, args...) returns a vector.

@kahaaga
Copy link
Member

kahaaga commented Dec 17, 2022

Ok, a radically different suggestion, regardless of argument order: use functors instead, like we do in TimeseriesSurrogates. For motivation, see the last comment in https://discourse.julialang.org/t/how-are-functors-useful/24110/28.

What if we instead of having entropy(e::Shannon, args...), we define EntropyShannon() and use it as a functor? In general, it would look something like:

struct EntropyShannon; params...; end
struct EntropyRenyi; params...; end
struct EntropyTsallis; params...; end
...


(e::EntropyShannon)(est, x...)
(e::EntropyRenyi)(est, x...)
(e::EntropyTsallis)(est, x...)

This would also simplify stuff in CausalityTools, where on the v2 branch one would do

struct MIShannon <: MutualInformation; params...; end
struct MITsallis <: MutualInformation; params...; end
....

mutualinfo(def::MITsallisTypeA,  measure::MITsallis, est::SomeEstimator, x, y)
mutualinfo(def::MITsallisTypeB,  measure::MITsallis, est::SomeEstimator, x, y)
....

With a functor approach, this would be the much less verbose

(mi::MITsallis)(def::MITsallisTypeA, est, x, y, ....) # seamlessly compute multi-information using varargs
(mi::MITsallis)(def::MITsallisTypeA, est, x, y, ...)

I actually like this a lot more than the current approach. Moreover, implementing generator-type structs, as desired in #70, would come very naturally.

One would have to think a bit about how to implement the multiscale interface, but I think that should be doable...

@Datseris
Copy link
Member Author

It seems that the community generally leans towards type first, input afterwards. So let's do this everywhere.


I don't see any immediate benefit for functors for entropies. At the moment it appears that you have to create exactly the same amount of structs and define the same amount of new functions. And a user would have to learn exactly the same amount of new things plus one more, the concept of functors, which I don't think is at all as a basic of a concept as a function.

So why use something conceptiually more complicated than functions?


mutualinfo(def::MITsallisTypeA,  measure::MITsallis, est::SomeEstimator, x, y)

off-topic, but I think that this is too much for a user. I wouldn't define MIType at all and just use different functions all together. Having three different types to decide the "algorithm" used probably is a good hint that the name of the function should be where the specialization should occur at.

@Datseris
Copy link
Member Author

Ok I read now the discourse post you linked that argues in favor of functors for establishing a 1-argument-function interface. Hence, this would be only valid if we could achieve that. We can't; we will always have a probabilities estimator and a data input. The alternative of passing the probability estimator as a "parameter" to the entropy functor goes again what we want to achieve here on an educational level: that the entropy and the probability extraction are two genuinely different things.

@kahaaga
Copy link
Member

kahaaga commented Dec 18, 2022

And a user would have to learn exactly the same amount of new things plus one more, the concept of functors, which I don't think is at all as a basic of a concept as a function.

Good point.

So why use something conceptiually more complicated than functions?

The advantage would be pre-allocation for speed ups during repeated computations (i.e. surrogate testing). But this is a deeper issue that we should address later, when all related packages have a stable release. That is, there are probably advantages to user a similar approach for entropy, probabilities, mutualinfo, transferentropy etc for pre-allocations and repeated computations.

It seems that the community generally leans towards type first, input afterwards. So let's do this everywhere.

Ok, fine with me! Let's do that.

Ok I read now the discourse post you linked that argues in favor of functors for establishing a 1-argument-function interface. Hence, this would be only valid if we could achieve that. We can't; we will always have a probabilities estimator and a data input. The alternative of passing the probability estimator as a "parameter" to the entropy functor goes again what we want to achieve here on an educational level: that the entropy and the probability extraction are two genuinely different things.

Yes, let's stick to our current approach.

@kahaaga
Copy link
Member

kahaaga commented Dec 18, 2022

off-topic, but I think that this is too much for a user. I wouldn't define MIType at all and just use different functions all together. Having three different types to decide the "algorithm" used probably is a good hint that the name of the function should be where the specialization should occur at.

I opened an issue for this in CausalityTools. See the linked issue above.

@Datseris
Copy link
Member Author

The advantage would be pre-allocation for speed ups during repeated computations (i.e. surrogate testing).

This is possible just as well with the struct. The pre-allocation happens once you initialize e = Shannon(). It doesn't matter if you call e as e(est, x) or entropy(e, est, x). At least with our surrogate package it could have happened just like that just as well. You could have r = RandomPhase(); surrogenerator(r, x) and r has pre-initialized the FT plan.


okay, I'll finish #187 tomorrow then with the final decided upon API which is: types are always first argument, but no functors.

@kahaaga
Copy link
Member

kahaaga commented Dec 18, 2022

This is possible just as well with the struct. The pre-allocation happens once you initialize e = Shannon(). It doesn't matter if you call e as e(est, x) or entropy(e, est, x). At least with our surrogate package it could have happened just like that just as well. You could have r = RandomPhase(); surrogenerator(r, x) and r has pre-initialized the FT plan.

Ah, yes. Let's keep this in mind for later.

okay, I'll finish #187 tomorrow then with the final decided upon API which is: types are always first argument, but no functors.

Sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion-design Discussion or design matters high priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants