Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Histogram distribution #335

Closed
wants to merge 31 commits into from
Closed
Changes from 6 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
7af0ff3
[ENH] Histogram distribution
ShreeshaM07 May 16, 2024
e676a1e
bin_width when list
ShreeshaM07 May 16, 2024
b401a8f
parameterization different
ShreeshaM07 May 17, 2024
cdd0e64
pdf using nnp.where
ShreeshaM07 May 19, 2024
181e5b9
cdf implemented
ShreeshaM07 May 19, 2024
1e70a02
mean and var implemented
ShreeshaM07 May 19, 2024
cc8c030
ppf implementation
ShreeshaM07 May 20, 2024
3f205e6
Tuple input for bins
ShreeshaM07 May 21, 2024
79c4a6c
params2 modified
ShreeshaM07 May 21, 2024
e266710
Rectified mean and variance using E[X] and E[(X-mu)^2]
ShreeshaM07 May 21, 2024
9ccbcac
energy when x is outside the possible X
ShreeshaM07 May 21, 2024
46af6c8
energy_x outside corrected
ShreeshaM07 May 24, 2024
52dd77b
energy_x for when x is insede the bins range
ShreeshaM07 May 24, 2024
6dce11c
Primitive array distribution init rework
ShreeshaM07 Jun 1, 2024
796bf8e
solved test_pdf and test_ppf cases next `_shuffle_distr` and `subsett…
ShreeshaM07 Jun 4, 2024
5ae4754
introduced single arr distr along with pre-existing 2D arr dist
ShreeshaM07 Jun 6, 2024
bc4dec9
plot() made to work and resolved test_sample and some other failing CIs
ShreeshaM07 Jun 7, 2024
a75dcf8
mean and var for single arr distr
ShreeshaM07 Jun 7, 2024
30ff70f
ppf corrected when P is in 1st Bin
ShreeshaM07 Jun 7, 2024
8357ac5
BaseArrayDistribution inherits BaseDistribution
ShreeshaM07 Jun 7, 2024
bbbfd5f
0 values in bin_mass caught
ShreeshaM07 Jun 7, 2024
87af806
0 val
ShreeshaM07 Jun 7, 2024
ad83f2e
solved subsetting issue now shuffle_distr and loc again for true shuffle
ShreeshaM07 Jun 9, 2024
0101fdc
test_ppf when shuffled modified & shuffle distr made to work for arra…
ShreeshaM07 Jun 10, 2024
21ab3f4
plot() works now
ShreeshaM07 Jun 10, 2024
86bf070
energy_x implemented
ShreeshaM07 Jun 10, 2024
41a02f1
np.floating
ShreeshaM07 Jun 10, 2024
3589f2f
energy_self
ShreeshaM07 Jun 10, 2024
17f9836
merged skpro changed files
ShreeshaM07 Jun 10, 2024
e665322
removed distributions.rst and init.py
ShreeshaM07 Jun 10, 2024
a92722c
Revert "merged skpro changed files"
ShreeshaM07 Jun 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions skpro/distributions/histogram.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# copyright: skpro developers, BSD-3-Clause License (see LICENSE file)
"""Histogram distribution."""

__author__ = ["ShreeshaM07"]

import numpy as np

from skpro.distributions.base import BaseDistribution


class Histogram(BaseDistribution):
"""Histogram Probability Distribution.

The histogram probability distribution is parameterized
by the bins and bin densities.

Parameters
----------
bins : float or array of float 1D
array has the bin boundaries with 1st element the first bin's
starting point and rest are the bin ending points of all bins
bin_mass: array of float 1D
Mass of the bins or Area of the bins.
Sum of all the bin_mass must be 1.
index : pd.Index, optional, default = RangeIndex
columns : pd.Index, optional, default = RangeIndex
"""

def __init__(self, bins, bin_mass, index=None, columns=None):
self.bins = bins
self.bin_mass = bin_mass

super().__init__(index=index, columns=columns)

def _mean(self):
"""Return expected value of the distribution.

Returns
-------
float, sum(bin_mass)/range(bins)
expected value of distribution (entry-wise)
"""
bins = self.bins
# 1 is the cumulative sum of all bin_mass
return 1 / (max(bins) - min(bins))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not correct? Also, you need to be careful about the different cases of bins being int or iterable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take care of the bins cases. But in the case where bins has the bin edges then shouldn't this be the mean as mean = sum(bin_width*bin_height)/sum(bin_width),the numerator is basically area under the histogram which is = 1 and the sum of bin_width would be the range of the bins values thus = max(bins)- min(bins). Is that incorrect?

Copy link
Contributor Author

@ShreeshaM07 ShreeshaM07 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please review if what I have considered for the mean and var is correct or do I have to use E[X] = μ=∫∞−∞x*pdf(x)dx across all the different pdfs for the different bins?

Copy link
Collaborator

@fkiraly fkiraly May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your formula is simply incorrect.
The correct formula for mean is:

Let $b_0, \dots, b_n$ the bin boundaries, and $m_i, 1= 1,\dots, n$ the mass in the bin $[b_{i-1}, b_i]$.

The mean of the histogram distirbution is then

$$\mu =\frac{1}{2} \sum_{i=1}^n (b_i + b_{i-1})\cdot m_i$$

which you can obtain by applying np.dot and a shifted sum.

(this is obtained if you substitute pdf(x) into your formula and carry out the integration correctly)

Copy link
Collaborator

@fkiraly fkiraly May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for "easy" computation of the mean and variance, you can use that the histogram distribution is the same as the two-step conditional where you first sample which bin you are in, with probabilities $m_i$, and then from the uniform within the bin.

Use the conditional formulae for mean and variance on this idea - this also shows why the mean has the above form, as the weighted mean of means of uniform distributions on the bin intervals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes that is correct, I've made a mistake I will correct it now. Thanks for the help.


def _var(self):
r"""Return element/entry-wise variance of the distribution.

Returns
-------
2D np.ndarray, same shape as ``self``
variance of the distribution (entry-wise)
"""
bins = self.bins
bin_mass = self.bin_mass
bin_width = np.diff(bins)
mean = self._mean()
var = np.sum((bin_mass / bin_width - mean) * bin_width) / (
max(bins) - min(bins)
)
return var

def _pdf(self, x):
"""Probability density function.

Parameters
----------
x : 1D np.ndarray, same shape as ``self``
values to evaluate the pdf at

Returns
-------
1D np.ndarray, same shape as ``self``
pdf values at the given points
"""
bin_mass = np.array(self.bin_mass.copy())
bins = self.bins
pdf = []
if isinstance(bins, list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main comment, this looks correct, but it is quite inefficient due to the use of loops.

I would strongly advise to use numpy methods for everything.

For instance, for bin widths, use diff.

To find the bin in which the x-value falls, you could use cumsum and np.where with >.

bin_width = np.diff(bins)
pdf_arr = bin_mass / bin_width
for X in x:
if len(np.where(X < bins)[0]) and len(np.where(X >= bins)[0]):
pdf.append(pdf_arr[min(np.where(X < bins)[0]) - 1])
else:
pdf.append(0)
pdf = np.array(pdf)
return pdf

def _cdf(self, x):
"""Cumulative distribution function.

Parameters
----------
x : 1D np.ndarray, same shape as ``self``
values to evaluate the cdf at

Returns
-------
1D np.ndarray, same shape as ``self``
cdf values at the given points
"""
bins = self.bins
bin_mass = self.bin_mass
cdf = []
pdf = self._pdf(x)
if isinstance(bins, list):
for X in x:
# cum_bin_index is an array of all indices
# of the bins or bin edges that are less than X.
cum_bin_index = np.where(X >= bins)[0]
X_index_in_x = np.where(X == x)
if len(cum_bin_index) == len(bins):
cdf.append(1)
elif len(cum_bin_index) > 1:
cdf.append(
np.cumsum(bin_mass)[-2]
+ pdf[X_index_in_x][0] * (X - bins[cum_bin_index[-1]])
)
elif len(cum_bin_index) == 0:
cdf.append(0)
elif len(cum_bin_index) == 1:
cdf.append(pdf[X_index_in_x][0] * (X - bins[cum_bin_index[-1]]))
cdf = np.array(cdf)
return cdf


# import pandas as pd

# x = np.array([100, 1, 0.75, 1.8, 2.5, 3, 5, 6, 6.5, 0])
# hist = Histogram(
# bins=[0.5, 2, 7],
# bin_mass=[0.3, 0.7],
# index=pd.Index(np.arange(3)),
# columns=pd.Index(np.arange(2)),
# )
# pdf = hist._pdf(x)
# print(pdf)
# cdf = hist._cdf(x)
# print(cdf)
# mean = hist._mean()
# print(mean)
# var = hist._var()
# print(var)
Loading