-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Histogram distribution #335
Closed
Closed
Changes from 6 commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
7af0ff3
[ENH] Histogram distribution
ShreeshaM07 e676a1e
bin_width when list
ShreeshaM07 b401a8f
parameterization different
ShreeshaM07 cdd0e64
pdf using nnp.where
ShreeshaM07 181e5b9
cdf implemented
ShreeshaM07 1e70a02
mean and var implemented
ShreeshaM07 cc8c030
ppf implementation
ShreeshaM07 3f205e6
Tuple input for bins
ShreeshaM07 79c4a6c
params2 modified
ShreeshaM07 e266710
Rectified mean and variance using E[X] and E[(X-mu)^2]
ShreeshaM07 9ccbcac
energy when x is outside the possible X
ShreeshaM07 46af6c8
energy_x outside corrected
ShreeshaM07 52dd77b
energy_x for when x is insede the bins range
ShreeshaM07 6dce11c
Primitive array distribution init rework
ShreeshaM07 796bf8e
solved test_pdf and test_ppf cases next `_shuffle_distr` and `subsett…
ShreeshaM07 5ae4754
introduced single arr distr along with pre-existing 2D arr dist
ShreeshaM07 bc4dec9
plot() made to work and resolved test_sample and some other failing CIs
ShreeshaM07 a75dcf8
mean and var for single arr distr
ShreeshaM07 30ff70f
ppf corrected when P is in 1st Bin
ShreeshaM07 8357ac5
BaseArrayDistribution inherits BaseDistribution
ShreeshaM07 bbbfd5f
0 values in bin_mass caught
ShreeshaM07 87af806
0 val
ShreeshaM07 ad83f2e
solved subsetting issue now shuffle_distr and loc again for true shuffle
ShreeshaM07 0101fdc
test_ppf when shuffled modified & shuffle distr made to work for arra…
ShreeshaM07 21ab3f4
plot() works now
ShreeshaM07 86bf070
energy_x implemented
ShreeshaM07 41a02f1
np.floating
ShreeshaM07 3589f2f
energy_self
ShreeshaM07 17f9836
merged skpro changed files
ShreeshaM07 e665322
removed distributions.rst and init.py
ShreeshaM07 a92722c
Revert "merged skpro changed files"
ShreeshaM07 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
# copyright: skpro developers, BSD-3-Clause License (see LICENSE file) | ||
"""Histogram distribution.""" | ||
|
||
__author__ = ["ShreeshaM07"] | ||
|
||
import numpy as np | ||
|
||
from skpro.distributions.base import BaseDistribution | ||
|
||
|
||
class Histogram(BaseDistribution): | ||
"""Histogram Probability Distribution. | ||
|
||
The histogram probability distribution is parameterized | ||
by the bins and bin densities. | ||
|
||
Parameters | ||
---------- | ||
bins : float or array of float 1D | ||
array has the bin boundaries with 1st element the first bin's | ||
starting point and rest are the bin ending points of all bins | ||
bin_mass: array of float 1D | ||
Mass of the bins or Area of the bins. | ||
Sum of all the bin_mass must be 1. | ||
index : pd.Index, optional, default = RangeIndex | ||
columns : pd.Index, optional, default = RangeIndex | ||
""" | ||
|
||
def __init__(self, bins, bin_mass, index=None, columns=None): | ||
self.bins = bins | ||
self.bin_mass = bin_mass | ||
|
||
super().__init__(index=index, columns=columns) | ||
|
||
def _mean(self): | ||
"""Return expected value of the distribution. | ||
|
||
Returns | ||
------- | ||
float, sum(bin_mass)/range(bins) | ||
expected value of distribution (entry-wise) | ||
""" | ||
bins = self.bins | ||
# 1 is the cumulative sum of all bin_mass | ||
return 1 / (max(bins) - min(bins)) | ||
|
||
def _var(self): | ||
r"""Return element/entry-wise variance of the distribution. | ||
|
||
Returns | ||
------- | ||
2D np.ndarray, same shape as ``self`` | ||
variance of the distribution (entry-wise) | ||
""" | ||
bins = self.bins | ||
bin_mass = self.bin_mass | ||
bin_width = np.diff(bins) | ||
mean = self._mean() | ||
var = np.sum((bin_mass / bin_width - mean) * bin_width) / ( | ||
max(bins) - min(bins) | ||
) | ||
return var | ||
|
||
def _pdf(self, x): | ||
"""Probability density function. | ||
|
||
Parameters | ||
---------- | ||
x : 1D np.ndarray, same shape as ``self`` | ||
values to evaluate the pdf at | ||
|
||
Returns | ||
------- | ||
1D np.ndarray, same shape as ``self`` | ||
pdf values at the given points | ||
""" | ||
bin_mass = np.array(self.bin_mass.copy()) | ||
bins = self.bins | ||
pdf = [] | ||
if isinstance(bins, list): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. main comment, this looks correct, but it is quite inefficient due to the use of loops. I would strongly advise to use For instance, for bin widths, use To find the bin in which the x-value falls, you could use |
||
bin_width = np.diff(bins) | ||
pdf_arr = bin_mass / bin_width | ||
for X in x: | ||
if len(np.where(X < bins)[0]) and len(np.where(X >= bins)[0]): | ||
pdf.append(pdf_arr[min(np.where(X < bins)[0]) - 1]) | ||
else: | ||
pdf.append(0) | ||
pdf = np.array(pdf) | ||
return pdf | ||
|
||
def _cdf(self, x): | ||
"""Cumulative distribution function. | ||
|
||
Parameters | ||
---------- | ||
x : 1D np.ndarray, same shape as ``self`` | ||
values to evaluate the cdf at | ||
|
||
Returns | ||
------- | ||
1D np.ndarray, same shape as ``self`` | ||
cdf values at the given points | ||
""" | ||
bins = self.bins | ||
bin_mass = self.bin_mass | ||
cdf = [] | ||
pdf = self._pdf(x) | ||
if isinstance(bins, list): | ||
for X in x: | ||
# cum_bin_index is an array of all indices | ||
# of the bins or bin edges that are less than X. | ||
cum_bin_index = np.where(X >= bins)[0] | ||
X_index_in_x = np.where(X == x) | ||
if len(cum_bin_index) == len(bins): | ||
cdf.append(1) | ||
elif len(cum_bin_index) > 1: | ||
cdf.append( | ||
np.cumsum(bin_mass)[-2] | ||
+ pdf[X_index_in_x][0] * (X - bins[cum_bin_index[-1]]) | ||
) | ||
elif len(cum_bin_index) == 0: | ||
cdf.append(0) | ||
elif len(cum_bin_index) == 1: | ||
cdf.append(pdf[X_index_in_x][0] * (X - bins[cum_bin_index[-1]])) | ||
cdf = np.array(cdf) | ||
return cdf | ||
|
||
|
||
# import pandas as pd | ||
|
||
# x = np.array([100, 1, 0.75, 1.8, 2.5, 3, 5, 6, 6.5, 0]) | ||
# hist = Histogram( | ||
# bins=[0.5, 2, 7], | ||
# bin_mass=[0.3, 0.7], | ||
# index=pd.Index(np.arange(3)), | ||
# columns=pd.Index(np.arange(2)), | ||
# ) | ||
# pdf = hist._pdf(x) | ||
# print(pdf) | ||
# cdf = hist._cdf(x) | ||
# print(cdf) | ||
# mean = hist._mean() | ||
# print(mean) | ||
# var = hist._var() | ||
# print(var) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's not correct? Also, you need to be careful about the different cases of
bins
being int or iterable.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take care of the
bins
cases. But in the case wherebins
has the bin edges then shouldn't this be the mean as mean =sum(bin_width*bin_height)/sum(bin_width)
,the numerator is basically area under the histogram which is =1
and the sum of bin_width would be the range of thebins
values thus =max(bins)- min(bins)
. Is that incorrect?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please review if what I have considered for the
mean
andvar
is correct or do I have to use E[X] = μ=∫∞−∞x*pdf(x)dx across all the different pdfs for the different bins?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your formula is simply incorrect.
The correct formula for mean is:
Let$b_0, \dots, b_n$ the bin boundaries, and $m_i, 1= 1,\dots, n$ the mass in the bin $[b_{i-1}, b_i]$ .
The mean of the histogram distirbution is then
which you can obtain by applying
np.dot
and a shifted sum.(this is obtained if you substitute pdf(x) into your formula and carry out the integration correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for "easy" computation of the mean and variance, you can use that the histogram distribution is the same as the two-step conditional where you first sample which bin you are in, with probabilities$m_i$ , and then from the uniform within the bin.
Use the conditional formulae for mean and variance on this idea - this also shows why the mean has the above form, as the weighted mean of means of uniform distributions on the bin intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes that is correct, I've made a mistake I will correct it now. Thanks for the help.