-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Replace scipy.stats.gaussian_kde
with KDEpy.FFTKDE
in plotly.figure_factory.create_distplot
#4985
Comments
Thanks for the suggestion @HistoryBeginsAtSumer - I'll add it to our backlog, but we are very reluctant to introduce more dependencies, so I doubt we will make the change any time soon. Thanks - @gvwilson |
@gvwilson Thank you for your response. If introducing new dependencies is worrisome in any capacity, I propose that from scipy.stats import gaussian_kde
from numpy.typing import ArrayLike
import numpy as np
def gaussian_kde_patched(data: ArrayLike) -> 'gaussian_kde':
'''
Monkey patches `scipy.stats`'s `gaussian_kde` class so the proper version
of Silverman's rule for bandwidth estimation is applied.
Parameters
----------
data : ArrayLike
Sample points to estimate the density from.
Returns
-------
'gaussian_kde'
An instance of `scipy.stats.gaussian_kde` initialized with the proper
Silverman bandwidth estimation method.
Raises
------
ValueError
- If data is not numeric
- If data has length 0
- If data is not 1D
'''
# Cast data as Numpy array
data = np.asarray(data)
# Input validation
if not np.issubdtype(data.dtype, np.number):
raise ValueError(
"`data` must be a numerical array, got type {data.dtype}."
)
if data.size == 0:
raise ValueError("`data` must contain at least one data point.")
if data.ndim != 1:
raise ValueError(f"`data` must be a 1D array, got {data.ndim}.")
# Evaluate base factor
sample_size = len(data)
base_factor = (sample_size * 0.75) ** (-0.2)
# Compute classic standard deviation estimate
standard_deviation = np.std(data, ddof=1)
# Compute robust standard deviation estimate
q1, q3 = np.quantile(data, q=[0.25, 0.75])
robust_standard_deviation = (q3 - q1) / 1.34
# Compute scale estimate
scale_estimate = min(standard_deviation, robust_standard_deviation)
# Update base factor
base_factor *= scale_estimate / standard_deviation
# Return KDE object
return gaussian_kde(dataset=data, bw_method=base_factor, weights=None)
if __name__ == '__main__':
import plotly.graph_objects as go
from scipy.stats import norm
# Fixed seed for reproducibility
random_state = 42
rng = np.random.default_rng(seed=random_state)
# Generate mock data
data = np.concatenate([
rng.normal(loc=-5.0, scale=0.2, size=1000),
rng.normal(loc=+0.0, scale=0.2, size=8000),
rng.normal(loc=+5.0, scale=0.2, size=1000)
])
# Instantiate both KDEs
kde = gaussian_kde(dataset=data, bw_method='silverman')
kde_patched = gaussian_kde_patched(data=data)
# Evaluate ground truth
x = np.linspace(data.min(), data.max(), 1000)
y_true = (
norm(loc=-5.0, scale=0.1).pdf(x=x) * 0.1 +
norm(loc=+0.0, scale=0.1).pdf(x=x) * 0.8 +
norm(loc=+5.0, scale=0.1).pdf(x=x) * 0.1
)
# Plot final result
go.Figure().add_trace(
trace=go.Scatter(
x=x, y=y_true, mode='lines',
line=dict(color='black', dash='dash', width=1.0),
name='ground_truth'
)
).add_trace(
trace=go.Scatter(
x=x, y=kde.evaluate(points=x), mode='lines',
line=dict(color='red', width=1.0),
name='scipy.stats.gaussian_kde'
)
).add_trace(
trace=go.Scatter(
x=x, y=kde_patched.evaluate(points=x), mode='lines',
line=dict(color='blue', width=1.0),
name='monkey_patch'
)
).update_layout(
showlegend=True, title='<b>BANDWITH SELECTION COMPARATIVE STUDY</b>',
legend=dict(x=0.99, y=0.99, xanchor='right', yanchor='top')
).show() I'd be more than willing to create a feature branch and open a pull request, in case that helps. |
Summary
I would like to kindly request that
Plotly
consider replacing the use ofscipy.stats.gaussian_kde
withKDEpy.FFTKDE
in the implementation ofplotly.figure_factory.create_distplot
. This change would enhance the accuracy and performance of the kernel density estimation (KDE) and provide more robust handling of bandwidth selection for 1D data.Reason for the Request
The current implementation leverages
scipy.stats.gaussian_kde
, which applies a suboptimal version of Silverman's rule of thumb for bandwidth selection. Specifically, it uses the sample standard deviation to estimate the scale of the data, rather than the minimum of the sample standard deviation and the interquartile range (IQR) divided by 1.34. This approach can lead to overestimation of the bandwidth in the presence of outliers or non-normal data distributions. Moreover, it is misleading to user who otherwise think "silverman" has a definite meaning, and might therefore believe some densities to be materially different than what they really are, and incorrectly take action on such basis.In contrast,
KDEpy.FFTKDE
is:Benefits of the Change
Proposed Implementation
To integrate
KDEpy.FFTKDE
intoplotly.figure_factory.create_distplot
(more specifically here: _displot.py, around line 361), the following adjustments could be made:scipy.stats.gaussian_kde
withKDEpy.FFTKDE
and argumentbw='silverman'
.References
Conclusion
Thank you for considering this feature request. Replacing
scipy.stats.gaussian_kde
withKDEpy.FFTKDE
would significantly enhance the robustness, accuracy, and performance of kernel density estimation inplotly.figure_factory.create_distplot
. I would be happy to assist with further details, testing, or any other support needed to implement this change.Thank you for your time and for maintaining such a high-quality library.
The text was updated successfully, but these errors were encountered: