Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Replace scipy.stats.gaussian_kde with KDEpy.FFTKDE in plotly.figure_factory.create_distplot #4985

Open
HistoryBeginsAtSumer opened this issue Jan 27, 2025 · 2 comments
Labels
feature something new P3 backlog

Comments

@HistoryBeginsAtSumer
Copy link

Summary

I would like to kindly request that Plotly consider replacing the use of scipy.stats.gaussian_kde with KDEpy.FFTKDE in the implementation of plotly.figure_factory.create_distplot. This change would enhance the accuracy and performance of the kernel density estimation (KDE) and provide more robust handling of bandwidth selection for 1D data.

Reason for the Request

The current implementation leverages scipy.stats.gaussian_kde, which applies a suboptimal version of Silverman's rule of thumb for bandwidth selection. Specifically, it uses the sample standard deviation to estimate the scale of the data, rather than the minimum of the sample standard deviation and the interquartile range (IQR) divided by 1.34. This approach can lead to overestimation of the bandwidth in the presence of outliers or non-normal data distributions. Moreover, it is misleading to user who otherwise think "silverman" has a definite meaning, and might therefore believe some densities to be materially different than what they really are, and incorrectly take action on such basis.

In contrast, KDEpy.FFTKDE is:

  1. More Robust: It allows for better control over bandwidth selection and supports Silverman's rule of thumb with the more robust scale estimate.
  2. Faster: By leveraging FFT, it is computationally more efficient, particularly for large datasets.
  3. Customizable: It provides greater flexibility in kernel and bandwidth selection, enabling users to achieve more accurate density estimations.

Benefits of the Change

  • Improved Accuracy: Robust scale estimation would result in more reliable KDEs, especially for datasets with outliers or skewed distributions.
  • Performance Gains: Faster computation for large datasets using FFT.
  • Flexibility: Enhanced control over bandwidth selection and kernel options.

Proposed Implementation

To integrate KDEpy.FFTKDE into plotly.figure_factory.create_distplot (more specifically here: _displot.py, around line 361), the following adjustments could be made:

  1. Replace the call to scipy.stats.gaussian_kde with KDEpy.FFTKDE and argument bw='silverman'.
  2. Make sure both "x" and "y" values are correctly handled.
  3. Ensure compatibility with existing parameters and provide clear documentation for users.

References

Conclusion

Thank you for considering this feature request. Replacing scipy.stats.gaussian_kde with KDEpy.FFTKDE would significantly enhance the robustness, accuracy, and performance of kernel density estimation in plotly.figure_factory.create_distplot. I would be happy to assist with further details, testing, or any other support needed to implement this change.

Thank you for your time and for maintaining such a high-quality library.

@gvwilson gvwilson added feature something new P3 backlog labels Feb 3, 2025
@gvwilson
Copy link
Contributor

gvwilson commented Feb 3, 2025

Thanks for the suggestion @HistoryBeginsAtSumer - I'll add it to our backlog, but we are very reluctant to introduce more dependencies, so I doubt we will make the change any time soon. Thanks - @gvwilson

@HistoryBeginsAtSumer
Copy link
Author

@gvwilson Thank you for your response. If introducing new dependencies is worrisome in any capacity, I propose that scipy.stats's gaussian_kde be monkey patched (within create_displot) by computing the right bandwidth (using a proper version of Silverman's formula for 1D data) and passing it to the bw_method argument. It can easily be achieved. Here's an example:

from scipy.stats import gaussian_kde
from numpy.typing import ArrayLike
import numpy as np


def gaussian_kde_patched(data: ArrayLike) -> 'gaussian_kde':
    '''
    Monkey patches `scipy.stats`'s `gaussian_kde` class so the proper version
    of Silverman's rule for bandwidth estimation is applied.

    Parameters
    ----------
    data : ArrayLike
        Sample points to estimate the density from.

    Returns
    -------
    'gaussian_kde'
        An instance of `scipy.stats.gaussian_kde` initialized with the proper
        Silverman bandwidth estimation method.

    Raises
    ------
    ValueError
        - If data is not numeric
        - If data has length 0
        - If data is not 1D
    '''
    # Cast data as Numpy array
    data = np.asarray(data)

    # Input validation
    if not np.issubdtype(data.dtype, np.number):
        raise ValueError(
            "`data` must be a numerical array, got type {data.dtype}."
        )

    if data.size == 0:
        raise ValueError("`data` must contain at least one data point.")

    if data.ndim != 1:
        raise ValueError(f"`data` must be a 1D array, got {data.ndim}.")

    # Evaluate base factor
    sample_size = len(data)
    base_factor = (sample_size * 0.75) ** (-0.2)

    # Compute classic standard deviation estimate
    standard_deviation = np.std(data, ddof=1)

    # Compute robust standard deviation estimate
    q1, q3 = np.quantile(data, q=[0.25, 0.75])
    robust_standard_deviation = (q3 - q1) / 1.34

    # Compute scale estimate
    scale_estimate = min(standard_deviation, robust_standard_deviation)

    # Update base factor
    base_factor *= scale_estimate / standard_deviation

    # Return KDE object
    return gaussian_kde(dataset=data, bw_method=base_factor, weights=None)


if __name__ == '__main__':
    import plotly.graph_objects as go
    from scipy.stats import norm

    # Fixed seed for reproducibility
    random_state = 42
    rng = np.random.default_rng(seed=random_state)

    # Generate mock data
    data = np.concatenate([
        rng.normal(loc=-5.0, scale=0.2, size=1000),
        rng.normal(loc=+0.0, scale=0.2, size=8000),
        rng.normal(loc=+5.0, scale=0.2, size=1000)
    ])

    # Instantiate both KDEs
    kde = gaussian_kde(dataset=data, bw_method='silverman')
    kde_patched = gaussian_kde_patched(data=data)

    # Evaluate ground truth
    x = np.linspace(data.min(), data.max(), 1000)

    y_true = (
        norm(loc=-5.0, scale=0.1).pdf(x=x) * 0.1 +
        norm(loc=+0.0, scale=0.1).pdf(x=x) * 0.8 +
        norm(loc=+5.0, scale=0.1).pdf(x=x) * 0.1
    )

    # Plot final result
    go.Figure().add_trace(
        trace=go.Scatter(
            x=x, y=y_true, mode='lines',
            line=dict(color='black', dash='dash', width=1.0),
            name='ground_truth'
        )
    ).add_trace(
        trace=go.Scatter(
            x=x, y=kde.evaluate(points=x), mode='lines',
            line=dict(color='red', width=1.0),
            name='scipy.stats.gaussian_kde'
        )
    ).add_trace(
        trace=go.Scatter(
            x=x, y=kde_patched.evaluate(points=x), mode='lines',
            line=dict(color='blue', width=1.0),
            name='monkey_patch'
        )
    ).update_layout(
        showlegend=True, title='<b>BANDWITH SELECTION COMPARATIVE STUDY</b>',
        legend=dict(x=0.99, y=0.99, xanchor='right', yanchor='top')
    ).show()

I'd be more than willing to create a feature branch and open a pull request, in case that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature something new P3 backlog
Projects
None yet
Development

No branches or pull requests

2 participants