Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding min-max linkage to Agglomerative Clustering #19698

Open
NimaSarajpoor opened this issue Mar 17, 2021 · 9 comments
Open

Adding min-max linkage to Agglomerative Clustering #19698

NimaSarajpoor opened this issue Mar 17, 2021 · 9 comments

Comments

@NimaSarajpoor
Copy link
Contributor

Hi,

I checked out the "the scikit-learn inclusion criterion:" and the algorithm I am about to suggest satisfies all except one! And, that's the number of citations, which is 116 according to google scholar (not 200+).

However, I think many people (who are on the application side of ML) use scikit-learn for their projects and are not aware of many algorithms that are available. Thus, including a new algorithm in the scikit-learn can let them know such method exists. Please let me know if it is a good idea to add this method can be added to the Agglemorative Clustering. If so, I can work on it (as I already implemented it myself) and add it to the package.

Challenge: The linkages in agglemorative clustering that can work with the distance matrix are complete, linkage, and average. However, the resulted clusters don't have a clear centroid. So, a method that can work with a distance matrix and provide a centroid-based structured clusters can provide a lot of flexibility. It can be used for any distance measure and since it utilizes the distance matrix only, it can be performed very fast for large-size data. It also provides centroids which can be used as the representatives of the clusters.

Solution: A min-max linkage proposed by Jacob Bien & Robert Tibshirani (2011) is a good solution. It merges two clusters if a ball that is required to cover the points of the two clusters together is the smallest compare to others. A ball for each cluster can be identified as follows: First, identify the furthest neighbor of each point and calculate its distance. Then, choose the point (M) that has the shortest distance to its furthest neighbor. That point M is the centroid and that shortest distance is the radius of the ball. I recently used it for my paper where I had to use DTW distance, but I also need a fast algorithm that provides a reasonable centroid rather than computing DBA at each step of the process.

Alternative solution: N/A

Additional Info:
Paper:
Bien, Jacob, and Robert Tibshirani. "Hierarchical clustering with prototypes via minimax linkage." Journal of the American Statistical Association 106.495 (2011): 1075-1084.

@deepthi1107
Copy link

hloo,i would like to contribute to this issue, can u please assign me this issue

@NimaSarajpoor
Copy link
Contributor Author

@deepthi1107 : To be honest, I don't know if I am allowed to assign you for this as I am waiting to see if someone confirm this new feature, so we can work on it.

@deepthi1107
Copy link

deepthi1107 commented Mar 20, 2021 via email

@jnothman
Copy link
Member

jnothman commented Mar 21, 2021 via email

@NimaSarajpoor
Copy link
Contributor Author

One of the reason for the scikit-learn-extra library is to create increased visibility for algorithms that are not sufficiently popular to maintain in the core library. (Of course, the visibility of scikit-learn-extra is also reduced relative to this library, and maybe we should consider linking out from the core document out more often.) Have you considered making this contribution there?

@jnothman: No. I used "KMedoid" from extra library a while ago, and I thought that library was just a small project created to add some new algorithms. I will mention it there. Thanks for your input.

@NimaSarajpoor
Copy link
Contributor Author

@deepthi1107 :
Hi, I just mentioned it on the scikit-learn-extra page. If they approve, we can start to work on this.

@deepthi1107
Copy link

deepthi1107 commented Mar 22, 2021 via email

@NimaSarajpoor
Copy link
Contributor Author

@cmarmo @jnothman

UPDATE:
I opened an issue of this on scikit-learn-extra about three months ago; however, there has been no reaction. Should I go ahead and implement it and submit a PR? Or should I close the issue?

@deepthi1107
Copy link

deepthi1107 commented Jun 27, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants