Issues with CluStream clustering #1633
PhilJahn
started this conversation in
Show and tell
Replies: 1 comment 2 replies
-
Good stuff, @PhilJahn! I am pinging @hoanganhngo610 on this one :) |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I was working with CluStream together with my colleague yesterday. We noticed that the CluStream clustering was inconsistent with the behavior of kMeans. An example on complex-9 is noted below. Grey dots with orange rims are the micro-cluster centers (based on micro_clusters[index].center), and colored dots with red rims correspond to the respective kMeans centroids. (For example, most noticeably the out-of-place light green data points in the top-left and turquoise ones in the bottom-center-left)
While investigating this, I noticed a mismatch between the micro-cluster centers used for computing the nearest micro-cluster (self.micro_clusters[index].center) and the one used for the cluster assignment (self._mc_centers[index]). The latter is only updated when performing kMeans clustering, whereas the former is kept up-to-date. This causes problems when a micro-cluster is deleted/merged as the index of the nearest micro-cluster may instead refer to the center of a deleted/merged micro-cluster, which could correspond to a very different position and, thus, a different kMeans cluster. (Same to a lesser degree for movements of a micro-cluster between kMeans steps).
This can be fixed by using self.micro_clusters[index].center for the cluster assignment.
I also noticed a second, presumably unintended behavior where the kMeans step can be skipped if the data point that arrives at the "% self.time_gap == self.time_gap - 1" timestamp is added to an existing micro-cluster. This can cause the clustering to be either outdated or, in the worst case, not be performed at all.
I have addressed both in my fork of the repository: https://github.com/PhilJahn/PhilJahnLMU-river-fork. I have opened up a pull request for this here: #1634
Related to this, I also want to suggest giving the option to perform the kMeans clustering as part of the prediction phase, as this seems to be closer to the idea of Online-Offline Stream clustering. This would also resolve the first issue above (at least when the option is used) as it would also keep _mc_centers[index] up-to-date regarding the predictions. I have also implemented this in a branch of my fork of the river-repository (https://github.com/PhilJahn/PhilJahnLMU-river-fork/tree/Offline-CluStream), but have not done documentation for it as I wanted to wait for feedback regarding the suggestion before doing so.
Beta Was this translation helpful? Give feedback.
All reactions