Issues with CluStream clustering #1633

PhilJahn · 2024-11-15T15:10:33Z

PhilJahn
Nov 15, 2024

Hello, I was working with CluStream together with my colleague yesterday. We noticed that the CluStream clustering was inconsistent with the behavior of kMeans. An example on complex-9 is noted below. Grey dots with orange rims are the micro-cluster centers (based on micro_clusters[index].center), and colored dots with red rims correspond to the respective kMeans centroids. (For example, most noticeably the out-of-place light green data points in the top-left and turquoise ones in the bottom-center-left)

While investigating this, I noticed a mismatch between the micro-cluster centers used for computing the nearest micro-cluster (self.micro_clusters[index].center) and the one used for the cluster assignment (self._mc_centers[index]). The latter is only updated when performing kMeans clustering, whereas the former is kept up-to-date. This causes problems when a micro-cluster is deleted/merged as the index of the nearest micro-cluster may instead refer to the center of a deleted/merged micro-cluster, which could correspond to a very different position and, thus, a different kMeans cluster. (Same to a lesser degree for movements of a micro-cluster between kMeans steps).
This can be fixed by using self.micro_clusters[index].center for the cluster assignment.

I also noticed a second, presumably unintended behavior where the kMeans step can be skipped if the data point that arrives at the "% self.time_gap == self.time_gap - 1" timestamp is added to an existing micro-cluster. This can cause the clustering to be either outdated or, in the worst case, not be performed at all.

I have addressed both in my fork of the repository: https://github.com/PhilJahn/PhilJahnLMU-river-fork. I have opened up a pull request for this here: #1634

Related to this, I also want to suggest giving the option to perform the kMeans clustering as part of the prediction phase, as this seems to be closer to the idea of Online-Offline Stream clustering. This would also resolve the first issue above (at least when the option is used) as it would also keep _mc_centers[index] up-to-date regarding the predictions. I have also implemented this in a branch of my fork of the river-repository (https://github.com/PhilJahn/PhilJahnLMU-river-fork/tree/Offline-CluStream), but have not done documentation for it as I wanted to wait for feedback regarding the suggestion before doing so.

smastelini · 2024-11-16T12:00:41Z

smastelini
Nov 16, 2024
Maintainer

Good stuff, @PhilJahn! I am pinging @hoanganhngo610 on this one :)

2 replies

smastelini Nov 16, 2024
Maintainer

Let's take care of #1634 and later we can discuss the implications of clustering at prediction time and how to address having this option. Thank you for your contribution!

PhilJahn Nov 18, 2024
Author

Great, thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with CluStream clustering #1633

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Issues with CluStream clustering #1633

PhilJahn Nov 15, 2024

Replies: 1 comment · 2 replies

smastelini Nov 16, 2024 Maintainer

smastelini Nov 16, 2024 Maintainer

PhilJahn Nov 18, 2024 Author

PhilJahn
Nov 15, 2024

Replies: 1 comment 2 replies

smastelini
Nov 16, 2024
Maintainer

smastelini Nov 16, 2024
Maintainer

PhilJahn Nov 18, 2024
Author