Skip to content

DBSTREAM incorrect labelling of noisy micro-clusters #1730

@th3sh3ph3rd

Description

@th3sh3ph3rd

Versions

river version: 0.23.0
Python version: 3.12.8
Operating system: Ubuntu 24.04.3 LTS

Describe the bug

The original DBSTREAM paper defines noisy micro-clusters as follows:
Image
However, the current River implementation of DBSTREAM insufficiently handles of noisy micro-clusters. If noisy micro-clusters are not subject to cleanup, they are straight up included in the list of clusters, which is not in line with the behaviour outlined by the original paper.

I propose the following fix:
Add

if self._micro_clusters[index].weight < self.minimum_weight:
    continue

after line 332 in the DBSTREAM implementation. This ensures that noisy micro-clusters are not labelled as clusters.

Steps/code to reproduce

The following code outputs 3 clusters, even though their respective weights are clearly below the minimum_weight threshold, qualifying them as noisy micro-clusters.

from river import cluster
from river import stream

X = [
    [0, 0], [50, 50], [100, 100]
]

dbstream = cluster.DBSTREAM(
    clustering_threshold=1.0,
    fading_factor=0.001,
    cleanup_interval=10,
    intersection_factor=0.3,
    minimum_weight=5
)

for x, _ in stream.iter_array(X):
    dbstream.learn_one(x)

for _, c in dbstream.clusters.items():
    print(f"center: {c.center}, weight: {c.weight}")

Output:

center: {0: 0, 1: 0}, weight: 1
center: {0: 50, 1: 50}, weight: 1
center: {0: 100, 1: 100}, weight: 1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions