HDBSCAN isn't always the answer

Every article on clustering recommends HDBSCAN. But it only creates clusters when the clearly exist. In business, force-creating clusters helps discover actionable segments (even if overlapping and unstable). K-Means is more apt here.

Sep 25, 2023
2 minutes read

Every time I run HDBSCAN on real-life dataset, this is what my results typically look like:

Here's proof from real-life examples. See how large the "Others" cluster is:

Dataset Others #1 #2 #3 Count
Marvel characters' powers 52.2% 5.2% 2.0% 1.9% 96
Singapore flat prices 40.9% 8.7% 7.9% 6.9% 50
American community survey 77.7% 2.4% 2.0% 1.2% 40

Though the US county demographics ended up with 3 distinct clusters:

... in every case, HDBSCAN bucketed over 40% of the data into an "Others" category that can't be clustered.

That's because HDBSCAN looks for clearly differentiated clusters. If there ARE no clusters, it clubs all the data into a single cluster. (See the last row under DBSCAN.)

Clustering algorithm comparison

But sometimes, we aren't interested in clearly defined clusters. We're want segments we can act on.

For example, how can we think about Marvel characters? With K-Means, we can force-segment them into 6 segments:

This gives us a feel for the how we might segment this group. Once we identify a segment profile, we can assign each person to the nearest profile.

This is almost exactly what K-Means does. It randomly picks "profiles" or seeds. Then it assigns each person to the nearest profile. And while there are smarter ways of selecting profiles, random is a good option to ideate with.

REMEMBER: People want to act on data. So create actionable segments. If the "right" clustering doesn't lead to a meaningful segments, it doesn't help. K-Means seems more apt for this than HDBSCAN.