Clustering Metrics Compared: ML Evaluation Guide

Unsupervised clustering has no ground truth labels, making evaluation fundamentally different from supervised learning. This guide compares the Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index, explaining when each metric works, when it misleads, and how to combine them for reliable cluster quality assessment across different data distributions and cluster shapes.

Key Takeaways

✓No single clustering metric works universally — Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz each measure different aspects of cluster quality; using only one metric can validate poor clusterings that happen to score well on that specific measure

✓Silhouette Score measures per-sample cluster fit — ranging from -1 to 1, it evaluates how similar each point is to its own cluster versus the nearest neighboring cluster, making it the most intuitive metric but computationally expensive on large datasets

✓All three metrics favor convex, spherical clusters — density-based clusters (DBSCAN, HDBSCAN) and non-convex shapes are systematically undervalued by these metrics, requiring alternative evaluation strategies for non-spherical data distributions

✓Combining metrics with domain validation produces reliable evaluations — run all three metrics, check for agreement, then validate against business-relevant criteria to confirm that mathematically optimal clusters are also semantically meaningful

✓At Boundev, we place senior ML engineers and data scientists who build production clustering pipelines with rigorous evaluation frameworks — from customer segmentation to anomaly detection systems that drive business decisions

Clustering is the only branch of ML where you don't know if you got the right answer. In supervised learning, you compare predictions to labels. In clustering, there are no labels. The algorithm found groups — but are they meaningful? Are they stable? Would a different algorithm on the same data produce better groups? Clustering metrics exist to answer these questions, but each metric has blind spots that can validate bad clusterings.

At Boundev, our ML engineering teams build production clustering systems for customer segmentation, fraud detection, and recommendation engines. We've learned that the metric you choose determines the clusters you get — and choosing wrong means building business decisions on mathematical artifacts. This guide covers the three most important internal clustering metrics, their failure modes, and how to combine them for reliable evaluation.

The Clustering Evaluation Challenge

Why evaluating unsupervised models requires multiple metrics and domain validation.

Ground truth labels available in unsupervised learning

Core internal metrics that assess cluster quality

73%

Of clustering projects use only one evaluation metric

2.1x

Better outcomes with multi-metric evaluation

The Three Core Clustering Metrics

Internal clustering metrics evaluate cluster quality without external labels by measuring two properties: how tightly packed each cluster is (cohesion) and how well-separated clusters are from each other (separation). Each metric weighs these properties differently, which is why they sometimes disagree.

Metric	Range	Better Score	What It Measures	Key Limitation
Silhouette Score	-1 to 1	Higher (closer to 1)	Per-sample fit: how similar each point is to its own cluster vs nearest neighbor	O(n²) complexity; struggles with varying cluster densities
Davies-Bouldin Index	0 to ∞	Lower (closer to 0)	Average worst-case similarity between each cluster and its most similar cluster	Uses centroid distance only; limited to Euclidean space
Calinski-Harabasz	0 to ∞	Higher	Ratio of between-cluster variance to within-cluster variance (variance ratio criterion)	Strongly favors convex, spherical clusters; fast to compute

Silhouette Score: Deep Dive

The Silhouette Score is the most granular of the three metrics because it calculates a score for each individual data point rather than just an aggregate value. For each sample, it computes two distances: a (mean distance to all other points in the same cluster) and b (mean distance to all points in the nearest neighboring cluster). The silhouette coefficient is (b - a) / max(a, b).

When Silhouette Score Works Well

The Silhouette Score excels at evaluating well-separated, roughly spherical clusters of similar size. It's the best metric for determining the optimal number of clusters (k) by comparing average scores across different k values.

● Scores > 0.7 indicate strong cluster structure with clear separation

● Scores 0.5-0.7 suggest reasonable structure that may benefit from refinement

● Scores 0.25-0.5 indicate weak or overlapping cluster boundaries

● Scores < 0.25 suggest no meaningful cluster structure or wrong k

When Silhouette Score Misleads

Silhouette Score systematically undervalues density-based and non-convex clusters. A crescent-shaped cluster perfectly captured by DBSCAN will score poorly because points on the outer edge are far from points on the inner edge, inflating the intra-cluster distance a.

● Non-convex clusters (crescents, rings) always get penalized

● Clusters with varying densities appear poorly formed even when correct

● O(n²) time complexity makes it impractical for datasets above ~100K samples

● Noise points from DBSCAN artificially depress the average score

Davies-Bouldin Index: Deep Dive

The Davies-Bouldin Index measures the average worst-case similarity between each cluster and its most similar neighbor. For each cluster, it finds the neighboring cluster that looks most like it (based on the ratio of within-cluster spread to between-cluster centroid distance), then averages these worst-case ratios across all clusters.

Strengths:

✓ Faster than Silhouette — uses centroid distances instead of pairwise point distances

✓ Considers both cohesion and separation — captures within-cluster scatter and between-cluster distance

✓ Lower is better — 0 represents perfect clustering; intuitive direction

✓ Works well for k-means validation — since k-means optimizes centroids, the DBI metric aligns naturally

Weaknesses:

✗ Centroid-dependent — only measures distance between cluster centers, missing shape and distribution information

✗ Limited to Euclidean space — doesn't work with non-Euclidean distance functions

✗ Favors convex clusters — same spherical bias as Silhouette but worse for irregular shapes

✗ Sensitive to outliers — a single outlier inflates within-cluster spread, spiking the index

Building Production ML Clustering Systems?

Boundev places senior ML engineers and data scientists who build clustering pipelines with rigorous evaluation frameworks. Our teams implement multi-metric validation, build automated cluster quality monitoring, and ensure clustering outputs drive meaningful business decisions. Embed an ML specialist in your team in 7-14 days through staff augmentation.

Talk to Our Team

Calinski-Harabasz Index: Deep Dive

The Calinski-Harabasz Index (also called the Variance Ratio Criterion) is the fastest of the three metrics and uses the ratio of between-cluster dispersion to within-cluster dispersion. Higher values indicate clusters that are both dense internally and well-separated from each other.

1Fast Computation

Uses sum-of-squares calculations rather than pairwise distances, making it practical for large datasets where Silhouette Score is too slow. Ideal as a first-pass metric during hyperparameter search.

2Scale-Dependent

Absolute values are meaningless across different datasets — only useful for comparing different k values or algorithms on the same data. A score of 500 isn't inherently better than 200 unless they're from the same dataset.

3Strongest Convex Bias

Of the three metrics, Calinski-Harabasz most strongly favors spherical, convex clusters. It will consistently prefer k-means output over DBSCAN output even when DBSCAN captures the true data structure better.

4Best Use Case

Use Calinski-Harabasz as a fast screening metric when tuning k-means or Gaussian Mixture Models. Filter candidate solutions quickly, then validate top candidates with Silhouette Score and domain expertise.

Choosing the Right Metric for Your Use Case

The right metric depends on your algorithm, data characteristics, and computational budget. Here's our decision framework from hundreds of clustering projects delivered by our ML teams.

Scenario	Primary Metric	Secondary Metric	Why
K-means on small data (<50K)	Silhouette Score	Davies-Bouldin	Per-sample analysis is feasible; DBI validates centroid quality
K-means on large data (>100K)	Calinski-Harabasz	Silhouette (sampled)	CH is fast; validate top-k candidates with sampled Silhouette
DBSCAN / HDBSCAN	DBCV (density-based)	Silhouette (without noise)	Standard metrics penalize non-convex shapes; exclude noise points
Customer segmentation	Silhouette + domain KPIs	Business metric validation	Clusters must be both statistically valid and business-meaningful
Hyperparameter tuning	Calinski-Harabasz	All three	CH speed enables broad search; validate final candidates with all metrics

Production Rule: Never ship a clustering model evaluated by a single metric. At minimum, run all three internal metrics and check for agreement. If they disagree, investigate why — the disagreement itself reveals important information about cluster shape, density distribution, or potential overfitting to a specific k value. Then validate against domain-specific criteria: do the clusters represent groups that a business stakeholder would recognize and find actionable?

Multi-Metric Evaluation Workflow

Here's the evaluation workflow we use across production clustering projects. It balances computational efficiency with evaluation rigor.

Broad Search with CH—sweep k values using Calinski-Harabasz. It's fast enough to test k=2 through k=50 on large datasets in minutes.

Narrow with DBI—take the top 5 candidate k values from CH and evaluate with Davies-Bouldin. Look for low DBI that agrees with high CH.

Validate with Silhouette—run Silhouette on the top 2-3 candidates. Examine per-sample scores to identify misassigned points and overlapping boundaries.

Domain Validation—present final candidates to domain experts. Do the clusters represent recognizable, actionable groups? Mathematical optimality without business meaning is worthless.

Stability Testing—bootstrap the dataset and re-run clustering. Stable clusters should persist across subsamples. Fragile clusters that change with minor data perturbations aren't production-ready.

Monitoring in Production—track metrics over time as new data arrives. Metric drift signals that the data distribution has shifted and clusters need retraining.

FAQ

What is the best clustering evaluation metric?

There is no single best clustering metric. The Silhouette Score provides the most granular per-sample analysis but is computationally expensive. The Calinski-Harabasz Index is the fastest to compute. The Davies-Bouldin Index is a balanced middle option. All three favor convex, spherical clusters and can mislead when evaluating density-based or non-convex clusterings. The best approach is using multiple metrics together and validating against domain-specific criteria to confirm clusters are both statistically valid and business-meaningful.

What does a Silhouette Score of 0.5 mean?

A Silhouette Score of 0.5 indicates moderate cluster structure. Points are somewhat closer to their assigned cluster than to neighboring clusters, but the separation isn't strong. Scores above 0.7 indicate strong clustering, 0.5-0.7 suggests reasonable but potentially improvable clusters, 0.25-0.5 indicates weak or overlapping boundaries, and below 0.25 suggests no meaningful cluster structure. The score ranges from -1 (points are in the wrong cluster) to 1 (points are perfectly matched to their cluster and far from neighbors).

How do I evaluate DBSCAN clustering quality?

Standard metrics like Silhouette Score, Davies-Bouldin, and Calinski-Harabasz systematically penalize DBSCAN results because they favor convex, spherical clusters. For density-based algorithms, use the Density-Based Cluster Validation (DBCV) metric which evaluates cluster quality based on density rather than centroid distance. If using traditional metrics, exclude noise points (label -1) from the calculation since they artificially depress scores. Visual inspection of clusters in reduced dimensions (t-SNE, UMAP) remains essential for validating density-based results.

How does Boundev approach ML clustering projects?

Boundev places senior ML engineers who build production clustering systems with multi-metric evaluation frameworks. Our teams implement the full pipeline: data preprocessing, algorithm selection, hyperparameter tuning with Calinski-Harabasz for broad search and Silhouette Score for final validation, domain expert review, stability testing through bootstrap analysis, and continuous monitoring in production. We embed ML specialists through staff augmentation in 7-14 days for customer segmentation, anomaly detection, recommendation systems, and other clustering use cases.

Clustering Metrics Compared: Evaluating Unsupervised ML Models