Motivation
Classical clustering methods K-means and K-medoids find cluster centers by initially assigning cluster center(s) and iteratively update it(them). Because every data point is assigned to the nearest cluster center(using distance), K-means and K-medoids can not detect nonspherical clusters. If we want to detect clusters with an arbitrary shape, we should perform density-based approaches. Although density-based spatial clustering of applications with noise(DBSCAN) can detect arbitrary shapes, we need to manually setting the density threshold, which is non-trivial. As for mean-shift clustering method, a cluster is defined as local maxima in the density of data points, which allows the findings of nonspherical clusters but requires embedding the data.
In the paper, the authors propose a new simple and efficient algorithm, which combines the advantages of the above cluster methods, but also overcomes their shortcomings.
Main algorithms
The algorithm is based on two principles, which reaches on the definition of a cluster.
- Cluster centers have higher local density, which means they are surrounded by neighbors with lower local density;
- Cluster centers are at a relatively large distance from points with a higher local density.
Computing local density ρ_i
In the paper, the authors use two different methods to calculate the local density. In most of the datasets, using cutoff distance:
where χ(x) = 1 if x<0 and χ(x) = 0 otherwise, and d_c is a cutoff distance. This algorithm is sensitively only to the relative magnitude of ρi, but robust to the choice of d_c.
When applying approach on Olivetti Face Database, the authors use a Guassian kernal to calculate distance:
And we can see that the first one is discrete value, the second one is continuous value, which means the former one is more easy to conflict.
Calculating distance δ_i
δ_i is measured by the minimum distance between point i and other points with higher local density.
For the point with highest density, take δ_i=maxj(dij). Here we may jump in such dilemma, maybe point i and point j has the same density, and they happen to be in the same cluster. If we using the following cluster procedure, we may choose both two point as cluster centers, and takes one cluster apart.
When I search some relevant information online, I find a better understanding of δ_i, which defines as followed:
Suppose {q_i }_(i=1)^N represents the descending index order of {ρ_i }_(i=1)^N, which mean it satisfies:
Cluster procedure
In order to choose the cluster center, the authors find such observation: the points of high δ and relatively high ρ are the cluster centers. In order to better visualize the result, they create a decision graph, which shown as followed:
From the figure, we can clearly see that point 1 and 10 are clusters. In the paper, the isolated points are 26,27,28. (But in the paper, the authors did not clearly propose an exact way to find out the outliers, which I think needed consideration.) But one thing we should mention is that we define the cluster center in a qualitative analysis but not a quantitative analysis, which would cause puzzles.
When we see figure A, the number of cluster centers is not clear and some may think different points as cluster centers. So as to determine the number of cluster centers, the authors gave a metric value:
In figure B, γ is pointed out. We can see that the higher γ is morely like the cluster center. And there is a clear different between cluster center and other points: the latter is relatively smooth, and there is a clear jump between cluster center and other points. In this way, we can see there are nine center clusters in the above figure.
Contribution
This paper is the first one which combines local density and distance together. Besides, the algorithm it propose is easy and feasible, which I believe has inspired a lot of people.
Although from my perspective, it has some small problems. But this bran-new idea deserves our respect.
My Idea
There are three points that I feel confused:
First, how to find the outliers? In this part the authors did not mention about it, maybe because this is not their main point.
Second, in the paper, the authors did not clearly how they elaborate the distance between two points, which means there is not specific distance formula.
Third, maybe the last metric value γ, we can normalize ρ and δ.
Summary
In the paper, the authors propose a simple and robust algorithm of the choice of cluster centers. It is easy for us to understand and know how to realize. But the paper still has some questions to be solved.
First, the authors declare two variables we need to calculate, then tell us how to use them to find the cluster center. In the experimental part, their algorithm shows great results on real datasets.