Unsupervised machine learning algorithms are wildly used in the world, and clustering is one of them. Generally, purposes of clustering is finding labels for cases. Today, I’m gonna demonstrate two clustering algorithms and how to evaluate their performance.
To do so, I use seeds data set from UCI machine learning repository as an example.
In order to cluster wheat according to its measurements of geometrical properties of kernels, we present two unsupervised machine techniques, k-means clustering and hierarchy clustering, to see which one is a better use in this case.
How it works:
K-means clustering basically iterates over calculating all points to cluster centers, assigning each of them to its nearest centers, and recalculating cluster centers until no data point changes group. Thus, how we define a distance between points becomes important. Different distance functions might come up with different results.
Once we set a distance function, we have to check data normalization issue. It is because that, generally, we want to apply a distance function on a weight-even data. To illustrate, let’s say now we have to calculate a distance between two data points, and the axes are annual income in US dollar and age. We know that the income should from 15,000 to 150,000 commonly, and the age should be in a range from 0 to 100. However, when we aggregate distances from each axis, the aggregate distance will be dominated by income since income is always represented much large mathematically compare to age. Hence, to solve this problem for a fair result, we could normalize all data attributes to a reasonable scale before we calculate it.
Also, the number of clusters (k) is another thing we should concern. Since we set the distance function and normalize data, we want to guarantee the derived clusters could somehow reflect its meaning in the real world. To acquire the best k, we could either rely on life experiences or business senses that k has real meaning in the world, or monitor the growth of k in unsupervised learning sense. The method we use here is to track the sum of error (SE) from each corresponding cluster center and find a point (k) that stop decreasing steeply in SE. The idea of that is since we know SE will be 0 ultimately (all data points in its own cluster) as k grows, we want a k that only finds big structures of elements, but not fits its noisy in the data.
In this case, we choose Euclidean distance, the most common one in the real world.
Find the knee:
In the “Total Error vs Number of Cluster” graph, we could see the slope become less steep while the number of cluster is 4 (where the knee is). It’s not that obvious in the graph I create, but we could verify that by observing another similar approach with different criteria. These plots are saying when we are adding cluster number until it reaches 4, we are getting a substantial decrease in error. After this point, we are only fitting noisy data. Therefore, for choosing k based on “Total Error vs Number of Cluster” graph, we choose k equals to 4.
To normalize or not to normalize:
Normalization of the attributes does influence the clustering result. We could notice that in two ways. First, we could see the visualization of clustering based on first two important PCA variables. The plot depicts that clustering on normalized data does a better job than clustering on non-normalized data. Second, from intrinsic evaluation, we could see that clustering on normalized data has better accuracy that indicates its result is better.
The reason for the difference between clustering between normalized data and non-normalized data is that k-means algorithm mostly depends on distance. If all attributes are not normalized into a similar range, partial attributes with larger numbers could dominate the calculated distance and make the result distance bias. To illustrate, height and weight are in different scale, and generally, in cm and kg, height is larger than weight. Thus, doing any math on them, we could get a result mostly leaning on the larger number, height in this case. To solve the problem, we normalize all attribute force them in the same range like [0, 1], so that we could get a fair result.
How it works:
Hierarchy clustering basically iterates pairing up two closest cluster (every point is a cluster in the beginning) until it has only one mass cluster.
How we define “closest” depends on what linkage method we use. There are many linkage methods, in this example, we’re gonna use two distance methods, single linkage and complete linkage. Our choice both single linkage and complete linkage which are calculating the distance of two nearest points in two different clusters and calculating the distance of two farthest points in two different clusters, respectively.
We could easily notice that, in this case, complete linkage performs much better then single linkage does. That is because, by their means, single linkage looks for chains and complete linkage looks for spherical units.
Furthermore, when we have to choice the number of clusters, we could easily observe that by examining dendogram to see its hierarchy structure.
In the results, based on performances and demdograms, we found single linkage had difficulty to divide whole data in to individual clusters, but complete linkage seems doing a pretty good job here. It successfully divide data points in to 3 major clusters.
Time complexity is always a good judgement to assess an algorithm. As we know, when n is close to inf., k-means clustering runs in a linear time, however, hierarchy clustering runs in a quadratic time. In this case, it a tie because n is only 210.
On the other hand, thanks to intrinsic evaluation (based on performance metrics), we found either when k=3 or k=4, k-means clustering has higher accuracy then hierarchy clustering.
These analyses show that different clustering methodologies and whether to normalize data may lead to various results, we need to be aware of that and carefully choose an appropriate clustering approach according to data set property. Generally, k-means algorithm is better with continuous variables, capable of dealing with large data, and giving global optimum of clusters. In contrary, hierarchy algorithm is more able to find hierarchical pattern in a data giving locally optimized clusters.
In this case, I prefer k-means than hierarchy since we didn’t see any hierarchy pattern in the data set and k-means clustering has a better performance.