Introduction to Clustering
With clustering, we can segment groups based on similarity in behaviour in areas such as marketing, customer retention, disease diagnosis, online shopping, music listening, social media activity, and document classification.
Clustering is the task of grouping a set of objects in such a way that objects in the same group, cluster, are more similar to each other than to those in other groups (clusters). Clustering deals with grouping/structuring of unstructured data where a pair of similar data points are placed in the same cluster. So, the notion of similarity of matching between data points plays an important role in clustering.
Four types of Clustering algorithms:
1. Partitioning Methods
The partitioning methods usually operate using a central tendency to represent a structure and use distance-based approach to create the clusters. The most well-known algorithm is:
1.1 K-means Algorithm
1.2 K-median Algorithm
2. Hierarchical Methods
The hierarchical algorithms are so-called because they create tree-like structures to create clusters and also use a distance-based approach for cluster creation. The most well-known algorithm is:
2.1 Agglomerative Hierarchical clustering
2.2 Divisive Hierarchical clustering
3. Density-Based Methods
They create arbitrary shaped clusters and identify dense regions and low-density regions and create clusters accordingly. The most well-known algorithm is:
3.1 DBSCAN Algorithm
4. Expectation Maximisation (EM) clustering:
EM clustering is described by a probability density function (typically a Gaussian distribution), so the cluster is represented by a mean and a covariance matrix of the points in it. In EM clustering, every point is soft assigned to all the clusters using a probability distribution. The most well-known algorithm is:
4.1 Gaussian Mixture Model (GMM) Clustering
1.1 K-Means Clustering
K-Means is probably the most popular clustering algorithm, because of its simplicity and its ability to scale. The user decides the number of resulting clusters (denoted K). K points are randomly assigned to be the cluster centers. From there, the algorithm assigns all the other points in the dataset to one of the clusters by taking the cluster whose euclidean distance with the point is minimal. Following this, the cluster centers are re-calculated by taking the average of each points’ coordinates. The algorithm reassigns every point to the closest cluster and repeats the process until the clusters converge and don’t change more.
Note that because of the random initialisation, results may depend on which points are randomly selected to initialise the clusters. Most implementations of the algorithm thus provide the ability to run the algorithms multiple times with different “random starts” so as to select the clustering that minimizes the sum of squared errors (the inertia) of the points and their cluster centers.
How to Choose the Right K Number
Choosing the right number of clusters is one of the key points of the K-Means algorithm. To find this number there are some methods field knowledge, business decision and elbow method.
For smaller datasets, k can be selected as: k=(N/2)^(1/2) where N i the size of the dataset. However, the elbow method is the preferred option as it relies on an analytical method backed with data, to make a decision.
The elbow method works on the Sum of the squared error principle. We populate random numbers of clusters and try to build a graph of the correspondingly obtained SSEs. For each number of clusters we obtained the corresponding SSEs. Now, if we plot the SSEs we obtain a plot as follows:
We find an Elbow at 4, so, which is the optimal number of clusters. On moving further also, the SSE decreases, but it creates a huge number of clusters which is not desired. This is a trade-off between the decreasing SSE and the number of created clusters.
1. Decide on K-clusters (Elbow point)
2. Measure the distance
3. Grouping based on minimum distance.
4. Reposition the centroids
5. Repeat step 2–4 until convergence is reached.
Advantage of using K-means
- Uses cases where we have even cluster size, flat geometry and not too many clusters.
- Easily adaptable, the algorithm can be scaled easily
- They work well on small or large datasets but we need to choose the number of clusters
Disadvantage of using K-Means
- K-Means tends to cluster around the categorical variables (because of their relative high variance in a normalised dataset)
- Outliers can never be indentified
- Suffers when there is noise in the data
- Even though it reduces intra-cluster variance, it faces local minimum problem.
- Complicated to predict best K-value
1.2 K-Medians Clustering
As mentioned above, the presence of numerous outliers may hinder significantly the algorithm’s efficiency. K-Means uses the mean of each points among a cluster to calculate each clusters’ center. However, the mean is not a robust metric. Hence, the presence of outliers will skew the centers toward them.
K-Medians is based on the same algorithm as K-Means with the difference that instead of calculating the mean of the coordinates of all the points in a given cluster, it uses the median. As a result, the clusters will become more dense and robust to outliers.
Advantage of using K-Medians
- Creating tight/dense clusters that are robust to outliers
Disadvantage of using K-Medians
- Requires the use of PyClustering or custom code to implement)
2.1 Agglomerative Hierarchical clustering
This method does not need clusters to be specified before-hand and rather chooses its clusters by using dendrograms. A dendrogram is a plot that tells about the way the clusters are distributed. Essentially, at the beginning of the process, each data point is in its own cluster. Using a dissimilarity function, the algorithm finds the two points in the dataset that are most similar and clusters them together.
To stop a single cluster from being formed, though, a dendrogram criterion is generally used which takes the longest edge that does not cross a horizontal line as the minimum distance criterion. Any cluster that crosses this line will be chosen in the final model.
Agglomerative Hierarchical Clustering Steps
Step 1: Make each data point a single-point cluster → That forms N clusters.
Step 2: Take the two closest data points and make them one cluster → That forms N-1 clusters
Step 3: Take the two closest clusters and make them one cluster → That forms N-2 clusters
Step 4: Repeat Step 3 until there is only one cluster.
To merge clusters, we also need to know a distance measure between the clusters a.k.a linkage criteria. There are five main linkage measures:
- Single-Linkage: the distance between the closest objects in two clusters.
- Complete-Linkage: the distance between the farthest objects in two clusters.
- Average-Linkage: an average of all pairwise distances between objects in two clusters.
- Centroid-Linkage: the distance between the centroids of two clusters.
- Ward’s distance: A measure of the change in the total distance from a formed centroid upon joining two clusters.
Just like choosing a distance metric, choosing linkage criteria should depend on the domain knowledge of the application.
Advantage of using Hierarchical Clustering
- Uses cases where we have categorical data
- Especially potent when the dataset contains real hierarchical relationships
- Flexibility because of the different dissimilarity functions available (i.e. complete linkage, single linkage, average linkage, minimum variance, etc.) that each give very different results
Disadvantage of using Hierarchical Clustering
- Sensitive to noise and outliers
- Computationally intensive. As the amount of data increases, it becomes very time/resources intensive as it processes each data point iteratively, looping through the entire dataset every time.
3.1 Density-Based Clustering
Density-based clustering methods are based on distributing points according to the various densities of the clusters. DBScan (Density-Based Spatial Clustering of Applications with Noise) Clustering is a clustering method that uses Density-based methods rather than distance-based clustering in K-Means and HC.
DB-Scan isolates clusters of high density from clusters of low density. To separate clusters based on their densities, DB-Scan starts by dividing the data into n dimensions. For each point, the algorithm forms a shape around the point, counting the number of other observations falling into the shape. DB-Scan iteratively expands the shape until no more points are around within a certain distance, noted epsilon (a parameter specified to the model)
Advantage of using DBScan
- We don’t need to specify the number of clusters. DBSCAN will find them based on how densely packed the points are and other input that we pass into it.
- Flexibility in the shapes and sized of clusters. DBSCAN is not limited to a certain shape. It can trace out densely packed points as long as the density is continued into pretty much any shape.
- Able to deal with noise
- Able to deal with outliers.
- Isolating outliers
Disadvantage of using DBScan
- Border points that are reachable from two clusters. The cluster that finds the points first will be assigning it to its cluster and since points are visited in an arbitrary order, DBSCAN is not guaranteed to return the same clustering if this is the case.
- Faces difficulty when finding clusters with varying densities.
- Data has high-dimensionality
- Highly sensitive to the two parameters- Eps and Min points.
4.1 Gaussian Mixture Model (GMM) Clustering
Gaussian Mixture Models are probabilistic models that assume that all samples are generated from a mix of a finite number of Gaussian distribution with unknown parameters. It belongs to the group of soft clustering algorithms in which every data point will belong to every cluster existing in the dataset, but with different levels of membership to each cluster. This membership is assigned as the probability of belonging to a certain cluster, ranging from 0 to 1.
GMM is one of the most advanced clustering methods that we will study in this series, it assumes that each cluster follows a probabilistic distribution that can be Gaussian or Normal. It is a generalization of K-Means clustering that includes information about the covariance structure of the data as well as the centers of the latent Gaussians.
Gaussian Mixture Model (GMM) Clustering Steps
- Initialize K Gaussian distributions. It does this with the µ (mean) and σ (standard deviation) values. They can be taken from the dataset (naive method) or by applying K-Means.
- Soft cluster the data: this is the ‘Expectation’ phase in which all datapoints will be assigned to every cluster with their respective level of membership.
- Re-estimate the gaussians: this is the ‘Maximization’ phase in which the expectations are checked and they are used to calculate new parameters for the gaussians: new µ and σ.
- Evaluate the log-likelihood of the data to check for convergence. The higher the log-likehood is, the more probable is that the mixture of the model we created is likely to fit our dataset. So, this is the function to maximize.
- Repeat from step 2 until convergence.
Advantage of using GMM
- It is a soft-clustering method, which assign sample membersips to multiple clusters. This characteristic makes it the fastest algorithm to learn mixture models
- There is high flexibility in the number and shape of the clusters.
Disadvantage of using GMM
- Sensitive to initialisation values
- Possible to converge to a local optimum
- Slow convergence rate.
- Handling large amounts of data (doesn’t scale well)
To conclude, there exist an incredible amount of clustering algorithms available, even beyond those rapidly covered in this article such as Affinity Propagation (AP) and Mean-Shift. The vast majority of them are readily accessible via open-source python libraries (i.e. Scikit-learn’s diverse options). The choice between them is not always evident but each technique has specific pros/cons and should be leveraged accordingly. The nature (size, underlying structure in the data etc.) of the data decides which clustering algorithm is the best. is the user who should supply the cluster criterion, in such a way that the result of the clustering will suit their needs.