Added: Desha Toomer - Date: 15.11.2021 03:46 - Views: 27067 - Clicks: 7989
in. Clustering is one of the most common exploratory data analysis technique used to get an intuition ab o ut the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup cluster are very similar while data points in different clusters are very different.
In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific. Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.
In this post, we will cover only Kmeans which is considered as one of the most used clustering algorithms due to its simplicity. Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups clusters where each data point belongs to only one group.
It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different far as possible. The less variation we have within clusters, the more homogeneous similar the data points are within the same cluster. The way kmeans algorithm works is as follows:. The approach kmeans follows to solve the problem is called Expectation-Maximization.
The E-step is asing the data points to the closest cluster. The M-step is computing the centroid of each cluster. Below is a break down of how we can solve it mathematically feel free to skip it.
The objective function is:. We first minimize J w. Then we minimize J w. Technically speaking, we differentiate J w. Then we differentiate J w. Therefore, If you want the truth k is:. And M-step is:. Which translates to recomputing the centroid of each cluster to reflect the new asments. Few things to note here:. Then we will use sklearn implementation that is more efficient take care of many things for us. The goal usually when we undergo a cluster analysis is either:.
The dataset has observations and 2 features. We will try to find K subgroups within the data points and group them accordingly. Below is the description of the features:. The above graph shows the scatter plot of the data colored by the cluster they belong to. We can think of those 2 clusters as geyser had different kinds of behaviors under different scenarios. The title of each plot will be the sum of squared distance of each initialization.
As a side note, this dataset is considered very easy and converges in less than 10 iterations. Therefore, to see the effect of random initialization on convergence, I am going to go with 3 iterations to illustrate the concept. However, in real world applications, datasets are not at all that clean and nice! As the graph above shows that we only ended up with two different ways of clusterings based on different initializations.
We would pick the one with the lowest sum of squared distance. Therefore, for each pixel location we would have 3 8-bit integers that specify the red, green, and blue intensity values. Our goal is to reduce the of colors to 30 and represent compress the photo using those 30 colors only. Doing so will allow us to represent the image using the 30 centroids for each pixel and would ificantly reduce the size of the image by a factor of 6.
From now on we will be using sklearn implementation of kmeans. Few thing to note here:. We can see the comparison between the original image and the compressed one. With smaller of clusters we would have higher compression rate at the expense of image quality. Sometimes domain knowledge and intuition may help but usually that is not the case.
In the cluster-predict methodology, we can evaluate how well the models are performing based on different K clusters since clusters are used in the downstream modeling. We pick k at the spot where SSE starts to flatten out and forming an elbow. Silhouette analysis can be used to determine the degree of separation between clusters. For each sample:. The coefficient can take values in the interval [-1, 1].
Therefore, we want the coefficients to be as big as possible and close to 1 to have a good clusters. Also, the thickness of the silhouette plot gives an indication of how big each cluster is. The plot shows that cluster 1 has almost double the samples than cluster 2. Moreover, the thickness of silhouette plot started showing wide fluctuations.
Kmeans algorithm is good in capturing structure of the data if clusters have a spherical-like shape. It always try to construct a nice spherical shape around the centroid. That means, the minute the clusters have a complicated geometric shapes, kmeans does a poor job in clustering the data. Below is an example of data points on two different horizontal lines that illustrates how kmeans tries to group half of the data points of each horizontal lines together. One group will have a lot more data points than the other two combined.
To make the comparison easier, I am going to plot first the data colored based on the distribution it came from. Then I will plot the same data but now colored based on the clusters they have been ased to. Since it tries to minimize the within-cluster variation, it gives more weight to bigger clusters than smaller ones. In other words, data points in smaller clusters may be left away from the centroid in order to focus more on the larger cluster. However, we can help kmeans perfectly cluster these kind of datasets if we use kernel methods. The idea is we transform to higher dimensional representation that make the data linearly separable the same idea that we use in SVMs.
Different kinds of algorithms work very well in such scenarios such as SpectralClusteringsee below:. Kmeans clustering is one of the most popular clustering algorithms and usually the first thing practitioners apply when solving clustering tasks to get an idea of the structure of the dataset. The goal of kmeans is to group data points into distinct non-overlapping subgroups.
It does a very good job when the clusters have a kind of spherical shapes. However, it suffers as the geometric shapes of clusters deviates from spherical shapes. This will help you decide when to use each method and under what circumstances. In this post, we covered both strength, weaknesses, and some evaluation methods related to kmeans.
Below are the main takeaways:. The notebook that created this post can be found here. Originally published at imaddabbura. Your home for data science. A Medium publication sharing concepts, ideas and codes. Get started. Open in app. in Get started. Get started Open in app. Imad Dabbura. Drawbacks Kmeans algorithm is If you want the truth k in capturing structure of the data if clusters have a spherical-like shape.
Machine Learning Data Science Clustering. More from Towards Data Science Follow. from Towards Data Science. More From Medium. Shubham Gupta in Towards AI. Introduction to Machine Learning. Budiyanto Simo. Black-Box models are actually more explainable than a Logistic Regression.
Samuele Mazzanti in Towards Data Science. Twitter Sentiment Analysis. Multiclass Classification Neural Network. Olaniyi O.If you want the truth k
email: [email protected] - phone:(191) 483-3075 x 1334
Why we believe alternative facts