Implementation Of Kmeans Clustering in Python
Introduction
In this blog, we would discuss the Implementation Of Kmeans Clustering in Python. K-Means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, where each group is distinct from other groups in terms of the data points that are in each group. To find these groups, the K-Means algorithm looks for a certain number of clusters in the data, which are defined by the value of K (the number of groups that the algorithm should find). Once the algorithm has found these K clusters, it will then assign each data point to the cluster that is closest to it.
One of the benefits of K-Means clustering is that it is relatively simple to understand and implement. Additionally, this algorithm can be used on data that has a large number of dimensions, which is not always the case with other clustering methods. One downside of K-Means clustering is that it can be sensitive to outliers in the data. Additionally, the value of K (the number of clusters) needs to be specified before running the algorithm, which may not always be obvious. Overall, K-Means clustering is a powerful tool that can be used to find groups in data. It is important to be aware of the limitations of this algorithm, but overall it can be a helpful tool in understanding and analyzing data.
Kmeans Clustering in Python
Let us initially take a custom dataset
from sklearn.cluster import KMeans import numpy as np #Samples X = np.array([[1, 2], [1, 4], [1, 0],[10, 2], [10, 4], [10, 0],[5,3],[7,2]]) #Defining and fitting K-means kmeans = KMeans(n_clusters=2).fit(X) #Will print Cluster allocation of each samples print(kmeans.labels_) #Predicting for unseen data print(kmeans.predict([[0, 0], [12, 3]])) # printing cluster centres print(kmeans.cluster_centers_)
Visualization
import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(6, 6)) plt.scatter(X[kmeans.labels_ == 0, 0], X[kmeans.labels_ == 0, 1], c='green', label='cluster 1') plt.scatter(X[kmeans.labels_ == 1, 0], X[kmeans.labels_ == 1, 1], c='blue', label='cluster 2') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='*', s=300, c='r', label='centroid') plt.legend()
K-means on Simple dataset
You can get the dataset link at Dataset
First, Upload the dataset
#Upload file to colab from your system from google.colab import files uploaded = files.upload()
#Converting to Dataframe import pandas as pd import io df = pd.read_csv(io.BytesIO(uploaded['faithful.csv']))
Visualizing the Input Data
#Visualizing Input Data import matplotlib.pyplot as plt plt.figure(figsize=(6, 6)) plt.scatter(df.iloc[:, 0], df.iloc[:, 1]) plt.xlabel('Eruption time in mins') plt.ylabel('Waiting time to next eruption') plt.title('Visualization of raw data');
Using the inbuilt Kmeans model from SK-learn Library
#K-Means Clustering from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler # Standardize the data X_std = StandardScaler().fit_transform(df) # Run local implementation of kmeans km = KMeans(n_clusters=2) km.fit(X_std) centroids = km.cluster_centers_
Visualizing the Clustered Data
# Plot the clustered data fig, ax = plt.subplots(figsize=(6, 6)) plt.scatter(X_std[km.labels_ == 0, 0], X_std[km.labels_ == 0, 1],c='green', label='cluster 1') plt.scatter(X_std[km.labels_ == 1, 0], X_std[km.labels_ == 1, 1],c='blue', label='cluster 2') plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], marker='*', s=300, c='r', label='centroid') plt.legend() plt.xlim([-2, 2]) plt.ylim([-2, 2]) plt.xlabel('Eruption time in mins') plt.ylabel('Waiting time to next eruption') plt.title('Visualization of clustered data', fontweight='bold') ax.set_aspect('equal');
Also, read – What is KMeans Clustering and its Working
Pingback: What is KMeans Clustering and its Working - Study Experts