Implementation Of Kmeans Clustering in Python

Introduction

In this blog, we would discuss the Implementation Of Kmeans Clustering in Python. K-Means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, where each group is distinct from other groups in terms of the data points that are in each group. To find these groups, the K-Means algorithm looks for a certain number of clusters in the data, which are defined by the value of K (the number of groups that the algorithm should find). Once the algorithm has found these K clusters, it will then assign each data point to the cluster that is closest to it.

 

One of the benefits of K-Means clustering is that it is relatively simple to understand and implement. Additionally, this algorithm can be used on data that has a large number of dimensions, which is not always the case with other clustering methods. One downside of K-Means clustering is that it can be sensitive to outliers in the data. Additionally, the value of K (the number of clusters) needs to be specified before running the algorithm, which may not always be obvious. Overall, K-Means clustering is a powerful tool that can be used to find groups in data. It is important to be aware of the limitations of this algorithm, but overall it can be a helpful tool in understanding and analyzing data.

 

 

 

Kmeans Clustering in Python

Let us initially take a custom dataset

 

from sklearn.cluster import KMeans
import numpy as np

#Samples
X = np.array([[1, 2], [1, 4], [1, 0],[10, 2], [10, 4], [10, 0],[5,3],[7,2]])

#Defining and fitting K-means
kmeans = KMeans(n_clusters=2).fit(X)  

#Will print Cluster allocation of each samples
print(kmeans.labels_) 

#Predicting for unseen data
print(kmeans.predict([[0, 0], [12, 3]])) 

# printing cluster centres
print(kmeans.cluster_centers_)

 

Visualization

 

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(X[kmeans.labels_ == 0, 0], X[kmeans.labels_ == 0, 1], c='green', label='cluster 1')
plt.scatter(X[kmeans.labels_ == 1, 0], X[kmeans.labels_ == 1, 1], c='blue', label='cluster 2')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='*', s=300,  c='r', label='centroid')
plt.legend()

 

 

 

 K-means on Simple dataset

 

You can get the dataset link at Dataset

 

First, Upload the dataset

#Upload file to colab from your system
from google.colab import files
uploaded = files.upload()

 

#Converting to Dataframe
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(uploaded['faithful.csv']))

 

Visualizing the Input Data

 

#Visualizing Input Data

import matplotlib.pyplot as plt
plt.figure(figsize=(6, 6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1])
plt.xlabel('Eruption time in mins')
plt.ylabel('Waiting time to next eruption')
plt.title('Visualization of raw data');

 

 

 

Using the inbuilt Kmeans model from SK-learn Library

 

#K-Means Clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Standardize the data
X_std = StandardScaler().fit_transform(df)

# Run local implementation of kmeans
km = KMeans(n_clusters=2)
km.fit(X_std)
centroids = km.cluster_centers_

 

Visualizing the Clustered Data

 

# Plot the clustered data
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(X_std[km.labels_ == 0, 0], X_std[km.labels_ == 0, 1],c='green', label='cluster 1')
plt.scatter(X_std[km.labels_ == 1, 0], X_std[km.labels_ == 1, 1],c='blue', label='cluster 2')
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], marker='*', s=300, c='r', label='centroid')
plt.legend()
plt.xlim([-2, 2])
plt.ylim([-2, 2])
plt.xlabel('Eruption time in mins')
plt.ylabel('Waiting time to next eruption')
plt.title('Visualization of clustered data', fontweight='bold')
ax.set_aspect('equal');

 

 

 

Also, read – What is KMeans Clustering and its Working

 

Share this post

One thought on “Implementation Of Kmeans Clustering in Python

Leave a Reply

Your email address will not be published. Required fields are marked *