Create a K-Means Clustering Algorithm from Scratch in Python (2024)

k-means clustering is an unsupervised machine learning algorithm that seeks to segment a dataset into groups based on the similarity of datapoints. An unsupervised model has independent variables and no dependent variables.

Suppose you have a dataset of 2-dimensional scalar attributes:

Create a K-Means Clustering Algorithm from Scratch in Python (3)

If the points in this dataset belong to distinct groups with attributes significantly varying between groups but not within, the points should form clusters when plotted.

Create a K-Means Clustering Algorithm from Scratch in Python (4)

Figure 1: A dataset of points with groups of distinct attributes.

This dataset clearly displays 3 distinct classes of data. If we seek to assign a new data point to one of these three groups, it can be done by finding the midpoint of each group (centroid) and selecting the nearest centroid as the group of the unassigned data point.

Create a K-Means Clustering Algorithm from Scratch in Python (5)

Figure 2: The data points are segmented into groups denoted with differing colors.

For a given dataset, k is specified to be the number of distinct groups the points belong to. These k centroids are first randomly initialized, then iterations are performed to optimize the locations of these k centroids as follows:

  1. The distance from each point to each centroid is calculated.
  2. Points are assigned to their nearest centroid.
  3. Centroids are shifted to be the average value of the points belonging to it. If the centroids did not move, the algorithm is finished, else repeat.

To evaluate our algorithm, we’ll first generate a dataset of groups in 2-dimensional space. The sklearn.datasets function make_blobs creates groupings of 2-dimensional normal distributions, and assigns a label corresponding to the group said point belongs to.

import seaborn as sns
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
centers = 5
X_train, true_labels = make_blobs(n_samples=100, centers=centers, random_state=42)
X_train = StandardScaler().fit_transform(X_train)
sns.scatterplot(x=[X[0] for X in X_train],
y=[X[1] for X in X_train],
hue=true_labels,
palette="deep",
legend=None
)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
Create a K-Means Clustering Algorithm from Scratch in Python (6)

Figure 3: The dataset we will use to evaluate our k means clustering model.

This dataset provides a unique demonstration of the k-means algorithm. Observe the orange point uncharacteristically far from its center, and directly in the cluster of purple data points. This point cannot be accurately classified as belonging to the right group, thus even if our algorithm works well it should incorrectly characterize it as a member of the purple group.

We’ll need to calculate the distances between a point and a dataset of points multiple times in this algorithm. To do so, lets define a function that calculates Euclidean distances.

def euclidean(point, data):
"""
Euclidean distance between point & data.
Point has dimensions (m,), data has dimensions (n,m), and output will be of size (n,).
"""
return np.sqrt(np.sum((point - data)**2, axis=1))

First, the k-means clustering algorithm is initialized with a value for k and a maximum number of iterations for finding the optimal centroid locations. If a maximum number of iterations is not considered when optimizing centroid locations, there is a risk of running an infinite loop.

class KMeans: def __init__(self, n_clusters=8, max_iter=300):
self.n_clusters = n_clusters
self.max_iter = max_iter

Now, the bulk of the algorithm is performed when fitting the model to a training dataset.

First we’ll initialize the centroids randomly in the domain of the test dataset, with a uniform distribution.

# Randomly select centroid start points, uniformly distributed across the domain of the dataset
min_, max_ = np.min(X_train, axis=0), np.max(X_train, axis=0)
self.centroids = [uniform(min_, max_) for _ in range(self.n_clusters)]

Next, we perform the iterative process of optimizing the centroid locations.

The optimization process is to readjust the centroid locations to be the means of the points belonging to it. This process is to repeat until the centroids stop moving, or the maximum number of iterations is passed. We’ll use a while loop to account for the fact that this process does not have a fixed number of iterations. Additionally, you could also use a for loop that repeats max_iter times and breaks when the centroids stop changing.

Before beginning the while loop, we’ll initialize the variables used in the exit conditions.

iteration = 0
prev_centroids = None

Now, we begin the loop. We’ll iterate through the data points in the training set, assigning them to an initialized empty list of lists. The sorted_points list contains one empty list for each centroid, where data points are appended once they’ve been assigned.

while np.not_equal(self.centroids, prev_centroids).any() and iteration < self.max_iter:
# Sort each data point, assigning to nearest centroid
sorted_points = [[] for _ in range(self.n_clusters)]
for x in X_train:
dists = euclidean(x, self.centroids)
centroid_idx = np.argmin(dists)
sorted_points[centroid_idx].append(x)

Now that we’ve assigned the whole training dataset to their closest centroids, we can update the location of the centroids and finish the iteration.

# Push current centroids to previous, reassign centroids as mean of the points belonging to them
prev_centroids = self.centroids
self.centroids = [np.mean(cluster, axis=0) for cluster in sorted_points]
for i, centroid in enumerate(self.centroids):
if np.isnan(centroid).any(): # Catch any np.nans, resulting from a centroid having no points
self.centroids[i] = prev_centroids[i]
iteration += 1

After the completion of the iteration, the while conditions are checked again, and the algorithm will continue until the centroids are optimized or the max iterations are passed. The full fit method is included below.

class KMeans: def __init__(self, n_clusters=8, max_iter=300):
self.n_clusters = n_clusters
self.max_iter = max_iter
def fit(self, X_train): # Randomly select centroid start points, uniformly distributed across the domain of the dataset
min_, max_ = np.min(X_train, axis=0), np.max(X_train, axis=0)
self.centroids = [uniform(min_, max_) for _ in range(self.n_clusters)]
# Iterate, adjusting centroids until converged or until passed max_iter
iteration = 0
prev_centroids = None
while np.not_equal(self.centroids, prev_centroids).any() and iteration < self.max_iter:
# Sort each datapoint, assigning to nearest centroid
sorted_points = [[] for _ in range(self.n_clusters)]
for x in X_train:
dists = euclidean(x, self.centroids)
centroid_idx = np.argmin(dists)
sorted_points[centroid_idx].append(x)
# Push current centroids to previous, reassign centroids as mean of the points belonging to them
prev_centroids = self.centroids
self.centroids = [np.mean(cluster, axis=0) for cluster in sorted_points]
for i, centroid in enumerate(self.centroids):
if np.isnan(centroid).any(): # Catch any np.nans, resulting from a centroid having no points
self.centroids[i] = prev_centroids[i]
iteration += 1

Lastly, lets make a method to evaluate a set of points to the centroids we’ve optimized to our training set. This method returns the centroid and the index of said centroid for each point.

def evaluate(self, X):
centroids = []
centroid_idxs = []
for x in X:
dists = euclidean(x, self.centroids)
centroid_idx = np.argmin(dists)
centroids.append(self.centroids[centroid_idx])
centroid_idxs.append(centroid_idx)
return centroids, centroid_idx

Now we can finally deploy our model. Lets train and test it on our original dataset and see the results. We’ll keep our original method of plotting our data, by separating the true labels by color, but now we’ll additionally separate the predicted labels by marker style, to see how the model performs.

kmeans = KMeans(n_clusters=centers)
kmeans.fit(X_train)
# View results
class_centers, classification = kmeans.evaluate(X_train)
sns.scatterplot(x=[X[0] for X in X_train],
y=[X[1] for X in X_train],
hue=true_labels,
style=classification,
palette="deep",
legend=None
)
plt.plot([x for x, _ in kmeans.centroids],
[y for _, y in kmeans.centroids],
'+',
markersize=10,
)
plt.show()
Create a K-Means Clustering Algorithm from Scratch in Python (7)

Figure 4: A failed example where one centroid has no points, and one contains two clusters.

Create a K-Means Clustering Algorithm from Scratch in Python (8)

Figure 5: A failed example where one centroid has no points, two contains two clusters, and two split one cluster.

Create a K-Means Clustering Algorithm from Scratch in Python (9)

Figure 6: A failed example where two centroids contain one and a half clusters, and two centroids split a cluster.

Looks like our model isn’t performing very well. We can infer two primary problems from these three failed examples.

  1. If a centroid is initialized far from any groups, it is unlikely to move. (Example: the bottom right centroid in Figure 4.)
  2. If centroids are initialized too close, they’re unlikely to diverge from one another. (Example: the two centroids in the green group in Figure 6.)

We’ll begin to remedy these problems with a new process of initializing the centroid locations. This new method is referred to as the k-means++ algorithm.

  1. Initialize the first centroid as a random selection of one of the data points.
  2. Calculate the sum of the distances between each data point and all the centroids.
  3. Select the next centroid randomly, with a probability proportional to the total distance to the centroids.
  4. Return to step 2. Repeat until all centroids have been initialized.

This code is included below.

# Initialize the centroids, using the "k-means++" method, where a random datapoint is selected as the first,
# then the rest are initialized w/ probabilities proportional to their distances to the first
# Pick a random point from train data for first centroid
self.centroids = [random.choice(X_train)]
for _ in range(self.n_clusters-1):
# Calculate distances from points to the centroids
dists = np.sum([euclidean(centroid, X_train) for centroid in self.centroids], axis=0)
# Normalize the distances
dists /= np.sum(dists)
# Choose remaining points based on their distances
new_centroid_idx, = np.random.choice(range(len(X_train)), size=1, p=dists)
self.centroids += [X_train[new_centroid_idx]]

If we run this new model a few times we’ll see it performs much better, but still not always perfect.

Create a K-Means Clustering Algorithm from Scratch in Python (10)

Figure 7: An ideal convergence, after implementing the k-means++ initialization method.

And with that, we’re finished. We learned a simple, yet elegant implementation of an unsupervised machine learning model. The complete project code is included below.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from numpy.random import uniform
from sklearn.datasets import make_blobs
import seaborn as sns
import random
def euclidean(point, data):
"""
Euclidean distance between point & data.
Point has dimensions (m,), data has dimensions (n,m), and output will be of size (n,).
"""
return np.sqrt(np.sum((point - data)**2, axis=1))
class KMeans: def __init__(self, n_clusters=8, max_iter=300):
self.n_clusters = n_clusters
self.max_iter = max_iter
def fit(self, X_train): # Initialize the centroids, using the "k-means++" method, where a random datapoint is selected as the first,
# then the rest are initialized w/ probabilities proportional to their distances to the first
# Pick a random point from train data for first centroid
self.centroids = [random.choice(X_train)]
for _ in range(self.n_clusters-1):
# Calculate distances from points to the centroids
dists = np.sum([euclidean(centroid, X_train) for centroid in self.centroids], axis=0)
# Normalize the distances
dists /= np.sum(dists)
# Choose remaining points based on their distances
new_centroid_idx, = np.random.choice(range(len(X_train)), size=1, p=dists)
self.centroids += [X_train[new_centroid_idx]]
# This initial method of randomly selecting centroid starts is less effective
# min_, max_ = np.min(X_train, axis=0), np.max(X_train, axis=0)
# self.centroids = [uniform(min_, max_) for _ in range(self.n_clusters)]
# Iterate, adjusting centroids until converged or until passed max_iter
iteration = 0
prev_centroids = None
while np.not_equal(self.centroids, prev_centroids).any() and iteration < self.max_iter:
# Sort each datapoint, assigning to nearest centroid
sorted_points = [[] for _ in range(self.n_clusters)]
for x in X_train:
dists = euclidean(x, self.centroids)
centroid_idx = np.argmin(dists)
sorted_points[centroid_idx].append(x)
# Push current centroids to previous, reassign centroids as mean of the points belonging to them
prev_centroids = self.centroids
self.centroids = [np.mean(cluster, axis=0) for cluster in sorted_points]
for i, centroid in enumerate(self.centroids):
if np.isnan(centroid).any(): # Catch any np.nans, resulting from a centroid having no points
self.centroids[i] = prev_centroids[i]
iteration += 1
def evaluate(self, X):
centroids = []
centroid_idxs = []
for x in X:
dists = euclidean(x, self.centroids)
centroid_idx = np.argmin(dists)
centroids.append(self.centroids[centroid_idx])
centroid_idxs.append(centroid_idx)
return centroids, centroid_idxs
# Create a dataset of 2D distributions
centers = 5
X_train, true_labels = make_blobs(n_samples=100, centers=centers, random_state=42)
X_train = StandardScaler().fit_transform(X_train)
# Fit centroids to dataset
kmeans = KMeans(n_clusters=centers)
kmeans.fit(X_train)
# View results
class_centers, classification = kmeans.evaluate(X_train)
sns.scatterplot(x=[X[0] for X in X_train],
y=[X[1] for X in X_train],
hue=true_labels,
style=classification,
palette="deep",
legend=None
)
plt.plot([x for x, _ in kmeans.centroids],
[y for _, y in kmeans.centroids],
'k+',
markersize=10,
)
plt.show()

Thanks for reading!
Connect with me on LinkedIn
See this project in GitHub

Create a K-Means Clustering Algorithm from Scratch in Python (2024)

FAQs

How to program the KMeans algorithm in Python from scratch? ›

Thus, the Kmeans algorithm consists of the following steps:
  1. We initialize k centroids randomly.
  2. Calculate the sum of squared deviations.
  3. Assign a centroid to each of the observations.
  4. Calculate the sum of total errors and compare it with the sum in the previous iteration.

How do you create a clustering algorithm in Python? ›

An introduction to popular clustering algorithms in Python
  1. Randomly pick k centroids from the examples as initial cluster centers.
  2. Assign each example to the nearest centroid 𝜇(i),j ∊ [1,…,k]
  3. Move the centroids to the center of the examples that were assigned to it.
Sep 25, 2023

What is K clustering algorithm in Python? ›

The k-means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different types of clustering methods, but k-means is one of the oldest and most approachable.

How do you implement K-Means algorithm for clustering? ›

Step-1: Select the number K to decide the number of clusters. Step-2: Select random K points or centroids. (It can be other from the input dataset). Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

What is k-means clustering in machine learning from scratch? ›

The K-Means algorithm (also known as Lloyd's Algorithm) consists of 3 main steps : Place the K centroids at random locations (here K=3) Assign all data points to the closest centroid (using Euclidean distance) Compute the new centroids as the mean of all points in the cluster.

How do you create a clustering algorithm? ›

How to Apply K-Means Clustering Algorithm?
  1. Choose the number of clusters k. The first step in k-means is to pick the number of clusters, k.
  2. Select k random points from the data as centroids. ...
  3. Assign all the points to the closest cluster centroid. ...
  4. Recompute the centroids of newly formed clusters. ...
  5. Repeat steps 3 and 4.

How to create an algorithm in Python? ›

Steps to Write an Algorithm in Python:
  1. Step 1: Understand the Problem. Problem: Given a list of numbers, find the maximum element. ...
  2. Step 2: Plan the Approach. We can iterate through the list and keep track of the maximum element encountered so far.
  3. Step 3: Break it Down. ...
  4. Step 4: Write Pseudocode. ...
  5. Step 5: Implement in Python.

What is an example of K clustering? ›

Use K means clustering to generate groups comprised of observations with similar characteristics. For example, if you have customer data, you might want to create sets of similar customers and then target each group with different types of marketing.

What is the best clustering model in Python? ›

The best tool to use depends on the problem at hand and the type of data available. There are three widely used techniques for how to form clusters in Python: K-means clustering, Gaussian mixture models and spectral clustering.

How to interpret K-Means clustering results in Python? ›

Interpreting the meaning of k-means clusters boils down to characterizing the clusters. A Parallel Coordinates Plot allows us to see how individual data points sit across all variables. By looking at how the values for each variable compare across clusters, we can get a sense of what each cluster represents.

How to improve K-Means clustering in Python? ›

  1. 1 Choose a smart initialization. One of the main factors that affect the performance of K-means clustering is how the initial cluster centers are chosen. ...
  2. 2 Scale your data. ...
  3. 3 Use a faster algorithm. ...
  4. 4 Evaluate your clusters. ...
  5. 5 Experiment with different parameters.
Mar 9, 2023

What is K clustering for dummies? ›

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.

What is the clustering algorithm for categorical data in Python? ›

In Python, the Kmodes function is part of the kmodes library, which implements the K-modes clustering algorithm. This function is used to perform clustering on categorical data, grouping similar data points into clusters based on their categorical attributes.

What is KMeans clustering in Python towards data science? ›

In K-Means clustering, the Euclidean distance is the most common metric for measuring similarity. In the figure below, we can clearly see 3 different groups. Hence, we could determine the centers of each group and each point would be associated with the closest center.

How to make a clustering model? ›

The five steps to building a clustering model
  1. STEP 1: Create a dataset. Compile an aggregated dataset ready to use by your model.
  2. STEP 2: Create a model. Create a clustering model instance in your platform of choice.
  3. STEP 3: Explore your dataset. ...
  4. STEP 4: Configure and train your model. ...
  5. STEP 5: Predict using your model.
Jan 11, 2023

What is KMeans clustering in Python NLP? ›

K-Means Clustering is an iterative algorithm that groups data points into K clusters. The algorithm starts by selecting K centroids randomly and updates them by computing the mean of the data points closest to each centroid. This process is repeated until the centroids stop moving or reach a stopping criterion.

What is the K mode algorithm for clustering? ›

The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function.

Top Articles
Best Soap2day alternatives: Free movie streaming sites - gHacks Tech News
Top Websites Like Soap2Day for Streaming Movies and TV Shows
Forozdz
#ridwork guides | fountainpenguin
The 10 Best Restaurants In Freiburg Germany
Boomerang Media Group: Quality Media Solutions
Teenbeautyfitness
Www.craigslist Augusta Ga
O'reilly's In Monroe Georgia
Lenscrafters Westchester Mall
Chuckwagon racing 101: why it's OK to ask what a wheeler is | CBC News
South Ms Farm Trader
Detroit Lions 50 50
10 Great Things You Might Know Troy McClure From | Topless Robot
The Rise of Breckie Hill: How She Became a Social Media Star | Entertainment
Oppenheimer Showtimes Near Cinemark Denton
Superhot Unblocked Games
6th gen chevy camaro forumCamaro ZL1 Z28 SS LT Camaro forums, news, blog, reviews, wallpapers, pricing – Camaro5.com
Conan Exiles Colored Crystal
How To Cut Eelgrass Grounded
Conscious Cloud Dispensary Photos
Nurse Logic 2.0 Testing And Remediation Advanced Test
Why Should We Hire You? - Professional Answers for 2024
Tu Pulga Online Utah
Urban Dictionary Fov
Dmv In Anoka
Ticket To Paradise Showtimes Near Cinemark Mall Del Norte
In hunt for cartel hitmen, Texas Ranger's biggest obstacle may be the border itself (2024)
Osrs Important Letter
Bridgestone Tire Dealer Near Me
Chicago Pd Rotten Tomatoes
Gr86 Forums
Japanese Pokémon Cards vs English Pokémon Cards
Reli Stocktwits
Cruise Ships Archives
Skip The Games Ventura
Best Weapons For Psyker Darktide
Why Gas Prices Are So High (Published 2022)
Ursula Creed Datasheet
Directions To The Closest Auto Parts Store
Bustednewspaper.com Rockbridge County Va
Does Target Have Slime Lickers
Random Animal Hybrid Generator Wheel
Ferhnvi
Sea Guini Dress Code
Doe mee met ons loyaliteitsprogramma | Victoria Club
1990 cold case: Who killed Cheryl Henry and Andy Atkinson on Lovers Lane in west Houston?
Parks And Rec Fantasy Football Names
Hcs Smartfind
Secondary Math 2 Module 3 Answers
Bunbrat
Latest Posts
Article information

Author: Otha Schamberger

Last Updated:

Views: 6029

Rating: 4.4 / 5 (75 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Otha Schamberger

Birthday: 1999-08-15

Address: Suite 490 606 Hammes Ferry, Carterhaven, IL 62290

Phone: +8557035444877

Job: Forward IT Agent

Hobby: Fishing, Flying, Jewelry making, Digital arts, Sand art, Parkour, tabletop games

Introduction: My name is Otha Schamberger, I am a vast, good, healthy, cheerful, energetic, gorgeous, magnificent person who loves writing and wants to share my knowledge and understanding with you.