CLUSTER ANALYSIS FOR DATA SCIENTISTS

Vijay Gadre
3 min readJul 6, 2023

--

Cluster Analysis is a multivariate statistical technique that groups observations on the basis some of their features that they are described by.

Goal:- To maximize the similarity of observations within a cluster and maximize the dissimilarity between clusters.

EUCLIDEAN DISTANCE

Euclidean distance
Euclidean Distance

The most intuitive way to measure the distance between two points is by drawing a straight line from one point (x1, y1) to another (x2, y2). That’s known as Euclidean Distance.

  1. 2D Space Euclidean Distance:
2D Space Euclidean Distance

2. 3D Space Euclidean Distance:

3D Space Euclidean Distance

3. N — Dim Space Euclidean Distance:

If the coordinates of A are (a1, a2, …, an) and of B are (b1, b2, …, bn)

N-D Space Euclidean Distance

K-MEANS CLUSTERING STEPS

Block Schematic of Steps involved in K-Means Clustering
Block Schematic of Steps involved in K-Means Clustering
  1. Choose number of clusters, k: Chosen by the person performing the clustering.
  2. Specify the number of seeds: A starting centroid can be chosen at random with an algorithm, or according to some prior knowledge.
  3. Assign each point to a centroid: Based on proximity, measured by Euclidean Distance.
  4. Adjust the centroids: Repeat Step 2 and 3 until you can no longer find a better clustering solution.
K-Means Clustering with k=3 clusters
K-Means Clustering with k=3 clusters

CLASSIFICATION VS CLUSTERING

Classification VS Clusting
Classification VS Clustering

TYPES OF CLUSTERING

There are 2 types of clustering:-

  1. Hierarchical: Historically, clustering was developed first. An example of hierarchical clustering would be the taxonomy of animal kingdom. It is superior to flat clustering in the fact that it explores all the solutions.
  2. Flat: There is no hierarchy, but rather the number of clusters are chosen prior to clustering. Flat methods have been developed because hierarchical clustering is much slower, and computationally expensive.

Hierarchical clustering is further divided into two types:-

  1. Divisive (Top-Down): We start from a situation where all the observations are in the same cluster, eg. from the dinosaurs. Then we split this big cluster into 2 smaller ones. Then, we continue with 3, 4, 5, and so on, until each observation is its separate cluster. To find the best split, we must explore all possibilities at each step.
  2. Agglomerative (Bottom-Up): We start from different dog and cat breeds, cluster them into dogs and cats respectively, and then we continue pairing up species, until we reach the animal cluster. To find the combination of observations into a cluster, we must explore all possibilities at each step.

--

--

Vijay Gadre
Vijay Gadre

Written by Vijay Gadre

Data Scientist | Machine Learning Engineer | Artificial Intelligence Engineer

No responses yet