CLUSTER ANALYSIS FOR DATA SCIENTISTS
Cluster Analysis is a multivariate statistical technique that groups observations on the basis some of their features that they are described by.
Goal:- To maximize the similarity of observations within a cluster and maximize the dissimilarity between clusters.
EUCLIDEAN DISTANCE
The most intuitive way to measure the distance between two points is by drawing a straight line from one point (x1, y1) to another (x2, y2). That’s known as Euclidean Distance.
- 2D Space Euclidean Distance:
2. 3D Space Euclidean Distance:
3. N — Dim Space Euclidean Distance:
If the coordinates of A are (a1, a2, …, an) and of B are (b1, b2, …, bn)
K-MEANS CLUSTERING STEPS
- Choose number of clusters, k: Chosen by the person performing the clustering.
- Specify the number of seeds: A starting centroid can be chosen at random with an algorithm, or according to some prior knowledge.
- Assign each point to a centroid: Based on proximity, measured by Euclidean Distance.
- Adjust the centroids: Repeat Step 2 and 3 until you can no longer find a better clustering solution.
CLASSIFICATION VS CLUSTERING
TYPES OF CLUSTERING
There are 2 types of clustering:-
- Hierarchical: Historically, clustering was developed first. An example of hierarchical clustering would be the taxonomy of animal kingdom. It is superior to flat clustering in the fact that it explores all the solutions.
- Flat: There is no hierarchy, but rather the number of clusters are chosen prior to clustering. Flat methods have been developed because hierarchical clustering is much slower, and computationally expensive.
Hierarchical clustering is further divided into two types:-
- Divisive (Top-Down): We start from a situation where all the observations are in the same cluster, eg. from the dinosaurs. Then we split this big cluster into 2 smaller ones. Then, we continue with 3, 4, 5, and so on, until each observation is its separate cluster. To find the best split, we must explore all possibilities at each step.
- Agglomerative (Bottom-Up): We start from different dog and cat breeds, cluster them into dogs and cats respectively, and then we continue pairing up species, until we reach the animal cluster. To find the combination of observations into a cluster, we must explore all possibilities at each step.