Hierarchical Clustering in Data Mining
Hierarchy is more informative structure rather than the unstructured set of clusters returned by non hierarchical clustering.
Basic algorithm:- Compute the proximity (similarity) matrix.
- Let each data point be cluster.
- Merge the two closest clusters.
- Update the proximity matrix until only one cluster remains.
Different approaches to define the cluster between the clusters.
1. Single linkage
In this algorithm, the pair of clusters having shortest distance is considered, if there exists the similarity between two clusters.
2. Complete linkage
In this method, the distance between one cluster and another cluster should be equal to the greatest distance from any member of one cluster to any member of the other cluster.
3. Average linkage
In this method, the distance between one cluster and another cluster should be equal to average distance from any member of one cluster to any member of the other cluster.
Dendrogram
It is a tree structure diagram which illustrates hierarchical clustering techniques. Each level shows clusters for that level.
Agglomerative hierarchical clustering
- It is a bottom-up approach, in which clusters have sub-clusters.
- The process is explained in the following flowchart.
Example: For given distance matrix, draw single link, complete link and average link dendrogram.
| A | B | C | D | E |
A | 0 | | | | |
B | 2 | 0 | | | |
C | 6 | 3 | 0 | | |
D | 11 | 9 | 7 | 0 | |
E | 9 | 8 | 5 | 4 | 0 |
Solution: Single link: From given distance matrix.
Step 1:
| A, B | C | D | E |
A, B | 0 | | | |
C | 3 | 0 | | |
D | 9 | 7 | 0 | |
E | 8 | 5 | 4 | 0 |
Step 2:
| A, B, C | D | E |
A, B, C | 0 | 0 | |
D | 7 | 4 | |
E | 5 | 0 | 0 |
Step 3:
| A, B, C | D, E |
A, B, C | 0 | |
D, E | 5 | 0 |
2. Complete link: Find maximum distance to draw complete link.
Step 1:
| A, B | C | D | E |
A, B | 0 | | | |
C | 6 | 0 | | |
D | 11 | 7 | 0 | |
E | 9 | 5 | 4 | 0 |
Step 2:
| A, B | C | D, E |
A, B | 0 | | |
C | 6 | 0 | |
D, E | 11 | 7 | 0 |
Step 3:
| A, B, C | D, E |
A, B, C | 0 | |
D, E | 11 | 0 |
3. Average link: To draw average link, compute the average link.
Step 1:
I) 6+3/2 = 4.5
ii) 11+9/2 = 10
iii) 9+8/2 = 8.5
| A, B | C | D | E |
A, B | 0 | | | |
C | 4.5 | | | |
D | 10 | 7 | 0 | |
E | 8.5 | 5 | 4 | 0 |
Step 2:
I) 11+9+9+8/2 = 9.25
II) 7+5/ 2 = 6
| A, B | C | D, E |
A, B | 0 | | |
C | 4.5 | 0 | |
D, E | 9.25 | 6 | 0 |
Step 3:
11+9.25+9.25+8+7+5/6 = 8.20
| A, B, C | D, E |
A, B, C | 0 | |
D, E | 8.20 | 0 |
Comparison between Single link, Complete link and Average link based on distance formula.
Single link | Complete link | Average link |
---|
Handles non-elliptical shapes.
It is sensitive to noise and outliers. | Less sensitive to noise and outliers. | It strikes a balance between single and complete links. |
It strikes a balance between single and complete links.