Cluster Analysis

E-mail Print PDF

 

Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure.

Types:

Data clustering algorithms can be hierarchical, partitional or hybrid. Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.

Two-way clustering, co-clustering or bi-clustering are the names for clusterings where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the row and columns are clustered simultaneously.

Hierarchical

Given a distance measure, elements can be combined. Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters. The traditional representation of this hierarchy is a tree data structure (called a dendrogram), with individual elements at one end and a single cluster with every element at the other. Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the bottom.

Cutting the tree at a given height will give a clustering at a selected precision.

This method builds the hierarchy from the individual elements by progressively merging clusters. The first step is to determine which elements to merge in a cluster. Usually, the two closest elements, therefore a distance matrix between all elements is a prior requirement.

Partitional

According to Pollard KS [48, p 212] the different methods for partition clustering are : partitioning around medoids (PAM) [51] , self-organising maps (SOM), and k-means. All methods map a collection of elements into k>=2 disjoint clusters by aiming ot maximize a particular criterion.

The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster.

PAM builds hierarchy bottom-up by iteratively computing the similarity between all pairs of clusters and then merging the most similar pair. More analytically Pam first computes k representative objects, called medoids. A medoid can be defined as that object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal. In the classification literature, such representative objects are called centrotypes. After finding the set of medoids, each object of the data set is assigned to the nearest medoid. That is, object i is put into cluster vi, when medoid mvi is nearer than any other medoid mw.

SOMs were originally invented by Teuvo Kohonen. Though computationally more expensive than other unsupervised methods, it has proved extremely useful when the input data is of high dimensionality and complexity. Past applications have included voice and handwriting recognition and there has recently been many applications in deciphering gene expression patterns. Not only do SOMs give an idea of clustering (i.e. similar expression patterns) but they are also spatially sensitive. Therefore one can not only tell what cluster a gene is from but also (with some confidence) determine its 'relatedness' to that cluster or cluster centre. What's more, all this information can be visualised in a 2-dimensional way using colours, which is good for those who intend to publish in journals or on the web.

SOM is a learning algorithm (neural network). The basic idea is to:

  • Represent high-dimensional data in a low-dimensional form without loosing any of the 'essence' of the data.
  • Organise data on the basis of similarity by putting entities geometrically close to each other.

Hybrid

The main approach of this type of clustering is the ‘hierarchical ordered partitioning and collapsing hybrid’ algorithm (HOPACH) [52]. This builds a tree of clusters, where the clusters in each level are ordered based on the pairwise dissimilarities between cluster medoids. According to Pollard KS [48] the algorithm starts at the root node and aims to find the right number of children for each node by alternating partitioning (divisive) steps with collapsing (agglomerative) steps.

 

 

Reference

 

 

  1. Gentleman, R.; Carey, V.; Huber, W.; Irizarry, R.; Dudoit, S. (Eds.) (2005),Bioinformatics and Computational Biology Solutions Using R and Bioconductor . Springer Publications pp 232-233

  2. Ashburner,M. et al. (2000) Gene Ontology: tool for the unification of biology. Nat Genet., 25, 25–29

  3. Kaufman, L., & Rousseeuw, P. J. (1990, March). Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, Inc.

  4. M.J. van der Laan, K.S. Pollard (2003). A new algorithm for hybrid clustering with visualization and the bootstrap. Journal of Statistical Planning and Inference, 117, pp. 275-303.

You are here: Tutorials Clustering