(also called an object) and the samples represent a number of measurements made on that case. Various multivariate statistical techniques can be applied to data organized in this way and one of the most common is some form of hierarchical clustering. The logic of clustering is evident from the fact that groups of genes will have similar expression patterns over samples, because they are induced by the same environmental conditions or regulated by the same transcription factors. The most common clustering algorithm to apply to microarray data comes from Eisen et al. (1998).

Clustering starts by developing a matrix of pairwise distances between the genes. There are different ways to calculate distances, one of the most straightforward being Euclidean distance. Suppose we are considering genes A and B, and we have observations on gene expression of ai for gene A and bi for gene B in sample i, then the Euclidean distance DEucl between the genes is:

where n is the number of samples. This distance measure is calculated for each pair of genes, resulting in a distance matrix, which is input to the clustering algorithm. To illustrate the calculation, Table 2.5 provides a hypothetical example of a very simple gene-expression matrix and the calculation of Euclidean distance between the three genes. This example suggests that the distance between genes B and C is smaller than between either A and B or A and C.

The Euclidean distance is not the only way of defining distances between genes. Other measures are Minkowski distance, Manhattan distance, and

Table 2.5 Hypothetical gene-expression matrix, illustrating the calculation of Euclidean distances (DEucl) between genes (see also Tables 2.6 and 2.7)

Hamming distance. In addition, the clustering may be based on a similarity measure, such as Pearson correlation, rather than distance. The reader is referred to Causton et al. (2003) and textbooks of multivariate statistical analysis for more information.

The object of clustering analysis is to develop a dendrogram that groups together genes with similar expression patterns. There are several principles that can be applied to achieve clustering. In an influential paper on gene-expression data analysis, Eisen et al. (1998) applied the so-called average linkage method. In this method a computer algorithm screens the matrix of pairwise distances for the smallest value (in the case of the genes sampled in Table 2.5, this would be 2.53, between genes B and C; see Table 2.6). Then a node is defined between these genes and geneexpression values are calculated for the node by averaging over the two genes involved. The distance matrix is then updated and a new smallest distance is identified. The procedure is repeated until g-1 nodes have been made, where g is the number of genes. Software packages such as developed by Eisen et al. (1998) not only provide a computational procedure but also a pictorial presentation of the clustered gene-expression pattern; each gene is qualified by a colour code, where red is used for upregulated expression and green for downregulated expression.

Cluster analysis is usually done in conjunction with other multivariate statistical techniques, such as principal component analysis (PCA; also known as singular-value decomposition). The aim of PCA is to find combinations of genes that jointly contribute most to the variability in the data. Technically speaking, one aims to find axes in the

Table 2.6 Euclidean distance (DEucl) between the three genes in Table 2.5 over the three samples



0 0

Post a comment