** IGNORE LINE **
** IGNORE LINE **
** IGNORE LINE **
Results

Evaluation of data set quality by tissue-wise hierarchical clustering

Prior to the analysis of chromosomal expression domains, we aimed to check whether the quality of our complete array expression data set (> 44 k genes) allows to extract discrepancies between tumor samples and normal epithelial tissues. Purely unsupervised hierarchical clustering of tissue samples based on gene expression vectors can provide such information. The use of the full set of 44 k genes for clustering is not desirable, because of high signal-noise ratios and computational considerations. Therefore, we pre-selected potentially informative genes for hierarchical clustering. We selected only genes which had reliable information about genomic localization and for which probe sets exceeded a minimum expression threshold in at least 20% of the experiments. To enrich informative genes for tissue distinction, we required a minimum standard deviation across all 50 samples. The pre-selection resulted in 514 probe sets. Note that we avoided to pre-select genes based on differential expression between tumor and normal tissue. We applied three rounds of normalization to genes and arrays. Finally, we applied standard centroid hierarchical clustering (Pearson correlation) to this dataset. Two large clusters were revealed (Figure 1). 18 out of 25 normal tissues formed one single cluster. The remaining 8 normal tissues mainly clustered together with matching tumor samples from same patients. This suggests that coalescence between tumor and normal samples from the same patients could be due to patient-specific gene expression characteristics. As the majority of normal samples could be clearly separated from tumors, we concluded that our data set is well suited to explore differences in gene expression between normal and tumor cells of colorectal origin.

Hierarchical clustering of samples from colorectal tumors and normal colon epithelia. On the right, you find the chromosomal localization of the genes and the official HUGO symbol or prospective Affymetrix cluster ID. On the top, the binary tree of tissue samples based on gene expression is given. The tissue denominators either contain TR for tumor or E for epithelium and a code reflecting the identity of each patient. In the center, the expression values after normalization have been color-coded: light blue means high expression, black means low (or no) expression. Note that only a representative fraction of the 514 genes is visualized here (white bars replace some portions of original heat map). The right cluster contains only samples from normal colon epithelia, the left cluster is composed primarily of tumors along with some interspersed normal epithelial samples. Note that misplaced normal tissue (E) samples often cluster along with matching tumor (TR) samples from the same patient.

