S. Omatu et al. (Eds.): IWANN 2009, Part II, LNCS 5518, pp. 971–977, 2009. 
© Springer-Verlag Berlin Heidelberg 2009 
An Evolutionary Approach for Sample-Based Clustering on 
Microarray Data 
Daniel Glez-Peña1, Fernando Díaz2, José R. Méndez1, Juan M. Corchado3, 
and Florentino Fdez-Riverola1 
1
 ESEI: Escuela Superior de Ingeniería Informática, University of Vigo, Edificio Politécnico, 
Campus Universitario As Lagoas s/n, 32004, Ourense, Spain 
{dgpena, moncho.mendez, riverola}@uvigo.es 
2
 Dept. Informática, University of Valladolid, Escuela Universitaria de Informática, Plaza Santa 
Eulalia, 9-11, 40005, Segovia, Spain 
fdiaz@infor.uva.es 
3
 Dept. Informática y Automática, University of Salamanca, Plaza de la Merced s/n, 37008, 
Salamanca, Spain 
corchado@usal.es 
Abstract. Sample-based clustering is one of the most common methods for dis-
covering disease subtypes as well as unknown taxonomies. By revealing hidden 
structures in microarray data, cluster analysis can potentially lead to more tai-
lored therapies for patients as well as better diagnostic procedures. In this work, 
we present a novel method for automatically discovering clusters of samples 
which are coherent from a genetic point of view. Each possible cluster is char-
acterized by a fuzzy pattern which maintains a fuzzy discretization of relevant 
gene expression values. Noise genes are identified and removed from the fuzzy 
pattern based on their probability of appearance. Possible clusters are randomly 
constructed and iteratively refined by following a probabilistic search and an 
optimization schema. Experimental results over publicly available microarray 
data show the effectiveness of the proposed method. 
Keywords: simulated annealing, sample-based clustering, discriminant fuzzy 
pattern, microarray data. 
1   Introduction and Motivation 
Within a gene expression matrix, there are usually several particular macroscopic 
phenotypes of samples related to some diseases or drug effects, such as diseased sam-
ples, normal samples or drug treated samples. The goal of sample-based clustering is 
to find the phenotype structures or sub-structure of these samples. Many conventional 
clustering algorithms have been adapted or directly applied to gene expression data 
where the signal-to-noise ratio may seriously degrade the quality and reliability of 
clustering results. This has the effect of obscuring clustering in samples that may be 
evident only when looking at a subset of genes. 
In this context, existing sample-based clustering methods can be (i) directly applied 
to cluster samples using all the genes as features (i.e., classical techniques as  
972 D. Glez-Peña et al. 
K-means, SOM, HC, etc.) or (ii) executed after a set of informative genes are identi-
fied. The problem with the first approach is the signal-to-noise ratio, which is known 
to seriously reduce the accuracy of clustering results due to the existence of noise and 
outliers of the samples [1]. To overcome such difficulty, particular methods can be 
applied to indentify informative genes and reduce gene dimensionality prior to clus-
tering samples in order to detect their phenotypes. In this context, both supervised and 
unsupervised informative gene selection techniques have been developed. 
While supervised informative gene selection techniques often obtain high cluster-
ing accuracy rates, unsupervised informative gene selection methods are more com-
plex because they assume no phenotype information being assigned to any sample [2]. 
In such a situation, two general strategies have been adopted to address the lack of 
prior knowledge: (i) unsupervised gene selection, that aims to reduce the number of 
genes before clustering samples by using some statistical models [3-5] and (ii) interre-
lated clustering, that takes benefits of utilizing the relationship between the genes and 
samples to perform gene selection and sample clustering simultaneously in an  
iterative paradigm [6-10].  
In this contribution we propose a simulated annealing-based algorithm for iterative 
class discovery that uses a novel fuzzy logic method for informative gene selection. 
The interrelated clustering process carried out is based on an iterative approach where 
possible clusters are randomly constructed and evaluated by following a probabilistic 
search and an optimization schema. The rest of the paper is structured as follows: 
Section 2 introduces the details of our proposed technique discussing relevant aspects 
of the whole algorithm. Section 3 presents the experimental setup carried out and the 
results obtained from a publicly available microarray data set. Finally, Section 4 
summarizes the main conclusions extracted from this work. 
2   Iterative Class Discovery Algorithm 
In this section we introduced the proposed method for automatically discovering clus-
ters of samples which are coherent from a genetic point of view. Each possible cluster 
is characterized by a fuzzy pattern which maintains a fuzzy discretization of relevant 
gene expression values. Noise genes are identified and removed from the fuzzy pattern 
based on their probability of appearance. Possible clusters are randomly constructed 
and iteratively refined by following a probabilistic search and an optimization schema. 
2.1   Selecting Relevant Genes 
In order to identify potential valuable genes, we use part of a previous successful gene 
selection technique called DFP (Discriminant Fuzzy Pattern) [11]. Our whole DFP 
algorithm comprises of three main steps. First, we represent each gene value in terms 
of one from the following linguistic labels: Low, Medium, High and their intersec-
tions LowMedium and MediumHigh. The output is a fuzzy microarray descriptor 
(FMD) for each existing sample (microarray). The second phase aims to find all genes 
that best explain each class, constructing a supervised fuzzy pattern (FP) for each 
pathology. Starting from the previous obtained FPs, our technique discriminates those 
genes that can provide a substantial discernibility between existing classes, generating 
a unique discriminant fuzzy pattern (DFP). In our present work, we only use steps one 
and two of the DFP algorithm. 
 An Evolutionary Approach for Sample-Based Clustering on Microarray Data 973 
2.2   Filtering Noisy Genes 
In order to discard those genes that belong to a given cluster of samples due only to 
pure chance, we deal with the concept of ‘noisy genes’. As the uncertainty decreases 
(there are predominance of one expression level over the other ones for all the genes 
in the available set of microarrays) the number of noise genes trends upward (the 
amount of information encoded by the data also decreases and then, there are more 
irrelevant genes). When uncertainty increases, the amount of information also grows 
and more genes are necessary to distinguish arrays in absence of other information. 
2.3   Assessing the Value of a Cluster 
Our cost function for evaluating each cluster combines two factors: (i) the number of 
genes in the fuzzy pattern associated to each cluster of the partition and (ii) the size of 
such cluster. The first factor in the cost function models the genetic coherence of a 
cluster. Assuming this hypothesis, it is expected that for clusters with equal sizes, the 
number of genes in a fuzzy pattern will be greater if the genetic coherence of the 
cluster is higher. The second factor is relevant since it has been experimentally ob-
served that meaningful genes in great clusters (after noisy genes have been filtered) 
are several orders of magnitude inferior to meaningful genes computed in small clus-
ters. This fact is reasonable because it will be more probable when the number of 
possibilities is also more reduced. Therefore, the size factor in the cost function is 
needed for doing comparable clusters of different size. 
2.4   Algorithm 
The application of our simulated annealing approach to cluster microarrays is as fol-
lows. First of all, we consider a pool which contains the set of m microarrays that 
must be clusterized into k different and unknown groups. In the final solution, some 
microarrays can stay in the pool without being associated to any cluster. Initially, a 
first solution to the problem (a partition of microarrays) is constructed randomly. All 
the microarrays of the pool are distributed randomly among k classes, where k is the 
desired number of clusters of the partition (the whole microarrays are spread propor-
tionally among the k clusters and the pool). Figure 1 shows the pseudo-code of the 
general algorithm. 
On every step a neighbour solution is determined by choosing one from the follow-
ing alternatives: (i) either moving a randomly chosen microarray from the pool to a 
cluster (perhaps empty), (ii) or by moving a randomly chosen microarray from a clus-
ter to the pool, (iii) or by exchanging randomly chosen microarrays among clusters, 
(iv) or by exchanging randomly chosen microarrays among a cluster and the pool, and 
(v) or by moving a randomly chosen microarray from one cluster to another cluster. 
The neighbour solutions of lower cost obtained in this way are always accepted, 
whereas the solutions with a higher cost are accepted with a given probability 
The algorithm stops if equilibrium is encountered. We define that equilibrium is 
reached if after 50 stages of temperature reduction the best solution can not be im-
proved. Opposed to the classical approach in which a solution to the problem is taken 
 
974 D. Glez-Peña et al. 
01 Create an initial, old_solution as an initial  partition 
in k clusters of the microarrays in pool 
02 best_solution Å old_solution; 
03 equilibrium_conter Å 0; 
04 T Å cost(old_solution); 
05 repeat 
06  old_solution Å best_solution; 
07  for iteration_counterÅ 1 to n do 
08   annealing_step(old_solution, best_solution); 
09  end for; 
10  T Å T · α; 
11  equilibrium_counter Å equilibrium_counter + 1; 
12 until equilibrium_counter > 50; 
Fig. 1. General pseudo-code of the simulated annealing-based clustering algorithm 
as the last solution obtained in the annealing process, we memorize the best solution 
found during the whole annealing process. Moreover, at the beginning of each tem-
perature epoch, the search is restarted from the best solution reached for the moment 
(Cf. line 6 of the main procedure presented in Figure1).  
Summing up, the annealing algorithm performs the local search by sampling the 
neighbourhood randomly. It attempts to avoid becoming prematurely trapped in a 
local optimum by sometimes accepting a low-grade solution. The acceptance level 
depends on the magnitude of the increment of the solution cost and on the spent 
search time. 
3   Experimental Setup and Results 
Dealing with unsupervised classification, it is very difficult to test the ability of 
a method to perform the clustering since there is no supervision of the process. 
In [12] the authors proposed that lymphoblastic leukemias with MLL transloca-
tions (mixed-lineage leukemia) constitute a distinct disease, denoted as MLL, 
and show that the differences in gene expression are robust enough to classify 
leukemias correctly as MLL, acute lymphoblastic leukemia (ALL) or acute mye-
loid leukemia (AML). The public dataset of this work has been used to test our 
proposal. The complete group of samples consists of 24 patients with B-
Precursor ALL (ALL), 20 patients with MLL rearranged B-precursor ALL 
(MLL) and 28 patients with acute myeloid leukemia (AML). All the samples 
were analyzed using the Affymetrix GeneChip U95a which contains 12600 
known genes. 
In this sense, the classification into different groups proposed by [12] is assumed to 
be the reference partition of samples in our work. The results of the proposed cluster-
ing algorithm working with this dataset are shown in Table 1. Table 1 presents the 
percentage of the times that each available microarray has been grouped together with 
other microarrays belonging to the reference groups (ALL, AML and MLL) in the 10 
executions of the algorithm. 
 An Evolutionary Approach for Sample-Based Clustering on Microarray Data 975 
Table 1. Clustering carried out by the proposed algorithm using the dataset presented in Arm-
strong et al. [12] 
Id . Array  ALL-mll all-mll AML-mll MLL-{all|aml} 
ALL-03 
 0.17 0.50 0.33 
ALL-61 
 0.25  0.75 
ALL-06 1.00    
ALL-08 1.00    
ALL-60 1.00    
ALL-11 0.90 0.10   
ALL-19 0.90 0.10   
ALL-07 0.89 0.11   
ALL-58 0.89   0.11 
ALL-59 0.89 0.11   
ALL-05 0.88   0.13 
ALL-13 0.86   0.14 
ALL-02 0.80 0.10  0.10 
ALL-20 0.78   0.22 
ALL-16 0.75 0.25   
ALL-10 0.70 0.10  0.20 
ALL-14 0.70 0.10  0.20 
ALL-09 0.67 0.22  0.11 
ALL-15 0.57 0.14  0.29 
ALL-01 0.50 0.25  0.25 
ALL-17 0.50 0.10 0.10 0.30 
ALL-18 0.50 0.50   
ALL-12 0.38 0.50  0.13 
ALL-04 0.25 0.63  0.13 
AML-38 
  1.00  
AML-39 
  1.00  
AML-41 
  1.00  
AML-42 
  1.00  
AML-43 
  1.00  
AML-44 
  1.00  
AML-46 
  1.00  
AML-49 
  1.00  
AML-50 
  1.00  
AML-51 
  1.00  
AML-52 
  1.00  
AML-53 
  1.00  
AML-54 
  1.00  
AML-57 
  1.00  
AML-66 
  1.00  
AML-68 
  1.00  
AML-69 
  1.00  
AML-70 
  1.00  
AML-71 
  1.00  
AML-72 
  1.00  
AML-40 
  0.90 0.10 
AML-56 
  0.90 0.10 
AML-67 
  0.90 0.10 
AML-65 
 0.11 0.89  
AML-55 0.11  0.78 0.11 
AML-47 
  0.70 0.30 
AML-48 
  0.57 0.43 
AML-45 
 0.83 0.17  
976 D. Glez-Peña et al. 
Table 1. (continued) 
Id . Array  ALL-mll all-mll AML-mll MLL-{all|aml} 
MLL-33 0.11  0.11 0.78 
MLL-29 
 0.25  0.75 
MLL-31 
 0.25  0.75 
MLL-26 
 0.29  0.71 
MLL-23 0.33   0.67 
MLL-36 0.11  0.22 0.67 
MLL-22 0.13 0.25  0.63 
MLL-64 0.30 0.10  0.60 
MLL-35 0.11  0.33 0.56 
MLL-21 0.25 0.25  0.50 
MLL-27 0.33 0.17  0.50 
MLL-30 0.40 0.10  0.50 
MLL-63 0.20 0.10 0.20 0.50 
MLL-32 0.56   0.44 
MLL-24 0.29 0.14 0.14 0.43 
MLL-37 0.10 0.10 0.40 0.40 
MLL-62 0.60 0.10  0.30 
MLL-28 0.25 0.38 0.13 0.25 
MLL-25 0.14 0.71  0.14 
MLL-34 0.25 0.63  0.13 
From Table 1 it can be viewed that the AML samples form a group whose samples 
are clearly distinguished from the rest (only sample AML-45 is mixed with other 
samples of ALL or MLL clusters, and sample ALL-03 is grouped in a 50% of the 
executions with other samples of the AML cluster). The confusion is greater between 
groups ALL and MLL since several samples of type MLL are grouped majorly with 
samples of ALL group (for example, samples MLL-32 and MLL-62), others are also 
grouped in a balanced way with samples of ALL/MLL group (MLL-25, MLL-28, 
MLL-34, ALL-04, ALL-12, and ALL-18), and the sample ALL-61 is grouped ma-
jorly with samples of MLL group. These results are reasonable since AML (Acute 
Myeloid Leukemia) are a different family from the Lymphoblastic Leukemias (ALL 
and MLL), and the set of MLL samples is speculated to be a potential subtype of the 
class of ALL. 
4   Conclusion 
The iterative class discovery method takes advantage of the properties of fuzzy logic 
and the theory of fuzzy sets for dealing with gene expression unsharp boundaries in 
which membership is a matter of degree. This method can be used to discover parti-
tions in which biological significance is guaranteed by the similitude between the 
fuzzy labels assigned to the samples belonging to the cluster. The clustering algorithm 
can be easily extended to applications different from clustering microarray data. 
Acknowledgments. This work is supported in part by the projects Research on 
Translational Bioinformatics (08VIB6) from University of Vigo and Development of 
computational tools for the classification and clustering of gene expression data in 
order to discover meaningful biological information in cancer diagnosis (ref. 
 An Evolutionary Approach for Sample-Based Clustering on Microarray Data 977 
VA100A08) from JCyL (Spain). The work of D. Glez-Peña is supported by a María 
Barbeito contract from Xunta de Galicia. 
References 
1. Xing, E.P., Karp, R.M.: Cliff: Clustering of high-dimensional microarray data via iterative 
feature filtering using normalized cuts. Bioinformatics 17(1), 306–315 (2001) 
2. Jiang, D., Tang, C., Zhang, A.: Cluster Analysis for Gene Expression Data: A Survey. 
IEEE Transactions on Knowledge and Data Engineering 16(11), 1370–1386 (2004) 
3. Alter, O., Brown, P.O., Bostein, D.: Singular value decomposition for genome-wide ex-
pression data processing and modeling. Proceedings of the National Academy of Sciences 
of the United States of America 97(18), 10101–10106 (2000) 
4. Ding, C.: Analysis of gene expression profiles: class discovery and leaf ordering. In: Pro-
ceedings of the Six Annual International Conference on Computational Molecular Biol-
ogy, pp. 127–136 (2002) 
5. Yeung, K.Y., Ruzzo, W.L.: Principal component analysis for clustering gene expression 
data. Oxford Bioinformatics 17(9), 763–774 (2000) 
6. Ben-Dor, A., Friedman, N., Yakhini, Z.: Class discovery in gene expression data. In: Pro-
ceedings of the fifth Annual International Conference on Computational Biology, pp. 31–
38 (2001) 
7. Xing, E.P., Karp, R.M.: Cliff: Clustering of high-dimensional microarray data via iterative 
feature filtering using normalized cuts. Oxford Bioinformatics 17(1), 306–315 (2001) 
8. von Heydebreck, A., Huber, W., Poustka, A., Vingron, M.: Identifying splits with clear 
separation: a new class discovery method for gene expression data. Oxford Bioinformat-
ics 17, 107–114 (2001) 
9. Tang, C., Zhang, A., Ramanathan, M.: ESPD: a pattern detection model underlying gene 
expression profiles. Oxford Bioinformatics 20(6), 829–838 (2004) 
10. Varma, S., Simon, R.: Iterative class discovery and feature selection using Minimal Span-
ning Trees. BMC Bioinformatics 5, 126 (2004) 
11. Glez-Peña, D., Álvarez, R., Díaz, F., Fdez-Riverola, F.: DFP: A Bioconductor package for 
fuzzy profile identification and gene reduction of microarray data. BMC Bioinformat-
ics 10, 37 (2009) 
12. Armstrong, S.A., Stauton, J.E., Silverman, L.B., Pieters, R., den Boer, M.L., Minden, 
M.D., Sallan, S.E., Lander, E.S., Golub, T.R., Korsmeyer, S.J.: MLL translocations spec-
ify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genet-
ics 20, 41–47 (2002)