E. Corchado et al. (Eds.): IDEAL 2006, LNCS 4224, pp. 1095 – 1102, 2006. 
© Springer-Verlag Berlin Heidelberg 2006 
Applying GCS Networks to Fuzzy Discretized 
Microarray Data for Tumour Diagnosis 
Fernando Díaz1, Florentino Fdez-Riverola2, Daniel Glez-Peña2, and J.M. Corchado3 
1
 Dept. Informática, University of Valladolid, Escuela Universitaria de Informática, 
Plaza Santa Eulalia, 9-11, 40005, Segovia, Spain 
fdiaz@infor.uva.es 
2
 Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, 
Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain 
{riverola, dgpena}@uvigo.es 
3
 Dept. Informática y Automática, University of Salamanca,  
Plaza de la Merced s/n, 37008, Salamanca, Spain 
corchado@usal.es 
Abstract. Gene expression profiles belonging to DNA microarrays are com-
posed of thousands of genes at the same time, representing the complex rela-
tionships between them. In this context, the ability of designing methods capa-
ble of overcoming current limitations is crucial to reduce the generalization er-
ror of state-of-the-art algorithms. This paper presents the application of a self-
organised growing cell structures network in an attempt to cluster biological 
homogeneous patients. This technique makes use of a previous successful su-
pervised fuzzy pattern algorithm capable of performing DNA microarray data 
reduction. The proposed model has been tested with microarray data belonging 
to bone marrow samples from 43 adult patients with cancer plus a group of six 
cases corresponding to healthy persons. The results of this work demonstrate 
that classical artificial intelligence techniques can be effectively used for tu-
mour diagnosis working with high-dimensional microarray data. 
1   Introduction and Motivation 
In recent years, machine learning and data mining fields have found a successful 
application area in the field of DNA microarray technology. Microarrays are one of 
the latest high-throughput technologies in experimental molecular biology, which 
allow monitoring of gene expression for tens of thousands of genes in parallel and are 
already producing high amounts of valuable data. One of the major uses of DNA 
microarray experiments is to attempt to infer meaningful relationships among genes, 
but the analysis and handling of such data is becoming one of the major bottlenecks in 
the utilization of this technology [1]. Since the number of examined genes in an ex-
periment is measured in terms of thousands, different data mining techniques have 
been intensively used to analyse and discover knowledge from gene expression data 
[2]. However, having so many fields relative to so few samples creates a high likeli-
hood of finding false positives. 
Recent studies in human cancer have demonstrated that microarrays can be used to 
develop a new taxonomy of cancer, including major insights into the genesis, progres-
sion, prognosis, and response to therapy on the basis of gene expression profiles [3]. 
1096 F. Díaz et al. 
However, there continues to be a need to develop new approaches to (i) diagnose 
cancer early in its clinical course, (ii) more effectively treat advanced stage disease, 
(iii) better predict a tumour’s response to therapy prior to the actual treatment, and (iv) 
ultimately prevent disease from arising through chemopreventive strategies. Given the 
fact that systematic classification of types of tumours is crucial to achieve advances in 
cancer treatment, several research works have been developed in this direction [4, 5]. 
Following a novel approach, in this paper we explore the capabilities of a growing 
cell structures (GCS) neural network to discover relevant knowledge for clustering 
patients suffering for acute myeloid leukemia (AML). This knowledge can be easily 
conveyed to and understood by humans via available visualization techniques. A key 
advantage of the proposed method is that it allows incorporating biological meaning-
ful information to the network operation in the form of a gene-based distance metric. 
Our GCS network makes use of a previous successful fuzzy discretization method for 
data reduction on microarray data domain. 
The rest of the paper is organized as follows: Section 2 summarizes our previous 
successful fuzzy discretization algorithm for data dimensionality reduction, Section 3 
presents the main issues about the proposed model based on a modified GCS network, 
Section 4 presents the experiments carried out and discusses the obtained results, 
finally, Section 5 gives out the concluding remarks. 
2   Discovering Relevant Genes Using a Discriminant Fuzzy 
Microarray Descriptor 
Input space reduction is often the key phase in the building of an accurate classifier 
[6]. Based on a novel fuzzy discretization method working as the retrieval stage in the 
GENECBR system [7], it is possible to represent any microarray by means of its gen-
eralized fuzzy microarray descriptor (FMD). This descriptor is a comprehensible 
representation for each gene expression level in terms of one from the following lin-
guistic labels: LOW, MEDIUM and HIGH. Moreover, from a set of FMDs the method is 
also able to construct a prototype, known as a fuzzy pattern (FP), which characterizes 
and summarizes the most relevant values of gene expression levels within a given set 
of microarrays. A FP is a higher concept constructed from a set of FMDs. A fuzzy 
pattern can be viewed as a prototype of the set of FMDs from which it is constructed. 
Therefore, the fuzzy pattern can capture relevant and common information about the 
gene expression levels of these FMDs. The final goal of the proposed method is to 
select a reduced number of relevant and representative genes allowing other artificial 
intelligence techniques being able to tackle with this high-dimensional domain. 
As aforementioned, the proposed method employs a fuzzy codification for the gene 
expression levels of each microarray, based on the discretization of real gene expres-
sion data into a small number of fuzzy membership functions. The whole algorithm 
comprises of three main steps: (i) first, we discretize the gene expression levels into 
binary variables according to a supervised learning process generating several FMDs; 
then, (ii) a unique FP is generated from the patients belonging to each specific pathol-
ogy; finally, (iii) we discriminate between those genes belonging to the existing FPs 
and we select a subset of relevant genes in order to construct the final discriminant 
fuzzy pattern (DFP). When the algorithm finishes each microarray can be represented 
Applying GCS Networks to Fuzzy Discretized Microarray Data for Tumour Diagnosis 1097 
by a simplified fuzzy vector of common genes that we call FMDDFP. The details of the 
whole process can be found in [8]. 
In this work we plan to employ this new representation of the available samples 
(FMDDFP) as input data for a GCS network. The target goal of the GCS network is to 
group those patients that are most similar to a new one but only taking into account 
the genetic information provided by the previously selected genes (DFP vector de-
scriptor). 
3   GCS Networks for Clustering Biologically Homogeneous 
Patients 
GCS neural networks [9] constitute an extension to Kohonen’s self-organising maps 
[10], and are only one member in the family of self-organising incremental models. 
GCS networks have the advantage of being able to automatically construct the net-
work topology, and to support easy visualisation of semantic similarity in high-
dimensional data. More importantly, the extracted knowledge that is relevant to clus-
tering can provide meaningful explanations for the clustering process and useful in-
sight into the underlying domain. 
To illustrate the working model of the GCS network used in our experimentation, a 
two-dimensional space is used, where the cells (neurons) are connected and organized 
into triangles [11]. Each cell in the network is associated with a weight vector, w, of 
the same dimension as the number of relevant genes selected in the previous step (size 
of the DFP vector). At the beginning of the learning process, the weight vector of 
each cell is initialized with random values [11]. The basic learning process in a GCS 
network consists of topology modification and weight vector adaptations carried out 
in three steps.  
In the first step of each learning cycle, the cell c, with the smallest distance be-
tween its weight vector, wc, and the actual FMDDFP is chosen as the winner cell or 
best-match cell. The selection process is succinctly defined by using the Euclidean 
distance measure as indicated in Expression (1) where O denotes the set of cells 
within the structure at a given point in time. 
: ;DFP c DFP ic FMD w FMD w i O− ≤ − ∀ ∈  (1) 
The second step of the learning process consists of the adaptation of the weight 
vector, wc, of the winning cell, and the weight vectors, wn, of its directly connected 
neighbouring cells, Nc, by means of Equations (2) and (3). 
( 1) ( ) ( )
c c c DFP cw t w t FMD wε+ = + −  (2) 
( 1) ( ) ( );
n n n DFP n cw t w t FMD w n Nε+ = + − ∀ ∈  (3) 
where εc and εn represent the learning rates for the winner and its neighbours respec-
tively, belonging to the [0, 1] interval, and Nc stands for the set of direct neighbour 
cells of the winning cell, c. 
In the third step, a signal counter, τ, is assigned to each cell, which reflects how of-
ten a cell has been chosen as winner. Equations (4) and (5) define how the signal 
1098 F. Díaz et al. 
counter is updated with parameter α acting as a constant rate of counter reduction for 
the rest of the cells at the current learning cycle, t. 
( 1) ( ) 1
c c
t tτ τ+ = +  (4) 
( 1) ( ) ( );i i it t t i cτ τ α τ+ = − ≠  (5) 
Growing cell structures also modify the overall network structure by inserting new 
cells into those regions that represent large portions of the input data (genetically 
similar patients), or removing cells that do not contribute to the input data representa-
tion. The cell deletion policy has not been used in our work due to the lack of great 
amounts of data. The adaptation process is then performed after a fixed number of 
learning cycles of input presentations (epochs). Therefore, the overall structure of a 
GCS network is modified through the learning process by performing only cell inser-
tion. Equations (6), (7) and (8) define the rules that govern the insertion behaviour of 
the network. 
/ ; ,i i jjh i j Oτ τ= ∀ ∈∑  (6) 
: ;q iq h h i O≥ ∀ ∈  (7) 
: ;r q p q qr w w w w p N− ≥ − ∀ ∈  (8) 
Insertion starts with selecting the cell, which served the most often as the winner, 
on the basis of the signal counter, τ. The cell, q, with the highest relative counter 
value, h, is selected. The neighbouring cell, r, of q with the most dissimilar weight 
vector is determined using Expression (8). In this expression, Nq denotes the set of 
neighbouring cells of q. A new cell, s, is inserted between the cells q and r, and the 
initial weight vector, ws, of this new cell is set to the mean of the two existing weight 
vectors, wq and wr. Finally, the signal counters, τ, in the neighbourhood, Ns, of the 
newly inserted cell, s, are adjusted. The new signal counter values represent an ap-
proximation to a hypothetical situation where s would have been existing since the 
beginning of the process. 
An important issue in the network operation is the distance calculation between 
two cells or one cell and the actual FMDDFP. Every time the network needs to compute 
a distance between two nodes, Expression (1) is used. In this work, we propose to 
hybridise our GCS network by using biological knowledge for the distance computa-
tion. Given that each cell in the network have a weight vector, wc, representing their 
location in the input space (FMDDFP space) and that each position in wc stands for a 
gene expression value, we can use the similarity between linguistic labels (repre-
sented by fuzzy sets) as a measure of the relation between each point belonging to wc 
and the corresponding value in the FMDDFP vector. 
In order to explain how we calculate this correspondence we need to previously de-
fine the similarity between linguistic labels (represented by fuzzy sets). In this case, it 
has been considered that the fuzzy intersection of two fuzzy sets A and B (represented 
by its membership functions, μA and μB, respectively) is given by the application of 
the min operator to the two membership functions, namely, μA ∩ B = min {μA, μB,}. On 
the other hand, the cardinality operator can be replaced by the integral operator (see 
Applying GCS Networks to Fuzzy Discretized Microarray Data for Tumour Diagnosis 1099 
[8] for details). In this way, the metric Sim(A, B) varies between the values 0 (total 
dissimilarity) and 1 (total similarity). 
It is important to highlight that the final goal of our GCS network is to cluster all 
patients that are genetically similar given a selected group of genes (DFP vector de-
scriptor) and without taking into account their previous assigned classes. Our pro-
posed method aims to find new relations between the patients even now unknown. 
Therefore, it is possible and not contradictory to group together patients suffering 
different (but genetically related) diseases. Such a topology has the added advantage 
that inter-cluster distances can be precisely quantified. Since such networks contain 
explicit distance information, they can be used effectively to (i) represent an indexing 
structure which indexes sets of related patients and to (ii) serve as a similarity meas-
urement between individual patients. 
Every time a new microarray needs to be classified a new FMDDFP is constructed 
and presented to the trained CGS network. A sorted vector of pairs, S, holding the 
similarity of each selected patient with the new microarray is generated. In order to 
produce a new classification, a proportional weighted voting schema is proposed. For 
this purpose, we need to ponder the vote of each patient contained in vector S. In this 
case, a weight αj for each retrieved patient, kj, is calculated based on the position (pos) 
that it occupies in the vector S and the level of similarity with the target case, Simj. 
For this task, Expression (9) is used. 
| | 1
| | 1
2
(2 1)2
S
j j S posSimα
−
−
=
−
 (9) 
Therefore, the classification made by the GCS network when a target patient is pre-
sented to the system depends on both the number of selected patients (those geneti-
cally similar taking into account the genes belonging to the DPF vector descriptor) 
and the level of similarity with the target patient. The solution proposed by the system 
is the class corresponding to the disease with the highest score. 
As we can surmise, it is easy to introduce a rejection mechanism in the voting 
model. We simply use a threshold T to indicate whether the score received by the best 
matching class is sufficiently strong (passing quota). In the event that the score re-
ceived by the matching class is less than T, the target patient remains unclassified. 
4   Evaluation 
The goal of this section is to evaluate the performance of the GCS network in con-
junction with the dimensionality reduction technique based on the notion of FMDs. 
The GCS is trained and tested over an available set of 49 microarrays from the Hae-
matology Service of the University Hospital of Salamanca (Spain). 
Acute myeloid leukemia is a heterogeneous group of hematological cancers with 
marked differences in their response to chemotherapy. As in many other human can-
cers, the diagnosis and classification of AML have been based on morphological, 
cytochemical and immunophenotipic features. More recently, genetic features have 
helped to define biologically homogeneous entities within AML [12]. Karyotype is 
the most important independent prognostic factor and therefore the most useful pa-
rameter for stratifying patients into risk groups. Thus, the favorable outcome group is 
1100 F. Díaz et al. 
composed of well-defined subtypes in terms of cytogenetics: t(15;17), inv(16) and 
t(8;21) [13, 14]. In contrast, the correlation between morphologic characteristics, 
genetic abnormalities and prognostic features is more inconsistent in the remaining 
AML. Analysis of gene expression profiles of tumors using microarray technology 
has become a powerful tool for classifying hematopoietic neoplasms [15]. 
Bone marrow samples from 43 adult patients with newly de novo diagnosed AML 
were analyzed. All samples contained more than 80% blast cells. The median age was 
36 years (range 14-70 years). Patients were classified according to the WHO classifi-
cation into 4 subgroups: (i) 10 APL with t(15;17) confirmed by FISH studies with LSI 
PML/RARA probe (Vysis, Stuttgart, Germany), (ii) 4 AML with inv(16) confirmed 
by FISH analysis with LSI CBFB probe (Vysis); (iii) 7 acute monocytic leukemias 
and (iv) 22 non-monocytic AML without recurrent cytogenetic translocations. In 
addition to this data, a set of 6 samples from healthy persons are also available and 
they constitute the group of control. Each case (microarray experiment) stores 22,283 
expressed sequence tags (ESTs) corresponding to the expression level of thousands of 
genes. The data consisted of 1,091,867 scanned intensities. 
The employed methodology splits the available data within a test set and a training 
set with 1/3 and 2/3 of the whole observations, respectively. In order to asses the 
maximum number of nodes of the GCS network an strategy of cross-validation is 
used, concretely a three fold stratified cross validation. In each round, each fold of the 
original training set is used to estimate the predictive accuracy of the GCS network 
which has been trained from the rest of folds. The mean error (and its standard error) 
for each number of nodes of the GCS are depicted in Figure 1, which shows the train-
ing and evaluation errors. As it can be seen, the configuration of 6 nodes as maximum 
number of cells of the GCS networks involves the minimal value of the error in the 
evaluation sets, so this value was used to train a GCS network from the training set. 
 
Fig. 1. CV#3 error of GCS network v.s. maximum number of cells at the parameter estimation 
Before the training of the GCS network, a reduced number of features (genes) are 
selected using the FMD representation of the available microarrays. Specifically, the 
original number of 22,283 ESTs per microarray was reduced to only 165 meaningful 
ESTs. After the training of the GCS network (with a maximum number of cells equal 
to 6) over the training set (with 34 observations) the classification error of the GCS 
network over training was a 8.82 % and a 6.67% over the test set (with 15 observa-
tions). 
Applying GCS Networks to Fuzzy Discretized Microarray Data for Tumour Diagnosis 1101 
Finally, the following considerations can be made from the clustering performed 
by the GCS network. The control group (samples from healthy persons) can be ade-
quately differentiated from patients with any kind of AML (see Figure 2). The APL 
and AML-mono groups are also well differentiated from the rest of AML groups, and 
there is some kind of overlapping among the AML-inv and the AML-other groups. It 
must be remembered that the AML-other group is the uncertain area at the current 
knowledge in the field of AML. The APL group is a kind of AML well characterized 
(morphologically, cytogenetically and genetically), whereas the AML-mono or AML-
inv are possible kinds of AML which are partially characterized. Finally, the AML-
other is no characterized in any way. Therefore, at the present state-of-the-art, the 
given classification can present mistakes and it is possible that some samples from the 
AML-other group belongs to the AML-mono, AML-inv or new subtypes of AML. 
 
Fig. 2. Final mapping between patients and cells in the GCS network space 
5   Conclusions 
This work explores the capabilities of a growing cell structure neural network to  
discover relevant knowledge for clustering patients suffering for acute myeloid leu-
kemia. A key advantage of the proposed method is that it allows incorporating bio-
logical meaningful information to the network operation in the form of a gene-based 
distance metric. Moreover, the GCS network makes use of a previous successful 
fuzzy discretization method for data reduction on microarray data domain. 
Using self-organising GCS networks to meaningfully cluster filtered microarray 
data has a number of appealing features over other approaches. For example, incre-
mental self-construction, and easy visualisation of biological relationships among the 
input data. The explanations of the clustering process carried out by the network can 
be addressed by means of our DFP vector. The most relevant knowledge for each 
cluster can be highlighted, and provide meaningful explanations about the clustering 
process and useful insight into the underlying problem and data. The experimental 
results show that with only a small subset of the genes belonging to a microarray, the 
performance of the network in terms of the clustering accuracy rate raises to 100%. 
1102 F. Díaz et al. 
References 
1. Piatetsky-Shapiro, G., Tamayo, P.: Microarray data mining: facing the challenges. ACM 
SIGKDD Explorations Newsletter, Vol. 5 (2). (2003) 1–5 
2. Cho, S.B., Won, H.H.: Machine learning in DNA microarray analysis for cancer classifi-
cation. Proc. of the First Asia-Pacific Bioinformatics Conference, (2003) 189–198 
3. Ochs, M.F., Godwin A.K.: Microarrays in Cancer: Research and Applications. BioTech-
niques, Vol. 34. (2003) s4–s15 
4. Xiang, Z.Y., Yang, Y., Ma, X., Ding, W.: Microarray expression profiling: Analysis and ap-
plications. Current Opinion in Drug Discovery & Development, Vol. 6 (3). (2003) 384–395 
5. Golub, T.: Genome-Wide Views of Cancer. The New England Journal of Medicine, Vol. 
344. (2001) 601–602 
6. Cakmakov, D., Bennani, Y.: Feature selection for pattern recognition. Informa Press, 
(2003) 
7. Díaz, F., Fdez-Riverola, F., Corchado, J. M.: GENE-CBR: a Case-Based Reasoning Tool 
for Cancer Diagnosis using Microarray Datasets. Computational Intelligence. ISSN 0824-
7935. In Press. 
8. Fdez-Riverola, F., Díaz, F., Borrajo, M.L., Yáñez, J.C., Corchado, J.M.: Improving Gene 
Selection in Microarray Data Analysis using fuzzy Patterns inside a CBR System. Proc. of 
the 6th International Conference on Case-Based Reasoning, (2005) 191–205 
9. Fritzke, B.: Growing Self-organising Networks – Why?. Proc. of the European Sympo-
sium on Artificial Neural Networks, (1993) 61–72 
10. Kohonen, T.: Self-Organising Maps. Springer-Verlag, (1995) 
11. Fritzke, B.: Growing Cell Structures - A Self-organizing Network for Unsupervised and 
Supervised Learning. Technical Report, International Computer Science Institute, Berke-
ley, (1993) 
12. Vardiman, W., Harris, N.L., Brunning, R.D.: The World Health Organization (WHO) 
classification of the myeloid neoplasms. Blood, Vol. 100. (2002). 2292–2302 
13. Grimwade, D.,  Walker, H.,  Oliver, F., Wheatley, K., Harrison, C.,  Harrison, G.,  Rees, 
J., Hann, I.,  Stevens, R., Burnett, A.,  Goldstone, A.:  The importance of diagnostic cyto-
genetics on outcome in AML: analysis of 1,612 patients entered into the MRC AML 10 
trial. Blood, Vol. 92. (1998) 2322–2333 
14. Slovak, M.L., Kopecky, K.J., Cassileth, P.A., Harrington, D.H., Theil, K.S., Mohamed, 
A., Paietta, E., Willman, C.L., Head, D.R., Rowe, J.M., Forman, S.J., Appelbaum, F.R.: 
Karyotypic analysis predicts outcome of preremission and postremission therapy in adult 
acute myeloid leukemia: a Southwest Oncology Group/Eastern Cooperative Oncology 
Group Study. Blood, Vol. 96. (2000). 4075–4083 
15. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C.,  Gaasenbeek, M., Mesirov, J.P., Coller, 
H., Loh, M.L.,  Downing, J.R.,  Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.:  Molecu-
lar classification of cancer: class discovery and class prediction by gene expression moni-
toring. Science, Vol. 286. (1999) 531–537