CBR System for Diagnosis of Patients 
 
 
Juan F. De Paz, Sara Rodríguez, Javier Bajo and Juan M. Corchado 
Departamento de Informática y Automática, Universidad de Salamanca 
Plaza de la Merced s/n, 37008, Salamanca, España 
{fcofds, srg, jbajope, corchado}@usal.es 
 
 
Abstract 
 
Microarray technology allows to measure the 
expression levels of thousands of genes in an 
experiment. The use of computational methods is 
fundamental in cancer research. One of the 
possibilities is the use of Artificial Intelligence 
techniques. Several of these techniques have been used 
to analyze expression arrays. This paper presents a 
Case-based reasoning (CBR) system for automatic 
classification of leukemia patients from microarray 
Data. The system incorporates novel algorithms for 
data mining that allow to filter and classify as well as 
extraction of knowledge. The system has been tested 
and the results obtained are presented in this paper. 
 
1. Introduction 
Microarray has become an essential tool in genomic 
research, making it possible to investigate global gene 
expression in all aspects of human disease [1]. 
Microarray technology is based on a database of gene 
fragments called expressed sequence tags (ESTs), 
which are used to measure target abundance using the 
scanned intensities of fluorescence from tagged 
molecules hybridized to ESTs [2]. Specifically, the HG 
U133 plus 2.0 are chips used for this kind of analysis. 
These chips analyze the expression level of over 
47.000 transcripts and variants, including 38.500 well-
characterized human genes. It is comprised of more 
than 54.000 probe sets and 1.300.000 distinct 
oligonucleotide feature. The HG U133 plus 2.0 
provides multiple, independent measurements for each 
transcript. Multiple probes mean you get a complete 
data set with accurate, reliable, reproducible results 
from every experiment. 
The process of studying a microarray is called 
expression analysis and consists of a series of phases: 
data collection, data pre-processing, statistical analysis, 
and biological interpretation. These phases analysis 
consists basically of three stages: normalization and 
filtering; clustering and classification. These stages can 
be automated and included in a CBR [3] system. The 
first step is critical to achieve both a good 
normalization of data and an initial filtering to reduce 
the dimensionality of the data set with which to work 
[4]. Since the problem at hand is working with high-
dimensional arrays, it is important to have a good pre-
processing technique that facilitates automatic 
decision-making about the variables that will be vital 
for the classification process. In light of these decisions 
it will be possible to reduce the original dataset. 
Moreover, the choice of a clustering technique allows 
data to be grouped according to certain variables that 
dominate the behaviour of the group. After organizing 
into groups it is possible to extract of knowledge and 
classify patients within the group which presents the 
most similarities. 
This technology has been adopted by the research 
community for the study of a wide range of biologic 
processes allowing carry out diagnosis. Currently, it is 
being very used [1] for diagnosing of cancer such as 
Leukemias. Leukemia, or blood cancer, is a disease 
that has a significant potential for cure if detected early 
[5]. The relationship between the chromosomal 
alterations and prognosis of leukemia and lymphomas 
is well established. Recently, conventional array-based 
expression profiling has demonstrated that 
chromosomal alterations are associated with distinctive 
patterns of expression. Leukemia is a blood cancer 
form, originating in a malfunctioning bone marrow that 
tends to produce abnormal red and white cells at an 
increased rate. The system proposed in the context of 
this work focuses on the detection of carcinogenic 
patterns in the data from microarrays for patients, and 
is constructed from a CBR system that provides a 
classification technique based on previous experiences. 
For some time now, we have been working on the 
identification of techniques to automate the reasoning 
cycle of several CBR systems applied to complex 
domains [3]. The objective of this work is to develop a 
CBR system that allows the identification of patients 
with various types of cancer. The model aims to 
improve the cancer classification based on microarray 
data. The system proposed in this paper presents a new 
synthesis that brings several artificial intelligence 
subfields together (filter techniques, clustering, 
artificial neural networks and extraction of knowledge). 
The retrieval, reuse, revision and learning stages of the 
CBR system use these techniques to facilitate the CBR 
adaptation to the domain of biological discovery with 
microarray datasets. Specifically, the system presented 
in this paper uses a model which takes advantage of 
two novel methods for analyzing microarray data: a 
technique for filtering data, and a technique ESOINN 
[25] for clustering. The first method combines various 
filtering techniques to dramatically reduce the 
dimensionality of the data. The second one allows 
clustering by incorporating both the distribution 
process of the entire surface of classification, and the 
separation between groups with low density among 
them.  
The paper is structured as follows: The next section 
presents the problem that motivates this research, i.e., 
the classification of leukemia patients from samples 
obtained through microarrays. Section 2 describe the 
proposed CBR model and how it is adapted to the 
problem under consideration. Finally, Section 3 
presents the results and conclusions obtained after 
testing the model. 
2. CBR System for Classifying Microarray 
Data 
The CBR developed tool receives data from the 
analysis of chips and is responsible for classifying of 
individuals based on evidence and existing data. The 
purpose of case-based reasoning (CBR) is to solve new 
problems by adapting solutions that have been used to 
solve similar problems in the past [6]. The primary 
concept when working with CBRs is the concept of 
case. A case can be defined as a past experience, and is 
composed of three elements: A problem description 
which describes the initial problem, a solution which 
provides the sequence of actions carried out in order to 
solve the problem, and the final state which describes 
the state achieved once the solution was applied. A 
CBR manages cases (past experiences) to solve new 
problems. The way cases are managed is known as the 
CBR cycle, and consists of four sequential steps which 
are recalled every time that a problem needs to be 
solved: retrieve, reuse, revise and retain. Each of the 
steps of the CBR life cycle requires a model or method 
in order to perform its mission. The algorithms selected 
for the retrieval of cases should be able to search the 
case base and to select from it the most similar 
problems, together with their solutions, to the new 
problem. In our case study, it conducted a filtering of 
variables, recovering important variables of the cases 
to determine the most influential in the conduct 
classification. Once the most important variables have 
been retrieved, the reuse phase begins, adapting the 
solutions for the retrieved cases to obtain the 
clustering. Once this grouping is accomplished, the 
next step is to extract the knowledge. The revise phase 
consists of an expert revision for the solution proposed, 
and finally, the retain phase allows the system to learn 
from the experiences obtained in the three previous 
phases, consequently updating the cases memory. 
2.1. Retrieve 
Traditionally, only the similar cases to the current 
problem are recovered, often because of performance, 
and then adapted. In the case study, the number of 
cases is not the problem, rather the number of 
variables. For this reason variables are retrieved at this 
stage and then, depending on the identified variables, 
the other stages of the CBR are carried out. This phase 
will be broken down into 6 stages which are described 
below:  
 
2.1.1. RMA: The RMA (Robust Multi-array Average) 
[7] algorithm is frequently used for pre-processing 
Affymetrix microarray data. RMA consists of three 
steps: (i) Background Correction; (ii) Quantile 
Normalization (the goal of which is to make the 
distribution of probe intensities the same for arrays); 
and (iii) Expression Calculation: performed separately 
for each probe set n.  
 
2.1.2. Control: During this phase, all probes used for 
testing hybridization are eliminated. These probes have 
no relevance at the time when individuals are 
classified, as there are no more than a few control 
points which should contain the same values for all 
individuals. If they have different values, the case 
should be discarded. Therefore, the probes control will 
not be useful in grouping individuals. 
 
2.1.3. Errors: On occasion, some of the measures 
made during hybridization may be erroneous; not so 
with the control variables. In this case, the erroneous 
probes that were marked during the implementation of 
the RMA must be eliminated. 
. 
2.1.4. Variability: Once both the control and the 
erroneous probes have been eliminated, the filtering 
begins. The first stage is to remove the probes that have 
low variability. This work is carried out according to 
the following steps: 
1. Calculate the standard deviation for each of the 
probes j 
   


N
j
ijjj x
N 1
2
··
1
  (1)   
Where N is the number of items total, j·  is the 
average population for the variable j, ijx is the 
value of the probe j for the individual i. 
2. Standardize the above values 
  

 

j
iz
·
 (2)   
where 


N
j
j
N 1
·
1
  and 
 


N
j
ijjj x
N 1
2
··
1
  where 
)1,0(Nzi   
3. Discard of probes for which the value of z meet 
the following condition: 0.1z . This will 
effect the removal of about 16% of the probes if 
the variable follows a normal distribution. 
 
2.1.5. Uniform distribution: Finally, all remaining 
variables that follow a uniform distribution are 
eliminated. The variables that follow a uniform 
distribution will not allow the separation of individuals. 
Therefore, the variables that do not follow this 
distribution will be really useful variables in the 
classification of the cases. The contrast of assumptions 
followed is explained below, using the Kolmogorov-
Smirnov [8] test as an example. H0: The data follow a 
uniform distribution; H1: The analyzed data do not 
follow a uniform distribution. Statistical contrast: 
    DDD ,max  (3)   
where 








 )(max 0
1
i
ni
xF
n
i
D





 



n
i
xFD i
ni
1
)(max 0
1
 with i as the pattern of 
entry, n the number of items and )(0 ixF  the 
probability of observing values less than i with H0 
being true. The value of statistical contrast is compared 
to the next value: 
  
)(nk
C
D    (4)   
in the special case of uniform distribution 
n
nnk
11.0
12.0)(  and a level of significance 
05.0  358.1C . 
 
2.1.6. Correlations: At the last stage of the filtering 
process, correlated variables are eliminated so that only 
the independent variables remain. To this end, the 
linear correlation index of Pearson is calculated and the 
probes meeting the following condition are eliminated. 
  
ji yx
r
··
 (5)   
being: 95.0  
ji
ji
ji
xx
xx
yxr
··
.·
·· 

 , 
  


N
s
sjjsiijxx xux
Ni 1
..
1
··
  Where 
ji xx .·
  is 
the covariance between probes i and j.  
2.2. Reuse 
Once filtered and standardized, the probes produce 
a set of values   with i = 1 ... N, j = 1 ... s where N is 
the total number of cases, s the number of end probes. 
The next step is to perform the clustering of individuals 
based on their proximity according to their probes. 
Since the problem on which this study is based 
contained no prior classification with which training 
could take place, a technique of unsupervised 
classification was used. There is a wide range of 
possibilities. Some of these techniques are artificial 
neural networks such as SOM [9] (self-organizing 
map), GNG [10] (Growing neural Gas) resulting from 
the union of techniques CHL [11] (competitive 
Hebbian Learning) and NG [12] (neural gas), GCS [11] 
(Growing Cell Structure), Growing Grid or the SOINN 
[13] (self-organizing incremental neuronal network). 
Some of the methods, such as self-organized Kohonen 
maps, set the number of clusters in the initial phase of 
training when using the algorithm of the k-means 
learning method. This is the reason that these methods 
cannot be used for the problem at hand, since in this 
case the number of clusters is unknown. However, the 
number of groups could be varied and the degree of 
waste compaction checked so that according to this 
value, the final number of groups could be set. This 
solution would require too much computing time and it 
would be difficult to limit the number of groups to 
include. The self-organized maps have other variants of 
learning methods that base their behaviour on methods 
similar to the NG. They create a mesh that is adjusted 
automatically to a specific area. The greatest 
disadvantage, however, is that both the number of 
neurons that are distributed over the surface and the 
degree of proximity are set beforehand, resulting in the 
number remaining constant throughout the entire 
training process, thus complicating, to a certain extend, 
the adaptation of the mesh. Unlike the self-organizing 
maps based on meshes, Growing Grid or GCS do not 
set the number of neurons, or the degree of 
connectivity, but they do establish the dimensionality 
of each mesh. This complicates the separation phase 
between groups once it is distributed evenly across the 
surface. 
After analyzing different techniques and checking 
the problems they might present so that they might be 
applied to the problem at hand, we have decided to use 
a variation of neural network SOINN [13], called 
ESOINN [14] (Enhanced  self-organizing incremental 
neuronal network). Unlike the SOINN, ESOINN 
consists of a single layer, so it is not necessary to 
determine the manner in which the training of the first 
layer changes to the second. With a single layer, 
ESOINN is able to incorporate both the distribution 
process along the surface and the separation between 
low density groups. The operation and training of the 
network presents many similarities with those used in 
GCS networks as far as distribution over the surface is 
concerned, but not as far as the dimensionality of the 
meshes. Nevertheless, it more closely resembles a 
merger between a CHL and a NG: it has characteristics 
of a network CHL in the initial phases of the algorithm, 
by which it could be understood as a phase of 
competition, while in a second phase, the network of 
nodes begins to expand just as with a NG network. 
This process is conducted in an iterative way until it 
reaches stability. Only the changes in training phase are 
detailed below: 
1. Update the weights of neurons by following a 
process similar to the SOINN, but introducing a 
new definition for the learning rate in order to 
provide greater stability for the model. This 
learning rate has produced good results in other 
networks such as SOM [17]. 
  ))((
111 1 aaa
WMnW    
))((1 ii aai WMnW    with 
iNi  
(6)   
Being
x
xn
1
)(1  , 
2
2
2
1
)(
x
xn

   
2. Delete the connections with higher age. The ages 
are typified and are removed those whose values 
are in the region of rejection with k>0. The value 
of   chosen is 0.05. 
3. If all input patterns have been passed then a KS-
Test [8] is carried out in order to determine if the 
density distribution for the neurons in each group 
follows a normal distribution. If so then the 
learning procedure is finished; otherwise the next 
pattern is processed. The value of   chosen is 
0.05. 
Once the clustering has been generated the 
extraction of knowledge using the CART [15]  
algorithm is carried out, and finally the new case is 
classified. The CART algorithm is a non parametric 
test that allows extracting rules that explain the 
classification carried out in the previous steps. There 
are others techniques to generate the decision trees, that 
is the case of the methods based on ID3 trees [16], 
although the most used currently is CART. This 
method allows to generate rules and to extract the most 
important variables to classify patients with high 
performance. 
2.3. Revise and Retain 
The revision is carried out by an expert who 
determines the correction with the group assigned by 
the system. If the assignation is considered correct, 
then the retrieve and reuse phases are carried out again 
so that the system is ready for the next classification. 
3. Results and Conclusions 
This paper has presented a CBR system which 
allows automatic cancer diagnosis for patients using 
data from microarrays. The model combines techniques 
for the reduction of the dimensionality of the original 
data set and a novel method of clustering for 
classifying patients. The system works in a way similar 
to how human specialists operate in the laboratory, but 
is able to work with great amounts of data and make 
decisions automatically, thus reducing significantly 
both the time required to make a prediction, and the 
rate of human error due to confusion. The CBR system 
presented in this work focused on identifying the 
important variables for each of the variants of blood 
cancer so that patients can be classified according to 
these variables.  
In the study of leukemia on the basis of data from 
microarrays, the process of filtering data acquires 
special importance. In the experiments reported in this 
paper, we worked with a database of bone marrow 
cases from 212 adult patients with five types of 
leukaemia. The retrieve stage of the proposed CBR 
system presents a novel technique to reduce the 
dimensionality of the data. The total number of probes 
in our experiments was reduced to 785, which 
increased the efficiency of the cluster probe. In 
addition, the selected variables resulted in a 
classification similar to that already achieved by 
experts from the laboratory of the Institute of Cancer. 
The error rates have remained fairly low especially for 
cases where the number of patients was high. To try to 
increase the reduction of the dimensionality of the data 
we applied principal components (PCA) [18], 
following the method of Eigen values over 1.  A total 
of 93 factors were generated, collecting 96% of the 
variability. However, this reduction of the 
dimensionality was not appropriate in order to obtain a 
correct classification of the patients, so this step was 
removed from the recovery phase. Figure 1 shows the 
classification performed for patients from groups MDS. 
As can be seen in Figure 1, represented in black, most 
of the people of the MDS group are together, 
coinciding with the previous classification given by the 
experts at the Institute of Cancer. Only a small portion 
of the individuals departed from the initial 
classification. Groups that have fewer individuals are 
those with a higher classification error. 
 
Figure 1. Classification obtained for MDS patients 
 
In a similar way we proceeded to evaluate the 
classification for the rest of the groups. Table 1 shows 
the total number of patients from each group and the 
number of misclassifications. As can be seen in Table 
1, groups with fewer patients are those with a greater 
error rate.  
 
Table 1. Classification errors numerical  
 Total Error 
ALL 10 3 
AML 49 11 
CLL 89 4 
CML 22 7 
MDS 42 5 
 
The final classification was compared with the data 
obtained using a dendogram [21] and PAM [20] 
(Partitioning Around Medoids). The proportion of 
errors in every group was calculated and the Kurskal-
Wallis [19] test was applied to determinate if the 
median of these proportions was equal. The results are 
shown in table 2. 
 
Table 2. Comparison of methods. * different 
median and = equal, (-) median of column less than 
median of row 
 CBR Dendogram PAM 
CBR    
Dendogram *(-)   
PAM *(-) *(-)  
 
Once checked that the retrieved probes allow 
classifying the patients in similar way to the original 
one, we can conclude that the retrieve phase works 
satisfactorily. Then, the extraction of knowledge is 
carried out bearing in mind the selected probes. The 
algorithm used was CART [12], and the results 
obtained are shown in Figure 2. 
n= 212  
 
node), split, n, loss, yval, (yprob) 
      * denotes terminal node 
 
 1) root 212 123 CLL (0.047 0.23 0.42 0.1 0.2)   
   2) x1555158_at>=0.125 90   4 CLL (0.033 0 0.96 0.011 0) * 
   3) x1555158_at< 0.125 122  73 AML (0.057 0.4 0.025 0.17 0.34)   
     6) x1555139_a_at>=0.375 44   7 AML (0.068 0.84 0.023 0.023 0.045) * 
     7) x1555139_a_at< 0.375 78  38 DSM (0.051 0.15 0.026 0.26 0.51)   
      14) x1554827_a_at< 0.625 30  12 CML (0.067 0.17 0.067 0.6 0.1)   
        28) x1554696_s_at>=0.375 8   3 AML (0.25 0.62 0 0 0.12) * 
        29) x1554696_s_at< 0.375 22   4 CML (0 0 0.091 0.82 0.091) * 
      15) x1554827_a_at>=0.625 48  11 MDS (0.042 0.15 0 0.042 0.77)   
        30) x1553244_at< 0.125 9   3 AML (0.22 0.67 0 0 0.11) * 
        31) x1553244_at>=0.125 39   3 MDS (0 0.026 0 0.051 0.92) * 
 
 
Figure 2. Extraction of knowledge 
 
The most important probes and their relevance in 
the classification of patients are extracted by means of  
this algorithm. In Figure 3 the most important probes in 
CLL are shown: 
 
Figure 3. Extraction of knowledge CLL 
 
The proposed model resolves this problem by using 
a technique that detects the genes of importance for the 
classification of diseases by analysing the available 
data. As demonstrated, the proposed system allows the 
reduction of the dimensionality based on the filtering of 
genes with little variability and those that do not allow 
a separation of individuals due to the distribution of 
data. It also presents a technique for clustering based in 
neuronal networks. The results obtained from empirical 
studies provide a tool that allows both the detection of 
genes and those variables that are most important for 
the detection of pathology, and the facilitation of a 
classification and reliable diagnosis, as shown by the 
results presented in this paper. 
Acknowledgments. Special thanks to the Institute of 
Cancer of Salamanca for the information and 
technology provided. 
4. References 
[1] J. Quackenbush, "Computational analysis of microarray 
data" Nature Review Genetics, vol. 2(6), (2001).  pp. 418-
427 
[2] R.J. Lipshutz, S.P.A. Fodor, T.R. Gingeras, D.H. 
Lockhart, "High density synthetic oligonucleotide arrays." 
Nature Genetics, Vol. 21, (1999) pp. 20-24 
[3] F. Riverola, F. Díaz, J. M. Corchado, "Gene-CBR: a 
case-based reasoning tool for cancer diagnosis using 
microarray datasets." Computational Intelligence, Vol. 22,  
(2006) pp 254-268 
[4] N.J. Armstrong, M.A. van de Wiel, "Microarray data 
analysis: From hypotheses to conclusions using gene 
expression data." Cellular Oncology, Vol. 26 (5-6), (2004)  
pp. 279-290 
[5] J.E. Rubnitz, N. Hijiya, Y. Zhou, M.L. Hancock, G.K. 
Rivera, C. Pui, "Lack of benefit of early detection of relapse 
after completion of therapy for acute lymphoblastic 
leukemia." Pediatric Blood & Cancer, Vol. 44 (2), (2005)  
pp. 138-141 
[6] J. Kolodner, "Case-Based Reasoning." Morgan 
Kaufmann (1993) 
[7] R.A. Irizarry, B. Hobbs, F. Collin, Y.D. Beazer-Barclay, 
K.J. Antonellis, U. Scherf, T.P. Speed, "Exploration, 
Normalization, and Summaries of High density 
Oligonucleotide Array Probe Level Data." Biostatistics, Vol. 
4 (2003) pp. 249-264 
[8] R. Brunelli, "Histogram Analysis for Image Retrieval." 
Pattern Recognition, Vol. 34, (2001) pp. 1625-1637 
[9] T. Kohonen, "Self-organized formation of topologically 
correct feature maps." Biological Cybernetics, (1982) pp. 59-
69 
[10] B. Fritzke, "A growing neural gas network learns 
topologies." Advances in Neural Information Processing 
Systems 7, (1995) pp. 625-632 
[11] T.  Martinetz, "Competitive Hebbian learning rule forms 
perfectly topology preserving maps." ICANN'93: 
International Conference on Artificial Neural Networks, 
(1993) pp. 427-434 
[12] T. Martinetz, K. Schulten, "A neural-gas network learns 
topologies." Artificial Neural Networks, (1991) pp. 397-402 
[13] F. Shen, "An algorithm for incremental unsupervised 
learning and topology representation." Tokyo: Ph.D. thesis. 
Tokyo Institute of Technology, (2006) 
[14] S. Furao, T. Ogura, O. Hasegawa, "An enhanced self-
organizing incremental neural network for online 
unsupervised learning." Neural Networks, Vol. 20, (2007). 
893-903 
[15] L. Breiman, J. Friedman, A. Olshen, C. Stone,  
"Classification and regression trees." Wadsworth 
International Group. (1984) 
[16] J. Quinlan, "Discovering rules by induction from large 
collections of examples." Expert systems in the micro 
electronic age, (1979) pp. 168-201 
[17] J.M. Corchado, J. Bajo, Y. De Paz, J.F. De Paz 
"Integrating Case Planning and RPTW Neuronal Networks to 
Construct an Intelligent Environment for Health Care." 
Expert Systems with Applications, In Press (2008) 
[18] I. Jolliffe, "Principal Component Analysis." Springer 
Series in Statistics (2002) 
[19] W. Kruskal, W. Wallis, "Use of ranks in one-criterion 
variance analysis." Journal  of American Statistics 
Association (1952) 
[20] L. Kaufman, P.J. Rousseeuw, "Finding Groups in Data: 
An Introduction to Cluster Analysis." Wiley, New York. 
(1990) 
[21] N. Saitou, M. Nie, "The neighbor-joining method: A 
new method for reconstructing phylogenetic trees." Mol. 
Biol, Vol. 4 (1987) pp. 406-425