Applying CBR Systems to Micro Array Data
Classification
Sara Rodr´ıguez, Juan F. De Paz, Javier Bajo, and Juan M. Corchado
Departamento de Informa´tica y Automa´tica, Universidad de Salamanca Plaza de
la Merced s/n, 37008, Salamanca, Spain
{srg,fcofds,jbajope,corchado}@usal.es
Summary. Microarray technology allows to measureing the expression levels of
thousands of genes in an experiment. This technology required requires computa-
tional solutions capable of dealing with great amounts of data and as well as tech-
niques to explore the data and extract knowledge which allow patients classification.
This paper presents a systems based on Case-based reasoning (CBR) for automatic
classification of leukemia patients from microarray data. The system incorporates
novel algorithms for data mining that allow to filter and classify as well as extraction
of knowledge. The system has been tested and the results obtained are presented in
this paper.
Key words: Case-based Reasoning, HG U133, dendogram, leukemia classification,
decision tree
1 Introduction
The progress in the biomedicine [1] [24] and the incorporation of compu-
tational and artificial intelligence techniques, have caused a big progress in
the detection and diagnosis of many illness. Microarray technology allows to
measure the expression levels of thousands of genes in an experiment. This
technology has been adopted by the research community for the study of a
wide range of biologic processes allowing carry out diagnosis. Currently, it is
being very used [23] for diagnosing of cancer such as Leukemias. This tech-
nique studies RNA chains thereby identifying the level of expression for each
gene studied. It consists of hybridizing a sample for a patient and colouring the
cellular material with a special dye. This offers different levels of luminescence
that can be analyzed and represented as a data array [24]. These methods and
tools need to work with expression arrays containing a large amount of data
points. Specifically, the HG U133 plus 2.0 are chips used for this kind of anal-
ysis. These chips analyze the expression level of over 47.000 transcripts and
variants, including 38.500 well-characterized human genes. It is comprised of
more than 54.000 probe sets and 1.300.000 distinct oligonucleotide Feature.
2 S. Rodr´ıguez et al.
The HG U133 plus 2.0 provides multiple, independent measurements for each
transcript. Multiple probes mean you get a complete data set with accurate,
reliable, reproducible results from every experiment. Eleven pairs of oligonu-
cleotide probes are used to measure the level of transcription of each sequence
represented on the GeneChip Human Genome Focus Array.
The process of studying a microarray is called expression analysis and
consists of a series of phases: data collection, data preprocessing, statistical
analysis, and biological interpretation. These phases analysis consists basically
of three stages: normalization and filtering; clustering and classification; and
extraction of knowledge. These stages can be automated and included in a
CBR [15] system. The first step is critical to achieve both a good normalization
of data and an initial filtering to reduce the dimensionality of the data set with
which to work [5]. Moreover, the choice of a clustering technique allows data to
be grouped according to certain variables that dominate the behaviour of the
group [6]. After organizing into groups it is possible to extract the information
necessary about the most significant probes that characterize each cluster. In
base on this information, the association of new individuals to a cluster can
be carried out. Finally, experts can learn from the analysis process.
For some time now, we have been working on the identification of tech-
niques to automate the reasoning cycle of several CBR systems applied to
complex domains [22] [15]. The microarray analysis to distinguish subclasses
in disease and identify pattern associated with disease according to its genes.
This patterns of expression that are used to classify types leukimia. This pa-
per presents a CBR system that facilitates the analysis and classification of
data from microarrays corresponding to patients with leukemia. Leukemia,
or blood cancer, is a disease that has a significant potential for cure if de-
tected early [4]. The model aims to improve the cancer classification based
on microarray data using CBR. The system presented in this paper uses a
model which takes advantage of three methods for analyzing microarray data:
a technique for filtering data, a technique for clustering and a method for
extracting the knowledge.
The paper is structured as follows: The next section presents the problem
that motivates this research, i.e., the classification of leukemia patients from
samples obtained through microarrays. Section 2 and Section 3 describe the
proposed CBR model and how it is adapted to the problem under consider-
ation. Finally, Section 4 presents the results and conclusions obtained after
testing the model.
2 CBR System for Classifying Micro Array Data
The CBR developed tool receives data from the analysis of chips and is re-
sponsible for classifying of individuals based on evidence and existing data.
Case-based Reasoning is a type of reasoning based on the use of past experi-
ences [7]. CBR systems solve new problems by adapting solutions that have
Applying CBR Systems to Micro Array Data Classification 3
been used to solve similar problems in the past, and learning from each new
experience. The primary concept when working with CBRs is the concept of
case. A case can be defined as a past experience, and is composed of three ele-
ments: A problem description, which delineates the initial problem; a solution,
which provides the sequence of actions carried out in order to solve the prob-
lem; and the final stage, which describes the state achieved once the solution
was applied. A CBR manages cases (past experiences) to solve new problems.
The way cases are managed is known as the CBR cycle, and consists of four
sequential phases: retrieve, reuse, revise and retain. The retrieve phase starts
when a new problem description is received. In this phase a similarity algo-
rithm is used to find the greatest number of cases in the cases memory. In our
case study, it conducted a filtering of variables, recovering important variables
of the cases to determine the most influential in the conduct classification as
explained in section 2.1. Once the most important variables have been re-
trieved, the reuse phase begins, adapting the solutions for the retrieved cases
to obtain the clustering. Once this grouping is accomplished, the next step is
to determine the provenance of the new individual to be evaluated. The revise
phase consists of an expert revision for the solution proposed, and finally, the
retain phase allows the system to learn from the experiences obtained in the
three previous phases, consequently updating the cases memory.
2.1 Retrieve
Contrary to what usually happens in the CBR, our case study is unique in that
the number of variables is much greater than the number of cases. This leads
to a change in the way the CBR functions so that instead of recovering cases
at this stage, important variables are retrieved. Traditionally, only the similar
cases to the current problem are recovered, often because of performance,
and then adapted. In the case study, the number of cases is not the problem,
rather the number of variables. For this reason variables are retrieved at this
stage and then, depending on the identified variables, the other stages of the
CBR are carried out. This phase will be broken down into 5 stages which are
described below:
RMA: The RMA (Robust Multi-array Average) [8] algorithm is frequently
used for pre-processing Affymetrix microarray data. RMA consists of three
steps: (i) Background Correction; (ii) Quantile Normalization (the goal of
which is to make the distribution of probe intensities the same for arrays);
and (iii) Expression Calculation: performed separately for each probe set n. To
obtain an expression measure we assume that for each probe set n, the back-
ground adjusted, normalized and log transformed intensities, denoted with Y,
follow a linear additive model
xi,j,n = µi,n + αj,n + i,j,n with i = 1...I, j = 1...J, n = 1...N
∑
j
αj = 0 (1)
4 S. Rodr´ıguez et al.
Where αj is a probe affinity effect, µj represents the log2 scale expression
level for array i and i,j represents an independent identically distributed error
term with mean 0. Median polish [21] is used to obtain estimates of the values.
Control and error: During this phase, all probes used for testing hy-
bridization are eliminated. These probes have no relevance at the time when
individuals are classified, as there are no more than a few control points which
should contain the same values for all individuals. If they have different val-
ues, the case should be discarded. Therefore, the probes control will not be
useful in grouping individuals. On occasion, some of the measures made dur-
ing hybridization may be erroneous; not so with the control variables. In this
case, the erroneous probes that were marked during the implementation of
the RMA must be eliminated.
Variability: Once both the control and the erroneous probes have been
eliminated, the filtering begins. The first stage is to remove the probes that
have low variability. This work is carried out according to the following steps:
1. Calculate the standard deviation for each of the probes j
σ.j = +
√√√√ 1
N
N∑
j=1
(µ¯·j − xij)2 (2)
Where N is the number of items total, is the average population for the
variable j, is the value of the probe j for the individual i.
2. Standardize the above values
zi =
σ·j − µ
σ
(3)
where µ =
1
N
N∑
j=1
σ·j and σ·j = +
√√√√ 1
N
N∑
j=1
(µ¯·j − xij)2 where zi ≡ N(0, 1)
3. Discard of probes for which the value of z meet the following condition:
z < −1.0 given that P (z < −1.0) = 0.1587. This will effect the removal
of about 16% of the probes if the variable follows a normal distribution.
Uniform Distribution: Finally, all remaining variables that follow a uni-
form distribution are eliminated. The variables that follow a uniform distri-
bution will not allow the separation of individuals. Therefore, the variables
that do not follow this distribution will be really useful variables in the classi-
fication of the cases. The contrast of assumptions followed is explained below,
using the Kolmogorov-Smirnov [13] test as an example.
D = max
{
D+, D−
}
(4)
where D+ = max1≤i≤n
{
i
n
− Fo (xi)
}
D− = max1≤i≤n
{
Fo (xi)− i− 1
n
}
with i as the pattern of entry, n the number of items and Fo (xi) the probability
Applying CBR Systems to Micro Array Data Classification 5
of observing values less than i with Ho being true. The value of statistical
contrast is compared to the next value:
Dα =
Cα
k (n)
(5)
in the special case of uniform distribution k (n) =
√
n+0.12+ 0.11√
n
and a level
of significance α = 0.05 Cα = 1.358.
Correlations: At the last stage of the filtering process, correlated vari-
ables are eliminated so that only the independent variables remain. To this
end, the linear correlation index of Pearson is calculated and the probes meet-
ing the following condition are eliminated.
rx·iy·j > α (6)
being: α = 0.95 rx·iy·j =
σx·iy·j
σx·iσy·j
rx·iy·j =
1
N
N∑
s=1
(µ¯·i − xsi) (µ¯·j − xsj) where
rx·iy·j is the covariance between probes i and j.
2.2 Reuse
Once filtered and standardized the data using different techniques of data min-
ing, the system produce a set of values xij with i = 1 ... N, j = 1 ... s where
N is the total number of cases, s the number of end probes. The next step
is to perform the clustering of individuals based on their proximity according
to their probes. Since the problem on which this study is based contained
no prior classification with which training could take place, a technique of
unsupervised classification was used. There is a wide range of possibilities in
data mining. Some of these techniques are artificial neural networks such as
SOM [9] (self-organizing map), GNG [10] (Growing neural Gas) resulting from
the union of techniques CHL [11] (Competitive Hebbian Learning) and NG
[12] (neural gas), GCS [10] (Growing Cell Structure). There are other tech-
niques with less computational cost that provide efficient results. Among them
we can find the dendogram and the PAM method [16] (Partitioning Around
Medoids). A dendrogram [17] is a ascendant hierarchical method with graph-
ical representation that facilitates the interpretation of results and allows an
easy way to establish groups without prior establishment. The PAM method
requires a selection of the number of clusters previous to its execution.
The dendograms are hierarchical methods that initially define as conglom-
erates for each available cases. At each stage the method joins those conglom-
erates of smaller distance and calculates the distance of the conglomerate with
everyone else. The new distances are updated in the matrix of distances. The
process finishes when there is one only conglomerate (agglomerative method).
The distance metric used in this paper has been the average linkage. This met-
ric calculates the average distance of each pair of nodes for the two groups,
6 S. Rodr´ıguez et al.
and based on these distances mergers the groups. The metric is known as un-
weighted pair group method using arithmetic averages (UPGMA) [18]. Once
the dendogram has been generated, the error rate is calculated bearing in mind
the previous cases. If the accuracy rate is up to 80%, the extraction of knowl-
edge using the CART (Classification and Regression Tree) [19] algorithm is
carried out, and finally the new case is classified. The CART algorithm is a
non parametric test that allows extracting rules that explain the classifica-
tion carried out in the previous steps. There are others techniques to generate
the decision trees, that is the case of the methods based on ID3 trees [20],
although the most used currently is CART. This method allows to generate
rules and to extract the most important variables to classify patients with
high performance.
2.3 Revise and Retain
The revision is carried out by an expert who determines the correction with
the group assigned by the system. If the assignation is considered correct,
then the retrieve and reuse phases are carried out again so that the system is
ready for the next classification
3 Case Study
In the case study presented in the framework of this research are available
232 samples are available from analyses performed on patients either through
punctures in marrow or blood samples. The aim of the tests performed is to
determine whether the system is able to classify new patients based on the
previous cases analyzed and stored.
Figure 1 shows a scheme of the bio-inspired model intended to resolve the
problem described in Section 2. The proposed model follows the procedures
that are performed in medical centres. As can be seen in Figure 1, a previous
phase, external to the model, consists of a set of tests which allow us to obtain
data from the chips and are carried out by the laboratory personnel. The chips
are hybridized and explored by means of a scanner, obtaining information on
the marking of several genes based on the luminescence. At that point, the
CBR-based model starts to process the data obtained from the microarrays.
The retrieve phase receives an array with a patient’s data as input infor-
mation. It should be noted that there is no filtering of the patients, since it is
the work of the researcher conducting this task. The retrieve step filters genes
but never patients. The aim of this phase is to reduce the search space to find
data from the previous cases which are similar to the current problem. The
set of patients is represented as D = {d1, ..., dt}, where di ∈ IRn represents
the patient i and n represents the number of probes taken into consideration.
As explained in Section 2.1 during the retrieve phase the data are normal-
ized by the RMA algorithm [8] and the dimensionality is reduced bearing
Applying CBR Systems to Micro Array Data Classification 7
Fig. 1. Proposed CBR model
in mind, above all, the variability, distribution and correlation of probes. The
result of this phase reduces any information not considered meaningful to per-
form the classification. The new set of patients is defined through s variables
D
′
=
{
d
′
1, ..., d
′
t
}
d
′
i ∈ IRs, s ≤ n.
The reuse phase uses the information obtained in the previous step to
classify the patient into a leukemia group. The patients are first grouped into
clusters. The data coming from the retriever phase consists of a group of pa-
tients D
′
=
{
d
′
1, ..., d
′
t
}
with d
′
i ∈ IRs, s ≤ n each one characterized by a set of
meaningful attributes di = (xi1, ..., x : is), where xij is the luminescence value
of the probe i for the patient j. In order to create clusters and consequently
obtain patterns to classify the new patient, the reuse phase implements a
method of hierarchical cluster called dendogram, which has been explained
in section 2.2. The system classifies the patients by taking into account their
proximity and their density, in such a way that the result provided is a set G
where G = {g1, ..., gr} r < s gi ⊂ D, gi ∩ gj = φ with i 6= j and i, j < r. The
set G is composed of a group of clusters, each of them containing patients with
a similar disease. The clusters have been constructed by taking into account
the similarity between the patient’s meaningful symptoms. Once the clusters
have been obtained, the accuracy rate is calculated, if it is greater than 80%
then the clustering and extraction of knowledge are carried out. The new pa-
tient is defined as d
′
t+1 and his membership to a group is determined following
the classification tree in section 2.2. The result of the reuse phase is a group
of clusters G =
{
g1, ..., g
′
i, ..., gr
}
r < s where g
′
i = gi ∪
{
d
′
t+1
}
.
An expert from the Cancer Institute is in charge of the revision process.
This expert determines if g
′
i = gi∪
{
d
′
t+1
}
can be considered as correct. In the
retain phase the system learns from the new experience. If the classification
8 S. Rodr´ıguez et al.
is considered successful, then the patient is added to the memory case D =
{d1, ..., dt, dt+1}.
4 Results and Conclusions
This paper has presented a CBR system which allows automatic cancer diag-
nosis for patients using data from microarrays. The model combines techniques
for the reduction of the dimensionality of the original data set and a method
of clustering and extraction the knowledge. The system works in a way similar
to how human specialists operate in the laboratory, but is able to work with
great amounts of data and make decisions automatically, thus reducing sig-
nificantly both the time required to make a prediction, and the rate of human
error due to confusion. The CBR system presented in this work focused on
identifying the important variables for each of the variants of blood cancer so
that patients can be classified according to these variables.
In the study of leukemia on the basis of data from microarrays, the process
of filtering data acquires special importance. In the experiments reported in
this paper, we worked with a database of bone marrow cases from 212 adult
patients with five types of leukaemia. The retrieve stage of the proposed CBR
system presents a novel technique to reduce the dimensionality of the data.
The total number of variables selected in our experiments was reduced to 785,
which increased the efficiency of the cluster probe. In addition, the selected
variables resulted in a classification similar to that already achieved by experts
from the laboratory of the Institute of Cancer. The error rates have remained
fairly low especially for cases where the number of patients was high. To try to
increase the reduction of the dimensionality of the data we applied principal
components (PCA) [14], following the method of Eigen values over 1. A total
of 93 factors were generated, collecting 96% of the variability. However, this
reduction of the dimensionality was not appropriate in order to obtain a cor-
rect classification of the patients. Figure 2a shows the classification performed
for patients from all the groups. In the left it is possible to observe the groups
identified in the classification process. Cases interspersed represent individu-
als with different classification to the previous-one. As shown in Figure 2a the
number of misclassified individuals have been low.
Once checked that the retrieved probes allow classifying the patients in
similar way to the original one, we can conclude that the retrieve phase works
satisfactorily. Then, the extraction of knowledge is carried out bearing in
mind the selected probes. The algorithm used was CART [19] and the results
obtained are shown in Figure 2b.
The proposed model resolves this problem by using a technique that de-
tects the genes of importance for the classification of diseases by analysing the
available data. As demonstrated, the proposed system allows the reduction of
the dimensionality based on the filtering of genes with little variability and
those that do not allow a separation of individuals due to the distribution of
Applying CBR Systems to Micro Array Data Classification 9
Fig. 2. Classification obtained.
data. It also presents a technique for clustering based in hierarchical methods.
The results obtained from empirical studies are promising and highly appreci-
ated by specialists from the laboratory, as they are provided with a tool that
allows both the detection of genes and those variables that are most impor-
tant for the detection of pathology, and the facilitation of a classification and
reliable diagnosis, as shown by the results presented in this paper.
Acknowledgments
Special thanks to the Institute of Cancer for the information and technology
provided.
References
1. Shortliffe E, Cimino J (2006) Biomedical Informatics: Computer Applications
in Health Care and Biomedicine. Springer Berlin Heidelberg New York
2. Tsoka S, Ouzounis C (2000) Recent developments and future directions in com-
putational genomics. FEBS Letters 480 (1):42–48
3. Lander E et al. (2001) Initial sequencing and analysis of the human genome.
Nature 409:860–921
4. Rubnitz J, Hijiya N, Zhou Y, Hancock M, Rivera G, Pui C (2005) Lack of benefit
of early detection of relapse after completion of therapy for acute lymphoblastic
leukemia. Pediatric Blood & Cancer 44 (2):138–141
5. Armstrong N, van de Wiel M (2004) Microarray data analysis: From hypotheses
to conclusions using gene expression data. Cellular Oncology. 26 (5-6):279-290
6. Quackenbush J (2001) Computational analysis of microarray data. Nature Re-
view Genetics 2(6):418–427
7. Kolodner J (1993) Case-Based Reasoning. Morgan Kaufmann
8. Irizarry R, Hobbs B, Collin F, Beazer-Barclay Y, Antonellis K, Scherf U, Speed
T (2003) Exploration, Normalization, and Summaries of High density Oligonu-
cleotide Array Probe Level Data. Biostatistics 4:249–264
10 S. Rodr´ıguez et al.
9. Kohonen T (1982) Self-organized formation of topologically correct feature
maps. Biological Cybernetics 59-69
10. Fritzke B (1995) A growing neural gas network learns topologies. Cambridge
MA: Tesauro G, Touretzky D, Leen T (eds) Advances in Neural Information
Processing Systems 7 625–632
11. Martinetz T (1993) Competitive Hebbian learning rule forms perfectly topol-
ogy preserving maps. ICANN’93: International Conference on Artificial Neural
Networks. Springer Amsterdam 427–434
12. Martinetz T, Schulten K (1991) A neural-gas network learns topologies (1991)
Kohonen T, Makisara K, Simula 0, Kangas J (eds) Artificial Neural Networks
Amsterdam 397-402
13. Brunelli, R.: Histogram Analysis for Image Retrieval. Pattern Recognition, Vol.
34, (2001) 1625–1637
14. Jolliffe I (2002) Principal Component Analysis. Second Edition. Springer Series
in Statistics
15. Riverola F, Daz F, Corchado J (2006) Gene-CBR: a case-based reasoning
tool for cancer diagnosis using microarray datasets. Computational Intelligence
22(3-4):254–268
16. Kaufman L, Rousseeuw P (1990) Finding Groups in Data: An Introduction to
Cluster Analysis. Wiley, New York
17. Saitou N, Nie M (1987) The neighbor-joining method: A new method for re-
constructing phylogenetic trees. Molecular Biology and Evolution 4:406–425
18. Sneath P, Sokal R (1973) Numerical Taxonomy. The Principles and Practice of
Numerical Classification. W.H. Freeman Company, San Francisco
19. Breiman L, Friedman J, Olshen A, Stone C (1984) Classification and regression
trees. Wadsworth International Group. Belmont, California
20. Quinlan J (1979) Discovering rules by induction from large collections of exam-
ples. In D. Michie Eds., Expert systems in the micro electronic age. Edinburgh
University Press. Edinburgh. 168–201
21. Holder D, Raubertas R, Pikounis V, Svetnik V and Soper K (2001) Statistical
analysis of high density oligonucleotide arrays: a SAFER approach. In Proceed-
ings of the ASA Annual Meeting Atlanta, GA
22. Corchado J, Corchado E, Aiken J, Fyfe C, Fdez-Riverola F, Glez-Bedia M
(2003) Maximum Likelihood Hebbian Learning Based Retrieval Method for
CBR Systems. In Proceedings. of the 5th International Conference on Case-
Based Reasoning. 107–121
23. Quackenbush J (2006) Microarray Analysis and Tumor Classification, The new
england journal o f medicine. 2463–2472
24. Zhenyu C, Jianping L, Liwei W, (2007) A multiple kernel support vector ma-
chine scheme for feature selection and rule extraction from gene expression data
of cancer tissue. Artificial Intelligence in Medicine 41:161–175