A Kernel Method for Classification Donald MacDonald1, Jos Koetsier1, Emilio Corchado1, Colin Fyfe1 and Juan Corchado2 1School of Information and Communication Technologies The University of Paisley, High Street, Paisley, PA1-2BE, Scotland. corc-ci0@paisley.ac.uk 2Departamento de Informática y Automática, Universidad de Salamanca, Spain. Abstract. Kernel Maximum Likelihood Hebbian Learning Scale Invariant Maps is a novel technique developed to facilitate the clustering of complex data effectively and effi- ciently and that is characterised for converging remarkably quickly. Kernels were origi- nally derived in the context of Support Vector Machines which identify the smallest num- ber of data points necessary to solve a particular problem, such as regression and classifi- cation. The combination of Maximum Likelihood Hebbian Learning Scale Invariant Map and the Kernel Space provides a very smooth scale invariant quantisation which can be used as a clustering technique. The efficiency of this method have been used to analyse an oceanographic problem. 1 Introduction This paper presents an efficient technique for data clustering. Kernel Maximum Like- lihood Hebbian Learning Scale Invariant Map (K-MLSIM) is based on a modification of a new type of topology preserving map that can be used for scale invariant classification [6]. Kernel models were first developed within the context of Support Vector Machines [16]. Support Vector Machines attempt to identify a small number of data points (the support vectors) which are necessary to solve a particular problem to the required accuracy. Kernels have been successfully used in the unsupervised investigation of structure in data sets[15] [11] [9]. Kernel methods map a data set into a Feature space using a nonlinear mapping. Then typically a linear operation is per- formed in the feature space; this is equivalent to performing a nonlinear operation on the original data set. The Scale Invariant Map is an implementation of the negative feedback network to form a topology preserving mapping. A kernel method is applied in this paper to an extension of the Scale Invariant Map (SIM) which is based on the application of the Maximum Likelihood Hebbian Learning (MLHL) method [4]and its possibilities are explored. The proposed methodology groups cases with similar structure, identifying clusters automatically in a data set in an unsupervised mode. The method has been successfully used to improve the reasoning process of a distrib- uted oceanographic system developed for monitoring and predicting toxic episodes in coastal waters. 2. Kernel Scale Invariant Map This section reviews the techniques used to construct the K-MLSIM method. 2.1 Scale Invariant Map Consider a network with N dimensional input data and having M output neurons. Then the activation of the ith output neuron is given by j N j iji xwact ∑ = = 1 (1) Now if we invoke a competition between the output neurons, it is possible to have a number of different competitions between output neurons, two obvious examples are: Type A: The neuron with greatest activation wins. Type B: The neuron closest to the input vector wins. The Kohonen network typically uses the second since the first requires specific re- normalisation to ensure that all neurons have a chance of winning a competition. We have shown that the scale invariant mapping produced by the new network does not require any additional competition limiting procedure when using the first criterion. In both cases, the winning neuron, the pth, is deemed to be maximally firing (=1) and all other output neurons are suppressed(=0). Its firing is then fed back through the same weights to the input neurons as inhibition. jallforwxe pjjj ___1.−← (2) where p is the winning neuron. Now the winning neuron excites those neurons close to it i.e. we have a neighbourhood function ),( jpΛ which ),(),( kpjp Λ≤Λ for all kpjpkj −≥−:, where . is the Euclidean norm. In the simulations described in this paper, we use a Gaussian whose radius is decreased during the course of the simulation. Then simple Hebbian learning gives )).(,().,( pjjtjtij wxipeipw −Λ=Λ=∆ ηη (3) where we have used xj as the activation of the jth input neuron and wij is the weight between this and the ith output neuron. For the pth winning neuron, the network is performing simple competitive learning but note the direct effect the pth output neu- ron's weight has on the learning of other neurons. This algorithm introduces competi- tion into the same network used in [5] to perform a Principal Component Analysis and in [7] to perform an Exploratory Projection Pursuit. 2.2 Kernel K-means Clustering We will follow the derivation of [14] who has shown that the k means algorithm can be performed in Kernel space. The basic idea of the set of methods known as kernel methods is that the data set is transformed into a nonlinear feature space ( )( xx )φφ →: . Any linear operation now performed in this feature space is equiva- lent to a nonlinear operation in the original space. The aim is to find k means, m , so that each point is close to one of the means. First we note that each mean may be described as lying in the manifold spanned by the observations, µ )( ixφ i.e. ∑=µ i ii xwm )(φµ . Now the k means algorithm chooses the means, , to minimise the Euclidean distance between the points and the clos- est mean µm ( ) ( ) ( ) ( ) ( ) ( ji ji ji i ii i ii kwwkwkwm xxxx,xx,xxx ,2 , 2 2 ∑∑∑ +−=−=− µµµµµ φφφ ) (4) i.e. the distance calculation can be accomplished in Kernel space by means of the K matrix alone. Let Miµ be the cluster assignment variable. i.e. Miµ=1 if )( ixφ is in the µth cluster and is 0 otherwise. [14] initialises the means to the first training patterns and then each new training point, ktt >++ 1),( 1xφ , is assigned to the closest mean and its cluster assignment variable calculated using ( )    ≠∀−<−= +++ otherwise xxif M ttt _0 ,mm)(__1 11 ,1 αµφφ µα α (5) In terms of the kernel function (noting that k(x,x) is common to all calculations) we have ( ) ( ) ( )    ≠∀−< − = ∑∑ ∑∑ + otherwise kwxxkww kwkwwif M ii iji jiji ii ijijji i t _0 ,2, 2),(__1 , , ,1 αµµµµ ααα α xx, xx,xx (6) We must then update the mean, mα to take account of the (t+1)th data point ( )( )tttt x αα φζα mmm −+= ++ 11 (7) where we have used the term m to designate the updated mean which takes into account the new data point and 1+t α ∑+= += 1 1 , ,1 t i i t M M α αζ (8) Now (7) may be written as ( ) ( ) ( ) ( )   −+= ∑∑∑ ++ i iiti i t ii i t i www xxxx φφςφφ ααα 11 (9) which leads to an update equation of   += +≠−=+ 1__ 1__)1(1 tifor tiforw w t it i ζ ζα α (10) 2.2.1 Kernel Self Organising Map We have previously used the above analysis to derive a Self Organising Map [10] in Kernel space. The SOM algorithm is a k means algorithm with an attempt to distrib- ute the means in an organised manner and so the first change to the above algorithm is to update the closest neuron's weights and those of its neighbours. Thus we find the winning neuron (the closest in feature space) as above but now instead of (6), we use ( ) µµαµ ∀Λ=+ ,,,1tM (11) where α is the identifier of the closest neuron and ( )µα ,Λ is a neighbourhood func- tion which in the experiments reported herein was a gaussian. Thus the winning neu- ron has a value of M=1 while the value of M for other neurons decreases monotoni- cally with distance (in neuron space) away from the winning neuron. For the experi- ments reported in this paper, we used a one dimensional vector of output neurons numbered 1 to 20 or 30. The remainder of the algorithm is exactly as reported in the previous section. The Kernel SOM was reported in [10][2] to have very fast convergence and results in that paper were based on Gaussian kernels. In [2], we showed that the reason for this very fast convergence was the interaction between the learning method (a combina- tion of one-shot learning and incremental decay) and the fact that we are working (for linear kernels) in data space but using the data points as basis vectors for that space. We now investigate the application of similar methods to the Scale Invariant Map. 2.2.2 The Kernel Scale Invariant Map We will, in this Section, consider only linear kernels though the extension to other kernels is straightforward. Since every point in data space can be represented by a linear combination of the training data set { }Nii ,...,1, =x , we have ∃ for all x in the data space. Similarly the weight vectors can be represented in the same way. Thus we can represent ∑== Ni iii vxv 1: x j N j jwy x∑ = = 1 µµ (12) as ∑∑∑∑ == j i jii j i jii kvxxvy )( xxµµµ (13) 2.2.2.1 The Competition Now we wish to have a competition to find out which output will win for a particular input, xl, and so the first question to be addressed is the nature of the competition. If we are working in data space with the above overcomplete basis, then every member of the training set is representable as a vector which is all zeros except for the lth posi- tion which is set to 1. Therefore we identify the lth column of the kernel matrix, and determine the output which wins the competition to represent xl as µµµµµµα vk lli i i xxvy maxargmaxargmaxarg === ∑ (14) where we have used kl as the vector from the lth column of the kernel matrix. 2.2.2.2 The Weight Update We have experimented with two methods: 1. The Kernel method of the previous section. With the notation above, we have ( )   = ≠−= lifor liforv v ii __ __1 ζ ζα α (15) with ζ defined as for the Kernel SOM. This continues to be one-shot learning with a subsequent gradual decay. The neural method used with the standard SIM. Note that since we are working in the space spanned by the data as basis vectors, the input vector has a zero everywhere but the lth position and we are subtracting vα ( α being the winning neuron). Thus αvxe −= (16) in this basis. We then apply the standard learning rule so that ( ) ii ev µαηµ ,Λ=∆ (17) Even though this method is an iterative method as is usual in neural methods, we find that only a few (less than 10, often just 2 or 3) iterations through a data set are enough to form the mapping. 2.3 Maximum Likelihood Hebbian Learning We have previously [2][10] considered a general cost function associated with a negative feedback PCA network. ( ){ }21 WyEJ T −= x (18) If the residual after the feedback has probability density function )||exp(1)( p Z p ee −= (19) Then we can denote a general cost function associated with this network as KpJ p +=−= ||)(log ee (20) where K is a constant. Finding the minimum of J corresponds to finding the maxi- mum of p(e) i.e. we are maximising the probability that the residual comes from a particular distribution. We do this by adjusting the weights. Therefore performing gradient descent on J we have Tp signpy W J W JW ))(||( 1 eee e −≈∂ ∂ ∂ ∂−=∂ ∂−∝∆ (21) We would expect that for leptokurtotic residuals (more kurtotic than a Gaussian dis- tribution), values of p < 2 would be appropriate, while platykurtotic residuals (less kurtotic than a Gaussian), values of p > 2 would be appropriate. We have previously [8] [3] [13]shown that this network can perform Exploratory Projection Pursuit. 2.3.1 Application to Kernel Scale Invariant Map Now the SIM was originally derived by introducing competition to the negative feed- back PCA network. Therefore we introduce the Maximum Likelihood Hebbian learn- ing concept of the last section to the Scale Invariant Map (SIM). Consider the feed- back in 16. Let us present a particular input, xl= (0,..,0,1,0,..0) which has 1 in the lth position, to the network. Then if neuron α wins the competition, it is because the weights vα have a high dot product with the elements of l's column of the k matrix; either vαl is large or vαk (for input k which will be grouped with l) is large. Thus the residuals after feedback will have a bimodal distribution - either the residuals will tend towards 0 or the residuals will be large. This suggests maximising the likelihood that the residuals come from a sub-gaussian distribution; therefore, we proposed the learning rules ( ) ( ) 2, jjj signv eeΛ µαηµ =∆ (22) This has an interesting effect on the learning rules in that a pie slice of the data is learned but the actual positions of the neuron centres themselves (when transformed back into data space) lie outside the data set. This enables a very smooth scale invari- ant quantisation as shown in Figure 1. The v weight vectors are shown in Figure 2. The data set, drawn uniformly from [-1,1]*[-1,1], is shown by the crosses. The weights of the converged Kernel Scale Invariant Map have been joined to form al- most a circle. The corresponding vector in data space based on the data as basis vec- tors is shown in Figure 2 each line of the diagram represents the weights of one out- put neuron in terms of the data points (the v weights in fact). Fig. 1. The data set is shown by the red crosses. It was drawn uniformly from[-1,1]*[-1,1]. The weights of the converged KSIM have been joined to form almost a circle. Fig. 2. The v weights as represented in the data basis. Each line is the weight vector into an output neuron and is shown. .2.3.2 The p parameter We have found that even a very small departure from the standard K-SIM parameter (with p=2) gives a visible change to the representation of the data set. In the top left of Figure 3, we show the converged weights after 7 iterations of the K-SIM algorithm when p=1.8; the weights are well enclosed in the data. In the top right of that figure, we show the weights when p=2.1 was used; the weights are beginning to move out- side the data set. In the bottom figure, we show what happens when p=2.5. Now the weights are well outside the data. Fig.3: The left figure shows the weights after 7 iterations of the K-SIM algorithm with p=1.8 on the standard artificial data set. The one in the middle shows the weights with p=2.1. The right figure shows the weights when p=2.5. Consider the situation in which there are n points in the pie slice won by yα . Without loss of generality let us write wα=(a1},a2,...,an,0,0,...,0) i.e. the vector wα has non- zero components corresponding to the points (in the training set), x1,…, xn while its components corresponding to xn+1…xN are all zero. Then ( ) ( ) 1, −=∆ psignw eeΛ µαηα (23) Let us consider only the situation in which the points x1,…,xn are presented to the network and so the output yα is the winner; thus ( )µα ,Λ =1. Let point x1 be pre- sented and so x=(1,0,0,...,0). Then e = (1-a1,-a2,...,-an, 0,...,0) (24) When point x2 is presented, x=(0,1,0,...,0) and e = (-a1,1-a2,...,-an, 0,...,0) (25) We will consider the effect of the update rules on this weight vector for different values of p. p=1. Focus now on the first element of wα, the element of the weight vector linking input x1 to output yα, then at convergence ( ) ( )( ) 011 ==∆ esignEwE α . Clearly if a1<0, sign(e1) is always positive. Therefore a1>0 .Thus there can only be two non- zero elements in this vector. Impossible. p=2. This is the standard K-SIM. Then at convergence ( ) ( )( ) 01111 ==∆ eesignEwE α and so 1-a1 + (n-1)*(-a1)=0 (26) Thus a1=1/n. This argument applies equally to all non-zero elements of wα and so   = 0,...,0,1,...,1,1 nnnα w which when we translate back to the original basis means that the centre of the KSIM is given by nnnn c xxx 1...11 21 +++=α (27) the mean of the data points for which neuron α is responsible for representing. p=3. At convergence ( ) ( )( ) 02111 ==∆ eesignEwE α and so, for 00. This implies 1 )1(1 1 −+ = n a The effect of varying p with n=10 is shown in Figure 4. We see that for values close to 1, a1 remains close to 0, while for values approaching 2, we reach the value. Subsequently, a1 continues to rise but shows signs of levelling off (right figure). Fig. 4: As we vary p from 1 to 2, a1 climbs slowly to 0.1 (left figure) and grows more slowly subsequently (right figure). 3. Clustering Real Data using the K-MLSIM In the following section we detail the results of the K-MLSIM on the red tides data. The K-MLSIM is specially suited to be used on the red tides data because of the flexibility of the p-parameter. Changing this parameter affects the update of the cen- tres and it determines how far each centre will move. This property allows the selec- tion of different clustering combinations, penalising, or accentuating the representa- tion of outliers in the clustering. In the case of the red tides data, as each instance of a red tide would be considered an outlier, it is necessary for us to use an algorithm to deal with these important data points in an appropriate manner. In this section we will show that the K-MLSIM is a powerful method that can extract the relevant clustering information. We have shown previously that the ε-insensitive SIM [12] in data space, which is equivalent to using a p value of 0, will place each centre in the median of the cluster. This property was extended by [1] to use different values of p to achieve different clustering combinations. The larger the value of p the greater the contribution outliers will have in the clustering. The effect that this parameter has on the clustering is important as it allows us to change the clustering to be more representative of the properties of the red tides data as it contains small numbers of outliers which are very important for the classifica- tion. If we were to cluster with a contemporary algorithm, we would likely be left with fewer centres that identify instances of red tides than we would like. In contrast the K-SIM can be tailored to give us more centres identifying red tides, resulting in a more expressive clustering. Figure 6 shows the clusters produced by the K-MLSIM with different values of the p parameter. This parameter will penalize the effect of outliers in the data. This results in the centres being placed in the median of the data cloud. Fig. 6.a Fig. 6.b Fig. 6.c Fig. 6: clustering of K-SIM on red tides data using 100 centres and a value of p = 0 (Fig.6a), a value of p = 0.5 (Fig 6b) and a value of p = 2 (Fig 6c). As we can see from Figure 6a, the clustering is a compact coding with at most 800 data samples being assigned to one centre. This coding is less interesting for the red tides data as the abnormal points are those which are important, and so we wish to find a clustering which promotes rather than penalises them. In the figure 6b we use a value of p = 0.5 which places more emphasis on the larger changes in the weights, which in turn means that outliers, or abnormal data, will have greater effect on the learning. Thus we can ensure that the data points representing red tides will be strongly represented in the clustering, which is ideal for this problem. In Figure 6b we can see that there is a more sparse representation as there is a greater emphasis placed on large differences between the winner and the input data. In this figure we can see that the dense clusters are assigned fewer centres and that less dense clusters that contain outliers are more favoured. Assigning more centres to the less dense clusters also allows us to get a better representation within clusters. Commonly in contempo- rary clustering, one centre in this sparse region of the data would capture a mixture of red tides and spurious outliers. As we are now placing more centres in these regions, we have a better chance to separate the true red tides from other outliers. Figure 6c. shows the clustering of the K-MLSIM on the red tides data using a p value of 2. The K-MLSIM is more penalizing to small changes in the weights than with p = 0.5. As can be seen from figure 6c it has provided an even more sparse representation of the clustering. This is a far more suitable clustering of the red tides data where we used smaller values of p. By changing the value of p in the weight update rule the K- MLSIM can be adapted to penalize or promote outliers in its clustering. 4 Conclusions We have demonstrated a new technique for clustering. Of interest too is the fact that the method allows investigation of the nonlinear projection matrix K that readily reveals when a new situation behaves similarly, which may be very important in the identification of toxin episodes in coastal water. References 1. Corchado E. and Fyfe C. The Scale Invariant Map and Maximum Likelihood Hebbian Learning. KES2002. Sixth International Conference on Knowledge-Based Intelligent Information Engineering Systems. Italy. (2002) 2. Corchado E. and Fyfe C. Relevance and kernel self-organising maps. In International Conference on Artificial Neural Networks, ICANN2003. (2003) 3. Corchado E., MacDonald D. and Fyfe C. Optimal projections of high dimensionald data. In IEEE International Conference on Data Mining, ICDM02. (2002) 4. Corchado E., MacDonald D. and Fyfe C. Maximum and Minimum Likelihood Hebbian Learning for Exploratory Projection Pursuit. Data Mining and Knowledge Discovery. In Press. 5. Fyfe C. PCA properties of interneurons. In From Neurobiology to Real World Computing, ICANN 93, (1993) pp. 183-188. 6. .Fyfe C. A scale-invariant feature map. Network:Computation in Neural Systems, 7: 269-275,1996. 7. Fyfe C. and Baddeley R. Non-linear data structure extraction using simple hebbian networks. Biologi- cal Cybernetics, (1995) 72(6):533-541. 8. Fyfe C., Baddeley R. and McGregor D.R. Exploratory Projection Pursuit: An Artificial Neural Net- work Approach, University of Strathclyde Research report/94/160. (1994). 9. Fyfe C. and Corchado J. M. Automating the construction of CBR Systems using Kernel Methods. International Journal of Intelligent Systems. Vol 16, No. 4, April 2001. ISSN 0884-8173. 10. Fyfe C. and MacDonald D. Epsilon-insensitive Hebbian Learning. Neurocomputing, (2002) 47:35-57. 11. Fyfe, C., MacDonald, D., Lai, P. L., Rosipal, R. and Charles, D. Unsupervised Learning with Radial Kernels in Recent Advances in Radial Basis Functions, Editors R. J. Howlett and L. C. Jain, Elsevier. (2000). 12. MacDonald D. Unsupervised Neural Networks for the Visualisation of Data. PhD Thesis, University of Paisley. (2002). 13. MacDonald D., Corchado E., Fyfe C. and Merenyi E. Maximum and Minimum Likelihood Hebbian Learning for Exploratory Projection Pursuit. In International Conference on Artificial Neural Net- works, ICANN2002. (2002). 14. Scholkopf B.,The Kernel Trick for Distances. Technical report, Microsoft Research, May 2000. 15. Scholkopf B., Smola A. and Muller K. R. Nonlinear component analysis as a kernel eigenvalue prob- lem. Neural Computation, (1998) 10:1299-1319. 16. Vapnik V. The nature of statistical learning theory, Springer Verlag. (1995)