A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain

Méndez, Jose R.; Fernández Riverola, Florentino; Díaz Gómez, Fernando; Iglesias, E. L.; Corchado Rodríguez, Juan Manuel

Título

A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain

Autor(es)

Méndez, Jose R.

Fernández Riverola, Florentino

Díaz Gómez, Fernando

Iglesias, E. L.

Corchado Rodríguez, Juan Manuel

Palabras clave

Computer Science

Fecha de publicación

2006/07

Editor

Springer Science + Business Media

Citación

Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining Lecture Notes in Computer Science. Lecture Notes in Computer Science. Volumen 4065, pp. 106-120.

Resumen

In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ 2-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naïve Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails.

URI

http://hdl.handle.net/10366/135054

ISBN

978-3-540-36036-0 (Print) / 978-3-540-36037-7 (Online)

ISSN

0302-9743 (Print) / 1611-3349 (Online)

Collections