Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain

Méndez, Jose R.; Iglesias, E. L.; Fernández Riverola, Florentino; Díaz Gómez, Fernando; Corchado Rodríguez, Juan Manuel

Título

Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain

Autor(es)

Méndez, Jose R.

Iglesias, E. L.

Fernández Riverola, Florentino

Díaz Gómez, Fernando

Corchado Rodríguez, Juan Manuel

Palabras clave

Computer Science

Fecha de publicación

2006

Editor

Springer Science + Business Media

Citación

Lecture Notes in Computer Science Current Topics in Artificial Intelligence. 11th Conference of the Spanish Association for Artificial Intelligence, CAEPIA 2005, Santiago de Compostela, Spain, November 16-18, 2005, Revised Selected Papers. Lecture Notes in Computer Science. Volumen 4177, pp. 449-458.

Resumen

Junk e-mail detection and filtering can be considered a cost-sensitive classification problem. Nevertheless, preprocessing methods and noise reduction strategies used to enhance the computational efficiency in text classification cannot be so efficient in e-mail filtering. This fact is demonstrated here where a comparative study of the use of stopword removal, stemming and different tokenising schemes is presented. The final goal is to preprocess the training e-mail corpora of several content-based techniques for spam filtering (machine approaches and case-based systems). Soundness conclusions are extracted from the experiments carried out where different scenarios are taken into consideration.

URI

https://hdl.handle.net/10366/135055

ISBN

978-3-540-45914-9 (Print) / 978-3-540-45915-6 (Online)

ISSN

0302-9743 (Print) / 1611-3349 (Online)

Aparece en las colecciones