Compartir
Título
Video violence detection using pre-trained VGG19 combined with manual logic, LSTM layers and Bi-LSTM layers
Autor(es)
Palabras clave
Convolutional Neural Networks (CNN)
Video violence detection
Physical aggression
Manual feature
Long Short Term Memory (LSTM)
Clasificación UNESCO
1203.04 Inteligencia Artificial
Fecha de publicación
2026-02-03
Editor
Springer Nature
Citación
Negre, P., Alonso, R.S., Prieto, J. et al. Video violence detection using pre-trained VGG19 combined with manual logic, LSTM layers and Bi-LSTM layers. Appl Intell 56, 72 (2026). https://doi.org/10.1007/s10489-026-07122-3
Resumen
[EN]Video violence detection using artificial intelligence plays a key role in public safety applications. Although convolutional and recurrent neural networks are widely adopted for this task, the actual contribution of temporal modeling over strong frame-level representations remains insufficiently analyzed. This work provides a systematic study of video violence detection models under a unified experimental framework. We investigate whether violence can be reliably detected from individual frames without explicit temporal modeling, evaluate the effectiveness of combining CNNs with LSTM and Bi-LSTM layers, and analyze the impact of architectural and hyperparameter choices, including neuron configuration and backbone selection (VGG-16 vs. VGG-19). Experiments are conducted on three widely used benchmark datasets. Our results show that frame-level analysis using a pre-trained VGG-19 network, combined with a simple aggregation strategy, achieves competitive performance, reaching 95% accuracy on Hockey Fights and 96% on Violent Flow. While Bi-LSTM layers can provide moderate improvements of up to 4% over standard LSTM models in certain datasets, these gains are not consistent across all scenarios. Furthermore, variations in hyperparameter configurations do not systematically lead to improved performance. Overall, this study highlights that increased architectural complexity does not always translate into better results and that, in several cases, simple frame-based approaches can rival more complex temporal models. These findings provide practical insights into the cost–benefit trade-off of temporal modeling for video-based violence detection.
URI
ISSN
0924-669X
DOI
10.1007/s10489-026-07122-3
Versión del editor
Aparece en las colecciones
- BISITE. Artículos [369]
Ficheros en el ítem
Tamaño:
2.513Mb
Formato:
Adobe PDF
Descripción:
Artículo principal













