Video violence detection using pre-trained VGG19 combined with manual logic, LSTM layers and Bi-LSTM layers

Negre, Pablo; Alonso Rincón, Ricardo Serafín; Prieto Tejedor, Javier; García García, Óscar

doi:10.1007/s10489-026-07122-3

Título

Video violence detection using pre-trained VGG19 combined with manual logic, LSTM layers and Bi-LSTM layers

Autor(es)

Negre, Pablo

Alonso Rincón, Ricardo Serafín

Prieto Tejedor, Javier

García García, Óscar

Palabras clave

Convolutional Neural Networks (CNN)

Video violence detection

Physical aggression

Manual feature

Long Short Term Memory (LSTM)

Clasificación UNESCO

1203.04 Inteligencia Artificial

Fecha de publicación

2026-02-03

Editor

Springer Nature

Citación

Negre, P., Alonso, R.S., Prieto, J. et al. Video violence detection using pre-trained VGG19 combined with manual logic, LSTM layers and Bi-LSTM layers. Appl Intell 56, 72 (2026). https://doi.org/10.1007/s10489-026-07122-3

Resumen

[EN]Video violence detection using artificial intelligence plays a key role in public safety applications. Although convolutional and recurrent neural networks are widely adopted for this task, the actual contribution of temporal modeling over strong frame-level representations remains insufficiently analyzed. This work provides a systematic study of video violence detection models under a unified experimental framework. We investigate whether violence can be reliably detected from individual frames without explicit temporal modeling, evaluate the effectiveness of combining CNNs with LSTM and Bi-LSTM layers, and analyze the impact of architectural and hyperparameter choices, including neuron configuration and backbone selection (VGG-16 vs. VGG-19). Experiments are conducted on three widely used benchmark datasets. Our results show that frame-level analysis using a pre-trained VGG-19 network, combined with a simple aggregation strategy, achieves competitive performance, reaching 95% accuracy on Hockey Fights and 96% on Violent Flow. While Bi-LSTM layers can provide moderate improvements of up to 4% over standard LSTM models in certain datasets, these gains are not consistent across all scenarios. Furthermore, variations in hyperparameter configurations do not systematically lead to improved performance. Overall, this study highlights that increased architectural complexity does not always translate into better results and that, in several cases, simple frame-based approaches can rival more complex temporal models. These findings provide practical insights into the cost–benefit trade-off of temporal modeling for video-based violence detection.

URI

https://hdl.handle.net/10366/169634

ISSN

0924-669X

DOI

10.1007/s10489-026-07122-3

Versión del editor

https://doi.org/10.1007/s10489-026-07122-3

Aparece en las colecciones