Video violence detection using pre-trained VGG19 combined with manual logic, LSTM layers and Bi-LSTM layers

Negre, Pablo; Alonso Rincón, Ricardo Serafín; Prieto Tejedor, Javier; García García, Óscar

doi:10.1007/s10489-026-07122-3

Título

Video violence detection using pre-trained VGG19 combined with manual logic, LSTM layers and Bi-LSTM layers

dc.contributor.author	Negre, Pablo
dc.contributor.author	Alonso Rincón, Ricardo Serafín
dc.contributor.author	Prieto Tejedor, Javier
dc.contributor.author	García García, Óscar
dc.date.accessioned	2026-02-09T09:54:44Z
dc.date.available	2026-02-09T09:54:44Z
dc.date.issued	2026-02-03
dc.identifier.citation	Negre, P., Alonso, R.S., Prieto, J. et al. Video violence detection using pre-trained VGG19 combined with manual logic, LSTM layers and Bi-LSTM layers. Appl Intell 56, 72 (2026). https://doi.org/10.1007/s10489-026-07122-3	es_ES
dc.identifier.issn	0924-669X
dc.identifier.uri	http://hdl.handle.net/10366/169634
dc.description.abstract	[EN]Video violence detection using artificial intelligence plays a key role in public safety applications. Although convolutional and recurrent neural networks are widely adopted for this task, the actual contribution of temporal modeling over strong frame-level representations remains insufficiently analyzed. This work provides a systematic study of video violence detection models under a unified experimental framework. We investigate whether violence can be reliably detected from individual frames without explicit temporal modeling, evaluate the effectiveness of combining CNNs with LSTM and Bi-LSTM layers, and analyze the impact of architectural and hyperparameter choices, including neuron configuration and backbone selection (VGG-16 vs. VGG-19). Experiments are conducted on three widely used benchmark datasets. Our results show that frame-level analysis using a pre-trained VGG-19 network, combined with a simple aggregation strategy, achieves competitive performance, reaching 95% accuracy on Hockey Fights and 96% on Violent Flow. While Bi-LSTM layers can provide moderate improvements of up to 4% over standard LSTM models in certain datasets, these gains are not consistent across all scenarios. Furthermore, variations in hyperparameter configurations do not systematically lead to improved performance. Overall, this study highlights that increased architectural complexity does not always translate into better results and that, in several cases, simple frame-based approaches can rival more complex temporal models. These findings provide practical insights into the cost–benefit trade-off of temporal modeling for video-based violence detection.	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Springer Nature	es_ES
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Convolutional Neural Networks (CNN)	es_ES
dc.subject	Video violence detection	es_ES
dc.subject	Physical aggression	es_ES
dc.subject	Manual feature	es_ES
dc.subject	Long Short Term Memory (LSTM)	es_ES
dc.title	Video violence detection using pre-trained VGG19 combined with manual logic, LSTM layers and Bi-LSTM layers	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.relation.publishversion	https://doi.org/10.1007/s10489-026-07122-3	es_ES
dc.subject.unesco	1203.04 Inteligencia Artificial	es_ES
dc.identifier.doi	10.1007/s10489-026-07122-3
dc.relation.projectID	info:eu-repo/grantAgreement/EC/HORIZON/101120726	es_ES
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es_ES
dc.identifier.essn	1573-7497
dc.journal.title	Applied Intelligence	es_ES
dc.volume.number	56	es_ES
dc.issue.number	3	es_ES
dc.type.hasVersion	info:eu-repo/semantics/publishedVersion	es_ES