A Transformer-Enhanced Human-Centric Unsupervised Framework for Multi-Person Video Anomaly Detection

Authors

  • Hiba Mohsin Abdulameer College of Computer Science and Information Technology, University of Al-Qadisiyah, Iraq
  • Ali Mohsin Al-juboori College of Computer Science and Information Technology,University of Al-Qadisiyah, Al-Qadisiyah, Iraq

DOI:

https://doi.org/10.29304/jqcsm.2026.18.22700

Keywords:

Anomaly detection, Skeleton-based analysis, HuVAD, Video surveillance

Abstract

Video anomaly detection in surveillance environments is still difficult. This is because abnormal events do not happen often, take different forms, and depend on the complex nature of real scenes. In addition, methods that depend on visual appearance are affected by changes in lighting, camera angles, and background conditions. These issues can reduce detection accuracy and also cause privacy problems. For this reason, recent studies focus more on motion-based representations that describe human behavior and reduce the effect of unnecessary visual details.

In this work, a framework for video anomaly detection is proposed. Spatial motion features are extracted using a 2D convolutional neural network. These features are then passed to a GRU network to model motion over time. A Transformer module is also used to help capture longer temporal relationships in motion sequences.

The proposed framework is able to handle scenes that include more than one person. During training and testing, all detected persons are considered within fixed-length temporal windows. Information from each person is then combined to produce an anomaly score that represents the overall scene behavior. This helps the model detect abnormal activities even in crowded surveillance scenes.

The proposed model uses an unsupervised one class learning approach. Training is performed using normal motion data only, Abnormal events are identified by observing deviations from the learned patterns of normal behavior. The experiments were carried out on real surveillance datasets using standard evaluation metrics. The model achieved an AUC-ROC of 93% at the frame level, indicating stable and consistent performance across different cases. The integration of spatial and temporal features contributed to a more accurate representation of complex motion patterns and reduced the likelihood of confusing abnormal behavior with normal activity.

Downloads

Download data is not yet available.

Downloads

Published

2026-06-28

How to Cite

Mohsin Abdulameer, H., & Mohsin Al-juboori, A. (2026). A Transformer-Enhanced Human-Centric Unsupervised Framework for Multi-Person Video Anomaly Detection. Journal of Al-Qadisiyah for Computer Science and Mathematics, 18(2), Comp 294–305. https://doi.org/10.29304/jqcsm.2026.18.22700

Issue

Section

Computer Articles