A Robust Statistical Framework for Outlier Detection and Its Influence on Predictive Modeling Accuracy

Authors

  • Hadeel Kamil Habeeb Faculty of Nursing, University of Al-Qadisiya, Al-Qadisiya, Iraq.
  • Faten Hatem Hassan Presidency of Al-Qadisiyah University, Department of Studies and Planning Statistics Division, Iraq.

DOI:

https://doi.org/10.29304/jqcsm.2025.17.32424

Keywords:

Outlier detection

Abstract

Outliers, defined as observations that deviate substantially from the majority of data, pose a serious challenge to predictive modeling by distorting estimation, increasing variance, and reducing model reliability. Although numerous statistical and machine learning approaches for outlier detection have been proposed, their direct influence on prediction accuracy across real-world domains has received limited attention. This study develops a robust statistical framework that integrates univariate, multivariate, and machine learning–based detection methods with confirmatory regression diagnostics and a bootstrap-driven model selection strategy. Candidate anomalies are first identified through histogram- and IQR-based screening, kNN and LOF density–proximity measures, and isolation forest and one-class SVM classifiers. They are then statistically validated using standardized residuals and Cook’s distance, while robustness is reinforced through MM-estimation and bounded loss functions. Evaluation is conducted using both synthetic contamination experiments and real datasets from finance, healthcare, and marketing, comparing models trained with and without detected outliers across classifiers such as SVM, logistic regression, KNN, random forest, and AdaBoost. The results demonstrate that excluding or down-weighting outliers consistently enhances predictive accuracy and stability, particularly in settings with heavy-tailed errors and heterogeneous distributions. The proposed framework provides a practical and statistically principled approach for improving model fidelity, offering broad applicability across diverse domains where reliable prediction is essential.

Downloads

Download data is not yet available.

References

P. R. Mushayi, “Factors Affecting Enterprise Resource Planning Migration: The South African Customer’s Perspective,” 2021.

I. Chatterjee, M. Zhou, A. Abusorrah, K. Sedraoui, and A. Alabdulwahab, “Statistics-based outlier detection and correction method for amazon customer reviews,” Entropy, vol. 23, no. 12, p. 1645, 2021.

E. Costa and I. Papatsouma, “Outlier detection for mixed-type data: A novel approach,” arXiv preprint arXiv:2308.09562, 2023.

G. Pang, L. Cao, and L. Chen, “Homophily outlier detection in non-IID categorical data,” Data Mining and Knowledge Discovery, vol. 35, no. 4, pp. 1163–1224, 2021.

F. Rabbi, A. Khalil, I. Khan, M. A. Almuqrin, U. Khalil, and M. Andualem, “Robust model selection using the out-of-bag bootstrap in linear regression,” Scientific reports, vol. 12, no. 1, p. 10992, 2022.

L. Insolia, A. Kenney, F. Chiaromonte, and G. Felici, “Simultaneous feature selection and outlier detection with optimality guarantees,” Biometrics, vol. 78, no. 4, pp. 1592–1603, 2022.

S. Salini, F. Laurini, G. Morelli, M. Riani, and A. Cerioli, “Covariance matrices of S robust regression estimators,” Journal of Statistical Computation and Simulation, vol. 92, no. 4, pp. 724–747, 2022.

M. Limnios, N. Noiry, and S. Clémençon, “Learning to rank anomalies: Scalar performance criteria and maximization of two-sample rank statistics,” presented at the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR, 2021, pp. 63–75.

D. M. Khan, A. Yaqoob, S. Zubair, M. A. Khan, Z. Ahmad, and O. A. Alamri, “Applications of robust regression techniques: an econometric approach,” Mathematical Problems in Engineering, vol. 2021, no. 1, p. 6525079, 2021.

Downloads

Published

2025-09-30

How to Cite

Kamil Habeeb, H., & Hatem Hassan, F. (2025). A Robust Statistical Framework for Outlier Detection and Its Influence on Predictive Modeling Accuracy. Journal of Al-Qadisiyah for Computer Science and Mathematics, 17(3), Static 17–40. https://doi.org/10.29304/jqcsm.2025.17.32424

Issue

Section

Statistic Articles