Improving Medical Diagnosis Using Missing Data Treatment Techniques: A Case Study on Thyroid Data

Authors

  • Sajjad Basim Abdulyasser Wasit,52001,Iraq

DOI:

https://doi.org/10.29304/jqcsm.2025.17.11974

Keywords:

Maximum expectation(EM), Multiple chained input(MICE),, Missing data, Classification model, Thyroid diseases, ML modeling, Random forests.

Abstract

In this paper, we focus on some advanced input techniques that deal with missing data from thyroid medical datasets such as maximum expectation (EM), multiple sequential input (MICE), etc. It is demonstrated that EM and MICE had a significant positive impact on prediction accuracy compared to their traditional counterparts (i.e. direct and ensemble inputs). Moreover, EM significantly enhanced predictions derived from random forests in a manner consistent with previous findings of additive predictive power for clinical variables only, confirming the promise of EM (and MICE) to provide a significant fundamental improvement in modeling in complex medical models, in the case of a comprehensive analysis of phenotypes similar to heart failure. In conclusion, the results of the present study indicate that advanced input methods increase the accuracy of diagnosis and therapeutic predictions in patients with thyroid diseases.

Downloads

Download data is not yet available.

References

Ashok, L. and Sivanandam, S. (2017) Diagnosis of Thyroid Disorder Using Infrared Thermography. International Conference of Electronics, Communication and Aerospace Technonlogy (ICECA), Coimbatore, 20-22 April 2017, 37-41.

S. Balasubramanian, "Clinical Relevance of ML Predictions using Health Datasets," University of Victoria Repository, 2024.

S. Alam, M. A. Hider, and A. Al Mukaddim, "Machine Learning Models for Predicting Thyroid Cancer Recurrence," Journal of Medical Informatics, 2024.

Y. F. Wang, Comparison Study of Radiomics and Deep-Learning Based on Methods for Thyroid Nodules Classification Using Ultrasound Images, 8, IEEE Access, (2020)

F. F. Chamasemani and Y. P. Singh, “Multi-class Support Vector Machine (SVM) Classifiers -- An Application in Hypothyroid Detection and Classification,” Theories and Applications, pp. 351–356, Sep. 2011, doi: 10.1109/bic-ta.2011.51.

S. Dash, M. N. Das, B. K. Mishra, School of Computer Engineering, KIIT University, Bhubaneswar-751024,Odisha, India, and Department of IT, C.V. Raman College of Engineering, Bhubaneswar-752054,Odisha, India, “Implementation of an optimized classification model for prediction of hypothyroid disease risks,” journal-article, 2021.

R. P. R. Kumar, M. S. Lakshmi, B. S. Ashwak, K. Rajeshwari, and S. M. Zaid, “Thyroid Disease Classification using Machine Learning Algorithms,” E3S Web of Conferences, vol. 391, p. 01141, Jan. 2023, doi: 10.1051/e3sconf/202339101141.

K. Salman and E. Sonuç, “Thyroid disease classification using machine learning algorithms,” Journal of Physics Conference Series, vol. 1963, no. 1, p. 012140, Jul. 2021, doi: 10.1088/1742-6596/1963/1/012140.

E. Alshdaifat, D. Alshdaifat, A. Alsarhan, F. Hussein, and S. M. F. S. El-Salhi, “The effect of preprocessing techniques, applied to numeric features, on classification algorithms’ performance,” Data, vol. 6, no. 2, p. 11, Jan. 2021, doi: 10.3390/data6020011.

C. Fan, M. Chen, X. Wang, J. Wang, and B. Huang, “A review on data preprocessing techniques toward efficient and reliable knowledge discovery from building Operational data,” Frontiers in Energy Research, vol. 9, Mar. 2021, doi: 10.3389/fenrg.2021.652801.

B. Srilatha, H. B. A, and D. Soumya, “Epidemiology and Treatment for Thyroid Cancer,” Jan. 01, 2011, OMICS Publishing Group. doi: 10.4172/1948-5956.s17-011.

K. A. Kelly and D. Brady, “Emergencies in Thyroid Function,” Sep. 11, 2009, Elsevier BV. doi: 10.1016/j.jen.2009.04.012.

R. Shames, “Diagnostic Challenges and Treatment Options for Thyroid Conditions,” Feb. 01, 2012, Mary Ann Liebert, Inc. doi: 10.1089/act.2012.18109.

R. Brown, J. A. de Souza, and E. E. W. Cohen, “Thyroid Cancer: Burden of Illness and Management of Disease,” Jan. 01, 2011, Ivyspring International Publisher. doi: 10.7150/jca.2.193.

M. Azur, E. A. Stuart, C. Frangakis, and P. J. Leaf, “Multiple imputation by chained equations: what is it and how does it work?,” Feb. 24, 2011, Wiley. doi: 10.1002/mpr.329.

V. Tadayon, “Bayesian Analysis of Censored Spatial Data Based on a Non-Gaussian Model,” Mar. 01, 2017. doi: 10.18869/acadpub.jsri.13.2.155.

D. Shah, S. Patel, and S. K. Bharti, "Heart Disease Prediction using Machine Learning Techniques," SN Computer Science, vol. 1, no. 6, 2020, doi: 10.1007/s42979-020-00365-y.

C. B. Gokulnath and S. P. Shantharajah, "An optimized feature selection based on genetic approach and support vector machine for heart disease," Cluster Computing, vol. 22, no. S6, pp. 14777-14787, 2018, doi: 10.1007/s10586-018-2416-4.

A. A. Ali, "Stroke Prediction using Distributed Machine Learning Based on Apache Spark," Stroke, vol. 28, no. 15, pp. 89-97, 2019. [Online]. Available:https://www.researchgate.net/profile/Nahla-Omran 2/publication/338458550_Stroke_Prediction_using_Distributed_Machine_Learning_Based_on_Apache_Spark/links/5e1619404585159aa4be6a2e/Stroke-Prediction-using-Distributed-Machine-Learning-Based-on-Apache-Spark.pdf.

P. Koturwar, S. Girase, and D. Mukhopadhyay, “A Survey of Classification Techniques in the Area of Big Data,” Jan. 01, 2015, Cornell University. doi: 10.48550/arxiv.1503.07477.

G. G. Towell, “Using Unlabeled Data for Supervised Learning,” Nov. 27, 1995. Accessed: Dec. 2024. [Online]. Available: https://papers.nips.cc/paper/1097-using-unlabeled-data-for-supervised-learning.pdf

L. Gondara and K. Wang, “MIDA: Multiple Imputation using Denoising Autoencoders,” Jan. 01, 2017, Cornell University. doi: 10.48550/arxiv.1705.02737.

A. Choudhury and M. R. Kosorok, “Missing Data Imputation for Classification Problems,” Jan. 01, 2020, Cornell University. doi: 10.48550/arxiv.2002.10709.

H. Wang, “Pattern Classification with Random Decision Forest,” Aug. 01, 2012. doi: 10.1109/icicee.2012.42.

A. Liaw and M. C. Wiener, “Classification and Regression by randomForest,” Jan. 01, 2007. Accessed: Dec. 2024. [Online]. Available: http://cogns.northwestern.edu/cbmg/LiawAndWiener2002.pdf

C. Kingsford and S. L. Salzberg, “What are decision trees?,” Nature Biotechnology, vol. 26, no. 9. Nature Portfolio, p. 1011, Sep. 01, 2008. doi: 10.1038/nbt0908-1011.

A. Criminisi, Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning. 2011. doi: 10.1561/9781601985415.

L. Kozma, “k Nearest Neighbors algorithm (kNN),” Jan. 01, 2008. Accessed: Jan. 2025. [Online]. Available: http://www.cis.hut.fi/Opinnot/T-61.6020/2008/knn.pdf

S. N. Wright and T. Marwala, “Artificial Intelligence Techniques for Steam Generator Modelling,” Jan. 01, 2008, Cornell University. doi: 10.48550/arxiv.0811.1711.

J. M. Moguerza and A. Muñoz, “Support Vector Machines with Applications,” Aug. 01, 2006, Institute of Mathematical Statistics. doi: 10.1214/088342306000000493.

G. L. Prajapati and A. Patle, “On Performing Classification Using SVM with Radial Basis and Polynomial Kernel Functions,” Nov. 01, 2010. doi: 10.1109/icetet.2010.134.

P. I. Frazier, “A Tutorial on Bayesian Optimization,” Jan. 01, 2018, Cornell University. doi: 10.48550/arxiv.1807.02811.

J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for Hyper-Parameter Optimization,” Dec. 12, 2011, Centre National de la Recherche Scientifique. Accessed: Jan. 2025. [Online]. Available: https://hal.inria.fr/hal-00642998

J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian Optimization of Machine Learning Algorithms,” Jan. 01, 2012, Cornell University. doi: 10.48550/arxiv.1206.2944.

H. Wang, S. Gao, H. Zhang, W. Su, and M. Shen, “DP-HyPO: An Adaptive Private Hyperparameter Optimization Framework,” Jan. 01, 2023, Cornell University. doi: 10.48550/arxiv.2306.05734.

M. Feurer and F. Hutter, “Hyperparameter Optimization,” in The Springer series on challenges in machine learning, Springer International Publishing, 2019, p. 3. doi: 10.1007/978-3-030-05318-5_1.

David Hubbard,Benoît Rostykus,Yves Raimond,T. Jebara, “Beta Survival Models.” Oct. 2023. Accessed: Jan. 22, 2025. [Online]. Available: https://arxiv.org/pdf/1905.03818.pdf

P. Ranganathan, C. S. Pramesh, and R. Aggarwal, “Common pitfalls in statistical analysis: Logistic regression,” Jan. 01, 2017, Medknow. doi: 10.4103/picr.picr_87_17.

C. Bentéjac, A. Csörgő, and G. Martínez-Muñoz, “A comparative analysis of gradient boosting algorithms,” Aug. 24, 2020, Springer Science+Business Media. doi: 10.1007/s10462-020-09896-5.

N. Ponomareva et al., “TF Boosted Trees: A scalable TensorFlow based framework for gradient boosting,” Jan. 01, 2017, Cornell University. doi: 10.48550/arxiv.1710.11555.

C. J. Huberty and R. M. Barton, “An Introduction to Discriminant Analysis,” Oct. 01, 1989, SAGE Publishing. doi: 10.1080/07481756.1989.12022925.

D. W. Opitz and R. Maclin, “Popular Ensemble Methods: An Empirical Study,” Aug. 01, 1999, AI Access Foundation. doi: 10.1613/jair.614

Downloads

Published

2025-03-30

How to Cite

Abdulyasser, S. B. (2025). Improving Medical Diagnosis Using Missing Data Treatment Techniques: A Case Study on Thyroid Data. Journal of Al-Qadisiyah for Computer Science and Mathematics, 17(1), Comp. 192–201. https://doi.org/10.29304/jqcsm.2025.17.11974

Issue

Section

Computer Articles