Adaptive Features Selection Technique for Efficient Heart Disease Prediction

Heart disease is a common disease that causes death and is difficult to detect manually. A more efficient classification model that relies on machine learning methods to achieve higher classification accuracy, attracts the attention of researchers to design an effective prediction model. Moreover, it plays an important role in the practical application of medical cardiology with the aim of early detection of heart diseases. In this paper


Introduction
According to the European Society of Cardiology, approximately 26 million people are suffering from heart disease, which is considered a serious and life-threatening disease [1].As a result, the design of effective and efficient techniques for the accurate and early detection of these diseases has become one of the most important topics that most researchers are looking for, as this has an impact on saving the lives of many patients [2], where the manual diagnosis has many difficulties and problem in terms of time-consuming and inaccurate diagnosis.
Moreover, the trend of all research aims to increase the performance of automatic classification techniques to be suitable for the sensitivity of medical applications for time and accuracy.Many factors affect the performance of classification model-based machine learning, one of these factors is the characteristics of the dataset: accuracy and balancing dataset, size, and dimensionality of the dataset, sparsity, importance, and relevance of features.So, feature selection is an important and essential step required to improve the performance of the classification model in terms of decreasing the complexity (fast convergence) and increasing accuracy (avoiding overfitting) [7,8].
There are three types of feature selection methods classified according to the way of selecting the features relevant that maximize the classification decision accuracy: filter(rank), wrapper, and embedded-based selection method.For each type of feature selection there are advantages and disadvantages, therefore, choosing the suitable selection method depends on the dataset's characteristics and the application of the model implemented [9].
Filter-based feature selection methods select features by ranking features according to their importance, relevance, and correlation with the target class independent of the classifier algorithm.The ranking methods used to evaluate the feature are: information (using information theory concepts) as in mutual information feature selection method, statistical such as correlation and Chi-square based selection method, similarity measure as in Spectral feature selection (SPEC) and Laplacian Score (LS), and distance measure as in Relief method [10].
Wrapper-based feature selection methods select a subset of features by employing the classifier to choose the features having higher estimation methods i.e., features score the highest classification accuracy.Support Vector Machine Recursive Feature Elimination (SVM-RFE) is the most common wrapper feature selection method.Filter-based feature selection methods are characterized by less complexity and efficiency than wrapper-based methods, while wrapper-based methods achieve higher accuracy than filter-based methods [10].
All features selection methods require the common input parameter that refers to the required number of selection features (size of subset feature selected vector) and this parameter change according to the type of dataset and classifier.On the other hand, it is not reasonable to try all features, especially for datasets with large dimensions.In addition, some datasets do not need to feature selection since there is underfitting in classifier performance [10].
In this context, the work in this paper aims to produce an adaptive best number of selected features for any dataset, any feature selection method, and classifier with aim of finding the best classifier model with less than the most useful number of features for two heart disease datasets.On the other hand, the general aim of this work is to develop an efficient heart disease classification framework in terms of less complexity and high accuracy.
To achieve and investigate these aims the research work has the following objectives illustrated in bellow steps: whether a new test instance has heart disease or not.The rest of this paper is organized as follows.First, the related work is described in Section 2. Section 3 describes the methodology of the work.Section 4 discusses the results and analysis of the proposed work.Finally, the conclusions and future work are presented in Section 5.

Related works
Recently, most of the techniques based on machine learning methods are proposed to design heart disease prediction systems and all these techniques aim to increase the accuracy and efficiency of classification since the medical application is sensitive to time and accuracy factors.Most of these techniques depend on developing feature selection methods as trials to achieve more efficiency and accuracy.
A heart disease diagnosis system was proposed in [1]by using six classification methods: K-nearest neighbor (KNN), Artificial neural network (ANN), Decision tree (DT), Support vector machine (SVM), Logistic regression (LR), and Naïve bays (NB), and four features selection methods: Minimal redundancy maximal relevance (MRMR), Relief, Local learning, and Least absolute shrinkage selection operator (LASSO) in the aim of reducing the time complexity and achieve efficient and more accurate classification system.In this work, the author proposed a new feature selection method called the fast conditional mutual information feature selection algorithm (FCMIM).A comparative and analytical study was produced for all classification and feature selection methods.As a result, the FCMIM with the support vector machine (SVM) achieves higher accuracy about 92% with seven selected features from the original Cleveland Heart Disease dataset.
A comparative study on the analysis of different machine learning-based heart disease classification techniques was presented in [11]where different machine learning methods DT, LR, SVM, NB, ANN, and a hybrid model with LR and BN were used in this study.the results in this study show that the hybrid model with selected features achieves higher classification accuracy about 87.41% as compared with another used classifier.
Seven machine learning methods-based heart disease classification: Support Vector Machine (SVM), k-NN, Naive Bayes, Decision Tree, Logistic Regression (LR), Neural and Vote networks were employed IN [8] to identify the best subset features and best classifier of cardiac diseases predict.the outcomes of this research achieve a precision of about 91.4% and the authors concluded that the high impact of using relevant features with a strong analysis of data on obtaining high performance in terms of classification precision.
Another study proving the impact of preprocessing, analysis, understanding, and improving the quality of the heart diseases dataset on improving the performance of classification techniques was produced in [9].The performance of different classification methods: K-nearest neighbor, logistic regression, Gaussian naive Bayes, decision tree, random forest, and support vector machine was examined on the Cleveland heart diseases dataset with features selected using ten features selection methods: recursive feature elimination (RFE), forward feature selection, ReliefF, Lasso regression, Ridge regression, ANOVA, Chi-square, mutual information, backward feature selection, and exhaustive feature selection.as a result, the outcomes of this research show that the highest accuracy of 88.52% was obtained by using the decision tree classifier which was trained on a subset of features selected by the backward feature selection method.
Although all the above-related research produced beneficial feature selection techniques, they have some drawbacks.some of them do not achieve the required accuracy.No of these researches identified the optimal number of features without using trying ways. in this paper adaptive technique is used to identify the optimal relevant features that are suitable to any datasets and classifiers in an adaptive way that produces efficient and more accurate classification performance.

Material and Methods
All the theoretical backgrounds of methods and materials used in this research are explained in this section.

Datasets Description
The brief explanation and website of the two used heart disease datasets available in it are produced as follows: • The first Cleveland heart disease database contains 303 instances and 76 attributes but only 14 of them are used as standard in most studies, including the predicted attribute.It is available at (https://archive.ics.uci.edu/ml/datasets/heart+disease)[12].

Datasets preprocessing
Data preprocessing is a significant step to acquire good quality data that impact the performance of the general model.The two used datasets required scaling using Standard Normal Distribution (SND).As a result, the resultant transformed scaled data has been distributed with zero mean and unit variance.

Mutual Information-based Feature Selection Method (MI-FS)
The mutual information (MI) measurement is one of the information theory concepts [13].For two random variables  and , MI used as measurement that measure the relevance between them by measuring the information amount of  contains in  and the information amount of  contains in .Definition 3.1 (Entropy) Entropy is a measure of the average level of uncertainty or information of outcome possible for a random variable.For random variable  if () is the probability of  then the entropy denoted by () as in equation (1):

𝑥∈𝑋
For the two variables  and  with joint probability (, ), the joint entropy is denoted by (, ) as equation ( 2) ∈ On other hand, the entropy of a variable given (condition by) another variable is denoted by (|) as in equation (3):  (4) i.e., MI is the relative entropy between (, ) and the product ()() Definition 3.3 (Conditional Mutual Information) is the MI between the two random variables  and  given  as equation ( 5): ∈ ∈ ∈ So, mutual information can be written as a relation between it and entropy as Equation ( 6): Relevancy and redundancy are two terms the filter-based feature selection methods depend on.Where the feature selection method aims to find and select features that are more relevant to the target class and remove redundant features (the features that depend on or correlate with each other).According to equations ( 5) and ( 6) the mutual information can measure the relevancy between the features   and class  as: If (;   ) = 0 then according to the equation ( 6) () = (|) so,   is independent to target class  and has no relevant information on classification If (;   ) > 0 then   is the relevant feature and its classification information increase with increasing of MI.
In addition, redundancy can be measured using MI by measuring the MI between features where the features that have high MI consider redundant variables.
In the light of above, mutual information-based feature selection methods select features with highly important classification i.e., features with high MI with target class, and remove redundant and irrelevant features.According to this mechanism in selecting features mutual information-based features selection methods is one of the filter-based features selection methods [13].

Recursive Feature Elimination-based Feature Selection Method (RFE-FS)
This method is based on the mechanism of eliminating features with less importance repeatedly until obtained the required number of selected features is.At each iteration, the model will be trained and fewer important features removed.the significance of features is calculated depending on the weights of the algorithm for each iteration [10].
Initially, two parameters are required, the number of selected features and the estimator to be trained.the estimator is trained to all features in the original data to determine the significance of each feature then the less significant features are eliminated until obtain to the required number of features which is set in the initial step [9].

Classifier-based machine learning methods
Four machine learning are used in this work: logistic regression, Decision Tree (DT), Random Forest (RF), and Support Vector Machine (SVM), the detailed explanation of these methods is in the following references [3,5,6].The parameters of each classifier used in this work are clarified in Table (1).
Table 1 -parameters setting for each machine learning model.

Logistic regression
Penalty=l2; C: Inverse of regularization strength=1; Algorithm to use in the optimization problem= 'lbfgs'

Decision Tree
The function to measure the quality of a split=Gini; The strategy used to choose the split at each node= best;

Random forest
The number of trees=50; The function to measure the quality of a split=Gini Support vector machine Regularization parameter=20; kernel=rbf; gamma (Kernel coefficient) =1 The rest of the parameters are set as in the default setting in Python's built-in functions

General Framework of Proposed Heart Disease Prediction Methodology
Adaptive best feature selection and general step of this work will be explained in this section as in Figure 1.

Adaptive Getting the best number of features
To avoid entering manually the number of features to be selected using features selection methods, the adaptive and automatic method of choosing the optimal number of features that maximize the evaluation metrics of classification mode performance will be produced as in Algorithm (1).This function tries to input all possible number of features as input to the feature selection method and then trains the model with each extracted new subset feature by employing the k-fold method for dataset splitting.Finally, model evaluation metrics are computed for each new subset and then the number of selected features that achieve the highest evaluation metrics in terms of accuracy, recall, precision, and fi-score, will be returned.As a result, the new subset will be constructed from the original training and testing dataset with the optimal number of features.

The Model Cross-Validation Evaluation
This function enhances the performance of the model by controlling and avoiding the overfitting problem.By utilizing the k-fold method with k=10, the function is designed as in Algorithm (2) .This algorithm takes the newly constructed dataset and model as input and returns the average of evaluation metrics (accuracy, recall, precision, and f1-score), evaluation metrics at each k, training set, and the testing dataset at k with maximum evaluation metrics.

Input: The heart Disease dataset features columns as x, a target class label column as y, a classifier model
Output: average of evaluation metrics, evaluation metrics at each k, training set, and the testing dataset at k with maximum evaluation metrics.

Proposed Heart Disease Prediction Technique
The procedure we will work with to find the best model can be summarized in bellow pseudocode steps: Step1 Finally, the saved best model can be used as a prediction algorithm to predict the new patient sample whether has heart disease or not.

Experimental results
The proposed classification technique with adaptive feature selection algorithms using MI and RFE methods has been implemented on two standard heart disease databases.The results of this implementation using four classification methods are illustrated below in Tables.
The results in a Table (2) and ( 4) are obtained by calculating the maximum evaluation metrics after employing the kfold cross-validation on the first and second datasets respectively with k=10.While the results in Tables ( 3) and ( 5) are obtained by calculating the average of evaluation metrics.
Tables (2) and ( 3) show the number of selected features from the original first Cleveland heart disease database containing 303 instances and 13 attributes, as well as the performance of each classifier used in terms of accuracy, f1score, precision, and recall.As is clear from the results in Table (2) that the mutual information method returns the significant information corresponding to each feature, while the number of selected features with high significant importance varies due to which classifier is used.Therefore, the proposed adaptive algorithm determines the appropriate number of features for each classifier that achieve higher classification accuracy.
The highest accuracy of about 96.7 is obtained using SVM with 11 features.Although the rest classifiers obtained higher accuracy with a smaller number of selected features the accuracy was less than the accuracy achieved by SVM.
In medical applications, the importance of accuracy has a superior priority since it concerns patient life.
The reason behind these results is the impact of dataset size on the performance of a classifier.Where the DT and RF are affected by the size of the dataset more than SVM.So, the accuracy of RF decreases in a ratio higher than SVM as the size of the dataset decreases, due to the mechanism of RF that required a large dataset whereas, in SVM the hyperplane depends on the support vector only so if the dataset has a support vector required, then the impact of dataset size will be irrelevant.These results come true with concluded outcomes by (Althnian et al., 2021).
As a comparison between MI and RFE, for all classifiers the classification accuracy that is achieved in the case of using RFE is less than the accuracy achieves using MI.so, for this database, the MI method is suitable more than RFE where the accuracy achieved by all classifiers using RFE was 93.5 which is obtained by using MI and DT with three selected features only.
Tables (4) and (5) show the number of selected features from the original second heart Statlog Cleveland Hungary database contains 1190 instances and 11 attributes, as well as the performance of each classifier used in terms of accuracy, f1-score, precision, and recall.
Table 4 -Maximum evaluation measures at specific k-fold cross-validation corresponding to the number of selected features and different classifiers implemented on the second dataset. Table

5-Average evaluation measures of k-fold cross-validation corresponding to the number of selected features and different classifiers implemented on the second dataset.
Due to the increasing dataset size, the classification accuracy of DT and RF has improved as shown in Tables ( 4) and (5).Where the highest accuracy of 97.4% is obtained using RF with features selected using MI as well as RFE.The comparison between the results in Table (3) and ( 4) show the impact of the increasing size of the database effect on the performance of RF and DT where the accuracy increases clearly.Whereas the performance of SVM and LR decreases (with a smaller number of selected features) due to the increasing of database dimension, where these classifiers are well done with high dimensionality.As is clear in Figures (2)(3)(4)(5).

Classifier
With this data set, there is no obvious difference in accuracy for all classifiers using features selected using MI or RFE, while the number of features selected using RFE is less than the number of features selected using MI with the same accuracy.That indicates that since the RFE mechanism is based on classifier criteria in calculating the weights to features then it is suitable for classifiers which not depend on information theory in calculating the decision such as LR and SVM.
All the above results can be visualized in Figure 2 -5 to show the behavior of all classifier models across all features and all fold in the k-fold cross-validation method.Although the MI-based feature selection method which is considered a filter-based feature selection method, works independently of the classifier method choosing the number of selected features depends on the classifier model so random thresholding way not always proper especially when the significant information for features is convergent, and when the classifier did well with the high dimension of the dataset.
In light of all the above results, the best model for each dataset will candidate as in Table (6) with a brief explanation of the reason for the choices for each model as the best model.The results in Table (6) show that the chosen best model is to be saved as a predictor model based on the highest accuracy, a smaller number of selected features, and the complexity of the feature selection method.At the first the priority for accuracy since the medical application is sensitive to accuracy more than other factors, then when there are two results with equality in classification accuracy the model with a smaller number of selected features is chosen.Finally, if two results are equal in accuracy and the number of selected features, the model with the less expensive feature selected model is chosen.

Comparison Study of proposed Classification Framework with Previous Works
The performance of the proposed technique is compared in terms of accuracy with previously existing classification techniques for heart disease.This comparison study illustrates in Table (7).Furthermore, the best candidate method for each dataset is compared and detected to be saved and incorporated into practical medical applications in health organizations.

Conclusion
In this work, an adaptive feature selection technique based on Mutual Information (MI) and Recursive Feature Elimination (RFE) methods has been produced to design a complete heart disease detection system.the proposed system design is implemented using four machine learning methods including Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF).The adaptive proposed algorithm aims to choose the optimal number of selected features instead of thresholding or randomly selecting.Two standard heart datasets are used to train the model from the UCI repository: the first Cleveland heart disease, and the second Statlog Cleveland Hungary. the results show that the highest accuracy is achieved by SVM with 11 features selected by mutual information-based feature selection method for the first database about 96.7, and by RF with 10 features selected by MI for the second database about 97.4%. the outcomes of this work indicate that for each dataset and each classifier as well as according to the used feature selection method, there is an optimal number of features in terms of classification accuracy.on the other hand, there is the best model candidate for each dataset in terms of a smaller number of features and acceptable classification accuracy to achieve the efficiency of the model has been saved.where the accuracy of 93.5% was achieved by DT with only 3 features selected by MI for the first dataset, and 93.2% achieved by SVM with only 9 features selected by MI for the second dataset.In the future, we aim to choose a model with a smaller number of features and improve it to achieve higher accuracy to ensure more efficiency required for fast and light prediction models that can be used in medical applications.

Figure 2 andFig. 2 -Figure 4 andFig. 3 -Fig. 4 -
Figure 2 and Figure 3 describe the performance of four machine learning methods implemented on the first heart disease database in terms of f1-score with different numbers of features selected using MI and RFE-based feature selection methods and the proposed adaptive algorithm.

F1-score vsFig. 5 :
Fig.5: f1-score vs number of features different classifiers and RFE method applying on second heart disease database

1 -
Two heart diseases are preprocessed to implement the proposed work.2-Building a function that returns the best number of features by utilizing two feature selection methods: RFE and Mutual information, and four machine learning classifiers: logistic regression, K-nearest neighbor, Decision tree, and Random Forest.3-Constructing a new dataset including k best features.Creating a function for model cross-validation evaluation by employing a k-fold algorithm.This function takes a new dataset and different classifier model as input and returns model evaluation metrics for each model with indices of train set and test set at k the model achieves high performance with it 4-Select and save the best classifier model for each dataset to use for heart disease prediction to predict Mutual Information) if the joint probability between the variables  and  is (, ), () is the marginal probability density functions for , and () is the marginal probability density functions for , then the MI for  and  is (; ) as equation (4): ∈ Definition 3.2 (