Diagnose of Chronic Kidney Diseases by Using Naive Bayes Algorithm

Chronic kidney disease (CKD) develops gradually, usually after months or years when the kidneys lose function. In general, it may not be detected before it loses 25% of its functionality. Patients may begin to not recognize kidney failure because kidney failure may not give any symptoms at first. Treatment for kidney failure aims to control the causes and slow the progression of kidney failure. If the treatments are insufficient, the patient is in the end stage of kidney failure and the last treatment is dialysis or a kidney transplant. at this time. Therefore, it is necessary to make an early diagnosis to avoid reaching the stage of kidney failure. We conclude in this paper that the Naive Bayes algorithm is one of the best algorithms for diagnosing diseases with high accuracy of 99.24% and time of 0.003 seconds approximately because it is suitable for this kind of dataset.


Introduction
The kidneys are a vital organ for the proper functioning of the human body.Its function is to filter blood, remove waste products, and control fluid balance in the body and urine formation.Chronic Kidney Failure (CKD) is a condition in which kidney function is altered, and the ability to function properly decreases, leading to an increase in the amount of waste products in the blood that makes the human body sick in the long term [1].People with high blood pressure and diabetes and those who have family members with chronic kidney disease are at greater risk of developing kidney disease.The purpose of medical diagnosis is to mine useful information from the massive medical datasets which are accumulated frequently [2].
Data mining and machine learning can be used as an informative tool to extract useful information which helps pathologists and doctors import decisions making [3].Machine learning researchers create algorithms that can improve a solution to a problem that contains huge data such as medical data.Moreover, the amount of relevant problem-related data available improves the accuracy of the solution [4].Data mining, which is a branch of machine learning and artificial intelligence, has evolved to such an extent that it can now be used in a variety of areas, including risk assessment, industrial process control, healthcare, insurance, financial reporting, and forecasting of expense payments.business among many other fields [5].
Advantages of this Model (Naïve Bayes) "It is a relatively simple algorithm to understand and build".And "It is faster to predict classes using this algorithm than many other classification algorithms".In addition to "it can be easily trained using a small dataset suitable for this Kind of dataset which used in this paper" [ [4].In this study, Saudi medical records were investigated for the first time in the process of diagnosing CKD using machine learning techniques.The authors used correlation coefficient and recursive feature elimination for feature selection.Then, four classification algorithms were explored, namely: ANN, SVM, Naïve Bayes, and k-NN.The performance of each of these classifiers was examined by the classification accuracy, precision, recall, and f-measure achieved by the classifier.ANN, SVM, and NB all achieved an accuracy of 98 % while k-NN achieved an accuracy of 93.9%. Padmanaban, K. A., et al, 2020 [5] Chronic kidney disease has been analyzed and predicted for different classifiers: Naïve Bayes, SVM, KNN, and Decision tree.To compare the performance of these classifier algorithms, the WEKA tool has been used.From the performance result, it is observed that the decision tree algorithm provides the highest accuracy of 98%.The second most accurate classifier is SVM with an accuracy of 97.75%.It is also observed that the implementation of the ranking algorithm increases the performance for predicting CKD but a correct number of attributes must be selected.Some major factors like age, RBC, blood pressure, etc. have been considered for classification.Other parameters like nutrition, accommodation status, clean water availability, surroundings can be considered for the detection of CKD.In the future, the performance of other classifiers like ANN, Fuzzy logic can be compared using the WEKA tool for a similar situation and dataset. Deepika, B., et al, 2020 [6] This project is a medical sector application that helps medical practitioners in predicting CKD disease based on CKD parameters.It is automation to predict CKD and it identifies the disease and its stages effectively and economically method.It is accomplished through the KNN algorithms with 97% accuracy.And Naive Bayes classification algorithms 91% accuracy.This classification technique comes under the data mining technique.This algorithm takes CKD parameters as input and predicts disease based on old CKD patient data.

Data Mining Applications in Healthcare
The huge amount of data available in the health sector and the need to extract knowledge from these is huge Data makes data mining techniques the most effective solution for processing such a quantity of data and extract knowledge [3].Data mining is a process analyze and summarize the data into useful information that can be used to increase revenue, reducing costs, or both.It's the process to find relationships or patterns between tens of fields in large relational data.mining data consists of main components: information extraction, data storage and management, access provision, and analyze data and present the data in a useful format [8].

Naïve Bayes
Naive Bayes classifier is a powerful algorithm for the classification task.Even with working on a data set with millions of records with some attributes, the Naïve Bayes approach is best to use [3].In Naïve Bayes, the probability of its being a target class is calculated in which the instance is classified as belonging to the target class of highest probabilities [8].
Bayes' theory uses the posterior probability and the previous probability.It represents the pre-probability of an event or hypothesis of the original probability where it was obtained before obtaining any additional information.The revised probability of the event through the use of additional information or evidence that was obtained is known as the posterior probability [3].
The theory is written as equation [9]: Where : The prior probability of A is P(A) The prior probability of Ci is P(Ci) The posterior probability of A given Ci as P(A| Ci) The posterior probability of Ci given A as P(Ci |A) The classifier of Naive Bayes is a probabilistic simple and convenient classifier that depends on the application of the Bayes theorem.Naive Bayes regards each component of the attributes as an independent variable [10].

Proposed System Framework
In general, the proposed system involves four main phases: preprocessing, statistical analysis, feature selection, and classification.Each stage includes a set of sub-steps as figure (2) shows:

Figure (2): Proposed System Framework
The following sections provide a detailed explanation for each phase of the proposed and framework system:

Preprocessing Stage:
Pretreatment is one of the most important issues in the disease diagnosis and classification system.A set of data was adopted in the application of the proposed system containing two types of numerical and nominal data.When preparing the data, initially, the identifier drop column is executed.Because the identity of the patient is taken randomly and independently in the useful classification used in the proposed system, it does not contain any information in the classification.Then, the missing data is processed as will be explained later, and then a scan of all the features in the data set is done to see which features are similar and which are duplicates for deletion.This process is achieved depending on the Gaussian distribution.

A-Handle Missing Values
The data set contains two types of data: numerical and nominal, if the missing values are numerical, they are replaced by using the average value of the column, while in the second type the missing values are replaced by taking the adjacent value.The data set used in this paper contains two types of data: 10 numerical and 13 nominal attributes.

B. Encoding Categorical Features
In this step, converts each categorical value under a specified feature to a numerical value.To be dealt with in mathematical operations, as many machine learning algorithms cannot work with categorical data directly.It data must be digital.this step will convert each nominal attribute to 1 or 0 value to use it in arithmetic operations of train and test, as shown in table (2).

Statistical Analysis
Here will show the result of the Exploratory Data Analysis and Correlation Matrix.

A. Exploratory Data Analysis (EDA)
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

B. Correlation Matrix
A correlation matrix is a table showing correlation coefficients between variables.Each cell in the table shows the correlation between two variables and is diagnostic for advanced analysis.
Pearson correlation measures the linear association between two variables.It has a value between -1 and 1 where: • -1 indicates a perfect negative linear correlation between two variables.Where : age -age bp -blood pressure sg -specific gravity al -albumin su -sugar rbc -red blood cells pc -pus cell pcc -pus cell clumps ba -bacteria bgr -blood glucose random bu -blood urea sc -serum creatinine sod -sodium pot -potassium hemo -hemoglobin pcv -packed cell volume wc -white blood cell count rc -red blood cell count htn -hypertension dm -diabetes mellitus cad -coronary artery disease appet -appetite pe -pedal edema ane -anemia class -class

Conclusion
This paper is a medical sector application that helps medical practitioners in predicting the CKD disease based on the CKD parameters.We conclude that the Naive Bayes algorithm is one of the best algorithms for classification and diagnosis in the medical fields.We recommend the use of other algorithms with the same level of accuracy in diagnosis and speed in time.

• 0
indicates no linear correlation between two variables.• 1 indicates a perfect positive linear correlation between two variables.

Alassaf, R. A., et al, 2018,
[3]Related WorkThis section reviews some previous studies and explains the different techniques used to diagnose chronic kidney disease.References Polat, H., et al, 2017,[3]In this study, encapsulation and filtering methods were used CKD data set.And achieved 89% in Naïve Bayes Algorithm.Naive Bayes and

Table 1 -
The features before and after handling the missing values of the table shows that have no missing values after preprocessing it.