Investigating the Applicability of Logistic Regression and Artificial Neural Networks in Predicting Breast Cancer

whether

improvement in the survival for this disease has been recorded in countries with high-resource, the risk go on to rise, recording high mortality rates in developing countries [1].The Iraqi cancer Registry disclosed the breast cancer is the highest rates of cancer cases (19.1%) and the highest annual estimate of cancer in women (25.8 per hundred thousand of the female population) .The secondranked incidence of cancer mortality was breast cancer (2.7/100,000 populations) [3].According to a study, survival rate is 88% after 5 years of diagnosis and 80% after 10 years of diagnosis in early stage which means that about 88% of women diagnosed with breast cancer will survive for at least 5 years after their early diagnosis, therefore it is necessary to detect breast cancer at earliest stage possible [4].
Machine learning has become more influential in diagnosing cancer, because It allows deductions or inference to be made that classical statistical procedures could not make [5].
In order to assist oncologists making the right diagnosis of biopsy in breast cancer, a classification model known as a classifier can be very helpful.The Classification problem refers to predicting the target class of new observations, from a given set of predictive variables from the population dataset.
Since the outcome of biopsy can confirm the existence/ absence of the malignancy it is hence considered a binary outcome.
Logistic Regression (LR) have been applied increasingly in many fields particularly the medical fields, and is a perfect statistical algorithm for binary classification, that is evaluating the correlation between one or more categorical or continuous predictor variables and a dichotomous dependent variable [6].Logistic regression technique has the ability of assigning distinct datasets to predefined classes, the distinction is done by setting up the discrimination rules, these rules are estimated through the training phase and can be used to assign the new observations into the classes defined formerly [7].Methods of variable selection differ according to the problem.It is essential to include all relevant variables in the model.Some researchers propose inclusion of all clinical and other predictive variables in the model regardless of their significance to get a better model fit to the data.Yet, more variables will affect the coefficient in the model and lead to over-fitting model.Besides, a model with many insignificant predictors will produce less classification accuracy and it would be hard to explain the results.Commonly, statistical model building techniques attempt to minimize the number of variables to get a numerically stable and generalized model, but this can cause in a large standard errors.Variables selection can be done in two ways filter and statistical [8,9].For filter method, the variables are reduced according to their importance as was done in similar research.On the other hand, statistical method for variable selection can be done by either of the following methods [10];  "Enter: A procedure for variable selection in which all variables in a block are entered in a single step. Forward Selection (Conditional). Forward Selection (Likelihood Ratio). Forward Selection (Wald). Backward Elimination (Conditional). Backward Elimination (Likelihood Ratio). Backward Elimination (Wald)." In forward selection, the significant effects once entered then could not be removed from the model.For backward elimination, removed effects from the model, cannot be entered again.While, for stepwise selection which are the method of focus in our study the variables already included in the model do not need to remain, they can be entered into or eliminated from the model in a certain manner that every step of forward selection could be followed by a backward elimination step or more.The stepwise selection procedure stops if no additional effect is added to the model [11].In this study, two different variable selection procedures were implemented, namely Enter and Stepwise methods to establish logistic regression models.
Artificial neural networks (ANN) are commonly used as a robust decision making systems especially for medical diagnosing after being trained using historical data set.ANNs advantages can be summarized in that tuning neural weights is done online with no need to any pre-training phase, and persistence and performance systems is ensured.ANN is a powerful classifier that represents a nonlinear relationship between input and output.Basically, a simple ANN consists of three layers, an input layer, hidden layer/s and an output layer.At the input layer the inputs are weighted, i.e. each input value is multiplied by certain weight.At the hidden layer, all weighted inputs along with a bias are summed.Finally at the output layer the summed value obtained is converted to activation signal using transform function.The ANN is trained with a learning algorithm according to the type of the given problem.Generally the learning algorithms are either supervised learning, unsupervised learning or reinforcement learning [12].
This study aims to evaluate the performance of two techniques, logistic regression and neural networks in order to determine which of the used methods is more powerful in classifying the type of breast tumor in benign or malignant classes.

Data
The dataset used in this study, is the breast-cancer-Wisconsin.data file which was collected from UCI machine learning repository [13].The dataset consists of observations of 699 breast fine needle aspirations (FNA)s.It is arranged in 11 columns each row represents observations belong to a patient's breast FNA that was obtained from medical analysis.The first column is an identification code associated with each patient; the following 9 columns are the features used to analyze each FNA obtained from patient breast tumor; clamp thickness (range to which cell aggregates , mono-or multilayered), uniformity of cell sizes, uniformity of cell forms, marginal adhesion (coherence of the marginal cells of the cell aggregates), size of the single epithelial cell(diameter of the inhabitance of the biggest cells comparative to erythrocytes), Bare nuclei (the ratio of single cell nuclei that were freed from encirclement cytoplasm), chromatin blandness, nucleolus normality, and mitosis [14].The last column is the dependant variable (cancer type; 4 for malign and 2 for benign tumors).
"All malignant aspirates were histologically confirmed whereas FNAs diagnosed as benign masses were biopsied only at the patient's request.The remainder of benign cytologies were confirmed by clinical reexamination 3 and 12 months after the aspiration.Masses that produced unsatisfactory or suspicious FNAs were surgically biopsied" [15].
All the independent variables have numerical values ranging from 1 to10, and these values were obtained through medical analysis or lab tests.The distribution of the dependant variable Class is; Benign: 458 (65.5%( and Malignant: 241 (34.5%).
The first step in this study was converting the data into an Excel sheet to make it easier to build the statistical model which is generated and analyzed using SPSS, V19.0, SPSS Inc. using LR algorithm, then the data was imported to (SPSS) program for processing.
The second step is data cleaning; missing values is a well known issue that exists in datasets.There are several methods to overcome missing values like Listwise or case deletion, substituting the missing values with mean or mode of that variable and other methods [16].For our study the missing values were replaced by the mean of the nearby attribute values.As it is not required in the design and analysis of our model, the identifier number (Id)column was removed.

Logistic Regression (LR) Model
The conditional probability for dependent variable to occur is given by the logistic function [17], Where probability estimates are between 0 and 1 because of the logistic transformation, z is also called logit.The logit is a linear multiple regression model of the independent variables Where  0 ...  are coefficients of the independent variables calculated by estimation of the maximum likelihood,  1 …   are independent variables and n is the number of explanatory variables.
While reference probability is defined as, the log(odds), or log-odds ratio, is defined by, and expresses the natural logarithm of the ratio between the probability that an event will occur, p(Y=1), to the probability that it will not occur p(Y=0), it is found by calculating the probability of each event.Odds ratio measure the incidence when the independent variable increases by one unit.The odds ratio is defined as, For the first method two models were set, a full model using the standard ENTER method with all 9 attributes and a reduced model using the stepwise forward selection (Wald) method.Stepwise selection method tests the entry of variables according to the significance of the score statistic, while removal testing is done according to the probability of the Wald statistic [10], the model was developed with only 5 attributes which are tumor thickness, uniformity of cell size, marginal adhesion, bare nuclei, and bland chromatin since they were statistically significant at the level of 0.05 using Wald statistic.
Validation of the model is very important to measure the stability and robustness of the coefficients resulting from logistic regression and a crucial part of the process of model-building [18].The validation is using different data set pertaining the values of the coefficient as for the training data to calculate the percentage of correct classifications.The percentage of correctly predicted samples from the training samples must be ≥ to the validated samples [19].
Many statistical tools for model performance validation in binary logistic regression are available like data splitting, repeated data-splitting, jackknife technique and bootstrapping [20].For this purpose, the data-splitting technique was used in our study, where the data had been randomly divided into two groups; the first consisting of 80% of the data (550) sample was used for developing the LR model with 373 benign and 177 malignant, and the second group consisting of 20% of data (149) sample (85 benign, 64 malignant)was used for validating the two models.
The training data was used at first to fit both full and reduced models then we apply the validation data to the fitted models to evaluate the model's performance.The obtained posterior probability for malignant class was considered and its value was then classified into two categories; posterior probability in range of (0-0.5)=benign, and posterior probability in range of (0.5-1)= malignant.Results obtained are then evaluated in terms of measures such as ACC, Specificity, Sensitivity, and ROC curve area.

Neural Networks
In our study, two types of ANN were used.The first one is Multilayer Perceptron (MLP) network (Fig. 1) which is a well known network architecture that has been used in medical, engineering, mathematical modeling research.In MLP, a fixed value (bias) along with weighted sum of inputs are propagated to the hidden layer via a transfer function to generate the output, and the topology of feed-forward layers arrangement of units is called Feed Forward Neural Network (FFNN).The learning ability of the MLP is highly increased by the hidden layer.The input is modified by the activation function of the network so as to give a required output.Model building is strongly affected by the hidden nodes number, hidden layers number, and the type of activation function selection [21].
The output of a MLP NN is given below: Where () is the output value,   is the input vector, T is the transform function, c is a constant,   is the vector of weights, n is the size of input vector.The equation is in discrete time j [12] .
The second type of NN used is radial basis function neural network RBF which is based on supervised learning.RBF NN are efficient in modeling nonlinear data and training this type of NN can be done in one stage counter to MLP.In the hidden layer RBFNN uses nonlinear Gaussian transfer function whereas in the output layer it uses a linear summation transfer function.The real values of the ndimensional input vector X is fed to all units in the hidden layer at the same time (Fig. 2).The Gaussian RBF is given by; Where the functions ∅(‖ −  () ‖), i=1,2,…,N are called the RBFs, where a p-norm (often the Euclidean 2-norm) denotes ‖ .‖ ,  () is the basis function centre and i is its radius.A linear combination of basis functions can be used for approximation of a nonlinear function.The output : R n → R, of the network is thus, where N is the number of neurons in the hidden layer and the real parameters   ,  = 1, 2, . . .,  are the weights of the linear output neurons [22].
To train RBF networks, once the type of radial basis function is selected, all needed to do is choosing the functions' dimensions and centers and estimating the output neuron weights.For the ANN, two models were developed using two different types of NN, namely MLP and RBF.The architecture of the MLP neural network had four layers; the input layer consisted of 9 input elements, corresponded to the data taken from cytology, then two hidden layers with sigmoid activation function, the first one had 7 nodes while the second hidden layer consisted of 5 nodes and the output layer with 2 neurons, representing 0 for benign and 1 for malignant lesions.A back propagation algorithm based on scaled conjugate optimization technique was used to model MLP for our dataset.To get the optimum neural network structure, a considerable number of neural networks has been simulated by changing the number of hidden layers, hidden nodes, iterations and learning rates.Whereas the feed forward topology of RBF network developed for this work was composed of 3 layers, input layer with the 9 input elements, a single hidden layer with a nonlinear RBF activation function and 9 neurons fully interconnected to the output layer units and a linear output layer with 2 elements.The error function E used to index the learning efficiency of both neural networks was the Sum of Squared Error (SSE) criterion function which had to be minimized over the given training set.The performance of the NN models was determined by dividing the dataset into two separate sets 70% of samples for the training and 30% for the validation.After the networks had been trained perfectly using the training data, each network was tested by presenting the testing set to the trained network and a diagnostic output vector of 0's and 1's was generated.

Performance Metrics
Accuracy which is the percentage of correct predictions is the most used measure in classification task.Sensitivity and specificity have to be calculated because the first indicates the performance of classification for minority class, while the second indicates the proportion of majority samples that are correctly identified.Also the area under a ROC curve (AUC) was used to evaluate the performance of the feature selection method [23].For our work, the two models were evaluated using these metrics (Equations 9-13) based on the confusion matrix shown in Table 1.
Where, TP rate is sensitivity and

Results
In this study, different models were set using IBM SPSS statistics 19 software and the performance of the classifiers was compared.The dataset related to breast cancer was downloaded from UCI-Machine Learning repository, and was fed to our Logistic-regression models, MLP, and RBF neural networks.Each classifier was well trained with the dataset and a Model is set and validated with test samples, then results were obtained.The results of training LR full model using the training sample is shown in  For the neural network, both MLP and RBF models has shown a considerable improvement in all performance metrics than those scored by logistic regression models, and RBF NN has preceded all the other models developed in this work with highest correct classification rate of 95.4%, sensitivity of 98.5%, AUC of 96.125%, except for specificity which was 93% the same for both MLP and RBF.Table 4 and Fig. 3 show a comparison of the performance of the logistic regression models and two types of neural networks on testing samples in terms of percentage of correct classification rate (accuracy), sensitivity, specificity, and areas under receiver operating characteristic curve AUC.

Discussion
ANN and LR are widely used for tasks of prediction or classification.In this work, comparison of the four models developed was based on the validation dataset after the models had been sufficiently trained with the training data to assure whether the output of the these models will predict future samples precisely.The first logistic regression model set was full model which included all the nine covariates and the second was a reduced model.The second model built with logistic regression was the reduced model using the stepwise method where the variables with the largest Wald test p-value has been removed which were; uniformity of cell shape, single epithelial cell size, mitosis, and normal nucleoli, retaining the coefficients of the significant covariates only, but the output of the reduced model did not show any improvement in any of the metrics used except for specificity as shown in Table 4. & Fig. 3.
The result analysis of our study showed that the ability of RBF NN to diagnose breast cancer is superior to Binary Logistic Regression models (both full and reduced) and to MLP achieving highest and most accurate results where the accuracy of the RBF NN model was 95.4% and a sensitivity of 98.5% followed by MLP.While logistic regression method showed much less accuracy (74.7% for full model and 74.03% for reduced), sensitivity (68.75% for full model and 64.1% for reduced), specificity (83.53% for full model and 85.33% for reduced), and AUC (76.1% for full model and 74.79% for reduced), in comparison with neural networks.

Conclusion
This paper represents a comparative study of the diagnosing performance of two different machine learning techniques namely logistic regression and artificial neural networks in the prognosis of breast tumors weather malignant or benign using the breast-cancer-Wisconsin.data file which was collected

Figure 3 .
Figure 3.Comparison of the average sensitivity, specificity, accuracy, and AUC obtained for the full, reduced LR models, MLP, and RBF artificial neural networks

Table 2
where the values of logistic regression parameters, standard errors, Wald statistics and pvalues of the logistic regression model are computed.Considering all available variables, the logit of the full model is given by, Logit1= -10.112 +.586*clump thickness + 0.218* uniformity of cell size + 0.159*uniformity of cell shape + 0.304*marginal adhesion -0.110*single epithelial cell size + 0.394*bare nuclei+0.483*blandchromatin+0.115*normalnucleoli+ 0.487*mitosis.From Table2, we find that small p-values of clump thickness, marginal adhesion, bare nuclei, and bland chromatin indicate that they are most significant predictor of malignancy in the model at level of 0.05.Also, the coefficients of the reduced model were computed from training the model by the stepwise method using the training sample is shown in Table3.from which the logit of the reduced model is

Table 2 .
Parameter Estimations of the Full LR model fitted to the training sample.

Table 3 .
Parameter Estimations of the reduced LR model (stepwise) fitted to the training samples.

Table 4 .
Comparative performance of the four models on validation samples