Comparison of Min-Max normalization and Z-Score Normalization in the K-nearest neighbor (kNN) Algorithm to Test the Accuracy of Types of Breast Cancer

The purpose of this study was to examine the results of the prediction of breast cancer, which have been classified based on two types of breast cancer, malignant and benign. The method used in this research is the k-NN algorithm with normalization of min-max and Z-score, the programming language used is the R language. The conclusion is that the highest k accuracy value is k = 5 and k = 21 with an accuracy rate of 98% in the normalization method using the min-max method. Whereas for the Z-score method the highest accuracy is at k = 5 and k = 15 with an accuracy rate of 97%. Thus the min-max normalization method in this study is considered better than the normalization method using the Z-score. The novelty of this research lies in the comparison between the two min-max normalizations and the Z-score normalization in the k-NN algorithm.


Introduction
In some datasets, there are different ranges of values for each attribute. The difference in the range of values for each attribute causes the malfunction of the attribute which has a much smaller value, compared to other attributes. Therefore, it is necessary to transform data with normalization, to equalize the range of values for each attribute with a certain scale, in order to produce well-normalized data. Data transformation with normalization can be done in several ways, namely Min-Max normalization, Z-Score normalization, Decimal Scaling normalization, Sigmoidal normalization, and Softmax normalization [1]. In this study, two normalization methods were used, namely Min-Max and Z-Score. The algorithm used was K-Nearest Neighbors (kNN), while the data used in this study was a dataset of breast cancer types. k-NN is an algorithm or method used to classify data [2] [3]. Classification is an important stage in data mining. Classification is grouping new data or objects into classes or labels based on certain attributes. KNN is one of the nonparametric machine learning algorithms (models). A nonparametric model is a model that does not assume anything about the distribution of instances in the dataset. Nonparametric models are usually more difficult to interpret, but one advantage is that the class decision lines generated by the model can be very flexible and nonlinear.
In this research, the dataset being analyzed is related to breast cancer. Based on data, breast cancer ranks second as a cause of cancer death in women after lung cancer [4]. Today, about 1 in 8 women (12%) will develop breast cancer in their lifetime. The American Cancer Society estimates that in 2017, about 252,710 women will be diagnosed with invasive breast cancer and about 40,610 will die from the disease. Only 5% of 10% of breast cancers occur in women with a clear genetic predisposition for this disease. Most breast cancers are "sporadic" which means there is no immediate family history of the disease. The risk for developing breast cancer increases as a woman ages.
Research conducted by Pandey and Jain (2017) [5] compared data normalization using the min-max and Z Score methods on the IRIS dataset with 100% accuracy at k = 1 using the min-max normalization method and 85.71% using the z score [5]. Research conducted by Chamidah et al (2012) [1] obtained optimal classification results in breast cancer cases with an accuracy of 96.86% for the min-max normalization method and 95.68% for the Z-Score method. In the research of Nasution et al (2019) [6] comparing Data Normalization for Wine Classification Using the k-NN Algorithm on the wine dataset, the results obtained are 65.92% for the method using min-max normalization and 65.85% for Z-Score normalization.
In this study, we used the k-NN method classification by comparing the normalized min-max and Z Score to test the accuracy of breast cancer types.
The purpose of this study was to examine the results of the prediction of breast cancer, which have been classified based on two types of breast cancer, malignant and benign. The novelty proposed lies in the comparison between the two min-max normalizations and the Z-score normalization in the k-NN algorithm.

Research Method
This study uses the data mining classification method with the k-NN algorithm and uses R programming. Figure 1 is the steps taken in conducting research:

Classification Algorithm k-NN
In general, data mining is a scientific discipline that studies methods for extracting knowledge or finding patterns from large data. Data mining is also an interactive and interactive process to get an interesting new pattern. This pattern will certainly be very useful. An interactive process means a process that still requires human interaction to be carried out. Meanwhile, the interactive process means a process that is not only done once, it needs an iterative process to get the important data in question. Models generated from the data mining process are usually perfect so that they can be generalized for future purposes. Data mining is an activity that includes collecting, using historical data to find regularities, patterns, and relationships in large data sets [7] [8]. Data mining performs extraction to obtain important information that is implicit and previously unknown, from data [9]. Other names for data mining are knowledge discovery in databases (KDD), big data, business intelligence, knowledge extraction, pattern analysis, and information harvesting. The purpose of data mining is to extract and identify data for certain information related to a large database or big data [10]. The main function of data mining according to estimation, forecasting, classification, clustering, and association [11].
Classification is the process of finding a model (or function) that describes and differentiates data classes or concepts that aim to be used to predict the class of objects whose class label is unknown [12]. The algorithms used in the classification are Decision Tree (CART, ID3, C4.5, Credal DT, Credal C4.5, Adaptive Credal C4.5), Naive Bayes (NB), K-Nearest Neighbor (k-NN), Linear Discriminant Analysis (LDA), Logistic Regression (LogR), and others.
The K-nearest neighbors or k-NN algorithm is an algorithm that functions to classify data based on learning data (train datasets), which are taken from k nearest neighbors [13]. Where k is the number of closest neighbors. K-nearest neighbors perform classification with learning data projections in multi-dimensional space. This space is divided into sections that represent the learning data criteria. Each learning data is represented as points c in many-dimensional space. The new classified data is then projected on a multi-dimensional space that contains c points of learning data. The classification process is carried out by finding the nearest c point from the new c (nearest neighbor). A common technique of finding the nearest neighbor is done using the Euclidean distance formula [14] which can be calculated using the following formula.

Collecting and preparing data
This study uses the Wisconsin Breast Cancer Diagnostic [20] dataset from the UCI Machine at http://archive.ics.uci.edu/ml. The breast cancer data included 569 cancer biopsy samples, with 32 variables including ID, diagnosis and 30 other variables were laboratory measurements. The id variable is the medical record number of a patient with breast cancer. Diagnosis variable is the determination of the health condition currently experienced by breast cancer sufferers, the diagnosis is divided into two, namely "M" and "B". For "M" denotes Malignant (malignant) and "B" denotes Benign (benign). The other 30 variables are laboratory measurements consisting of mean, se (standard error) and wort. The three variables each consist of 10 different characteristics, namely radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimension. Figure 2 is the Wisconsin Breast Cancer Diagnostic dataset.

Fig. 2. Wisconsin Breast Cancer Diagnostic Dataset
The patient id variable is a unique number for each patient in the data and does not provide meaningful information, so in this study the id variable was excluded from the model. Diagnosis variables are used to predict, this variable indicates whether the cancer is malignant or benign cancer. The following table 1 is the frequency of diagnosis of malignant and benign cancer. For the other 30 variables, numerical values have different measurements from each of the 10 characteristic values. The smoothness_mean level starts from 0.05263 to 0.16340, the radius_mean ranges from 6,981 to 28,110 and the area_mean ranges from 143.5 to 2501, this will have an impact on the calculation of area_mean which has a value much greater than the calculation of smoothness distance. This impact creates a classification problem, so the data needs to be normalized to change the feature scale.

Normalization of numeric data
Min-Max normalization is a normalization method by performing linear transformations of the original data so as to produce a balance of value comparisons between data before and after the process [15] [16]. This method can use the following formula [17]:

Training and testing data
From the data, 569 biopsies were classified as benign or malignant. In this study, it will be tested how well the model has been given after normalization. The data will be divided into two, namely training data and testing data. Training data is used to build the k-NN model while testing data is used to estimate the accuracy of the model. This study used 469 training data while testing data used 100 data to stimulate new cancer patients. Table 4. Is the percentage comparison of the frequency of breast cancer diagnosis:  Table 4 shows that the training data used for malignant is 36.9% and for benign is 63.1%, while the comparison of testing data for malignant is 39% and benign is 61%.

Result and Discussion
In the data mining classification algorithm, there is an evaluation to determine the level of accuracy of the classification algorithm. The classification algorithm is divided into 2 data, namely training data and testing data. Training data is used to create a pattern in forming a classification model. Meanwhile, data testing is used to measure the accuracy of the classification algorithm whether it succeeds in classifying correctly. Evaluation uses a Confusion matrix to provide decisions obtained in training and testing. Confusion matrix, provides an assessment of classification performance based on true or false objects. To get better accuracy results, an experiment was carried out. From the experiments conducted in this study is to calculate the overall average value. The confusion matrix provides an assessment of the classification performance based on true or false objects. Figure 3 is the result of the value of the min-max normalization method with k = 21 Fig. 3. Min-max normalization method with k = 21 Figure 3 shows the values are divided into four categories, namely true negative, true positive, false negative and false positive. The Benign column shows that 61 is true negative, this value is the case where the cancer is benign and the k-NN algorithm identifies it correctly, while the False positive category shows the number 0.The malignant column shows that a true positive result of 37 predicts a truly positive one. malignant. Whereas 2 states false negative, this indicates that the k-NN approach does not agree with the actual column, so in this case the predicted value will be benign, even though the cancer is actually malignant. This mistake is very dangerous because it can cause the patient to believe that the patient is free of malignant cancer, even though in fact the patient is exposed to malignant cancer so that if not treated properly, the cancer will continue to spread. The following figure 4 compares the accuracy with different k in the min-max normalization method.

Fig. 4. Comparison of accuracy in the min-max normalization method
Based on Figure 4 above, it can be seen by the min-max normalization method the highest accuracy value at k = 5 and k = 21 with an accuracy value of 98% and the lowest result at k = 1, k = 7, k = 9 and k = 27 with values 96% accuracy.
The dataset will be transformed again using a different normalization method. The next normalization method used is the z-score normalization. The formula used in this method can be seen in equation 2. Normalization of the Z-score is done by processing the mean and standard deviation of the attribute values. Figure 5 is the result of the value of the z-score normalization method with k = 21.   Figure 6 also shows that testing the z-score normalization method gets the highest accuracy at k = 5 and k = 15 with an accuracy value of 97% and the lowest result at k = 1, k = 13, k = 21, k = 23, k = 25 and k = 27 with an accuracy value of 95%. The test results in this study inform that the z-score normalization method has a stable accuracy between 95% to 97%. The accuracy value of the z-score method found in this study is higher than the results of research conducted by Pandey and Jain (2017) [5] on the IRIS data set, and Nasution et al (2019) [6] regarding the wine data set.

Conclusion
Breast cancer is classified into two, namely benign and malignant. Benign breast cancer in this model was 62.7% and malignant breast cancer was 37.3%. The training data set used for malignant breast cancer was 36.9% and for benign breast cancer was 63.1%. As for the testing dataset, 39% for malignant breast cancer and 61% for benign breast cancer. Based on the test results using the min-max normalization method, it was found that 61% of benign breast cancer was predicted to be truly benign (true negative) while benign breast cancer that was predicted to be malignant (False positive) did not exist, while malignant breast cancer was predicted to be benign. (false negative) by 2% and malignant breast cancer that was predicted to be malignant (true positive) by 37%. At the min-max normalization, the highest k accuracy value is k = 5 and k = 21 with an accuracy rate of 98%. The test results using the z-score normalization method showed that 61% of benign breast cancer was predicted to be truly benign (true negative), while benign breast cancer that was predicted to be malignant (False positive) did not exist, while malignant breast cancer was predicted to be benign (false negative) by 5% and malignant breast cancer that was really predicted to be malignant (true positive) by 34%. For the Z-score method, the highest accuracy is at k = 5 and k = 15 with an accuracy rate of 97%. Thus the min-max method in this study is considered better than the normalization method using the Z-score and this study strengthens previous research conducted by Pandey and Jain (2017) [5].