Application of Data Mining Classification Method for Student Graduation Prediction Using K-Nearest Neighbor (K-NN) Algorithm

The student graduation rate is one of the indicators to improve the accreditation of a course. It is needed to monitor and evaluate student graduation tendencies, timely or not. One of them is to predict the graduation rate by utilizing the data mining technique. Data Mining Classification method used is the algorithm K-Nearest Neighbor (K-NN). The data used comes from student data, student value data, and student graduation data for the year 2010-2012 with a total of 2,189 records. The attributes used are gender, school of origin, IP study program Semester 1-6. The results showed that the K-NN method produced a high accuracy of 89.04%.


Introduction
The college is one of the education levels that is considered as the last gate for students to gain knowledge before they finally involve themselves in the workforce. Institutions of higher education should improve the quality of service and satisfy the students as well as the public space around them in order to compete with other colleges. Higher Education Accreditation by BAN-PT (Higher Education National Accreditation Body) is one of the parameters in determining the quality of universities and courses in Indonesia. The student Graduation rate is an indicator of increasing the accreditation of a course. Therefore, it is necessary to monitor and evaluate the student graduation tendencies, on time or not.
Stmik Amikom Purwokerto is one of the higher education institutions in the field of management and computer science. In the Regulation Of The Stmik Amikom Purwokerto Education year 2009 in Chapter I Article 1 paragraph 2 mentioned that the undergraduate program (S1) regular is an academic education program after secondary education, which has a study burden of at least 144 SKS (unit credit Semester) and as many as 160. The duration standard of the undergraduate study program (S1) is 8 semesters, but many students are found to graduate beyond the scheduled ones. From the above data, it can be seen that only about 50% of the total students of each generation graduated on time. It is undoubtedly necessary for monitoring and evaluation of the student graduation rate. Because if left unchecked and the graduation rate will have an impact on the value of accreditation, and can be an obstacle for Stmik Amikom Purwokerto to go to a higher level that is from high school to institute.
Research to be done is to analyze the value of accuracy generated by the classification method K-Nearest neighbor (K-NN) because K-Nearest neighbor is a strict algorithm against the training of data that has much noise. Also, K-Nearest Neighbor is more effective when the data training is significant[1] [2].
Based on the explanation above, can be formulated the problem is "how to implement data mining techniques useful as information to know the graduation rate of students in the Stmik Amikom Purwokerto using the algorithm K-Nearest Neighbor (K-NN)." The research aims to know the level of accuracy that has been submitted By the K-Nearest Neighbor (K-NN) algorithm in predicting the graduation rate of students at Stmik Amikom Purwokerto. The applicative benefits that can be obtained from this research are expected to help determine the level of student graduation so that it can be a monitoring and evaluation material to improve the quality of the campus and can be used as a reference in conjunction with the expert system and decision-making system. The materials used in this study were obtained from student data, student Value data, and graduation data of students of Stmik Amikom Purwokerto year 2010 to 2012.

Research Concept
This framework is the steps that will be taken in resolving the issue that will be addressed. The framework in this study looks like in figure 2

Fig. 2 Research Frameworks
The following describes the Research framework shown in Figure 2 above: a. Literature study In this study, the literary studies of the authors conducted a review of the journals or other relevant sources of the library relating to this study, then the source of the library was used as a reference for Support in the research process.

b. Data Collection
At the stage of the author's data collection using student graduation data in 2010-2012 to be used as research data. Student Data obtained from 2010 to 2012 as many as 2.189 Students.

c. Preprocessing Data
In the preprocessing phase of the data, researchers perform several processes to obtain a clean dataset of missing values and inconsistent data so that the dataset can be recognized and managed using WEKA.

d. Dataset distribution
Once data is obtained, the process of dividing the dataset and subsequent data will be divided into two, namely training data and data testing.
e. Application of the K-Nearest Neighbor (K-NN) method Data that has been divided into two then done process of counting using method K-Nearest Real (K-NN), the application that is used in the calculation process is Waikato Environment Knowledge Analysis (WEKA) [13].

f. Accuracy results
From the process of implementing the K-Nearest Neighbor (K-NN) method that has been done before, obtained the result of accuracy that then from the results of accuracy will be analyzed whether the algorithm used to provide the best accuracy results or not in Classification of student graduation data on time and promptly.
g. Results analysis The results of the analysis phase will conclude from the implementation of the K-Nearest Neighbor (K-NN) Classification method of the student graduation data. Then the accuracy results will be compared with the accuracy results of the implementation of the Decision Tree classification method that has been analyzed in previous research [14]1[15.

Literature Study
In the literary study phase, researchers conducted a review of the relevant journals, books, and other library resources related to the study. Subsequently, reviewed library sources are used as a reference to support the research process.

Data Collection
Researchers collected the data needed in this study. Data used from Stmik Amikom Purwokerto, which is data of student graduation from

Data Preprocessing a. Data cleanup and integration
The data cleanup process is done so that the data obtained is relevant data according to the needs and because not all the attributes in the table will be used [4]. A critical data cleanup is done to improve performance in the mining process. The data cleanup can be done by deleting the incomplete data and deleting the unused attributes [5].

b. Data transformation
In the Data transformation phase, changes in the type of gender, course, and IP attributes are changed. Once the data transformation process is done, the last step of preprocessing data is to convert the dataset from an Excel file to CSV or ARFF format in order to be recognized as a data source on WEKA [6]. However, before being saved as an ARFF file format, it is known that the existing dataset is still the original dataset that still mixed so that the dataset will need to be 2, i.e. data that will be used as the data training and data to be used as data testing [7].

Dataset Distributions
Datasets that have been created after going through several stages of preprocessing data amounted to 1,746 Records.

Application of K-NN Method
After the distribution of the dataset is done, process mining or testing dataset using the method or algorithm that has been determined and in this study used the algorithm K-Nearest Neighbor (K-NN)[8] [9]. The mining process is done by implementing the K-Nearest Neighbor (K-NN) method and using the intelligent system provided by the WEKA tool. Tests will be conducted on each dataset that has been shared in the previous discussion[10] [11].

a. K-Fold Cross-Validation
This process is a process to get the K-optimal using the method K-Fold Cross-Validation. The Fold used is 10. Then The value K to be used is k = 1.
b. Accuracy Test

1) Testing Data Training
Testing of the training data resulted in the highest accuracy in test k = 1, which is 88.16%, the meaning of the 667 records known that as many as 588 records were predicted correctly and the 79 records were mispredicted.

2) Testing Data Testing
Testing of data testing resulted in an accuracy of 89.04%, which means that from 867 Records It is known that as many as 772 records are predicted to be correct and 95 records are predicted to be incorrect. Value k = 1 Detailed Accuracy By class: The steps for calculating the K-Nearest Neighbor (K-NN) method are as follows: 1) Specifying the K parameter 2) Calculating the distance between the data to be evaluated with all training 3) Sorting the distance formed 4) Determine the closest distance to the order K 5) Pairing a compatible class 6) Look up the number of classes from the closest neighbor and set the class as the data class to be evaluated. Calculations are done by calculating the distance Euclidean distance from each training data to the testing data. After calculating Euclidean distance, then specify K classification and take the largest value based on the specified K is 1. Furthermore, the results of classification k are determined by order of the ranking and retrieved 1 data that has the highest Euclidean value [12].
From the overall calculation of the distance between the training data amounting to 667 data, the calculation results are given a range and sorted from the smallest to the largest. Once sorted, then seen the classification result arising from the calculation of the highest distance or follow the value of the first rank.
From the results of manual calculations that researchers have done generated data with a "timely" label as many as 772 records whereas with the label "No" amounted to 95 records. This means that the resulting accuracy can be calculated as follows: Accuracy percentage = (Total correct prediction/Total data) x 100% = (772/867) x 100% = 89.04%

Accuracy Results
From the test results of the DataSet with 3 attempts that have been done in the previous discussion, obtained the highest accuracy result as follows:

Result Analysis
Research that has been done using the graduation dataset consisting of data training and data testing, by implementing the K-optimal method on the 2 types of datasets resulted in the acquisition of accuracy value in the training data of 88.16 % in tests with a value of k = 1 while testing data testing with a value of k = 1 resulted in an accuracy of 89.04%.
From the research that has been done by the authors, it can also be analyzed that the value of k affects the accuracy value of each dataset test. This is evidenced by the value of accuracy generated in test tests to 1 to 3, where each experiment used different k values of 1, 3, and 5. In test testing of data, training resulted in the highest accuracy value in the implementation of value K = 1 which is 88.16% while the lowest accuracy value is generated on the test with a value of k = 5, then it can be concluded that the accuracy value Highest possible implementation of the smallest K value. b. From Testing conducted by the author of the training data with the amount of 667 Records obtained the highest accuracy of 88.16% in the 1st test with a value of k = 1, meaning of 667 Records as many as 588 Records is the total predictions is correct. Then D to test that has been done by the authors against the data testing with the amount of data 867 record generates an accuracy of 89.04%, meaning of 867 records as much as 772 record is the total prediction right.

Suggestions
a. Try using Other algorithms like Naive Bayes, Neural Network, or any other.
b. Try to make a comparison with more algorithms in order to be generated the best level of accuracy that can later be used as a reference to developing a system that can be used to predict the graduation rate of students at Stmik Amikom Purwokerto.
c. Try many more attributes and records in data mining processing.
d. It takes a high level of precision and a perfect data cleanup so that no noise occurs.