IMPLEMENTATION OF K-NEAREST NEIGHBOUR ( KNN ) ALGORITHM TO PREDICT STUDENT ’ S PERFORMANCE

Salah satu unsur untuk menjadi penilaian akreditasi adalah ketepatan waktu lulusan siswa. Adanya siswa yang tidak aktif tentu akan mempengaruhi ketepatan waktu kelulusan. Prediksi kinerja siswa diperlukan untuk mencegah siswa yang tidak aktif. Algoritma KNN digunakan untuk memprediksi kinerja siswa dengan menggunakan metode klasifikasi. Penelitian ini untuk mengoptimalkan algoritma KNN untuk memprediksi kinerja siswa dengan metode klasifikasi. Penelitian yang telah dilakukan dengan menggunakan data Jurusan Teknik Informatika Politeknik Harapan Bersama menyimpulkan bahwa nilai K terbaik adalah 3, 6, dan 9 untuk mendapatkan prediksi terbaik. Hasil ini diperoleh dengan mencoba nilai K, 3 hingga 60. Nilai prediksi kemudian dibandingkan, hasil yang salah diprediksi dimana persentase terkecil adalah yang terbaik. Kata kunci: KNN; optimasi; kinerja siswa. ABSTRACT One of the elements to be an accreditation assessment is the timeliness of graduating students. The existence of non-active students will certainly affect the timeliness of graduation. Prediction of student performance is needed to prevent non-active students. KNN algorithm was used to predict student performance by using classification method. This research is to optimize KNN algorithm to predict student performance by classification method. The research had been done by using data of department Informatics Engineering Politeknik Harapan Bersama conclude that the best value K are 3, 6, and 9 to get the best predict. This result is obtained by trying the value of K is 3 to 60. The predicted value is then compared, the incorrectly predicted result of which the smallest percentage is the best.


INTRODUCTION
Each department will try to improve the quality of education and accreditation of the department.One of the elements to be an accreditation assessment is the timeliness of graduating students [1].The more students who graduate on time the better the value of accreditation.The existence of non-active students will certainly affect the timeliness of graduation.The more non-active students will be more and more students who pass not on time.Thus, the more number of non-active students hence can affect the value of accreditation of study program.
Prediction of student performance is needed to prevent non-active students.Research on predictions of student performance several times had been conducted.Among the research that had been conducted are predictions of student activity using the KNN algorithm [2].Similar studies had also been conducted, namely; research to predict students' graduation using KNN algorithm [3], In addition to using the KNN algorithm, the application of Fuzzy Inference System (FIS) had also been used to predict student activity [4], Random Forest algorithm to predict length of student study [5], Decision Tree C4.5 algorithm to predict potentially non-active students [6] and to predict the study period of students [7].In addition to these studies, other similar studies had also been conducted; research about academic performance using decision tree techniques [8], predict student's performace using data mining technique [9], and estimating student's performance using Weka Environment [10].
KNN algorithm is perhaps one of the simplest machine learning algorithms, it is still used widely [11].The KNN algorithm is highly dependent on the number of Kernels to get predicted results.Several studies have been conducted to optimize the number of Kernel KNN algorithms.Research that had been done; learning K on the KNN algorithm to make predictions [12], optimization techniques modified K Nearest Neighbor classification using Genetic algorithm [13], and optimization of K parameters in K-Nearest Neighbor algorithm for classification of diabetes disease mellitus [14].This research is to optimize KNN algorithm to predict student performance by classification method.Classification methods are widely applied in many sciences such as health sciences [15], science education [16][17], building science [18], and others.The focus of research that has been done is to optimize the kernel on KNN algorithm to predict (classification method) student performance.

METHODS
KNN algorithm was used to predict student performance by using classification method.Nearest Neighbor classifiers are defined by their characteristic of classifying unlabeled examples by assigning them the class of similar labeled examples.Despite the simplicity of this idea, nearest neighbor methods are extremely powerful [11].Steps of the research process shown in Figure 1.

Collecting Data
We utilize the student academic data of Department Informatics Engineering Politeknik Harapan Bersama.The data used was 1530 rows with 7 attributes numeric.These 7 attributes are; grade point, grade point average, hometown, type of school, major at school, parent's job, and student performace.The student performance is coded as "A" to indicate active or "N" to indicate non-active.

Exploring and Preparing Data
Data exploration and preparation was done to see the dataset to be used.If there is data that is not appropriate, then the data will be corrected.Data exploration and preparation was done using the str command in R Studio.Checking results show the dataset has been structured with 1,530 lines and 7 attributes as expected.The first few lines of the checking output are shown in Figure 2.  (1)

Training Model on The Data
This step, the ready dataset is used to classify.For the KNN algorithm, at this stage it is not for model formation; the training process only involves storing input data in a structured format.Model training had been done using Kernel = 39.The number of Kernels used is 39 because one common practice is to begin with Kernel equal to the square root of the number of training examples.This stage has produced one model.

Evaluating Model Performance
The next step of the process is to evaluate how well the had predicted.To do this, I used CrosTable function in the gmodels packages of R Studio.After loading the packet, I have created a cross-tabulation showing the agreement between two output vectors: label and prediction.The cross-tabulation is shown in Figure 4. Figure 4 shows the number of false negative is 0 and false positive is 9, so there 9% classified is incorrect.

Improving Model Performance
In these step, attempted improving the model by trying several different values for kernel (K).By trying out different values of K, it is hoped that the best model will be obtained.The same 500 labels are classified using different K values.Then the numbers of the false negative and false positive are displayed each iteration.

RESULT AND DISCUSSION
The result of the research the percentage of incorrect prediction student performance.Table 1 shows the results of research with attributes; K value, false negative, false positive, and predict incorrectly.20 different K values are used to compare the percentage of incorrectly predicted.the best value of K to make predictions is 3, 6, and 9 with the percentage of incorrectly predicted 0%. Figure 5 shows that the greater the value of K the greater the incorrectly predicted, and the smaller the value of K the smaller the incorrectly predicted.Although the percentage of incorrectly predicted is getting greater, but the percentage is not to so different, just about 0,0%.
Figure 6 shows Relation k values with false negative and false positive.Whatever the value of K then the false negative value is 0, it indicates that the accuracy for predicting the inactive student (0) is very high.The greater the value of K the higher the false positive and otherwise.It shows that the greater the value of K the accuracy of the prediction decreases and otherwise.

CONCLUSIONS
The research had been done by using data of department Informatics Engineering Politeknik Harapan Bersama conclude that the best value K are 3, 6, and 9 to get the best predict.This result is obtained by trying the value of K is 3 to 60.The predicted value is then compared, the incorrectly predicted result of which the smallest percentage is the best.

Figure 2 .
Figure 2. Structure Dataset Then I transformed by normalizing numerical data to equate the data.Normalizing numerical data using equation (1), and output are shown in Figure 3.

Figure 3 .
Figure 3. Structure Dataset After Normalizing After the data becomes the same, the next is preparing the data by creating training and test dataset.I had used about 70% of the data for training and about 30% for tests.