Analysis of Data Mining Using K-Means Clustering Algorithm for Product Grouping

Rizki Barokah Store is one of the stores that every day sell a variety of basic materials of daily necessities such as food, drinks, snacks, toiletries, and so on. However, some problems occur in the Rizki Barokah Store is often a build-up of product stocks that resulted in the product has expired. This is due to an error in making decisions on the product stock. In addition to these problems, with the amount of sales data stored on the database, the store has not done data mining and grouping to know the potential of the product. Whereas data-processing technology can already be done using data mining techniques. To overcome the period of the land, the technique used in data mining with the clustering method using the algorithm K-means. With the use of these techniques, the purpose of this research is to grouping products based on products of interest and less interest, advise on the stock of products, and know the products of interest and less demand.


Introduction
The Rizki Barokah Store is a store that is located in Jalan Brigadir 17 No. 47 RT 001 RW 001 Rempoah Village Baturraden District. In its activities, the Rizki Barokah Store sells various basic materials of daily necessities such as food, beverages, snacks, toiletries, and so on. The store was established in February 2018. Because the store is classified into a large store, then the longer a shop stands then, the larger the data owned by the store. Based on the observed results, the number of products sold in November amounted to 1127, then in December 1812, and in January of 2075 products. However, with the amount of sales data stored in the database, the data is still raw data that has not produced useful information.
In sales activities carried out, there is a common problem that is the accumulation of stock products expired. The activity of supplying product stock in the Rizki Barokah Store by using distributors and other stores. The stock of the accumulated product occurs because several expired products cannot be returned to the distributor. Mistakes in determining the product 's stock can be detrimental to the shop because the expired items do not make a profit on the shop. Based on the data contained in the store is known to have expired product as much as 426 from August 2018 to March 2019. Sales activities carried out by the Rizki Barokah Store at this time not yet utilize their sales data to explore information about the potential of the product, whereas data processing technology can already be done using data mining techniques[1] [2]. Data mining is a process to explore useful and previously unknown information from a large stack of data [3]. One of the methods that can be used in the business world is data mining with a clustering technique using the K-means algorithm [4] [5].
This research method used is data mining with a clustering technique using the K-means algorithm for product grouping based on the most desirable and less desirable products to determine the stock of products in the Rizki Barokah Store. K-means algorithm is an iterative grouping algorithm that performs the partition of the data set into a number of the preset k clusters [6]. The K-means algorithm is simple to implement and run, comparatively fast, adaptable, common usage in practice. Historically, K-means became one of the most important algorithms in data mining [7].

Stock
According to [8], stock of goods is as an activity that covers the goods of the owner of the organization to be sold at a time or certain business period or stock of goods that are still in the process of production or supplies of raw materials waiting for their use in the production process. Meanwhile, the supply of goods is defined as goods acquired by the company for resale or further processing in order to carry out the company's activities [9] [10]. Companies that can precisely control their system of availability will facilitate the company to survive operational activities and maintain the smooth operation of the company. Therefore, the supply of goods is important, because the success of the planning and supervision of supplies will have a significant impact on the success of a company, one of them on the company's profit determination [11].

Data Mining
Data mining is a process of hiring one or more computer learning techniques (machine learning) to analyze and extract knowledge automatically [14]. Data mining can be defined as induction-based learning, which is a process of forming the definitions of general concepts done by observing specific examples of concepts to be learned. Knowledge Discovery in Database (KDD) is the implementation of scientific methods on data mining [12]. In this context, data mining is a step of the KDD. The process of data Mining that implements the Knowledge Discovery in Databases (KDD) process is in Figure 1 as follows:

Clustering
Clustering is also known as segmentation. This method is used to identify the natural group of a case based on an attribute group, grouping data that has attribute resemblance [13][15].

K-Means Algorithm
Algoritma K-means to set the cluster values(k) randomly, for the meantime, the value becomes the center of the cluster or is commonly called centroid, mean or "means." Then the distance of each existing data against each centroid is calculated using the Euclidean formula so that it finds the closest distance on each data with a centroid. Then do the classification based on its proximity to the centroid. Do until the centroid value is not changed [15]. The steps of doing clustering with the method K-means [12] are as follows: a. Select the number of clusters K.
b. Initiation k Cluster Center This can be done in various ways. However, the most often done is using random. Cluster centers are given an initial value with random numbers.
c. Place each record or object to the nearest cluster. The distance of both objects determines the proximity of two objects. Similarly, the proximity of a record to a particular cluster is determined the distance between the data with the cluster center. The closest distance between a single record and one specific cluster will determine which data is entered in which cluster. To calculate the distance of all the data to each cluster center point can use the distance theory of Euclidean distance formulated in Figure 1.2 as follows: e. Redefine each object by using the new cluster center. If the cluster center is no longer changed, then the classifying process is complete. Alternatively, go back to step c until the cluster Center is not changed anymore.

Research Methods
The research methods used are as follows:

b. Data Collection Methods
The method of data collection used in this research is interviews, documentation, and observation to explore the existing problems and obtained data amounting to 1397 in November 2018 until January 2019.

c. Research concept
In this research, the stages of which are the identification of problems, data collection, stage preprocessing, the use of clustering methods, as well as the withdrawal of conclusions from the results that have been obtained. The research concept is found in Figure 3 below:

Fig 3. Research Concept
As for the explanation of the concept of research on figure 3 Initial step on the concept of this research is the identification of the problem to know the existing problems, as well as the method used in this research, is clustering with the Algorithna K-means. Then the collection of data obtained for three months with the amount of 1397. The next step is to preprocessing The data that aims to handle the duplication of data as well as the selection of attributes, and the attributes used are item name, amount sold, and stock. The next step is processed using the clustering method with the K-means algorithm and then concludes the results obtained.

Problem Identification
In this study, several literary studies were conducted by studying the literature related to the research conducted for the grouping of products as well as conducting the appropriate selection of algorithms for use in this study. Based on the results of identification, the algorithm used is K-means clustering for product grouping based on products that are desirable and less desirable to advise on the Rizki Barokah Store in determining the product stock.

Data Collection
Data used in this research is obtained from the Rizki Barokah Store, which is a stock data that explains the amount of stock remaining goods and product sales data from November 2018 -January 2019 describing the number of each product sold that month.

Preprocessing
At this stage is the handling of data that has duplication and selection of attributes. The attributes used are the item name, number sold, and stock. So The result of this stage of the final DataSet consists of 931 number of items and consists of 3 attributes.

Use of Clustering Methods
Once the preprocessing stage is completed, the next step is processed using the algorithms K-means with the clustering method. The following is a test on 10 samples of manually conducted data to view the results of product groupings based on the algorithms K-means with clustering techniques contained in table 1. The calculation step is as follows: a. Specify the number of clusters and then specify some problems that occurred in the cluster center. On this research author selects 2 clusters with data 2 nd and 4 th as Cluster center namely : b. Once the cluster center selection is performed, the next step is to calculate the cluster center distance in the 1st iteration.

1) Calculates distances on the cluster Center.
To compute distances on any existing data against the cluster Center used the Euclidean distance formula in Figure 4: Here are the results of the calculations that can be seen in table 2: 2) Grouping the Data The next step is to group the data from the calculation of the process in Table 2 by entering each object (data) into a cluster (group) based on the minimum distance. The following is the result of the data grouping in the first iteration found in Table 3:  Based on the results obtained in table 3, the value of 1 is the closest distance obtained by viewing the minimum value between C1 and C2.

3) Calculating the new cluster center
The next step is to calculate the center of the new cluster, as the way is to calculate the average of the minimum value entered in each cluster. The calculation of the first cluster Center (C1) that is calculated based on the average is the following: C1 = (2 + 9 + 2 + 3+2 + 1)/6 = 3.1 C1 = (4 + 5 + 7 + 4+8 + 10)/6 = 6.3 A second cluster center calculation (C2) that has been calculated based on average is as follows: 1) The next step is to calculate the cluster center distance in the 2nd iteration. The cluster center is extracted from the calculation results of the new cluster in the 1st iteration. 2) The next step is to calculate the distance between the center of the cluster using the Euclidean distance formula just as in the first step. The calculation result can be seen in table 4: 3) Grouping the Data The next step is to group the data from the calculation of the process in Table 4 by entering each object (data) into a cluster (group) based on the minimum distance. The following is the result of the data grouping in the 2 nd iteration found in Table 5:  1) The next step is to calculate the cluster center distance in the 3 rd iteration. The cluster center is extracted from the calculation results of the new cluster in the 2 nd iteration.
2) The next step is to calculate the distance between the center of the cluster using the Euclidean distance formula just as in the first step. The calculation result can be seen in table 6:

3) Grouping the Data
The next step is to group the data from the calculation of the process in Table 6 by entering each object (data) into a cluster (group) based on the minimum distance. The following is the result of the data grouping in the 3 rd iteration found in Table 7:  Based on the results obtained in table 7, the value of 1 is the closest distance obtained by viewing the minimum value between C1 and C2. Because the results of the grouping in the 2nd iteration and the 3rd iteration already have the same results, then the iteration process stops at the 3rd iteration.
Based on the calculation result, the product grouping in the store Rizki Barokah with a sample number of 10 data consists of 2 clusters, namely cluster1 and cluster2. Pada cluster1 known to have 4 less popular products, and on cluster2, 6 products are most in-demand. The Following are the results of the clustering K-means contained in table 8:

Conclusions
Based on the results of the research that is done, it can be concluded that: a. The method of clustering with the K-means algorithm can be used for product grouping in Rizki Barokah Store, so it can be used in determining the stock of products.
b. Based on the results of the calculations can be found that 6 products are in demand and 4 less desirable products.
c. By knowing the most desirable and less desirable products, Rizki Barokah Store can determine the stock of products by prioritizing the purchase of stock products in the most in-demand products and reduction of purchases against less popular products to reduce product stock buildup.