Improved C45 performance with gain ratio for credit approval dataset

— People's shopping behavior has undergone many changes after the COVID-19 pandemic. Many people have switched to using the marketplace to make buying and selling transactions. The payment process in the marketplace is relatively easy, especially when using a credit card. The increase in demand for credit must be addressed better by financial providers to minimize bad loans. The best thing in minimizing bad credit is to be more selective in choosing credit customers. Data mining is a field that can study old data to become new knowledge in the future. In data mining, the classification of bad credit customers is mostly done. One of the algorithms that excels in handling credit approval datasets is C45. The C45 model is widely used because it has an output decision tree that is easier to understand in human language. The number of data attributes can affect the performance of the algorithm. Feature selection is a form of attribute reduction to improve data quality and improve classification algorithm performance. Gain ratio is the development of information gain and is the best feature selection model and is widely used by researchers. This study performs a classification using C45 and uses a gain ratio for the selection of credit approval data features. By using the gain ratio, the accuracy of the C45 classification algorithm increased from the previous 94.12% to 95.29%.


Introduction
The COVID-19 pandemic gave rise to many new behaviors in human life. In addition to changing social behavior, economic behavior has also changed a lot due to social restrictions in society. Some of the economic behavior that has changed due to the pandemic is people's spending habits. The existence of social restrictions in society has led to an increase in online buying and selling transactions [1]. The increase in the number of online transactions during 2020 has an impact on the trend of credit provider data. The credit provider banks must think of new ways to minimize bad loans. In fact, banks can use more than one model to decide on the credit approval of their customers [2]. In addition to using traditional methods, data mining is also widely used for credit approval classification [3].
Data mining is the study of data so that it can generate new knowledge. Data mining is widely used for classification [4] [5]. The classification process can use many algorithms, one of the best classification algorithms is C45 which can produce decision tree outputs [6]. C45 is proven to be able to handle numeric or nominal type [7]. One of the advantages of the C45 output is that it can be more easily understood in human language [3]. In the calculation process, the C45 algorithm uses the gain value to calculate the importance of each data attribute used. The attribute with the highest gain value will be used for the first node and so on until all data attributes are used up for the other nodes.
Credit approval classification using the C45 algorithm has been done [3]. In this study, a decision support system was made with an accuracy rate of 94.12%. In addition, several studies were also conducted to improve the accuracy of credit approval classification using the information gain method [8]. In its development, many improvements to the information gain method have been carried out. One of the most prominent improvements to the information gain method is the information gain ratio method [9]. In the information gain ratio method, the split information value is used to divide the information gain value. The information gain raio process is proven to improve the performance of the classification algorithm [10].
This study uses the information gain ratio method for the selection of credit customer data features. The data used is credit card customer data with a total of 14 regular attributes and 1 label attribute. This dataset has 766 records. After calculating the information gain ratio, the threshold value is set to be 0.21. The classification process is carried out using the C45 algorithm. Validation in the classification is carried out using 10 folds cross validation and using a confusion matrix for the evaluation process. The classification process was carried out and the results obtained an accuracy rate of 95.29%. Without using the information gain ratio, the accuracy rate is only 94.12%. In fact, the information gain ratio can improve the classification performance of the C45 algorithm by 1.17%.

Related Research
Research with the theme of credit approval has been carried out with various results. In the previous study [8] used the credit approval dataset and information gain feature selection. This research uses the K-NN classification algorithm. In this study, the best accuracy was obtained using K-NN and information gain with an accuracy rate of 94.78%. Currently the development of information gain is found by dividing the split information and is known as the gain ratio. Classification research using C45 is also widely carried out. One of them is to classify the Covid 19 surveillance dataset [7]. The advantage of C45 is that it has an output decision tree that can be easily understood by human language. C45 is also one of the best classification algorithms and is widely recommended by international researchers [11].

Data Mining
Data mining is a science that focuses on existing computerized datasets or records [12]. The current use of digital media clearly enriches existing digital data. The amount of data that has no meaning will only become digital waste that makes our data storage media full. The existence of data mining is very helpful in processing data so that it becomes a new knowledge. In data mining there are various main functions. Such as Estimation, Prediction, Clustering, Association and Classification. Classification is one of the main functions and is widely used because it can handle numeric and nominal data. Various algorithm models are the mainstay in the classification process. One of the most popular and proven to have good performance is the C45.

Research Methods
This study uses an experimental method. The experiment was carried out using existing datasets and using a rapid miner as a calculation tool. Figure 1 is the research method carried out. The stages of the research carried out are as follows:

Data collection
The data collection process is carried out using a credit approval dataset from banks. This dataset is classified as private data from the use of credit cards in one bank. The credit card usage data used has 766 records and 14 regular attributes and 1 label attribute. Table 1 is the metadata of the credit approval dataset.

2. Feature selection
The first process after data collection is feature selection. Feature selection in this study uses the gain ratio [9]. The gain ratio is proven to improve the performance of the classification algorithm [10]. The feature selection process is used to determine how high the influence of the attribute in the classification is according to the gain value of the attribute. Furthermore, some attributes that are considered to have no effect or have low gain will be set aside and not used in the classification process. The success of the feature selection process using the gain ratio is also influenced by the threshold used.

Validation
The validation process is carried out using 10 folds cross validation. This process is used to ensure that all data records have participated in training and testing data. The process of this validation process uses a rapid miner application. Figure 2 is the running process of the rapid miner. Figure 3 is a validation display using cross validation in rapid miner.

4. Algorithm calculation and result evaluation
Algorithm calculation is done by rapid miner. The process that is followed is as shown in Figure 2. In the image there is one data which is then separated into 2 using the multiply tools. One of them is the C45 calculation process using the original dataset. Next, the dataset is selected for features before C45 classification. Furthermore, the results of the accuracy of the two are compared. The comparison is done by using confusion matrix.

1. Results
The gain ratio is used to determine the importance value of each attribute in the credit approval dataset. Table 2 is the result of calculating the gain ratio using a rapid miner. From these results, it is known that the tunggakan pokok attribute has the most influence in the classification with the highest importance value. While tipe pinjaman type attribute is the attribute with the lowest importance value. The results showed an increase in the performance of the C45 algorithm by selecting features using the gain ratio. By using the gain ratio, the accuracy of the C45 algorithm increases to 95.29%. Previously, the accuracy of the C45 algorithm without using a gain ratio was 94.12%. This 1.17% increase in performance occurs by using a threshold value of 0.21 in the gain ratio. Figure 4 is the performance of the C45 algorithm with the gain ratio feature selection.

Discussion
This study uses an experimental method by trying every possibility that exists. From table 2 which has described the importance of all attributes using the gain ratio. From this gain ratio value, threshold can be taken for the classification process using C45. This threshold is used as a limit for attributes that will be used or left in the next classification process using C45. Table 3 is the overall result of the C45 classification using the threshold in accordance with the previous gain ratio results.   Table 3 explains the increase in the accuracy of C45 by using the gain ratio. The greatest increase in accuracy occurs by using a threshold of 0.21. By using a threshold of 0.21 it means that only the best 4 attributes are used in the C45 classification process. The use of the gain ratio does not always improve the performance and accuracy of the algorithm. In fact, using a gain ratio with a threshold of 0.22 or more than 0.22 can reduce the accuracy of C45.

Conclusion
This study shows an increase in the accuracy of the C45 algorithm by adding a gain ratio feature selection for the credit approval dataset. The highest increase in the accuracy of the C45 algorithm, which is 1.17%, occurs using a threshold value of 0.21.