Improving the Accuracy of the C45 Classification Algorithm Using Information Gain Ratio Feature Selection for Classification of Type 2 Diabetes Mellitus Disease
DOI:
https://doi.org/10.32497/jaict.v9i2.5845Keywords:
information gain ratio, decision tree, diabetes type 2Abstract
Abstract”” Diabetes is a disease that can cause death. Diabetes can cause heart failure, chronic kidney disease, glaucoma that attacks the eyes and several other diseases. WHO data states that there were more than 2 million deaths due to diabetes in 2019. Data from the International Diabetes Federation shows that around 537 adults are recorded as living with diabetes. This condition must be treated immediately, considering that diabetes is one of the most deadly non-communicable diseases in the world. Patient registration is mostly done in hospitals. A lot of data will only become digital waste if it does not have more benefits. In 2020 Diabetes and Hospital in Sylhet donated patient data for further research. This data contains 520 patient records with 17 attributes that have been validated by specialist doctors. Early stage diabetes risk prediction data is released by the uci repository as public data and can be used for research testing. Research using this dataset has been widely carried out with the previous best accuracy level of 95.96%. In previous studies, all attributes were used in the classification process. The number of irrelevant attributes can affect the performance of the classification algorithm. This study uses the information gain ratio for feature selection of the early stage diabetes risk prediction dataset. The C45 algorithm is used for classification, evaluation using confusion matrix and validation using 10 folds cross validation. The results of this study improve the performance of C45 so that it obtains an accuracy level of 96.15%. This study also produces a decision tree for diabetes..
References
WHO, “Diabetes,” 2023 World Health Organization. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/diabetes
C. J. Ejiyi et al., “A robust predictive diagnosis model for diabetes mellitus using Shapley-incorporated machine learning algorithms,” Healthc. Anal., vol. 3, no. December 2022, p. 100166, 2023, doi: 10.1016/j.health.2023.100166.
Univercity of Washington, “Explore results from the 2019 Global Burden of Disease (GBD) study.” [Online]. Available: https://vizhub.healthdata.org/gbd-results/
International Diabetes Federation, “Diabetes.”
Databoks, “Diabetes Tipe 2 Paling Banyak Diderita Orang Indonesia pada 2023.”
I. H. Witten, E. Frank, M. A. Hall, and C. J, Data Mining (Fourth Edition), 4th ed. Kaufmann, Morgan, 2017. doi: https://doi.org/10.1016/B978-0-12-804291-5.00004-0.
ikhsan wisnuadji Gamadarenda and I. Waspada, “Implementasi Data Mining Untuk Deteksi Penyakit Ginjal Kronis (Pgk) Menggunakan K-Nearest Neighbor (Knn) Dengan Backward Elimination,” vol. 7, no. 2, pp. 417”“426, 2018, doi: 10.25126/jtiik.202071896.
M. F. Kurniawan and Ivandari, “Komparasi Algoritma Data Mining untuk Klasifikasi Kanker Payudara,” IC Tech, vol. I April 20, pp. 1”“8, 2017.
G. Aguilera-Venegas, A. López-Molina, G. Rojo-MartÃnez, and J. L. Galán-GarcÃa, “Comparing and tuning machine learning algorithms to predict type 2 diabetes mellitus,” J. Comput. Appl. Math., vol. 427, p. 115115, 2023, doi: 10.1016/j.cam.2023.115115.
C. Carpinteiro, J. Lopes, A. Abelha, and M. F. Santos, “A Comparative Study of Classification Algorithms for Early Detection of Diabetes,” Procedia Comput. Sci., vol. 220, pp. 868”“873, 2023, doi: 10.1016/j.procs.2023.03.117.
I. Ivandari, M. R. Maulana, and M. A. Al Karomi, “Classification of Type 2 Diabetes using Decission Tree Algorithm,” Jaict, vol. 8, no. 2, pp. 236”“241, 2023, [Online]. Available: https://jurnal.polines.ac.id/index.php/jaict/article/view/4835
M. A. Alkaromi, “Information Gain untuk Pemilihan Fitur pada Klasifikasi Heregistrasi Calon Mahasiswa dengan Menggunakan K-NN,” 2014.
B. Azhagusundari and A. S. Thanamani, “Feature Selection based on Information Gain,” no. 2, pp. 18”“21, 2013.
M. A. Al Karomi, M. R. Maulana, S. J. Prasetiyono, Ivandari, and Arochman, “Strengthening campus finance by analyzing attribute attributes for student registration classifications.” p. 1, 2019. [Online]. Available: https://jurnal.polines.ac.id/index.php/jaict/article/view/1431
S. Kumari, D. Kumar, and M. Mittal, “An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier,” Int. J. Cogn. Comput. Eng., vol. 2, no. January, pp. 40”“46, 2021, doi: 10.1016/j.ijcce.2021.01.001.
Ivandari, W. Setianto, and M. A. Alkaromi, “Klasifikasi Diabetes Tipe 2 Menggunakan Algoritma K-Nearest Neighbour,” IC-Tech, vol. 18, no. 1, pp. 36”“41, 2023, doi: 10.47775/ictech.v18i1.273.
I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques 3rd Edition. Elsevier, 2011.
E. Prasetyo, Data Mining Konsep dan Aplikasi menggunakan Matlab. Yogyakarta: Andi Offset, 2012.
H. Deng and G. Runger, “Feature Selection via Regularized Trees,” Jan. 2012, Accessed: Oct. 16, 2014. [Online]. Available: http://arxiv.org/abs/1201.1587v3
J. Novakovic, “The Impact of Feature Selection on the Accuracy of 1DwYH Bayes Classifier,” vol. 2, pp. 1113”“1116, 2010.
Ian H Witten. Eibe Frank. Mark A Hall, Data Mining 3rd. 2011.
Ivandari and M. A. Al Karomi, “Classification of Covid-19 Survillance Datasets using the Decision Tree Algorithm,” Jaict, vol. 6, no. 1, pp. 44”“49, 2021, [Online]. Available: https://jurnal.polines.ac.id/index.php/jaict/article/view/2896
Ivandari and M. A. Al Karomi, “Algoritma K-NN untuk klasifikasi dataset Covid-19 survillance,” IC Tech, vol. 16, no. 1, pp. 12”“15, 2021, [Online]. Available: https://ejournal.stmik-wp.ac.id/index.php/ictech/article/view/137
F. Gorunescu, Data Mining: Concepts; Models and Techniques. Springer, 2011.
J. Gao, Z. Wang, T. Jin, J. Cheng, Z. Lei, and S. Gao, “Information gain ratio-based subfeature grouping empowers particle swarm optimization for feature selection,” Knowledge-Based Syst., vol. 286, no. 28 February 2024, 111380, 2024, doi: https://doi.org/10.1016/j.knosys.2024.111380.
S. Diabetes and B. Hospital in Sylhet, “Early stage diabetes risk prediction dataset.” [Online]. Available: https://archive.ics.uci.edu/dataset/529/early+stage+diabetes+risk+prediction+dataset
Downloads
Additional Files
Published
Issue
Section
License
Authors who publish with this journal agree to the following terms:Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).