Improving the Accuracy of the C45 Classification Algorithm Using Information Gain Ratio Feature Selection for Classification of Type 2 Diabetes Mellitus Disease

Ivandari Ivandari, Much. Rifqi Maulana, Ichwan Kurniawan, M Adib Al Karomi

Abstract


Abstract— Diabetes is a disease that can cause death. Diabetes can cause heart failure, chronic kidney disease, glaucoma that attacks the eyes and several other diseases. WHO data states that there were more than 2 million deaths due to diabetes in 2019. Data from the International Diabetes Federation shows that around 537 adults are recorded as living with diabetes. This condition must be treated immediately, considering that diabetes is one of the most deadly non-communicable diseases in the world. Patient registration is mostly done in hospitals. A lot of data will only become digital waste if it does not have more benefits. In 2020 Diabetes and Hospital in Sylhet donated patient data for further research. This data contains 520 patient records with 17 attributes that have been validated by specialist doctors. Early stage diabetes risk prediction data is released by the uci repository as public data and can be used for research testing. Research using this dataset has been widely carried out with the previous best accuracy level of 95.96%. In previous studies, all attributes were used in the classification process. The number of irrelevant attributes can affect the performance of the classification algorithm. This study uses the information gain ratio for feature selection of the early stage diabetes risk prediction dataset. The C45 algorithm is used for classification, evaluation using confusion matrix and validation using 10 folds cross validation. The results of this study improve the performance of C45 so that it obtains an accuracy level of 96.15%. This study also produces a decision tree for diabetes..


Keywords


information gain ratio, decision tree, diabetes type 2

Full Text:

PDF

References


WHO, “Diabetes,” 2023 World Health Organization. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/diabetes

C. J. Ejiyi et al., “A robust predictive diagnosis model for diabetes mellitus using Shapley-incorporated machine learning algorithms,” Healthc. Anal., vol. 3, no. December 2022, p. 100166, 2023, doi: 10.1016/j.health.2023.100166.

Univercity of Washington, “Explore results from the 2019 Global Burden of Disease (GBD) study.” [Online]. Available: https://vizhub.healthdata.org/gbd-results/

International Diabetes Federation, “Diabetes.”

Databoks, “Diabetes Tipe 2 Paling Banyak Diderita Orang Indonesia pada 2023.”

I. H. Witten, E. Frank, M. A. Hall, and C. J, Data Mining (Fourth Edition), 4th ed. Kaufmann, Morgan, 2017. doi: https://doi.org/10.1016/B978-0-12-804291-5.00004-0.

ikhsan wisnuadji Gamadarenda and I. Waspada, “Implementasi Data Mining Untuk Deteksi Penyakit Ginjal Kronis (Pgk) Menggunakan K-Nearest Neighbor (Knn) Dengan Backward Elimination,” vol. 7, no. 2, pp. 417–426, 2018, doi: 10.25126/jtiik.202071896.

M. F. Kurniawan and Ivandari, “Komparasi Algoritma Data Mining untuk Klasifikasi Kanker Payudara,” IC Tech, vol. I April 20, pp. 1–8, 2017.

G. Aguilera-Venegas, A. López-Molina, G. Rojo-Martínez, and J. L. Galán-García, “Comparing and tuning machine learning algorithms to predict type 2 diabetes mellitus,” J. Comput. Appl. Math., vol. 427, p. 115115, 2023, doi: 10.1016/j.cam.2023.115115.

C. Carpinteiro, J. Lopes, A. Abelha, and M. F. Santos, “A Comparative Study of Classification Algorithms for Early Detection of Diabetes,” Procedia Comput. Sci., vol. 220, pp. 868–873, 2023, doi: 10.1016/j.procs.2023.03.117.

I. Ivandari, M. R. Maulana, and M. A. Al Karomi, “Classification of Type 2 Diabetes using Decission Tree Algorithm,” Jaict, vol. 8, no. 2, pp. 236–241, 2023, [Online]. Available: https://jurnal.polines.ac.id/index.php/jaict/article/view/4835

M. A. Alkaromi, “Information Gain untuk Pemilihan Fitur pada Klasifikasi Heregistrasi Calon Mahasiswa dengan Menggunakan K-NN,” 2014.

B. Azhagusundari and A. S. Thanamani, “Feature Selection based on Information Gain,” no. 2, pp. 18–21, 2013.

M. A. Al Karomi, M. R. Maulana, S. J. Prasetiyono, Ivandari, and Arochman, “Strengthening campus finance by analyzing attribute attributes for student registration classifications.” p. 1, 2019. [Online]. Available: https://jurnal.polines.ac.id/index.php/jaict/article/view/1431

S. Kumari, D. Kumar, and M. Mittal, “An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier,” Int. J. Cogn. Comput. Eng., vol. 2, no. January, pp. 40–46, 2021, doi: 10.1016/j.ijcce.2021.01.001.

Ivandari, W. Setianto, and M. A. Alkaromi, “Klasifikasi Diabetes Tipe 2 Menggunakan Algoritma K-Nearest Neighbour,” IC-Tech, vol. 18, no. 1, pp. 36–41, 2023, doi: 10.47775/ictech.v18i1.273.

I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques 3rd Edition. Elsevier, 2011.

E. Prasetyo, Data Mining Konsep dan Aplikasi menggunakan Matlab. Yogyakarta: Andi Offset, 2012.

H. Deng and G. Runger, “Feature Selection via Regularized Trees,” Jan. 2012, Accessed: Oct. 16, 2014. [Online]. Available: http://arxiv.org/abs/1201.1587v3

J. Novakovic, “The Impact of Feature Selection on the Accuracy of 1DwYH Bayes Classifier,” vol. 2, pp. 1113–1116, 2010.

Ian H Witten. Eibe Frank. Mark A Hall, Data Mining 3rd. 2011.

Ivandari and M. A. Al Karomi, “Classification of Covid-19 Survillance Datasets using the Decision Tree Algorithm,” Jaict, vol. 6, no. 1, pp. 44–49, 2021, [Online]. Available: https://jurnal.polines.ac.id/index.php/jaict/article/view/2896

Ivandari and M. A. Al Karomi, “Algoritma K-NN untuk klasifikasi dataset Covid-19 survillance,” IC Tech, vol. 16, no. 1, pp. 12–15, 2021, [Online]. Available: https://ejournal.stmik-wp.ac.id/index.php/ictech/article/view/137

F. Gorunescu, Data Mining: Concepts; Models and Techniques. Springer, 2011.

J. Gao, Z. Wang, T. Jin, J. Cheng, Z. Lei, and S. Gao, “Information gain ratio-based subfeature grouping empowers particle swarm optimization for feature selection,” Knowledge-Based Syst., vol. 286, no. 28 February 2024, 111380, 2024, doi: https://doi.org/10.1016/j.knosys.2024.111380.

S. Diabetes and B. Hospital in Sylhet, “Early stage diabetes risk prediction dataset.” [Online]. Available: https://archive.ics.uci.edu/dataset/529/early+stage+diabetes+risk+prediction+dataset




DOI: http://dx.doi.org/10.32497/jaict.v9i2.5845

Refbacks

  • There are currently no refbacks.


ISSN: 2541-6340
Online ISSN: 2541-6359

Visitor: 

View My Stats

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.