CLASSIFICATION BOOSTING IN IMBALANCED DATA
Main Article Content
Abstract
Most existing classification approaches assumed underlying training data set to be evenly distributed. However, in the imbalanced classification, the training data set of one majority class could far surpass those of the minority class. This becomes a problem because it’s usually produces biased classifiers that have a higher predictive accuracy over the majority class, but poorer predictive accuracy over minority class. One popular method recently used to rectify this is the SMOTE (Synthetic Minority Over-Sampling Technique) which combines algorithms at data level. Therefore, this paper presents a novel approach for learning and imbalanced data sets, based on a combination of the SMOTE algorithm and the boosting procedure by focusing on a two-class problem. The Bidikmisi data set is imbalanced, because the distribution of majority class examples is 15 times the number of minority class examples. All models have been evaluated using stratified 5-fold cross-validation, and the performance criteria (such as Recall, F-Value and G-Mean) are examined. The results show that the SMOTE-Boosting algorithms have a better classification performance than the AdaBoost.M2 method, as the g-mean value increases 4-fold after the SMOTE method is used. We can say that SMOTE-Boosting algorithm is quite successful when taking advantage of boosting algorithms with SMOTE. When boosting affects the accuracy of the random forest by focusing on all data classes, the SMOTE algorithm alters the performance values of the random forest only in minority classes.
Downloads
Article Details
Licensee MJS, Universiti Malaya, Malaysia. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
References
Bühlaman, P., & Hothorn, T. (2007). Boosting Algorithms: Regularization, Prediction and Model Fitting. Statistical Science, 22(4): 477-505.
Cahyani, N., Fithriasari, K., Irhamah & Iriawan, N. (2018). On the comparison of deep learning neural network and binary logistic regression for classifying the acceptance status of bidikmisi scholarship applicants in east java. MATEMATIKA: Malaysian Journal of Industrial and Applied
Mathematics, 34 (Special Issue): 83-90.
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 321-357.
Chawla, N.V., Lazarevic, A., Hall, L.O. & Bowyer, K.W. (2003). SMOTEBoost: Improving the prediction of the minority class in boosting. Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 22-26 September, 107–119, Springer.
Freund, Y. & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. Proceedings of the 2nd European Conference on Computational Learning Theory, Barcelona, Spain, 13-15 March, 23-37, Springer.
Freund, Y. & Schapire, R. (1996). Experiments with a new boosting algorithm. Proceedings of the 13th International Conference on Machine Learning, 325-332.
Han, J., Kamber, M. & Pei, J. (2006). Data Mining Concepts and Techniques 2nd Edition. USA: Kaufman Publisher.
Han, J., Kamber, M. & Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. USA: Kaufman Publisher.
Imran, M., Afroze, M., Sanampudi, SK., & Qyser, AAM. (2016). Data mining of imbalanced dataset in educational data using Weka tool. International Journal of Engineering Science and Computing, 6(6): 7666-7669.
Japkowicz, N. & Stephen, S. (2002). The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis, 6(5), 203-231.
Leaes, A., Fernandes, P., Lopes, L. & Assunção, J. (2017). Classifying with AdaBoost.M1: The training error threshold myth. Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference, Marco Island, Florida, 22-24 May.
Li, X., Wang, L. & Sung, E. (2008). AdaBoost with SVM-based component classifiers. Engineering Applications of Artificial Intelligence, 21(5) 785-795. From University of Wollongong Publications: http://ro.uow.edu.au/eispapers/602.
Schapire, R. & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37: 297–336.
Sun, Y., Wong, A.K.C. & Wang, Y. (2005). Parameter inference of cost-sensitive boosting algorithm. Proceedings of the 4th International Conference Machine Learning and Data Mining in Pattern Recognition, Leipzig, German, 9-11 July, pp. 21-30, Springer.
Suryaningtyas, W., Iriawan, N., Fithriasari, K., Ulama, BSS., Susanto, I., & Pravitasari, AA. (2018). On the Bernoulli Mixture Model for Bidikmisi Scholarship Classification with Bayesian MCMC. Journal of Physics: Conference Series, 1090: 1-8.
Ting, K. (2000). A Comparative Study of Cost-Sensitive Boosting Algorithms. Proceedings of 17th International Conference on Machine Learning, Stanford, CA, pp. 983-990.