Predicting DNA Sequence Similarity Across Species Using Machine Learning: a K-mer Based Approach

Authors

  • Vinodkumar R. Patil Department Computer Engineering, GES’s R. H. Sapat College of Engineering, Management Studies and Research, Nashik, INDIA. https://orcid.org/0000-0001-8803-9724
  • Archana S. Vaidya Department Computer Engineering, GES’s R. H. Sapat College of Engineering, Management Studies and Research, Nashik, INDIA.
  • Manisha S. Patil Department of Computer Science and Engineering (Data Science), R. C. Patel Institute of Technology, Shirpur, INDIA.

DOI:

https://doi.org/10.22452/

Keywords:

DNA classification, machine learning, K-mer, naive bayes

Abstract

This study examines the effectiveness of machine learning methods for categorizing DNA sequences across human, chimpanzee, and dog samples. We employed k-mer encoding to renovate DNA structures into numerical representations suitable for machine learning models. Four classifiers, such as naive bayes and weighted naive bayes, random forest, K-nearest neighbors, and decision tree, were applied and evaluated using accuracy, precision, recall, and F1-score. The naive bayes classifier consistently outperformed the others across all three datasets, achieving the highest accuracy in classifying human DNA (98.4%), followed by chimpanzee DNA (91.4%), and exhibiting significantly lower accuracy for dog DNA (68.9%). This performance disparity is attributed to the increasing evolutionary distance from human DNA. Additionally, a weighted naive bayes model that was trained on human data showed very high accuracy in predicting chimpanzee (99.3%) and dog (92.6%) DNA sequences. The results presented here show the significance of taking into account evolutionary relations and dataset features whenever developing and training classification models for genetic sequence analysis. The research extends the present research by evaluating the performance of several different algorithms on separate DNA databases, identifying strengths and weaknesses, and suggesting avenues for future research focusing on advanced feature engineering and algorithm selection for improved cross-species classification.

References

Momenzadeh, M., Sehhati, M., & Rabbani, H. (2020). Using hidden Markov model to predict recurrence of breast cancer based on sequential patterns in gene expression profiles. Journal of Biomedical Informatics, 111, 103570.

Nayak, J., Mishra, M., Naik, B., Swapnarekha, H., Cengiz, K., & Shanmuganathan, V. (2022). An impact study of COVID‐19 on six different industries: Automobile, energy and power, agriculture, education, travel and tourism and consumer electronics. Expert systems, 39(3), e12677.

Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers, E. W. (2009). GenBank. Nucleic acids research, 37(suppl_1), D26-D31.

Solis-Reyes, S., Avino, M., Poon, A., & Kari, L. (2018). An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PloS one, 13(11), e0206409.

Shadab, S., Khan, M. T. A., Neezi, N. A., Adilina, S., & Shatabda, S. (2020). DeepDBP: deep neural networks for identification of DNA-binding proteins. Informatics in Medicine Unlocked, 19, 100318.

Onesime, M., Yang, Z., & Dai, Q. (2021). Genomic Island Prediction via Chi‐Square Test and Random Forest Algorithm. Computational and Mathematical Methods in Medicine, 2021(1), 9969751.

Alotaibi, H., Alsolami, F., & Mehmood, R. (2021). DNA profiling: An investigation of six machine learning algorithms for estimating the number of contributors in DNA mixtures. International Journal of Advanced Computer Science and Applications, 12(11).

Arowolo, M. O., Adebiyi, M. O., & Adebiyi, A. A. (2021). A genetic algorithm approach for predicting ribonucleic acid sequencing data classification using KNN and decision tree. TELKOMNIKA (Telecommunication Computing Electronics and Control), 19(1), 310-316.

Mathur, G., Pandey, A., & Goyal, S. (2023). A comprehensive tool for rapid and accurate prediction of disease using DNA sequence classifier. Journal of Ambient Intelligence and Humanized Computing, 14(10), 13869-13885.

Arowolo, M. O., Adebiyi, M., Adebiyi, A. A., & OKesola, J. O. (2021). Predicting RNA-Seq data using genetic algorithm and ensemble classification algorithms. Indonesian Journal of Electrical Engineering and Computer Science, 21(2), 1073-1081.

Hamed, B. A., Ibrahim, O. A. S., & Abd El-Hafeez, T. (2023). Optimizing classification efficiency with machine learning techniques for pattern matching. Journal of Big Data, 10(1), 124.

Peretz, O., Koren, M., & Koren, O. (2024). Naive Bayes classifier–An ensemble procedure for recall and precision enrichment. Engineering Applications of Artificial Intelligence, 136, 108972.

Zuhanda, M. K., Permata, L., & Ongko, E. (2025). Impact of Adaptive Synthetic on Naïve Bayes Accuracy in Imbalanced Anemia Detection Datasets. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 9(1), 85-93.

Xia, X., & Yan, J. (2021). Construction of music teaching evaluation model based on weighted naïve bayes. Scientific Programming, 2021(1), 7196197.

Ye, Y. (2024, May). Design and Implementation of an English Mobile Learning System Based on Weighted Naive Bayes. In 2024 5th International Conference on Big Data and Informatization Education (ICBDIE 2024) (pp. 187-196). Atlantis Press.

Pazhanikumar, K., & KuzhalVoiMozhi, S. N. (2024). Remote sensing image classification using modified random forest with empirical loss function through crowd-sourced data. Multimedia Tools and Applications, 83(18), 53899-53921.

Coscia, A., Dentamaro, V., Galantucci, S., Maci, A., & Pirlo, G. (2024). Automatic decision tree-based NIDPS ruleset generation for DoS/DDoS attacks. Journal of Information Security and Applications, 82, 103736.

Corso, M. P., Perez, F. L., Stefenon, S. F., Yow, K. C., García Ovejero, R., & Leithardt, V. R. Q. (2021). Classification of contaminated insulators using k-nearest neighbors based on computer vision. Computers, 10(9), 112.

Bhushan Bawankar. (2024). Analysis of machine learning approaches for dna sequencing and classification: An optimized approach. Communications on Applied Nonlinear Analysis, 31(2s), 436–453. https://doi.org/10.52783/cana.v31.659.

Rrmoku, K., Selimi, B., & Ahmedi, L. (2022). Application of trust in recommender systems—utilizing naive Bayes classifier. Computation, 10(1), 6.

Jackins, V., Vimal, S., Kaliappan, M., & Lee, M. Y. (2021). AI-based smart prediction of clinical disease using random forest classifier and Naive Bayes. The Journal of Supercomputing, 77(5), 5198-5219.

Wickramasinghe, I., & Kalutarage, H. (2021). Naive Bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation. Soft Computing, 25(3), 2277-2293.

Wang, A. X., Chukova, S. S., & Nguyen, B. P. (2023). Ensemble k-nearest neighbors based on centroid displacement. Information Sciences, 629, 313-323.

Gao, L., Li, D., Liu, X., & Liu, G. (2022). Enhanced chiller faults detection and isolation method based on independent component analysis and k-nearest neighbors classifier. Building and Environment, 216, 109010.

Beskopylny, A. N., Stel’makh, S. A., Shcherban’, E. M., Mailyan, L. R., Meskhi, B., Razveeva, I., & Beskopylny, N. (2022). Concrete strength prediction using machine learning methods CatBoost, k-nearest neighbors, support vector regression. Applied Sciences, 12(21), 10864.

Published

30-06-2026

Issue

Section

Original Articles