Simulation modelling of single nucleotide genetic polymorphisms

  • Mikalai M.   Yatskou Belarusian State University, 4 Niezaliezhnasci Avenue, Minsk 220030, Belarus
  • Vladimir V. Apanasovich Independent researcher, Minsk, Belarus
  • Vasily V. Grinev Belarusian State University, 4 Niezaliezhnasci Avenue, Minsk 220030, Belarus

Abstract

We propose an approach for the identification of single nucleotide polymorphisms (SNPs) in DNA sequences, based on the simulation modelling of sites of single nucleotides using the generation of random events according to the beta or normal distributions, the parameters of which are estimated from the available experimental data. The developed approach improves the accuracy of determining SNPs in DNA molecules and permits to investigate the reliability of specific experiments as well as to estimate the errors of determination of the parameters obtained in real experimental conditions. The verification of the simulation model and analysis methods is carried out on a set of reference human genomic DNA sequencing data provided by the Genome in a Bottle Consortium. The comparative analysis of the existing statistical SNP identification algorithms and machine learning methods, trained on the simulated data from the genomic sequencing of human DNA molecules, is carried out. The best results are obtained for machine learning models, in which the accuracy of SNP identification is 2–5 % higher than for classical statistical methods.

Author Biographies

Mikalai M.   Yatskou, Belarusian State University, 4 Niezaliezhnasci Avenue, Minsk 220030, Belarus

PhD (physics and mathematics), docent; head of the department of systems analysis and computer simulation, faculty of radiophysics and computer technologies

Vladimir V. Apanasovich, Independent researcher, Minsk, Belarus

doctor of science (physics and mathematics), full professor; independent researcher

Vasily V. Grinev, Belarusian State University, 4 Niezaliezhnasci Avenue, Minsk 220030, Belarus

PhD (biology), docent; associate professor at the department of genetics, faculty of biology

References

  1. Sung WK. Algorithms for next-generation sequencing. 1st edition. New York: Chapman & Hall/CRC; 2017. 364 p. DOI: 10.1201/9781315374352.
  2. Kappelmann-Fenzl M, editor. Next generation sequencing and data analysis. 1st edition. Cham: Springer; 2021. 218 p. DOI: 10.1007/978-3-030-62490-3.
  3. Wu XL, Xu J, Feng G, Wiggans GR, Taylor JF, He J, et al. Optimal design of low-density SNP arrays for genomic prediction: algorithm and applications. PLoS ONE. 2016;11(9):e0161719. DOI: 10.1371/journal.pone.0161719.
  4. Korani W, Clevenger JP, Chu Y, Ozias-Akins P. Machine learning as an effective method for identifying true single nucleotide polymorphisms in polyploid plants. Plant Genome. 2019;12(1):180023. DOI: 10.3835/plantgenome2018.05.0023.
  5. Masoudi-Nejad A, Narimani Z, Hosseinkhan N. Next generation sequencing and sequence assembly. Methodologies and algorithms. 1st edition. New York: Springer; 2013. 86 p. DOI: 10.1007/978-1-4614-7726-6.
  6. Su Z, Marchini J, Donnelly P. HAPGEN2: simulations of multiple disease SNPs. Bioinformatics. 2011;27(16):2304–2305. DOI: 10.1093/bioinformatics/btr341.
  7. Oh JH, Deasy JO. SITDEM: a simulation tool for disease/endpoint models of association studies based on single nucleotide polymorphism genotypes. Computers in Biology and Medicine. 2014;45:136–142. DOI: 10.1016/j.compbiomed.2013.11.021.
  8. Meyer HV, Birney E. PhenotypeSimulator: a comprehensive framework for simulating multi-trait, multi-locus genotype to phenotype relationships. Bioinformatics. 2018;34(17):2951–2956. DOI: 10.1093/bioinformatics/bty197.
  9. Hendricks AE, Dupuis J, Gupta M, Logue MW, Lunetta KL. A comparison of gene region simulation methods. PLoS ONE. 2012;7(7):e40925. DOI: 10.1371/journal.pone.0040925.
  10. Peng B, Chen HS, Mechanic LE, Racine B, Clarke J, Clarke L, et al. Genetic Simulation Resources: a website for the registration and discovery of genetic data simulators. Bioinformatics. 2013;29(8):1101–1102. DOI:10.1093/bioinformatics/btt094.
  11. Peng B, Chen HS, Mechanic LE, Racine B, Clarke J, Gillanders E, et al. Genetic data simulators and their applications: an overview. Genetic Epidemiology. 2015;39(1):2–10. DOI: 10.1002/gepi.21876.
  12. Yatskou MM, Apanasovich VV. Simulation modelling and machine learning platform for processing fluorescence spectroscopy data. In: Tuzikov AV, Belotserkovsky AM, Lukashevich MM, editors. Pattern Recognition and Information Processing. PRIP-2021. Cham: Springer; 2022. p. 178–190 (Communications in computer and information science; volume 1562). DOI: 10.1007/978-3-030- 98883-8_13.
  13. Jacquin L, Cao TV, Grenier C, Ahmadi N. DHOEM: a statistical simulation software for simulating new markers in real SNP marker data. BMC Bioinformatics. 2015;16:404. DOI: 10.1186/s12859-015-0830-7.
  14. Volkau AU, Yatskou MM, Grinev VV. Selecting informative features of human gene exons. Journal of the Belarusian State University. Mathematics and Informatics. 2019;1:77–89. Russian. DOI: 10.33581/2520-6508-2019-1-77-89.
  15. Xu Silun, Skakun VV. Comparative analysis of deep learning neural networks for the segmentation of cancer cell nuclei on immunohistochemical fluorescent images. Journal of the Belarusian State University. Mathematics and Informatics. 2024;1:59–70. Russian. EDN: TOOSJI.
  16. Grinev VV, Yatskou MM, Skakun VV, Chepeleva MV, Nazarov PV. ORFhunteR: an accurate approach to the automatic identification and annotation of open reading frames in human mRNA molecules. Software Impacts. 2022;12:100268. DOI: 10.1016/j.simpa. 2022.100268.
  17. Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics. 2006;15(3):651–674. DOI: 10.1198/106186006X133933.
  18. Breiman L, Friedman J, Olshen R, Stone C. Classification and regression trees. 1st edition. Wadsworth: Wadsworth International Group; 1984. 358 p.
  19. Vapnik VN. The nature of statistical leaning theory. 2nd edition. New York: Springer; 2000. 314 p. DOI: 10.1007/978-1-4757- 3264-1.
  20. Murphy KP. Probabilistic machine learning [Internet]. London: The MIT Press; 2022. 864 p. Available from: https://mitpress. mit.edu/9780262369305/probabilistic-machine-learning.
  21. R Core Team. R: a language and environment for statistical computing. R foundation for statistical computing [Internet]. Vienna: [s. n.]; 2021. Available from: https://www.R-project.org.
  22. Zook JM, McDaniel J, Olson ND, Wagner J, Parikh H, Heaton H, et al. An open resource for accurately benchmarking small variant and reference calls. Nature Biotechnology. 2019;37(5):561–566. DOI: 10.1038/s41587-019-0074-6.
  23. Liao Y, Smyth GK, Shi W. The R-package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research. 2019;47(8):e47. DOI: 10.1093/nar/gkz114.
  24. Yatskou MM, Smolyakova EV, Skakun VV, Grinev VV. Entropy-based detection of single-nucleotide genetic polymorphism sites. In: A. N. Sevchenko Institute of Applied Physical Problems of Belarusian State University. Proceedings of the 7 th International scientific-practical conference «Applied problems of optics, informatics, radiophysics and condensed matter physics»; 2023 May 18–19; Minsk, Belarus. Minsk: Belarusian State University; 2023. p. 191–193. Russian.
Published
2024-08-02
Keywords: single nucleotide polymorphism, SNP, SNP identification, simulation modelling, machine learning
Supporting Agencies This work was carried out in the framework of the state programme of scientific research «Convergence-2025» (grant No. 3.04.3.1, state registration No. 20211918).
How to Cite
Yatskou, M. M.  , Apanasovich, V. V., & Grinev, V. V. (2024). Simulation modelling of single nucleotide genetic polymorphisms. Journal of the Belarusian State University. Mathematics and Informatics, 2, 104-112. Retrieved from https://journals.bsu.by/index.php/mathematics/article/view/6114
Section
Theoretical Foundations of Computer Science