Simulation modelling of single nucleotide genetic polymorphisms
Abstract
We propose an approach for the identification of single nucleotide polymorphisms (SNPs) in DNA sequences, based on the simulation modelling of sites of single nucleotides using the generation of random events according to the beta or normal distributions, the parameters of which are estimated from the available experimental data. The developed approach improves the accuracy of determining SNPs in DNA molecules and permits to investigate the reliability of specific experiments as well as to estimate the errors of determination of the parameters obtained in real experimental conditions. The verification of the simulation model and analysis methods is carried out on a set of reference human genomic DNA sequencing data provided by the Genome in a Bottle Consortium. The comparative analysis of the existing statistical SNP identification algorithms and machine learning methods, trained on the simulated data from the genomic sequencing of human DNA molecules, is carried out. The best results are obtained for machine learning models, in which the accuracy of SNP identification is 2–5 % higher than for classical statistical methods.
References
- Sung WK. Algorithms for next-generation sequencing. 1st edition. New York: Chapman & Hall/CRC; 2017. 364 p. DOI: 10.1201/9781315374352.
- Kappelmann-Fenzl M, editor. Next generation sequencing and data analysis. 1st edition. Cham: Springer; 2021. 218 p. DOI: 10.1007/978-3-030-62490-3.
- Wu XL, Xu J, Feng G, Wiggans GR, Taylor JF, He J, et al. Optimal design of low-density SNP arrays for genomic prediction: algorithm and applications. PLoS ONE. 2016;11(9):e0161719. DOI: 10.1371/journal.pone.0161719.
- Korani W, Clevenger JP, Chu Y, Ozias-Akins P. Machine learning as an effective method for identifying true single nucleotide polymorphisms in polyploid plants. Plant Genome. 2019;12(1):180023. DOI: 10.3835/plantgenome2018.05.0023.
- Masoudi-Nejad A, Narimani Z, Hosseinkhan N. Next generation sequencing and sequence assembly. Methodologies and algorithms. 1st edition. New York: Springer; 2013. 86 p. DOI: 10.1007/978-1-4614-7726-6.
- Su Z, Marchini J, Donnelly P. HAPGEN2: simulations of multiple disease SNPs. Bioinformatics. 2011;27(16):2304–2305. DOI: 10.1093/bioinformatics/btr341.
- Oh JH, Deasy JO. SITDEM: a simulation tool for disease/endpoint models of association studies based on single nucleotide polymorphism genotypes. Computers in Biology and Medicine. 2014;45:136–142. DOI: 10.1016/j.compbiomed.2013.11.021.
- Meyer HV, Birney E. PhenotypeSimulator: a comprehensive framework for simulating multi-trait, multi-locus genotype to phenotype relationships. Bioinformatics. 2018;34(17):2951–2956. DOI: 10.1093/bioinformatics/bty197.
- Hendricks AE, Dupuis J, Gupta M, Logue MW, Lunetta KL. A comparison of gene region simulation methods. PLoS ONE. 2012;7(7):e40925. DOI: 10.1371/journal.pone.0040925.
- Peng B, Chen HS, Mechanic LE, Racine B, Clarke J, Clarke L, et al. Genetic Simulation Resources: a website for the registration and discovery of genetic data simulators. Bioinformatics. 2013;29(8):1101–1102. DOI:10.1093/bioinformatics/btt094.
- Peng B, Chen HS, Mechanic LE, Racine B, Clarke J, Gillanders E, et al. Genetic data simulators and their applications: an overview. Genetic Epidemiology. 2015;39(1):2–10. DOI: 10.1002/gepi.21876.
- Yatskou MM, Apanasovich VV. Simulation modelling and machine learning platform for processing fluorescence spectroscopy data. In: Tuzikov AV, Belotserkovsky AM, Lukashevich MM, editors. Pattern Recognition and Information Processing. PRIP-2021. Cham: Springer; 2022. p. 178–190 (Communications in computer and information science; volume 1562). DOI: 10.1007/978-3-030- 98883-8_13.
- Jacquin L, Cao TV, Grenier C, Ahmadi N. DHOEM: a statistical simulation software for simulating new markers in real SNP marker data. BMC Bioinformatics. 2015;16:404. DOI: 10.1186/s12859-015-0830-7.
- Volkau AU, Yatskou MM, Grinev VV. Selecting informative features of human gene exons. Journal of the Belarusian State University. Mathematics and Informatics. 2019;1:77–89. Russian. DOI: 10.33581/2520-6508-2019-1-77-89.
- Xu Silun, Skakun VV. Comparative analysis of deep learning neural networks for the segmentation of cancer cell nuclei on immunohistochemical fluorescent images. Journal of the Belarusian State University. Mathematics and Informatics. 2024;1:59–70. Russian. EDN: TOOSJI.
- Grinev VV, Yatskou MM, Skakun VV, Chepeleva MV, Nazarov PV. ORFhunteR: an accurate approach to the automatic identification and annotation of open reading frames in human mRNA molecules. Software Impacts. 2022;12:100268. DOI: 10.1016/j.simpa. 2022.100268.
- Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics. 2006;15(3):651–674. DOI: 10.1198/106186006X133933.
- Breiman L, Friedman J, Olshen R, Stone C. Classification and regression trees. 1st edition. Wadsworth: Wadsworth International Group; 1984. 358 p.
- Vapnik VN. The nature of statistical leaning theory. 2nd edition. New York: Springer; 2000. 314 p. DOI: 10.1007/978-1-4757- 3264-1.
- Murphy KP. Probabilistic machine learning [Internet]. London: The MIT Press; 2022. 864 p. Available from: https://mitpress. mit.edu/9780262369305/probabilistic-machine-learning.
- R Core Team. R: a language and environment for statistical computing. R foundation for statistical computing [Internet]. Vienna: [s. n.]; 2021. Available from: https://www.R-project.org.
- Zook JM, McDaniel J, Olson ND, Wagner J, Parikh H, Heaton H, et al. An open resource for accurately benchmarking small variant and reference calls. Nature Biotechnology. 2019;37(5):561–566. DOI: 10.1038/s41587-019-0074-6.
- Liao Y, Smyth GK, Shi W. The R-package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research. 2019;47(8):e47. DOI: 10.1093/nar/gkz114.
- Yatskou MM, Smolyakova EV, Skakun VV, Grinev VV. Entropy-based detection of single-nucleotide genetic polymorphism sites. In: A. N. Sevchenko Institute of Applied Physical Problems of Belarusian State University. Proceedings of the 7 th International scientific-practical conference «Applied problems of optics, informatics, radiophysics and condensed matter physics»; 2023 May 18–19; Minsk, Belarus. Minsk: Belarusian State University; 2023. p. 191–193. Russian.
Copyright (c) 2024 Journal of the Belarusian State University. Mathematics and Informatics
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The authors who are published in this journal agree to the following:
- The authors retain copyright on the work and provide the journal with the right of first publication of the work on condition of license Creative Commons Attribution-NonCommercial. 4.0 International (CC BY-NC 4.0).
- The authors retain the right to enter into certain contractual agreements relating to the non-exclusive distribution of the published version of the work (e.g. post it on the institutional repository, publication in the book), with the reference to its original publication in this journal.
- The authors have the right to post their work on the Internet (e.g. on the institutional store or personal website) prior to and during the review process, conducted by the journal, as this may lead to a productive discussion and a large number of references to this work. (See The Effect of Open Access.)