Отбор информативных признаков экзонов генов человека

Andrei V. Volkau; Mikalai M. Yatskou; Vasily V. Grinev

doi:10.33581/2520-6508-2019-1-77-89

Andrei V. Volkau Belarusian State University, 4 Niezaliežnasci Avenue, Minsk 220030, Belarus
Mikalai M. Yatskou Belarusian State University, 4 Niezaliežnasci Avenue, Minsk 220030, Belarus
Vasily V. Grinev Belarusian State University, 4 Niezaliežnasci Avenue, Minsk 220030, Belarus

DOI: https://doi.org/10.33581/2520-6508-2019-1-77-89

Abstract

Dimensionality reduction of the human gene exon feature space is considered with the aim of gene identification. To evaluate the performance of various feature selection algorithms, computational experiments were carried out using the examples of exons of 14 known human genes. It is proven that exons are clearly separable regarding gene affiliation. Feature selection algorithms are sensitive to noise features and allow to estimate their number. Reducing the number of features improves CPU-time, memory usage as well as reduces the complexity of a model and makes it easier to interpret. Our findings indicate that utilizing of features of flanking intronic sequences leads to better prediction models in comparison with utilizing of exon features. The results of the research provide new opportunities for study of human gene data using machine learning algorithms.

Author Biographies

Andrei V. Volkau, Belarusian State University, 4 Niezaliežnasci Avenue, Minsk 220030, Belarus

postgraduate student at the department of system analysis and computer simulation, faculty of radiophysics and computer technologies

Mikalai M. Yatskou, Belarusian State University, 4 Niezaliežnasci Avenue, Minsk 220030, Belarus

PhD (physics and mathematics), docent; associate professor at the department of system analysis and computer simulation, faculty of radiophysics and computer technologies

Vasily V. Grinev, Belarusian State University, 4 Niezaliežnasci Avenue, Minsk 220030, Belarus

PhD (biology), docent; associate professor at the department of genetics, faculty of biology

References

Grinev VV, Migas AA, Kirsanava AD, Mishkova OA, Siomava N, Ramanouskaya TV, et al. Decoding of exon splicing patterns in the human RUNX1–RUNX1T1 fusion gene. International Journal of Biochemistry & Cell Biology. 2015;68:48–58. DOI: 10.1016/j.biocel.2015.08.017.
Zhang M. Statistical features of human exons and their flanking regions. Human Molecular Genetics. 1998;7(5):919 – 932. DOI: 10.1093/hmg/7.5.919.
Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–2517. DOI: 10.1093/bioinformatics/btm344.
Cox TF. Ch. 16. Multidimensional scaling in process control. Handbook of Statistics. 2003;22:609 – 623. DOI: 10.1016/s0169-7161(03)22018-6.
Martinez AM, Kak AC. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2001;23(2):228–233. DOI: 10.1109/34.908974.
John GH, Kohavi R, Pfleger K. Irrelevant Features and the Subset Selection Problem. In: Cohen WW, Hirsh H, editors. Machine Learning Proceedings. Proceedings of the Eleventh International Conference; 1994 July 10 –13; New Brunswick, Canada. New Brunswick: Rutgers University; 1994. p. 121–129. DOI: 10.1016/b978-1-55860-335-6.50023-4.
Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research. 2004; 5:1205–1224.
Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2016;13(5):971– 989. DOI: 10.1109/tcbb.2015.2478454.
Belanche LA, Gonzalez FF. Review and evaluation of feature selection algorithms in synthetic problems [Internet]. [Cited 2018 September 12]. Available from: http://arxiv.org/abs/1101.2320.
Wang L, Lei Y, Zeng Y, Tong L, Yan B. Principal feature analysis: a multivariate feature selection method for fMRI data. Computational and Mathematical Methods in Medicine. 2013;2013:1–7. DOI: 10.1155/2013/645921.
Kira K, Rendell LA. A practical approach to feature selection. In: Machine Learning Proceedings. Proceedings of the Ninth International Workshop on Machine Learning; 1992 July 1–3; Aberdeen, Scotland. Aberdeen: ML; 1992. p. 249–256. DOI: 10.1016/b978-1-55860-247-2.50037-1.
Kononenko I. Estimating attributes: Analysis and extensions of RELIEF. In: Machine Learning: ECML94. European Conference; 1994 April 6 – 8; Catania, Italy. Berlin: Springer; 1994. p. 171–182. DOI: 10.1007/3-540-57868-4_57.
Singh SR, Murthy HA, Gonsalves TA. Feature selection for text classification based on Gini coefficient of inequality. The Fourth Workshop on Feature Selection in Data Mining. 2010;10:76 – 85.
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A. A review of feature selection methods on synthetic data. Knowledge and Information Systems. 2012;34(3):483–519. DOI: 10.1007/s10115-012-0487-8.
Kalousis А, Prados J, Hilario M. Stability of Feature Selection Algorithms: a study on high dimensional spaces. Knowledge and information System. 2007;12(1):95–116.
Nogueira S, Sechidis K, Brown G. On the stability of feature selection. Journal of Machine Learning Research. 2018;18(174):1–54.
Nilsson NJ. Artificial intelligence: A modern approach. Artificial Intelligence. Elsevier BV. 1996;82(1–2):369 –380. DOI: 10.1016/0004-3702(96)00007-0.
Merkle EC, Steyvers M. Choosing a Strictly Proper Scoring Rule. Decision Analysis. 2013;10(4):292–304. DOI: 10.1287/deca.2013.0280.
Aken BL, Ayling S, Barrell D, Clarke L, Curwen V, Fairley S, et al. The Ensembl gene annotation system. Database. 2016;2016:baw093. DOI: 10.1093/database/baw093.
Orriols-Puig A, Bernadó-Mansilla E. Evolutionary rule-based systems for imbalanced data sets. Soft Computing. 2008;13(3):213–225. DOI: 10.1007/s00500-008-0319-7.
Qiu W, Joe H. Generation of Random Clusters with Specified Degree of Separation. Journal of Classification. 2006;23(2):315–334. DOI: 10.1007/s00357-006-0018-y.

Selecting informative features of human gene exons

Abstract

Author Biographies

References