Selecting informative features of human gene exons
Abstract
Dimensionality reduction of the human gene exon feature space is considered with the aim of gene identification. To evaluate the performance of various feature selection algorithms, computational experiments were carried out using the examples of exons of 14 known human genes. It is proven that exons are clearly separable regarding gene affiliation. Feature selection algorithms are sensitive to noise features and allow to estimate their number. Reducing the number of features improves CPU-time, memory usage as well as reduces the complexity of a model and makes it easier to interpret. Our findings indicate that utilizing of features of flanking intronic sequences leads to better prediction models in comparison with utilizing of exon features. The results of the research provide new opportunities for study of human gene data using machine learning algorithms.
References
- Grinev VV, Migas AA, Kirsanava AD, Mishkova OA, Siomava N, Ramanouskaya TV, et al. Decoding of exon splicing patterns in the human RUNX1–RUNX1T1 fusion gene. International Journal of Biochemistry & Cell Biology. 2015;68:48–58. DOI: 10.1016/j.biocel.2015.08.017.
- Zhang M. Statistical features of human exons and their flanking regions. Human Molecular Genetics. 1998;7(5):919 – 932. DOI: 10.1093/hmg/7.5.919.
- Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–2517. DOI: 10.1093/bioinformatics/btm344.
- Cox TF. Ch. 16. Multidimensional scaling in process control. Handbook of Statistics. 2003;22:609 – 623. DOI: 10.1016/s0169-7161(03)22018-6.
- Martinez AM, Kak AC. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2001;23(2):228–233. DOI: 10.1109/34.908974.
- John GH, Kohavi R, Pfleger K. Irrelevant Features and the Subset Selection Problem. In: Cohen WW, Hirsh H, editors. Machine Learning Proceedings. Proceedings of the Eleventh International Conference; 1994 July 10 –13; New Brunswick, Canada. New Brunswick: Rutgers University; 1994. p. 121–129. DOI: 10.1016/b978-1-55860-335-6.50023-4.
- Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research. 2004; 5:1205–1224.
- Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2016;13(5):971– 989. DOI: 10.1109/tcbb.2015.2478454.
- Belanche LA, Gonzalez FF. Review and evaluation of feature selection algorithms in synthetic problems [Internet]. [Cited 2018 September 12]. Available from: http://arxiv.org/abs/1101.2320.
- Wang L, Lei Y, Zeng Y, Tong L, Yan B. Principal feature analysis: a multivariate feature selection method for fMRI data. Computational and Mathematical Methods in Medicine. 2013;2013:1–7. DOI: 10.1155/2013/645921.
- Kira K, Rendell LA. A practical approach to feature selection. In: Machine Learning Proceedings. Proceedings of the Ninth International Workshop on Machine Learning; 1992 July 1–3; Aberdeen, Scotland. Aberdeen: ML; 1992. p. 249–256. DOI: 10.1016/b978-1-55860-247-2.50037-1.
- Kononenko I. Estimating attributes: Analysis and extensions of RELIEF. In: Machine Learning: ECML94. European Conference; 1994 April 6 – 8; Catania, Italy. Berlin: Springer; 1994. p. 171–182. DOI: 10.1007/3-540-57868-4_57.
- Singh SR, Murthy HA, Gonsalves TA. Feature selection for text classification based on Gini coefficient of inequality. The Fourth Workshop on Feature Selection in Data Mining. 2010;10:76 – 85.
- Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A. A review of feature selection methods on synthetic data. Knowledge and Information Systems. 2012;34(3):483–519. DOI: 10.1007/s10115-012-0487-8.
- Kalousis А, Prados J, Hilario M. Stability of Feature Selection Algorithms: a study on high dimensional spaces. Knowledge and information System. 2007;12(1):95–116.
- Nogueira S, Sechidis K, Brown G. On the stability of feature selection. Journal of Machine Learning Research. 2018;18(174):1–54.
- Nilsson NJ. Artificial intelligence: A modern approach. Artificial Intelligence. Elsevier BV. 1996;82(1–2):369 –380. DOI: 10.1016/0004-3702(96)00007-0.
- Merkle EC, Steyvers M. Choosing a Strictly Proper Scoring Rule. Decision Analysis. 2013;10(4):292–304. DOI: 10.1287/deca.2013.0280.
- Aken BL, Ayling S, Barrell D, Clarke L, Curwen V, Fairley S, et al. The Ensembl gene annotation system. Database. 2016;2016:baw093. DOI: 10.1093/database/baw093.
- Orriols-Puig A, Bernadó-Mansilla E. Evolutionary rule-based systems for imbalanced data sets. Soft Computing. 2008;13(3):213–225. DOI: 10.1007/s00500-008-0319-7.
- Qiu W, Joe H. Generation of Random Clusters with Specified Degree of Separation. Journal of Classification. 2006;23(2):315–334. DOI: 10.1007/s00357-006-0018-y.
Copyright (c) 2019 Journal of the Belarusian State University. Mathematics and Informatics
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The authors who are published in this journal agree to the following:
- The authors retain copyright on the work and provide the journal with the right of first publication of the work on condition of license Creative Commons Attribution-NonCommercial. 4.0 International (CC BY-NC 4.0).
- The authors retain the right to enter into certain contractual agreements relating to the non-exclusive distribution of the published version of the work (e.g. post it on the institutional repository, publication in the book), with the reference to its original publication in this journal.
- The authors have the right to post their work on the Internet (e.g. on the institutional store or personal website) prior to and during the review process, conducted by the journal, as this may lead to a productive discussion and a large number of references to this work. (See The Effect of Open Access.)