To the question of optimal division of the full sample during machine learning

Abstract

The article is devoted to modern approaches to dividing of a data set into training, control and test samples, used in the process of machine learning for forecasting purposes. The actual issue of choosing the optimal division of the entire available set of data into named above samples is considered. The author analyses the results of the operation of a software algorithm developed to find the optimal division of a data set into isolated samples for the purposes of machine learning of predictive models. A recommendation to separate 80 % of the total sample to minimise the forecast error of the developed models is given.

Author Biography

Valentin O. Suvalov, Belarusian State University, 4 Niezaliežnasci Avenue, Minsk 220030, Belarus

postgraduate student at the department of digital economy, faculty of economics

References

  1. Dobbin KK, Simon RM. Optimally splitting cases for training and testing high dimensional classifiers [Internet; cited 2020 March 19]. Available from: https://bmcmedgenomics.biomedcentral.com/articles/10.1186/1755-8794-4-31.
  2. Afendras G, Markatou M. Optimality of training / test size and resampling effectiveness of cross-validation estimators of the generalization error. Journal of Statistical Planning and Inference. 2019;199:286–301.
  3. Hyndman RJ, Khandakar Y. Automatic time series forecasting: the forecast package for R. Journal of Statistical Software. 2008;27(3):1–22. DOI: 10.18637/jss.v027.i03.
  4. Xiaozhe Wang, Smith K, Hyndman R. Characteristic-based clustering for time series data. Data Mining and Knowledge Discovery. 2006;13(3):335–364. DOI: 10.1007/s10618-005-0039-x.
  5. Hyndman RJ, Athanasopoulos G. Forecasting: principles and practice. Melbourne: OTexts; 2013. 291 p.
  6. Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19(6):716–723. DOI: 10.1109/TAC.1974.1100705.
  7. Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;6(2):461– 464. DOI: 10.1214/aos/1176344136.
  8. Sugiura N. Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics. 1978;7(1):13–26. DOI: 10.1080/03610927808827599.
  9. Hyndman RJ, Koehler AB. Another look at measures of forecast accuracy. International Journal of Forecasting. 2006;22(4):679–688. DOI: 10.1016/j.ijforecast.2006.03.001.
Published
2021-07-30
Keywords: machine learning, big data, data analysis, econometric tools, training sample, control sample, test sample
Supporting Agencies The author is grateful to the 1st category specialist of the liquidity regulation department of the National Bank of the Republic of Belarus K. S. Bogolyubskaya-Sinyakova and deputy head of the liquidity regulation department of the National Bank of the Republic of Belarus A. A. Kazakevich for their valuable comments and suggestions, which helped to improve the article.
How to Cite
Suvalov, V. O. (2021). To the question of optimal division of the full sample during machine learning. Journal of the Belarusian State University. Economics, 1, 37-45. Retrieved from https://journals.bsu.by/index.php/economy/article/view/3687
Section
C. Mathematical and Quantitative Methods