Conditions for privatizing the elements of arrays by computing threads

  • Nikolai A. Likhoded Belarusian State University, Niezaliežnasci Avenue, 4, 220030, Minsk, Belarus
  • Maksim A. Paliashchuk Belarusian State University, Niezaliežnasci Avenue, 4, 220030, Minsk, Belarus

Abstract

The set of operations of the parallel algorithm for implementation on the GPU must be split into computation threads.  The threads must be grouped into computation units that run atomically on stream processors, also called multiprocessors.  For good GPU performance, it is important that as much data as possible can fit into fast register and shared memory,  otherwise slow global and local memory are used. The degree of memory usage with fast access reflects the computational  property of the algorithm, called locality. When implementing algorithms on multiprocessor computing devices, the use  of locality plays a crucial role in achieving high performance. In this paper, necessary conditions and sufficient conditions  have been formulated and proved, the use of which allows receiving threads with privatized data, i. e. it allows to receive such computation threads that the array element is used only by one thread and therefore it is advisable to place it in the  register.

Author Biographies

Nikolai A. Likhoded, Belarusian State University, Niezaliežnasci Avenue, 4, 220030, Minsk, Belarus

doctor of science (physics and mathematics), full professor; professor at the department of computational mathematics, faculty of applied mathematics and computer science

Maksim A. Paliashchuk, Belarusian State University, Niezaliežnasci Avenue, 4, 220030, Minsk, Belarus

assistant at the department of computational mathematics, faculty of applied mathematics and  computer science

References

  1. Voevodin VlV, Voevodin VadV. [The fortunate locality of supercomputers]. Otkrytye sistemy [Open systems]. 2013;9:12–15. Russian.
  2. Buluc A, Gilberta JR, Budak C. Solving path problems on the GPU. Parallel Computing. 2010;36(5– 6):241–253.
  3. Likhoded NA, Paliashchuk MA. [Estimate of locality of parallel algorithms implemented on GPUs]. Vestnik Yuzhno-Ural’skogo gosudarstvennogo universiteta. Seriya: Vychislitel’naya matematika i informatika [Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering]. 2016:5(3):96 –111. Russian. DOI: 10.14529/cmse160307.
  4. Likhoded NA, Poleshchuk MA. Method of ranking tiles size parameters of parallel algorithm. Doklady of the National Academy of Sciences of Belarus. 2015;59(4):25–33. Russian.
  5. Kandemir M, Ramanujam J, Irwin M, Narayanan V, Kadayif I, Parikh A. A compiler based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Transactions on Computer-Aided Design. 2004;23(2):243–260.
  6. Baskaran M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming; 2008 February 20 –23; Salt Lake City, USA. New York: ACM; 2008. p. 1–10. DOI: 10.1145/1345206.1345210.
  7. Poleshchuk MA, Likhoded NA. [Array privatization by computing threads]. In: CSIST’2016. Mezhdunarodnyi kongress po informatike: informatsionnye sistemy i tekhnologii; 24 –27 oktyabrya 2016 g.; Minsk, Belarus’ [CSIST’2016. International Congress on Computer Science: Information Systems and Technologies; 2016 October 24 –27; Minsk, Belarus]. Minsk: Belarusian State University; 2016. p. 883–888. Russian.
  8. Baskaran M, Ramanujam J, Sadayappan P. Automatic C-to-CUDA code generation for affine programs. In: Compiler Construction. 19th International Conference. Part of the Joint European Conferences on Theory and Practice of Software; 2010 March 20 –28; Paphos, Cyprus. Berlin, Heidelberg: Springer-Verlag; 2010. DOI: 10.1007/978-3-642-11970-5_14.
  9. Katz GJ, Kider J. All-pairs shortest-paths for large graphs on the GPU. In: Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware; 2008 June 20–21; Sarajevo, Bosnia and Herzegovina. Geneve: Eurographics Association; 2008. p. 47–55.
  10. Lund BD, Smith JW. A Multi-Stage {CUDA} Kernel for Floyd-Warshall. CoRR [Internet]. 2010;abs/1001.4108. Available from: http://arxiv.org/abs/1001.4108.
  11. Adutskevich EV, Likhoded NA, Sikorsky AO. [Parallelization of sequential programs: distribution of arrays among processors and structurization of communications]. Kibernetika i sistemnyi analiz [Cybernetics and System Analysis]. 2012;48(1):144 –163. Russian.
  12. Voevodin VV. Informatsionnaya struktura algoritmov [Informational structure of algorithms]. Moscow: MGU; 1997. Russian.
  13. Likhoded NA, Tolstikov AA. Formalization of communication operations of multidimensional loops. Proceedings of the National Academy of Sciences of Belarus. Physics and Mathematics Series. 2010;3:109 –114. Russian.
Published
2019-01-19
Keywords: parallel computations, GPU, tiling, array privatization, registers
Supporting Agencies The prepared report was sponsored by the government program of scientific research of the Republic of Belarus «Convergence-2020» (subprogram «Methods of mathematical modeling of complex systems»).
How to Cite
Likhoded, N. A., & Paliashchuk, M. A. (2019). Conditions for privatizing the elements of arrays by computing threads. Journal of the Belarusian State University. Mathematics and Informatics, 3, 59-67. Retrieved from https://journals.bsu.by/index.php/mathematics/article/view/1013