TY - JOUR
T1 - Parallel QR factorization using givens rotations in MPI-CUDA for multi-GPU
AU - Tapia-Romero, Miguel
AU - Meneses-Viveros, Amilcar
AU - Hernandez-Rubio, Erika
N1 - Publisher Copyright:
© 2020 Science and Information Organization.
PY - 2020
Y1 - 2020
N2 - Modern supercomputers incorporate the use of multi-core processors and graphics processing units. Applications running on these computers take advantage of these technologies with scalable programs that work with multicores and accelerator such as graphics processing unit. QR factorization is essential for several numerical tasks, such as linear equations solvers, compute inverse matrix or compute a diagonal matrix, to name a few. There are several factorization algorithm such as LU, Cholesky, Givens and Householder, among others. The efficient parallel implementation of each parallelization algorithm will depend on the structure of the data and the type of parallel architecture used. A common strategy in parallel programming is to break a problem into subproblems to solve them in different processing units. This is very useful when dealing with complex problems or when the data is too large to work with the available memory. However, it is not clear how data partitioning affects subtask performance when mapping to processing units, specifically to graphical processing units. This work explores the partitioning of large symmetric matrix data for QR factorization using Givens rotations and its parallel implementation using MPI and CUDA is presented.
AB - Modern supercomputers incorporate the use of multi-core processors and graphics processing units. Applications running on these computers take advantage of these technologies with scalable programs that work with multicores and accelerator such as graphics processing unit. QR factorization is essential for several numerical tasks, such as linear equations solvers, compute inverse matrix or compute a diagonal matrix, to name a few. There are several factorization algorithm such as LU, Cholesky, Givens and Householder, among others. The efficient parallel implementation of each parallelization algorithm will depend on the structure of the data and the type of parallel architecture used. A common strategy in parallel programming is to break a problem into subproblems to solve them in different processing units. This is very useful when dealing with complex problems or when the data is too large to work with the available memory. However, it is not clear how data partitioning affects subtask performance when mapping to processing units, specifically to graphical processing units. This work explores the partitioning of large symmetric matrix data for QR factorization using Givens rotations and its parallel implementation using MPI and CUDA is presented.
KW - CUDA
KW - Givens factorization
KW - Heterogeneous programming
KW - Scalable parallelism
UR - http://www.scopus.com/inward/record.url?scp=85085772673&partnerID=8YFLogxK
U2 - 10.14569/IJACSA.2020.0110578
DO - 10.14569/IJACSA.2020.0110578
M3 - Artículo
AN - SCOPUS:85085772673
SN - 2158-107X
VL - 11
SP - 636
EP - 645
JO - International Journal of Advanced Computer Science and Applications
JF - International Journal of Advanced Computer Science and Applications
IS - 5
ER -