PIER
 
Progress In Electromagnetics Research
ISSN: 1070-4698, E-ISSN: 1559-8985
Home | Search | Notification | Authors | Submission | PIERS Home | EM Academy
Home > Vol. 116 > pp. 49-63

A MEMORY EFFICIENT AND FAST SPARSE MATRIX VECTOR PRODUCT ON A GPU

By A. Dziekonski, A. Lamecki, and M. Mrozowski

Full Article PDF (200 KB)

Abstract:
This paper proposes a new sparse matrix storage format which allows an efficient implementation of a sparse matrix vector product on a Fermi Graphics Processing Unit (GPU). Unlike previous formats it has both low memory footprint and good throughput. The new format, which we call Sliced ELLR-T has been designed specifically for accelerating the iterative solution of a large sparse and complex-valued system of linear equations arising in computational electromagnetics. Numerical tests have shown that the performance of the new implementation reaches 69 GFLOPS in complex single precision arithmetic. Compared to the optimized six core Central Processing Unit (CPU) (Intel Xeon 5680) this performance implies a speedup by a factor of six. In terms of speed the new format is as fast as the best format published so far and at the same time it does not introduce redundant zero elements which have to be stored to ensure fast memory access. Compared to previously published solutions, significantly larger problems can be handled using low cost commodity GPUs with limited amount of on-board memory.

Citation:
A. Dziekonski, A. Lamecki, and M. Mrozowski, "A memory efficient and fast sparse matrix vector product on a GPU," Progress In Electromagnetics Research, Vol. 116, 49-63, 2011.
doi:10.2528/PIER11031607
http://www.jpier.org/pier/pier.php?paper=11031607

References:
1. Krakiwsky, S. E., L. E. Turner, and M. Okoniewski, "Acceleration of finite difference time-domain (FDTD) using graphics processor units (GPU) ," IEEE MTT-S International Microwave Symposium Digest 2004, 1033-1036, June 2004.

2. Adams, S., J. Payne, and R. Boppana, Finite difference time domain (FDTD) simulations using graphics processors, High Performance Computing Modernization Program Users Group Conference, 2007.

3. Sypek, P., A. Dziekonski, and M. Mrozowski, "How to render FDTD computations more effective using a graphics accelerator," IEEE Transactions on Magnetic, Vol. 45, No. 3, 1324-1327, March 2009.
doi:10.1109/TMAG.2009.2012614

4. Xu, K., Z. Fan, D.-Z. Ding, and R.-S. Chen, "GPU accelerated unconditionally stable crank-nicolson FDTD method for the analysis of three-dimensional microwave circuits ," Progress In Electromagnetics Research, Vol. 102, 381-395, 2010.
doi:10.2528/PIER10020606

5. Stefanski, T. P. and T. D. Drysdale, "Acceleration of the 3D ADIFDTD method using graphics processor units," IEEE MTT-S International Microwave Symposium Digest 2009, 241-244, June 2009.
doi:10.1109/MWSYM.2009.5165678

6. Rossi, F. V., P. P. M. So, N. Fichtner, and P. Russer, "Massively parallel two-dimensional TLM algorithm on graphics processing units," IEEE MTT-S International Microwave Symposium Digest 2008, 153-156, June 2008.
doi:10.1109/MWSYM.2008.4633126

7. Rossi, F. and P. P. M. So, "Hardware accelerated symmetric condensed node TLM procedure for NVIDIA graphics processing units," IEEE APSURSI Antennas and Propagation Society International Symposium 2009 , 1-4, June 2009.
doi:10.1109/APS.2009.5171726

8. Tao, Y. B., H. Lin, and H. J. Bao, "From CPU to GPU: GPU-based electromagnetic computing (GPUECO)," Progress In Electromagnetics Research, Vol. 81, 1-19, 2008.
doi:10.2528/PIER07121302

9. Gao, P. C., Y. B. Tao, and H. Lin, "Fast RCS prediction using multiresolution shooting and bouncing ray method on the GPU," Progress In Electromagnetics Research, Vol. 107, 187-202, 2010.
doi:10.2528/PIER10061807

10. Lezar, E. and D. B. Davidson, "GPU-accelerated method of moments by example: Monostatic scattering," IEEE Antennas and Propagation Magazine, Vol. 52, 120-135, 2010.
doi:10.1109/MAP.2010.5723240

11. Garcia-Castillo, L. E., I. Gomez-Revuelto, F. Saez de Adana, and M. Salazar-Palma, "A finite element method for the analysis of radiation and scattering of electromagnetic waves on complex environments," Computer Methods in Applied Mechanics and Engineering , Vol. 194, No. 2-5, 637-655, February 2005.
doi:10.1016/j.cma.2004.05.025

12. Gomez-Revuelto, I., L. E. Garcia-Castillo, D. Pardo, and L. Demkowicz, "A two-dimensional self-adaptive finite element method for the analysis of open region problems in electromagnetics," IEEE Transactions on Magnetics, Vol. 43, No. 4, 1337-1340, April 2007.
doi:10.1109/TMAG.2007.892413

13. Lezar, E. and D. B. Davidson, "GPU-based arnoldi factorisation for accelerating finite element eigenanalysis," Proceedings of the 11th International Conference on Electromagnetics in Advanced Applications --- ICEAA'09, 380-383, September 2009.
doi:10.1109/ICEAA.2009.5297413

14. Jian, L. and K. T. Chau, "Design and analysis of a magnetic-geared electronic-continuously variable transmission system using finite element method," Progress In Electromagnetics Research, Vol. 107, 47-61, 2010.
doi:10.2528/PIER10062806

15. Ping, X. W. and T. J. Cui, "The factorized sparse approximate inverse preconditioned conjugate gradient algorithm for finite element analysis of scattering problems," Progress In Electromagnetics Research, Vol. 98, 15-31, 2009.
doi:10.2528/PIER09071703

16. Tian, J., Z. Q. Lv, X. W. Shi, L. Xu, and F. Wei, "An efficient approach for multifrontal algorithm to solve non-positive-definite finite element equations in electromagnetic problems," Progress In Electromagnetics Research, Vol. 95, 121-133, 2009.
doi:10.2528/PIER09070207

17. Saad, Y., Iterative Methods for Sparse Linear Systems, SIAM, 2004.

18. Velamparambil, S., S. MacKinnon-Cormier, J. Perry, R. Lemos, M. Okoniewski, and J. Leon, "GPU accelerated krylov subspace methods for computational electromagnetics," 38th European Microwave Conference EuMC 2008, 1312-1314, October 27-31, 2008.

19. Cwikla, A., M. Mrozowski, and M. Rewienski, "Finite-difference analysis of a loaded hemispherical resonator," IEEE Transactions on Microwave Theory and Techniques, Vol. 51, No. 5, 1506-1511, May 2003.
doi:10.1109/TMTT.2003.810131

20. Yang, X., "A survey of various conjugate gradient algorithms for iterative solution of the largest/smallest eigenvalue and eigenvector of a symmetric matrix," Progress In Electromagnetics Research, Vol. 5, 567-588, 1991.

21. Bell, N. and M. Garland, "Efficient sparse matrix-vector multiplication on CUDA," NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, December 2008.

22. Vazquez, F., E. M. Garzon, J. A. Martinez, and J. J. Fernandez, "The sparse matrix vector product on GPUs," Proceedings of the 2009 International Conference on Computational and Mathematical Methods in Science and Engineering, Vol. 2, 1081-1092, July 2009.

23. Monakov, A., A. Lokhmotov, and A. Avetisyan, "Automatically tuning sparse matrix-vector multiplication for GPU architectures," High Performance Embedded Architectures and Compilers, Lecture Notes in Computer Science, Vol. 5952, 111-125, 2010.
doi:10.1007/978-3-642-11515-8_10

24. Vazquez, F., G. Ortega, J. J. Fernandez, and E. M. Garzon, "Improving the performance of the sparse matrix vector product with GPUs," IEEE 10th International Conference on Computer and Information Technology (CIT), 1146-1151, 2010.
doi:10.1109/CIT.2010.208

25. Dziekonski, A., A. Lamecki, and M. Mrozowski, "GPU acceleration of multilevel solvers for analysis of microwave components with finite element method ," IEEE Microwave and Wireless Components Letters, Vol. 21, No. 1, January 1-3, 2011.

26. Kirk, D. B. and W. W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, Elsevier Inc., 2010.

27. Sanders, J. and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming, Nvidia Corporation, 2011.

28. Programming Guide Version 3.2, Nvidia Corporation, 2011.

29. http://www.nvidia.com/object/fermi architecture.html,.

30. CUDA CUSPARSE Library, Nvidia Corporation, 2011,.

31. Lee, V. W., C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey, "Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU ," ACM SIGARCH Computer Architecture News --- ISCA'10, Vol. 38, June 2010.

32., "http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-intel-mkl-100-threading/#5,".

33. Kucharski, A. and P. Slobodzian, "The application of macromodels to the analysis of a dielectric resonator antenna excited by a cavity backed slot ," 38th European Microwave Conference, EuMC 2008, 519-522, October 27-31, 2008.


© Copyright 2014 EMW Publishing. All Rights Reserved