Optimizing CUDA code by kernel fusion: application on BLAS

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Název česky	Optimalizace CUDA kódu pomocí fúzí kernelů: aplikace na BLAS
Autoři	FILIPOVIČ Jiří MADZIN Matúš FOUSEK Jan MATYSKA Luděk
Rok publikování	2015
Druh	Článek v odborném periodiku
Časopis / Zdroj	The Journal of Supercomputing
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	http://link.springer.com/article/10.1007/s11227-015-1483-z
Doi	http://dx.doi.org/10.1007/s11227-015-1483-z
Obor	Informatika
Klíčová slova	GPU; CUDA; BLAS; Kernel fusion; Code generation
Popis	Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector–vector) and BLAS-2 (matrix–vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24x faster for the examples tested.
Související projekty:	Zaměstnáním nejlepších mladých vědců k rozvoji mezinárodní spolupráce Rozsáhlé výpočetní systémy: modely, aplikace a verifikace IV. Rozsáhlé výpočetní systémy: modely, aplikace a verifikace V.