Main Content

coder.gpu.kernel

Pragma that mapsfor-loops to GPU kernels

Description

coder.gpu.kernel()is a loop-level pragma that you must place immediately before a for loop. It generates a kernel with the dimensions computed from the loop parameters.

Note

Thecoder.gpu.kernelpragma overrides all parallel loop analysis checks that the software performs. Usecoder.gpu.kernelfunfirst before using the more advanced functionality of thecoder.gpu.kernelpragma.

coder.gpu.kernel(B,T)is a loop-level pragma that you must place immediately before a for loop. It generates a kernel with the dimensions specified byBandT.B[Bx,By,1]is an array that defines the number of blocks in the grid along dimensionsxandy(znot used).T[Tx,Ty,Tz]is an array that defines the number of threads in the block along dimensionsx,y, andz.

A value of -1 forBandTindicates that GPU Coder™ must infer the grid and block dimensions automatically. Thecoder.gpu.kernelpragma generates errors for invalid grid and block dimensions.

example

coder.gpu.kernel(B,T,M,name)expects the sameBandTarguments. You can specify optional argumentsMandname.Mis a positive integer specifying the minimum number of blocks per streaming multiprocessor. Sometimes, increasingMcan reduce the register usage within a kernel and improve kernel occupancy. A value of -1 forMindicates that GPU Coder must use the default value of 1.nameis a character array that allows you to customize the name of the generated kernel.

Specifying the kernel pragma overrides all parallel loop analysis checks. This override allows loops to be parallelized in situations where parallel loop analysis cannot prove that all iterations are independent of each other. First, ensure that the loop is safe to parallelize.

This function is a code generation function. It has no effect in MATLAB®.

例子

collapse all

This example shows how to use thekernelpragma in a function and generate CUDA®code.

In one file, write the entry-point functionscalarsthat accepts two vector inputsx,yof size1x4096and one scalar inputscale. The function has twofor-loops of different iteration lengths, one for vector addition and one for finding the cumulative sum. Place thecoder.gpu.kernel(1,1024)pragma outside the first loop. This pragma creates a kernel with one block having 1024 threads. Place thecoder.gpu.kernel(8,512,512,'reduction')pragma outside the second loop. This pragma creates a kernel with eight blocks having 512 threads per block. The kernel created for this block is named reduction.

function[vout, sout1] = scalars(x,y,scale) sout1 = 0; vout = coder.nullcopy(zeros(size(x))); coder.gpu.kernel(1,1024);fori=1:1024 vout(i) = x(i) + y(i);endcoder.gpu.kernel(8,512,512,'reduction');fori=1:4096 sout1 = (x(i)*scale) + sout1;endend

Use thecodegenfunction to generate CUDA MEX function.

codegen-configcoder.gpuConfig('mex')...-args{ones(1,4096,'double'),ones(1,4096,'double'),coder.typeof(0)}...-reportscalars

GPU Coder creates two kernels:scalars_kernel1for vector addition andscalarsreductionkernel for the cumulative sum. No kernel is needed for initializingsout1=0.

cudaMemcpy(gpu_y, y, 32768U, cudaMemcpyHostToDevice); cudaMemcpy(gpu_x, x, 32768U, cudaMemcpyHostToDevice); scalars_kernel1<<>>(gpu_y, gpu_x, gpu_vout); cudaMemcpy(gpu_sout1, sout1, 8U, cudaMemcpyHostToDevice); scalarsreduction<<>>(scale, gpu_x, gpu_sout1); cudaMemcpy(vout, gpu_vout, 32768U, cudaMemcpyDeviceToHost); cudaMemcpy(sout1, gpu_sout1, 8U, cudaMemcpyDeviceToHost);

scalars_kernel1has one block with 1024 threads per block, one for adding each element.scalarsreductionkernel has eight blocks with 512 threads per block, resulting in a total of 4096 threads.

You can use variables or expressions when specifying the kernel dimensions. For example, you can rewrite thescalarsentry-point function such that the grid and block dimensions are specified at compile time.

function[vout, sout1] = scalars(x,y,scale, a, b) sout1 = 0; vout = zeros(size(x)); coder.gpu.kernel(1,1024);fori=1:1024 vout(i) = x(i) + y(i);endcoder.gpu.kernel([a,a*b,1], [a*b, 1, 1],'reduction');fori=1:length(x) sout1 = (x(i)*scale) + sout1;endend

Use thecodegenfunction to generate CUDA MEX function.

codegen-configcoder.gpuConfig('mex')...-args{ones(1,4096,'double'),ones(1,4096,'double'),20,8,4}...-reportscalars

Version History

Introduced in R2017b