GPU¶
Usage
use GPU;
or
import GPU;
Supports utility functions for operating with GPUs.
Warning
This module is unstable and its interface is subject to change in the future.
GPU support is a relatively new feature to Chapel and is under active development.
For the most up-to-date information about GPU support see the technical note about it.
- attribute @gpu.assertEligible¶
This attribute can be applied to loops to ensure that they are eligible for GPU execution. Unlike
@assertOnGpu
, this attribute has no execution-time effect. It only asserts that the code could be executed on the GPU, and not that it will be executed.@gpu.assertEligible foreach i in 1..128 { /* ... */ } // variable version (applies to loop expressions and promoted expressions) @gpu.assertEligible var A = (foreach i in 1..128 do i*i) + 1;
- config param silenceAssertOnGpuWarning = false¶
This configuration parameter is used to disable warnings that are emitted when
@assertOnGpu
is used in a non-GPU compilation. Since@assertOnGpu
’s execution-time semantics are to halt execution if it is not on the GPU, it will always halt execution when the program is not compiled for the GPU. This is likely an issue, so the warning is emitted by default. However, in case the user is aware of this and wants to silence the warning, they can set this configuration parameter totrue
.
- attribute @assertOnGpu¶
This attribute can be applied to loops to ensure that they are executed on the GPU. It has the effect of
@gpu.assertEligible
, halting compilation if the construct it is applied to cannot be executed on the GPU. In addition, this attribute causes an execution-time check to be performed when it is reached, ensuring that the code is executed on the GPU.@assertOnGpu foreach i in 1..128 { /* ... */ } // variable version (applies to loop expressions and promoted expressions) @assertOnGpu var A = (foreach i in 1..128 do i*i) + 1;
- attribute @gpu.blockSize(blockSize: integral)¶
This attribute can be applied to loops to specify the GPU block size to use when executing the loop on the GPU.
// loop version @gpu.blockSize(64) foreach i in 1..128 { /* ... */ } // variable version (applies to loop expressions and promoted expressions) @gpu.blockSize(64) var A = (foreach i in 1..128 do i*i) + 1;
- attribute @gpu.itersPerThread(itersPerThread: integral, param cyclic: bool = false)¶
This attribute requests that the kernel executes each consecutive
numIters
iterations of the affected loop sequentially within the same GPU thread. Users must ensure that the arguments to this attribute are positive.// loop version @gpu.itersPerThread(4) foreach i in 1..128 { /* ... */ } // variable version (applies to loop expressions and promoted expressions) @gpu.itersPerThread(4) var A = (foreach i in 1..128 do i*i) + 1;
Specifying the cyclic argument to be true distributes the iterations across GPU threads in cyclic fashion instead of the default block discipline.
- proc gpuWrite(const args ...?k)¶
This function is intended to be called from within a GPU kernel and is useful for debugging purposes.
Currently using
write
to send output tostdout
will make a loop ineligible for GPU execution; usegpuWrite
instead.Currently this function will only work if values of type
c_ptrConst(c_char)
are passed.On NVIDIA GPUs the written values will be flushed to the terminal after the kernel has finished executing. Note that there is a 1MB limit on the size of this buffer.
- proc gpuClock() : uint¶
Returns value of a per-multiprocessor counter that increments every clock cycle. This function is meant to be called to time sections of code within a GPU enabled loop.
- proc gpuClocksPerSec(devNum: int)¶
Returns the number of clock cycles per second of a GPU multiprocessor. Note: currently we don’t support calling this function from within a kernel.
- proc syncThreads()¶
Synchronize threads within a GPU block.
- proc syncWarp(mask: uint(32) = 0xffffffff)¶
Causes the executing thread to wait until all warp lanes named in mask have executed a
syncWarp()
(with the same mask) before resuming execution. Each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute a correspondingsyncWarp()
with the same mask, or the result is undefined.
Allocate block shared memory, enough to store
size
elements ofeltType
. Returns aCTypes.c_ptr
to the allocated array. Note that although every thread in a block calls this procedure, the same shared array is returned to all of them.- Arguments:
eltType – the type of elements to allocate the array for.
size – the number of elements in each GPU thread block’s copy of the array.
- proc gpuAtomicAdd(ref x: ?T, val: T) : T¶
When run on a GPU, atomically add ‘val’ to ‘x’ and store the result in ‘x’. The operation returns the old value of x.
- proc gpuAtomicSub(ref x: ?T, val: T) : T¶
When run on a GPU, atomically subtract ‘val’ from ‘x’ and store the result in ‘x’. The operation returns the old value of x.
- proc gpuAtomicExch(ref x: ?T, val: T) : T¶
When run on a GPU, atomically exchange the value stored in ‘x’ with ‘val’. The operation returns the old value of x.
- proc gpuAtomicMin(ref x: ?T, val: T) : T¶
When run on a GPU, atomically compare ‘x’ and ‘val’ and store the minimum in ‘x’. The operation returns the old value of x.
- proc gpuAtomicMax(ref x: ?T, val: T) : T¶
When run on a GPU, atomically compare ‘x’ and ‘val’ and store the maximum in ‘x’. The operation returns the old value of x.
- proc gpuAtomicInc(ref x: ?T, val: T) : T¶
When run on a GPU, atomically increments x if the original value of x is greater-than or equal to val, if so the result is stored in ‘x’. Otherwise x is set to 0. The operation returns the old value of x.
- proc gpuAtomicDec(ref x: ?T, val: T) : T¶
When run on a GPU, atomically determine if ‘x’ equals 0 or is greater than ‘val’. If so store ‘val’ in ‘x’ otherwise decrement ‘x’ by 1. Otherwise x is set to val. The operation returns the old value of x.
- proc gpuAtomicAnd(ref x: ?T, val: T) : T¶
When run on a GPU, atomically perform a bitwise ‘and’ operation on ‘x’ and ‘val’ and store the result in ‘x’. The operation returns the old value of x.
- proc gpuAtomicOr(ref x: ?T, val: T) : T¶
When run on a GPU, atomically perform a bitwise ‘or’ operation on ‘x’ and ‘val’ and store the result in ‘x’. The operation returns the old value of x.
- proc gpuAtomicXor(ref x: ?T, val: T) : T¶
When run on a GPU, atomically perform a bitwise ‘xor’ operation on ‘x’ and ‘val’ and store the result in ‘x’. The operation returns the old value of x.
- proc gpuAtomicCAS(ref x: ?T, cmp: T, val: T) : T¶
When run on a GPU, atomically compare the value in ‘x’ and ‘cmp’, if they are equal store ‘val’ in ‘x’. The operation returns the old value of x.
- proc gpuSumReduce(const ref A: [] ?t)¶
Add all elements of an array together on the GPU (that is, perform a sum-reduction). The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays with int, uint, and real types are supported. A simple example is the following:
on here.gpus[0] { var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible writeln(gpuSumReduce(Arr)); // 15 }
- proc gpuMinReduce(const ref A: [] ?t)¶
Return the minimum element of an array on the GPU (that is, perform a min-reduction). The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays with int, uint, and real types are supported. A simple example is the following:
on here.gpus[0] { var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible writeln(gpuMinReduce(Arr)); // 1 }
- proc gpuMaxReduce(const ref A: [] ?t)¶
Return the maximum element of an array on the GPU (that is, perform a max-reduction). The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays with int, uint, and real types are supported. A simple example is the following:
on here.gpus[0] { var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible writeln(gpuMaxReduce(Arr)); // 5 }
- proc gpuMinLocReduce(const ref A: [] ?t)¶
For an array on the GPU, return a tuple with the value and the index of the minimum element (that is, perform a minloc-reduction). If there are multiple elements with the same minimum value, the index of the first one is returned. The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays with int, uint, and real types are supported. A simple example is the following:
on here.gpus[0] { var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible writeln(gpuMinLocReduce(Arr)); // (1, 2). Note that Arr[2]==1. }
- proc gpuMaxLocReduce(const ref A: [] ?t)¶
For an array on the GPU, return a tuple with the value and the index of the maximum element (that is, perform a maxloc-reduction). If there are multiple elements with the same maximum value, the index of the first one is returned. The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays with int, uint, and real types are supported. A simple example is the following:
on here.gpus[0] { var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible writeln(gpuMaxLocReduce(Arr)); // (5, 3). Note that Arr[3]==5. }
- proc gpuScan(ref gpuArr: [] ?t) where isNumericType(t) && !isComplexType(t)¶
Calculates an exclusive prefix sum (scan) of an array on the GPU. The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Arrays of numeric types are supported. A simple example is the following:
on here.gpus[0] { var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible gpuScan(Arr); writeln(Arr); // [0, 3, 5, 6, 11] }
- proc gpuSort(ref gpuInputArr: [] ?t)¶
Sort an array on the GPU. The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays of numeric types are supported. A simple example is the following:
on here.gpus[0] { var Arr = [3, 2, 1, 5, 4] : uint; // will be GPU-accessible gpuSort(Arr); writeln(Arr); // [1, 2, 3, 4, 5] }
- proc deviceAttributes(loc)¶
- const CHPL_GPU_ATTRIBUTE__MAX_THREADS_PER_BLOCK : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAX_BLOCK_DIM_X : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAX_BLOCK_DIM_Y : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAX_BLOCK_DIM_Z : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAX_GRID_DIM_X : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAX_GRID_DIM_Y : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAX_GRID_DIM_Z : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAX_SHARED_MEMORY_PER_BLOCK : c_int¶
- const CHPL_GPU_ATTRIBUTE__TOTAL_CONSTANT_MEMORY : c_int¶
- const CHPL_GPU_ATTRIBUTE__WARP_SIZE : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAX_PITCH : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE1D_WIDTH : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE2D_WIDTH : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE2D_HEIGHT : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE3D_WIDTH : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE3D_HEIGHT : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE3D_DEPTH : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAX_REGISTERS_PER_BLOCK : c_int¶
- const CHPL_GPU_ATTRIBUTE__CLOCK_RATE : c_int¶
- const CHPL_GPU_ATTRIBUTE__TEXTURE_ALIGNMENT : c_int¶
- const CHPL_GPU_ATTRIBUTE__TEXTURE_PITCH_ALIGNMENT : c_int¶
- const CHPL_GPU_ATTRIBUTE__MULTIPROCESSOR_COUNT : c_int¶
- const CHPL_GPU_ATTRIBUTE__KERNEL_EXEC_TIMEOUT : c_int¶
- const CHPL_GPU_ATTRIBUTE__INTEGRATED : c_int¶
- const CHPL_GPU_ATTRIBUTE__CAN_MAP_HOST_MEMORY : c_int¶
- const CHPL_GPU_ATTRIBUTE__COMPUTE_MODE : c_int¶
- const CHPL_GPU_ATTRIBUTE__PROCESS : c_int¶
- const CHPL_GPU_ATTRIBUTE__CONCURRENT_KERNELS : c_int¶
- const CHPL_GPU_ATTRIBUTE__ECC_ENABLED : c_int¶
- const CHPL_GPU_ATTRIBUTE__PCI_BUS_ID : c_int¶
- const CHPL_GPU_ATTRIBUTE__PCI_DEVICE_ID : c_int¶
- const CHPL_GPU_ATTRIBUTE__MEMORY_CLOCK_RATE : c_int¶
- const CHPL_GPU_ATTRIBUTE__GLOBAL_MEMORY_BUS_WIDTH : c_int¶
- const CHPL_GPU_ATTRIBUTE__L2_CACHE_SIZE : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAX_THREADS_PER_MULTIPROCESSOR : c_int¶
- const CHPL_GPU_ATTRIBUTE__COMPUTE_CAPABILITY_MAJOR : c_int¶
- const CHPL_GPU_ATTRIBUTE__COMPUTE_CAPABILITY_MINOR : c_int¶
- const CHPL_GPU_ATTRIBUTE__MAX_SHARED_MEMORY_PER_MULTIPROCESSOR : c_int¶
- const CHPL_GPU_ATTRIBUTE__MANAGED_MEMORY : c_int¶
- const CHPL_GPU_ATTRIBUTE__MULTI_GPU_BOARD : c_int¶
- const CHPL_GPU_ATTRIBUTE__PAGEABLE_MEMORY_ACCESS : c_int¶
- const CHPL_GPU_ATTRIBUTE__CONCURRENT_MANAGED_ACCESS : c_int¶
- const CHPL_GPU_ATTRIBUTE__PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES : c_int¶
- const CHPL_GPU_ATTRIBUTE__DIRECT_MANAGED_MEM_ACCESS_FROM_HOST : c_int¶
- record DeviceAttributes¶
- var gpuId : int¶
- proc init(loc)¶
- proc name : string¶
- proc maxThreadsPerBlock : int¶
- proc maxBlockDimX : int¶
- proc maxBlockDimY : int¶
- proc maxBlockDimZ : int¶
- proc MaxGridDimX : int¶
- proc maxGridDimY : int¶
- proc maxGridDimZ : int¶
- proc totalConstantMemory : int¶
- proc warpSize : int¶
- proc maxPitch : int¶
- proc maximumTexture1dWidth : int¶
- proc maximumTexture2dWidth : int¶
- proc maximumTexture2dHeight : int¶
- proc maximumTexture3dWidth : int¶
- proc maximumTexture3dHeight : int¶
- proc maximumTexture3dDepth : int¶
- proc maxRegistersPerBlock : int¶
- proc clockRate : int¶
- proc textureAlignment : int¶
- proc texturePitch_alignment : int¶
- proc multiprocessorCount : int¶
- proc kernelExecTimeout : int¶
- proc integrated : int¶
- proc canMapHostMemory : int¶
- proc computeMode : int¶
- proc concurrentKernels : int¶
- proc eccEnabled : int¶
- proc pciBusId : int¶
- proc pciDeviceId : int¶
- proc memoryClockRate : int¶
- proc globalMemoryBusWidth : int¶
- proc l2CacheSize : int¶
- proc maxThreadsPerMultiprocessor : int¶
- proc computeCapabilityMajor : int¶
- proc computeCapabilityMinor : int¶
- proc managedMemory : int¶
- proc multiGpuBoard : int¶
- proc pageableMemoryAccess : int¶
- proc concurrentManagedAccess : int¶
- proc pageableMemoryAccessUsesHostPageTables : int¶
- proc directManagedMemAccessFromHost : int¶