GPU

Usage

use GPU;

or

import GPU;

Supports utility functions for operating with GPUs.

Warning

This module is unstable and its interface is subject to change in the future.

GPU support is a relatively new feature to Chapel and is under active development.

For the most up-to-date information about GPU support see the technical note about it.

attribute @gpu.assertEligible

This attribute can be applied to loops to ensure that they are eligible for GPU execution. Unlike @assertOnGpu, this attribute has no execution-time effect. It only asserts that the code could be executed on the GPU, and not that it will be executed.

@gpu.assertEligible
foreach i in 1..128 { /* ... */ }

// variable version (applies to loop expressions and promoted expressions)
@gpu.assertEligible
var A = (foreach i in 1..128 do i*i) + 1;
config param silenceAssertOnGpuWarning = false

This configuration parameter is used to disable warnings that are emitted when @assertOnGpu is used in a non-GPU compilation. Since @assertOnGpu’s execution-time semantics are to halt execution if it is not on the GPU, it will always halt execution when the program is not compiled for the GPU. This is likely an issue, so the warning is emitted by default. However, in case the user is aware of this and wants to silence the warning, they can set this configuration parameter to true.

attribute @assertOnGpu

This attribute can be applied to loops to ensure that they are executed on the GPU. It has the effect of @gpu.assertEligible, halting compilation if the construct it is applied to cannot be executed on the GPU. In addition, this attribute causes an execution-time check to be performed when it is reached, ensuring that the code is executed on the GPU.

@assertOnGpu
foreach i in 1..128 { /* ... */ }

// variable version (applies to loop expressions and promoted expressions)
@assertOnGpu
var A = (foreach i in 1..128 do i*i) + 1;
attribute @gpu.blockSize(blockSize: integral)

This attribute can be applied to loops to specify the GPU block size to use when executing the loop on the GPU.

// loop version
@gpu.blockSize(64)
foreach i in 1..128 { /* ... */ }

// variable version (applies to loop expressions and promoted expressions)
@gpu.blockSize(64)
var A = (foreach i in 1..128 do i*i) + 1;
attribute @gpu.itersPerThread(itersPerThread: integral, param cyclic: bool = false)

This attribute requests that the kernel executes each consecutive numIters iterations of the affected loop sequentially within the same GPU thread. Users must ensure that the arguments to this attribute are positive.

// loop version
@gpu.itersPerThread(4)
foreach i in 1..128 { /* ... */ }

// variable version (applies to loop expressions and promoted expressions)
@gpu.itersPerThread(4)
var A = (foreach i in 1..128 do i*i) + 1;

Specifying the cyclic argument to be true distributes the iterations across GPU threads in cyclic fashion instead of the default block discipline.

proc gpuWrite(const args ...?k)

This function is intended to be called from within a GPU kernel and is useful for debugging purposes.

Currently using write to send output to stdout will make a loop ineligible for GPU execution; use gpuWrite instead.

Currently this function will only work if values of type c_ptrConst(c_char) are passed.

On NVIDIA GPUs the written values will be flushed to the terminal after the kernel has finished executing. Note that there is a 1MB limit on the size of this buffer.

proc gpuWriteln(const args ...?k)

Pass arguments to gpuWrite and follow with a newline.

proc gpuClock() : uint

Returns value of a per-multiprocessor counter that increments every clock cycle. This function is meant to be called to time sections of code within a GPU enabled loop.

proc gpuClocksPerSec(devNum: int)

Returns the number of clock cycles per second of a GPU multiprocessor. Note: currently we don’t support calling this function from within a kernel.

proc syncThreads()

Synchronize threads within a GPU block.

proc syncWarp(mask: uint(32) = 0xffffffff)

Causes the executing thread to wait until all warp lanes named in mask have executed a syncWarp() (with the same mask) before resuming execution. Each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute a corresponding syncWarp() with the same mask, or the result is undefined.

proc createSharedArray(type eltType, param size) : c_ptr(eltType)

Allocate block shared memory, enough to store size elements of eltType. Returns a CTypes.c_ptr to the allocated array. Note that although every thread in a block calls this procedure, the same shared array is returned to all of them.

Arguments:
  • eltType – the type of elements to allocate the array for.

  • size – the number of elements in each GPU thread block’s copy of the array.

proc gpuAtomicAdd(ref x: ?T, val: T) : T

When run on a GPU, atomically add ‘val’ to ‘x’ and store the result in ‘x’. The operation returns the old value of x.

proc gpuAtomicSub(ref x: ?T, val: T) : T

When run on a GPU, atomically subtract ‘val’ from ‘x’ and store the result in ‘x’. The operation returns the old value of x.

proc gpuAtomicExch(ref x: ?T, val: T) : T

When run on a GPU, atomically exchange the value stored in ‘x’ with ‘val’. The operation returns the old value of x.

proc gpuAtomicMin(ref x: ?T, val: T) : T

When run on a GPU, atomically compare ‘x’ and ‘val’ and store the minimum in ‘x’. The operation returns the old value of x.

proc gpuAtomicMax(ref x: ?T, val: T) : T

When run on a GPU, atomically compare ‘x’ and ‘val’ and store the maximum in ‘x’. The operation returns the old value of x.

proc gpuAtomicInc(ref x: ?T, val: T) : T

When run on a GPU, atomically increments x if the original value of x is greater-than or equal to val, if so the result is stored in ‘x’. Otherwise x is set to 0. The operation returns the old value of x.

proc gpuAtomicDec(ref x: ?T, val: T) : T

When run on a GPU, atomically determine if ‘x’ equals 0 or is greater than ‘val’. If so store ‘val’ in ‘x’ otherwise decrement ‘x’ by 1. Otherwise x is set to val. The operation returns the old value of x.

proc gpuAtomicAnd(ref x: ?T, val: T) : T

When run on a GPU, atomically perform a bitwise ‘and’ operation on ‘x’ and ‘val’ and store the result in ‘x’. The operation returns the old value of x.

proc gpuAtomicOr(ref x: ?T, val: T) : T

When run on a GPU, atomically perform a bitwise ‘or’ operation on ‘x’ and ‘val’ and store the result in ‘x’. The operation returns the old value of x.

proc gpuAtomicXor(ref x: ?T, val: T) : T

When run on a GPU, atomically perform a bitwise ‘xor’ operation on ‘x’ and ‘val’ and store the result in ‘x’. The operation returns the old value of x.

proc gpuAtomicCAS(ref x: ?T, cmp: T, val: T) : T

When run on a GPU, atomically compare the value in ‘x’ and ‘cmp’, if they are equal store ‘val’ in ‘x’. The operation returns the old value of x.

proc gpuSumReduce(const ref A: [] ?t)

Add all elements of an array together on the GPU (that is, perform a sum-reduction). The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays with int, uint, and real types are supported. A simple example is the following:

on here.gpus[0] {
  var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible
  writeln(gpuSumReduce(Arr)); // 15
}
proc gpuMinReduce(const ref A: [] ?t)

Return the minimum element of an array on the GPU (that is, perform a min-reduction). The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays with int, uint, and real types are supported. A simple example is the following:

on here.gpus[0] {
  var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible
  writeln(gpuMinReduce(Arr)); // 1
}
proc gpuMaxReduce(const ref A: [] ?t)

Return the maximum element of an array on the GPU (that is, perform a max-reduction). The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays with int, uint, and real types are supported. A simple example is the following:

on here.gpus[0] {
  var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible
  writeln(gpuMaxReduce(Arr)); // 5
}
proc gpuMinLocReduce(const ref A: [] ?t)

For an array on the GPU, return a tuple with the value and the index of the minimum element (that is, perform a minloc-reduction). If there are multiple elements with the same minimum value, the index of the first one is returned. The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays with int, uint, and real types are supported. A simple example is the following:

on here.gpus[0] {
  var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible
  writeln(gpuMinLocReduce(Arr)); // (1, 2). Note that Arr[2]==1.
}
proc gpuMaxLocReduce(const ref A: [] ?t)

For an array on the GPU, return a tuple with the value and the index of the maximum element (that is, perform a maxloc-reduction). If there are multiple elements with the same maximum value, the index of the first one is returned. The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays with int, uint, and real types are supported. A simple example is the following:

on here.gpus[0] {
  var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible
  writeln(gpuMaxLocReduce(Arr)); // (5, 3). Note that Arr[3]==5.
}
proc gpuScan(ref gpuArr: [] ?t)  where isNumericType(t) && !isComplexType(t)

Calculates an exclusive prefix sum (scan) of an array on the GPU. The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Arrays of numeric types are supported. A simple example is the following:

on here.gpus[0] {
  var Arr = [3, 2, 1, 5, 4]; // will be GPU-accessible
  gpuScan(Arr);
  writeln(Arr); // [0, 3, 5, 6, 11]
}
proc gpuSort(ref gpuInputArr: [] ?t)

Sort an array on the GPU. The array must be in GPU-accessible memory and the function must be called from outside a GPU-eligible loop. Only arrays of numeric types are supported. A simple example is the following:

on here.gpus[0] {
  var Arr = [3, 2, 1, 5, 4] : uint; // will be GPU-accessible
  gpuSort(Arr);
  writeln(Arr); // [1, 2, 3, 4, 5]
}
proc deviceAttributes(loc)
const CHPL_GPU_ATTRIBUTE__MAX_THREADS_PER_BLOCK : c_int
const CHPL_GPU_ATTRIBUTE__MAX_BLOCK_DIM_X : c_int
const CHPL_GPU_ATTRIBUTE__MAX_BLOCK_DIM_Y : c_int
const CHPL_GPU_ATTRIBUTE__MAX_BLOCK_DIM_Z : c_int
const CHPL_GPU_ATTRIBUTE__MAX_GRID_DIM_X : c_int
const CHPL_GPU_ATTRIBUTE__MAX_GRID_DIM_Y : c_int
const CHPL_GPU_ATTRIBUTE__MAX_GRID_DIM_Z : c_int
const CHPL_GPU_ATTRIBUTE__MAX_SHARED_MEMORY_PER_BLOCK : c_int
const CHPL_GPU_ATTRIBUTE__TOTAL_CONSTANT_MEMORY : c_int
const CHPL_GPU_ATTRIBUTE__WARP_SIZE : c_int
const CHPL_GPU_ATTRIBUTE__MAX_PITCH : c_int
const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE1D_WIDTH : c_int
const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE2D_WIDTH : c_int
const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE2D_HEIGHT : c_int
const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE3D_WIDTH : c_int
const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE3D_HEIGHT : c_int
const CHPL_GPU_ATTRIBUTE__MAXIMUM_TEXTURE3D_DEPTH : c_int
const CHPL_GPU_ATTRIBUTE__MAX_REGISTERS_PER_BLOCK : c_int
const CHPL_GPU_ATTRIBUTE__CLOCK_RATE : c_int
const CHPL_GPU_ATTRIBUTE__TEXTURE_ALIGNMENT : c_int
const CHPL_GPU_ATTRIBUTE__TEXTURE_PITCH_ALIGNMENT : c_int
const CHPL_GPU_ATTRIBUTE__MULTIPROCESSOR_COUNT : c_int
const CHPL_GPU_ATTRIBUTE__KERNEL_EXEC_TIMEOUT : c_int
const CHPL_GPU_ATTRIBUTE__INTEGRATED : c_int
const CHPL_GPU_ATTRIBUTE__CAN_MAP_HOST_MEMORY : c_int
const CHPL_GPU_ATTRIBUTE__COMPUTE_MODE : c_int
const CHPL_GPU_ATTRIBUTE__PROCESS : c_int
const CHPL_GPU_ATTRIBUTE__CONCURRENT_KERNELS : c_int
const CHPL_GPU_ATTRIBUTE__ECC_ENABLED : c_int
const CHPL_GPU_ATTRIBUTE__PCI_BUS_ID : c_int
const CHPL_GPU_ATTRIBUTE__PCI_DEVICE_ID : c_int
const CHPL_GPU_ATTRIBUTE__MEMORY_CLOCK_RATE : c_int
const CHPL_GPU_ATTRIBUTE__GLOBAL_MEMORY_BUS_WIDTH : c_int
const CHPL_GPU_ATTRIBUTE__L2_CACHE_SIZE : c_int
const CHPL_GPU_ATTRIBUTE__MAX_THREADS_PER_MULTIPROCESSOR : c_int
const CHPL_GPU_ATTRIBUTE__COMPUTE_CAPABILITY_MAJOR : c_int
const CHPL_GPU_ATTRIBUTE__COMPUTE_CAPABILITY_MINOR : c_int
const CHPL_GPU_ATTRIBUTE__MAX_SHARED_MEMORY_PER_MULTIPROCESSOR : c_int
const CHPL_GPU_ATTRIBUTE__MANAGED_MEMORY : c_int
const CHPL_GPU_ATTRIBUTE__MULTI_GPU_BOARD : c_int
const CHPL_GPU_ATTRIBUTE__PAGEABLE_MEMORY_ACCESS : c_int
const CHPL_GPU_ATTRIBUTE__CONCURRENT_MANAGED_ACCESS : c_int
const CHPL_GPU_ATTRIBUTE__PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES : c_int
const CHPL_GPU_ATTRIBUTE__DIRECT_MANAGED_MEM_ACCESS_FROM_HOST : c_int
record DeviceAttributes
var gpuId : int
proc init(loc)
proc name : string
proc maxThreadsPerBlock : int
proc maxBlockDimX : int
proc maxBlockDimY : int
proc maxBlockDimZ : int
proc MaxGridDimX : int
proc maxGridDimY : int
proc maxGridDimZ : int
proc maxSharedMemoryPerBlock : int
proc totalConstantMemory : int
proc warpSize : int
proc maxPitch : int
proc maximumTexture1dWidth : int
proc maximumTexture2dWidth : int
proc maximumTexture2dHeight : int
proc maximumTexture3dWidth : int
proc maximumTexture3dHeight : int
proc maximumTexture3dDepth : int
proc maxRegistersPerBlock : int
proc clockRate : int
proc textureAlignment : int
proc texturePitch_alignment : int
proc multiprocessorCount : int
proc kernelExecTimeout : int
proc integrated : int
proc canMapHostMemory : int
proc computeMode : int
proc concurrentKernels : int
proc eccEnabled : int
proc pciBusId : int
proc pciDeviceId : int
proc memoryClockRate : int
proc globalMemoryBusWidth : int
proc l2CacheSize : int
proc maxThreadsPerMultiprocessor : int
proc computeCapabilityMajor : int
proc computeCapabilityMinor : int
proc maxSharedMemoryPerMultiprocessor : int
proc managedMemory : int
proc multiGpuBoard : int
proc pageableMemoryAccess : int
proc concurrentManagedAccess : int
proc pageableMemoryAccessUsesHostPageTables : int
proc directManagedMemAccessFromHost : int