The library

Code and text by Sean Baxter, NVIDIA Research.

(Click here for license. Click here for contact information.)

The Library

To ease development MGPU includes a sample framework, defined in util/mgpucontext.h. At the start of your program create a CudaContext object. This encapsulates an event, a timer, a stream, and an allocator. Allocations made through this context are recycled after being freed, reducing calls to cudaMalloc.

include/util/mgpucontext.h

ContextPtr CreateCudaDevice(int ordinal);
ContextPtr CreateCudaDevice(int argc, char** argv, bool printInfo = false);

ContextPtr CreateCudaDeviceStream(int ordinal);
ContextPtr CreateCudaDeviceStream(int argc, char** argv, 
	bool printInfo = false);

Call CreateCudaDevice to create a context on the default stream or CreateCudaDeviceStream to create a context on the new stream. The (argc, argv) overloads parse the command-line arguments for a device ordinal. You can pass true for printInfo to print device attributes:

int main(int argc, char** argv) {
	ContextPtr context = CreateCudaDevice(argc, argv, true);
	return 0;
}

GeForce GTX 480 : 1401.000 Mhz   (Ordinal 1)
15 SMs enabled. Compute Capability sm_20
FreeMem:   1086MB   TotalMem:   1535MB.
Mem Clock: 1848.000 Mhz x 384 bits   (177.408 GB/s)
ECC Disabled

MGPU context and device objects are managed with the reference-counting pointer types ContextPtr and DevicePtr. MGPU-allocated memory is reference counted with intrusive_ptr< CudaDeviceMem<type> > which is bound to the MGPU_MEM(type) macro for ease of use.

#include "moderngpu.cuh"

using namespace mgpu;

int main(int argc, char** argv) {
	ContextPtr context = CreateCudaDevice(argc, argv);

	MGPU_MEM(uint) data = context->Malloc<uint>(1000);

	MGPU_MEM(int) a = context->FillAscending<int>(50, 0, 5);
	MGPU_MEM(float) b = context->GenRandom<float>(50, 0.0f, 10.0f);
	MGPU_MEM(double) c = context->SortRandom<double>(50, 0.0, 20.0);

	printf("A:\n");
	PrintArray(*a, "%6d", 10);

	printf("\nB:\n");
	PrintArray(*b, "%6.2lf", 10);

	printf("\nC:\n");
	PrintArray(*c, "%6.2lf", 10);

	return 0;
}

A:
    0:       0      5     10     15     20     25     30     35     40     45
   10:      50     55     60     65     70     75     80     85     90     95
   20:     100    105    110    115    120    125    130    135    140    145
   30:     150    155    160    165    170    175    180    185    190    195
   40:     200    205    210    215    220    225    230    235    240    245

B:
    0:    8.15   1.35   9.06   8.35   1.27   9.69   9.13   2.21   6.32   3.08
   10:    0.98   5.47   2.78   1.88   5.47   9.93   9.58   9.96   9.65   9.68
   20:    1.58   7.26   9.71   9.81   9.57   1.10   4.85   7.98   8.00   2.97
   30:    1.42   0.05   4.22   1.12   9.16   6.40   7.92   8.78   9.59   5.04
   40:    6.56   7.98   0.36   3.61   8.49   2.12   9.34   6.81   6.79   3.99

C:
    0:    0.64   0.69   0.73   0.92   1.02   1.94   2.50   2.52   2.98   3.42
   10:    3.48   3.74   4.20   5.54   6.04   6.33   6.34   7.63   7.84   8.17
   20:    8.44   8.77   8.91   9.16   9.50   9.75   9.80   9.81  12.93  13.11
   30:   13.27  13.90  14.12  14.19  14.81  14.86  15.09  15.15  15.28  15.31
   40:   15.88  15.90  15.95  16.15  16.44  16.47  17.45  18.42  19.00  19.88

CudaContext::Malloc allocates memory from its caching allocator. The class supports a variety of methods to fill device memory with data to accelerate testing and debugging. FillAscending, GenRandom, and SortRandom are demonstrated above. PrintArray prints CudaDeviceMem arrays to the console using printf-style format specifiers.

When MGPU_MEM-wrapped objects fall out of scope, the underlying device memory is inserted into a weighted least-recently-used cache. Subsequent queries check the pool and reuse allocations of a similar size before calling cudaMalloc. Once a program gets hot, the client can make small requests from CudaContext with high confidence that the call will return immediately.

Users can opt-out of the default caching allocator by deriving CudaAlloc and providing their own implementation, or simply by using CudaAllocSimple, which calls cudaFree immediately on device memory falling out of scope.

include/util/mgpucontext.h

class CudaAlloc : public CudaBase {
public:
	virtual cudaError_t Malloc(size_t size, void** p) = 0;
	virtual bool Free(void* p) = 0;
	virtual ~CudaAlloc() { }

	const CudaDevice& Device() const { return *_device; }
	CudaDevice& Device() { return *_device; }
protected:
	CudaAlloc(CudaDevice* device) : _device(device) { }
	DevicePtr _device;
};

CudaAlloc is an interface that defines two abstract methods for users to implement: Malloc allocates size bytes and returns the pointer in p. Free releases memory allocated by Malloc.

int main(int argc, char** argv) {
	ContextPtr context = CreateCudaDevice(argc, argv, true);
	AllocPtr standardAlloc(new CudaAllocSimple(&context->Device()));
	context->SetAllocator(standardAlloc);

Instantiate your allocator and associate it with the device context with CudaContext::SetAllocator. The provided caching allocator is not optimal for all applications; use the simple allocator to get back to a baseline.

int main(int argc, char** argv) {
	ContextPtr context = CreateCudaDevice(argc, argv, true);

	// Cast CudaAlloc* to CudaAllocBuckets*
	CudaAllocBuckets* buckets = dynamic_cast<CudaAllocBuckets*>
		(context->GetAllocator());

	// Set the capacity of the LRU cache to 500MB.
	buckets->SetCapacity(500000000);

You can set the capacity of the LRU cache dynamically. CudaContext::GetAllocator returns a CudaContext* pointer to the currently-selected allocator. Because we know it's a caching allocator, we use RTTI's dynamic_cast to retrieve a CudaAllocBuckets* pointer. We call SetCapacity with a request of 500MB to set the cache size. If the size of an allocation request plus the size of items allocated in the cache exceeds 500MB, the caching allocator frees the least-recently-used requests to make space for the new memory.

include/kernels/bulkremove.cuh

    KernelBulkRemove<Tuning><<<numBlocks, launch.x, 0, context.Stream()>>>(
		source_global, sourceCount, indices_global, indicesCount, 
		partitionsDevice->get(), dest_global);

The context object attempts to support CUDA streams in as non-obtrusive a manner as possible. All MGPU host functions take a CudaContext object by reference and pass the stream handle to the launch chevrons. This enqueues the kernel launch into the stream that attached to the context.

Some MGPU functions—namely reduce, join, and some variants of scan and vectorized sorted search—use cudaMemcpyDeviceToHost to move kernel outputs into host memory. This is a synchronizing function; it will cause the thread to wait on the transfer, preventing it from queueing launches on other streams. If this creates scheduling inefficiences, the programmer can split apart the host function, use cudaMemcpyAsync to asynchronously move data into CPU-pinned memory, and overlap scheduling of operations on other threads. This is an invasive and application-specific way to program, so it is not directly support by the MGPU library.

Load/store functions

MGPU functions are aggressively register-blocked. Register blocking amortizes per-CTA and per-thread costs by increasing the number of items processed per thread. To improve clarity and reduce errors, common routines for moving portions of data between memory spaces (global memory, shared memory, and register) have been factored into functions in the include/device/loadstore.cuh header.

The common template argument VT is the kernel's grain size; it specifies the number of values processed per thread. The argument NT is the number of threads in the CTA.

Most of these functions operate in strided order, in which elements are assigned to threads according to NT * i + tid, where i is the index of the element in the register and tid is the thread ID. Data should be loaded and stored in strided order, as this organizes warp transfers into cache lines, which maximizes data throughput.

Many MGPU algorithms work with data in thread order, in which elements are assigned to threads according to VT * tid + i. Each thread has access to VT consecutive elements which makes performing sequential operations like scan and merge very easy. However data should not be loaded or stored to global memory in thread order, as warp transfers would touch VT different cache lines, wasting memory bandwidth.

By choosing an odd number for VT we avoid bank conflicts that would otherwise be incurred when re-ordering data between strided and thread orders. Within a warp, all banks (VT * tid + i) % 32 are accessed exactly once for each step i when VT is odd. If VT is a power-of-two, you can expect VT-way conflicts at each step.

Load/store function prototypes are found in mgpudevice.cuh. Most functions have names matching the pattern Device[Source]To[Dest]:

include/mgpudevice.cuh

// For 0 <= i < VT: 
//		index = NT * i + tid;
//		if(index < count) reg[i] = data[index];
// Synchronize after load.
template<int NT, int VT, typename InputIt, typename T>
MGPU_DEVICE void DeviceSharedToReg(int count, InputIt data, int tid, 
	T* reg, bool sync = true);

Functions of this form are parameterized over NT and VT arguments—these are typically communicated to the kernel using the LaunchBox mechanism. The first argument is the count of items to move across the entire CTA. If NT * VT == count, an optimized implementation may be used which eliminates per-item predication to reduce latency and promote parallelism. The second argument is the source data, and its memory space should match the [Source] part of the function name. The third argument is the thread ID. The fourth argument is the destination data and its memory space should match the [Dest] part of the function name. The final argument is used to hit a __syncthreads after the operation. Data movement functions with Shared in the name synchronize by default; other functions do not.

Data can be loaded from shared memory into registers in thread order with DeviceSharedToThread. Data can be stored to shared from registers in thread order with DeviceThreadToShared. A common practice is to:

Regimented application of these utility functions to move data between global memory, shared memory, and register helps highlight the novel aspects of the kernel (the stuff in step 3).

Framework

Load/store functions