© 2013, NVIDIA CORPORATION. All rights reserved.
Code and text by Sean Baxter, NVIDIA Research.
(Click here for license. Click here for contact information.)
To ease development MGPU includes a sample framework, defined in util/mgpucontext.h. At the start of your program create a CudaContext
object. This encapsulates an event, a timer, a stream, and an allocator. Allocations made through this context are recycled after being freed, reducing calls to cudaMalloc
.
ContextPtr CreateCudaDevice(int ordinal); ContextPtr CreateCudaDevice(int argc, char** argv, bool printInfo = false); ContextPtr CreateCudaDeviceStream(int ordinal); ContextPtr CreateCudaDeviceStream(int argc, char** argv, bool printInfo = false);
Call CreateCudaDevice
to create a context on the default stream or CreateCudaDeviceStream
to create a context on the new stream. The (argc, argv) overloads parse the command-line arguments for a device ordinal. You can pass true for printInfo
to print device attributes:
int main(int argc, char** argv) { ContextPtr context = CreateCudaDevice(argc, argv, true); return 0; }
GeForce GTX 480 : 1401.000 Mhz (Ordinal 1) 15 SMs enabled. Compute Capability sm_20 FreeMem: 1086MB TotalMem: 1535MB. Mem Clock: 1848.000 Mhz x 384 bits (177.408 GB/s) ECC Disabled
MGPU context and device objects are managed with the reference-counting pointer types ContextPtr
and DevicePtr
. MGPU-allocated memory is reference counted with intrusive_ptr< CudaDeviceMem<type> >
which is bound to the MGPU_MEM(type)
macro for ease of use.
#include "moderngpu.cuh" using namespace mgpu; int main(int argc, char** argv) { ContextPtr context = CreateCudaDevice(argc, argv); MGPU_MEM(uint) data = context->Malloc<uint>(1000); MGPU_MEM(int) a = context->FillAscending<int>(50, 0, 5); MGPU_MEM(float) b = context->GenRandom<float>(50, 0.0f, 10.0f); MGPU_MEM(double) c = context->SortRandom<double>(50, 0.0, 20.0); printf("A:\n"); PrintArray(*a, "%6d", 10); printf("\nB:\n"); PrintArray(*b, "%6.2lf", 10); printf("\nC:\n"); PrintArray(*c, "%6.2lf", 10); return 0; }
A:
0: 0 5 10 15 20 25 30 35 40 45
10: 50 55 60 65 70 75 80 85 90 95
20: 100 105 110 115 120 125 130 135 140 145
30: 150 155 160 165 170 175 180 185 190 195
40: 200 205 210 215 220 225 230 235 240 245 B: 0: 8.15 1.35 9.06 8.35 1.27 9.69 9.13 2.21 6.32 3.08 10: 0.98 5.47 2.78 1.88 5.47 9.93 9.58 9.96 9.65 9.68 20: 1.58 7.26 9.71 9.81 9.57 1.10 4.85 7.98 8.00 2.97 30: 1.42 0.05 4.22 1.12 9.16 6.40 7.92 8.78 9.59 5.04 40: 6.56 7.98 0.36 3.61 8.49 2.12 9.34 6.81 6.79 3.99 C: 0: 0.64 0.69 0.73 0.92 1.02 1.94 2.50 2.52 2.98 3.42 10: 3.48 3.74 4.20 5.54 6.04 6.33 6.34 7.63 7.84 8.17 20: 8.44 8.77 8.91 9.16 9.50 9.75 9.80 9.81 12.93 13.11 30: 13.27 13.90 14.12 14.19 14.81 14.86 15.09 15.15 15.28 15.31 40: 15.88 15.90 15.95 16.15 16.44 16.47 17.45 18.42 19.00 19.88
CudaContext::Malloc
allocates memory from its caching allocator. The class supports a variety of methods to fill device memory with data to accelerate testing and debugging. FillAscending
, GenRandom
, and SortRandom
are demonstrated above. PrintArray
prints CudaDeviceMem
arrays to the console using printf-style format specifiers.
When MGPU_MEM
-wrapped objects fall out of scope, the underlying device memory is inserted into a weighted least-recently-used cache. Subsequent queries check the pool and reuse allocations of a similar size before calling cudaMalloc
. Once a program gets hot, the client can make small requests from CudaContext
with high confidence that the call will return immediately.
Users can opt-out of the default caching allocator by deriving CudaAlloc
and providing their own implementation, or simply by using CudaAllocSimple
, which calls cudaFree
immediately on device memory falling out of scope.
class CudaAlloc : public CudaBase { public: virtual cudaError_t Malloc(size_t size, void** p) = 0; virtual bool Free(void* p) = 0; virtual ~CudaAlloc() { } const CudaDevice& Device() const { return *_device; } CudaDevice& Device() { return *_device; } protected: CudaAlloc(CudaDevice* device) : _device(device) { } DevicePtr _device; };
CudaAlloc
is an interface that defines two abstract methods for users to implement: Malloc
allocates size
bytes and returns the pointer in p
. Free
releases memory allocated by Malloc
.
int main(int argc, char** argv) { ContextPtr context = CreateCudaDevice(argc, argv, true); AllocPtr standardAlloc(new CudaAllocSimple(&context->Device())); context->SetAllocator(standardAlloc);
Instantiate your allocator and associate it with the device context with CudaContext::SetAllocator
. The provided caching allocator is not optimal for all applications; use the simple allocator to get back to a baseline.
int main(int argc, char** argv) { ContextPtr context = CreateCudaDevice(argc, argv, true); // Cast CudaAlloc* to CudaAllocBuckets* CudaAllocBuckets* buckets = dynamic_cast<CudaAllocBuckets*> (context->GetAllocator()); // Set the capacity of the LRU cache to 500MB. buckets->SetCapacity(500000000);
You can set the capacity of the LRU cache dynamically. CudaContext::GetAllocator
returns a CudaContext*
pointer to the currently-selected allocator. Because we know it's a caching allocator, we use RTTI's dynamic_cast
to retrieve a CudaAllocBuckets*
pointer. We call SetCapacity
with a request of 500MB to set the cache size. If the size of an allocation request plus the size of items allocated in the cache exceeds 500MB, the caching allocator frees the least-recently-used requests to make space for the new memory.
include/kernels/bulkremove.cuh
KernelBulkRemove<Tuning><<<numBlocks, launch.x, 0, context.Stream()>>>( source_global, sourceCount, indices_global, indicesCount, partitionsDevice->get(), dest_global);
The context object attempts to support CUDA streams in as non-obtrusive a manner as possible. All MGPU host functions take a CudaContext
object by reference and pass the stream handle to the launch chevrons. This enqueues the kernel
launch into the stream that attached to the context.
Some MGPU functions—namely reduce, join, and some variants of scan and vectorized sorted search—use cudaMemcpyDeviceToHost
to move kernel outputs into host memory. This is a synchronizing function; it will cause the thread to wait on the transfer, preventing it from queueing launches on other streams. If this
creates scheduling inefficiences, the programmer can split apart the host function, use cudaMemcpyAsync
to asynchronously move data into CPU-pinned memory, and overlap scheduling of operations on other threads. This is an invasive and application-specific way to program, so it is not directly support by the MGPU library.
MGPU functions are aggressively register-blocked. Register blocking amortizes per-CTA and per-thread costs by increasing the number of items processed per thread. To improve clarity and reduce errors, common routines for moving portions of data between memory spaces (global memory, shared memory, and register) have been factored into functions in the include/device/loadstore.cuh header.
The common template argument VT is the kernel's grain size; it specifies the number of values processed per thread. The argument NT is the number of threads in the CTA.
Most of these functions operate in strided order, in which elements are assigned to threads according to NT * i + tid, where i is the index of the element in the register and tid is the thread ID. Data should be loaded and stored in strided order, as this organizes warp transfers into cache lines, which maximizes data throughput.
Many MGPU algorithms work with data in thread order, in which elements are assigned to threads according to VT * tid + i. Each thread has access to VT consecutive elements which makes performing sequential operations like scan and merge very easy. However data should not be loaded or stored to global memory in thread order, as warp transfers would touch VT different cache lines, wasting memory bandwidth.
By choosing an odd number for VT we avoid bank conflicts that would otherwise be incurred when re-ordering data between strided and thread orders. Within a warp, all banks (VT * tid + i) % 32 are accessed exactly once for each step i when VT is odd. If VT is a power-of-two, you can expect VT-way conflicts at each step.
Load/store function prototypes are found in mgpudevice.cuh. Most functions have names matching the pattern Device
[Source]To
[Dest]:
// For 0 <= i < VT: // index = NT * i + tid; // if(index < count) reg[i] = data[index]; // Synchronize after load. template<int NT, int VT, typename InputIt, typename T> MGPU_DEVICE void DeviceSharedToReg(int count, InputIt data, int tid, T* reg, bool sync = true);
Functions of this form are parameterized over NT and VT arguments—these are typically communicated to the kernel using the LaunchBox mechanism. The first argument is the count of items to move across the entire CTA. If NT * VT == count, an optimized implementation may be used which eliminates per-item predication to reduce latency and promote parallelism. The second argument is the source data, and its
memory space should match the [Source] part of the function name. The third argument is the thread ID. The fourth argument is the destination data and its
memory space should match the [Dest] part of the function name. The final argument is used to hit a __syncthreads
after the operation. Data movement functions with
Shared in the name synchronize by default; other functions do not.
Data can be loaded from shared memory into registers in thread order with DeviceSharedToThread
.
Data can be stored to shared from registers in thread order with DeviceThreadToShared
. A common practice is to:
Cooperatively load data into register in strided order and store to shared memory with DeviceGlobalToShared
.
Read out values in thread order into register with DeviceSharedToThread
.
Operate on data that is now sequentially ordered by thread. Scan works in this manner.
Store results from register in thread order into shared memory with DeviceThreadToShared
.
Cooperatively load data from shared memory into register in strided order and store to global memory with DeviceSharedToGlobal
.
Regimented application of these utility functions to move data between global memory, shared memory, and register helps highlight the novel aspects of the kernel (the stuff in step 3).