moderngpu 1.0

moderngpu 2.0 is a completely new project with no direct continuity from 1.0. However, many of the algorithms are similar, and those algorithms are covered in these pages.

  1. FAQ
  2. Introduction
    1. Libraries
    2. Goals
    3. Two-phase decomposition
    4. From scan to load-balancing search
    5. Expand
    6. Expand with load-balancing search
    7. Algorithms
  3. Performance
    1. Occupancy and latency
    2. Launch bounds
    3. Getting more performance out of MGPU
    4. LaunchBox
  4. The Library
    1. Framework
    2. Load/store functions
  5. Reduce and Scan
    1. Benchmark and usage
    2. Host functions
    3. Algorithm
    4. CTAReduce
    5. Reduce kernel
    6. CTAScan
    7. Scan kernel
  6. Bulk Remove and Bulk Insert
    1. Benchmark and usage
    2. Host functions
    3. Bulk remove algorithm
    4. BinarySearchPartitions
    5. KernelBulkRemove
    6. Bulk insert partitioning
    7. Merge Path
    8. Bulk insert algorithm
    9. Bulk insert host function and kernel
  7. Merge
    1. Benchmark and usage
    2. Host functions
    3. Algorithm
  8. Mergesort
    1. Benchmark and usage
    2. Host functions
    3. Algorithm
    4. Sorting networks
    5. Blocksort
    6. Flexible merge partitioning
    7. MergePathPartitions
    8. Launching from the host
  9. Segmented Sort and Locality Sort
    1. Benchmark and usage
    2. Host functions
    3. Algorithm
    4. Segmented blocksort
    5. Early-exit
    6. Filling the work queue
    7. Servicing the work queue
  10. Vectorized Sorted Search
    1. Benchmark and usage
    2. Host functions
    3. Algorithm
    4. Parallel sorted search
    5. CTASortedSearch
    6. SortedEqualityCount
  11. Load-Balancing Search
    1. Benchmark and usage
    2. Host function
    3. Algorithm
    4. CTALoadBalance
  12. IntervalExpand and IntervalMove
    1. Benchmark and usage
    2. Host functions
    3. IntervalExpand
    4. IntervalMove
  13. Relational Joins
    1. Benchmark and usage
    2. Host functions
    3. Algorithm
  14. Multisets
    1. Benchmark and usage
    2. Host functions
    3. The four multiset operations
    4. Balanced Path
    5. Serial multiset operations
    6. Multisets kernel
  15. Segmented Reduction
    1. Benchmarks and usage: Segmented reduction (CSR)
    2. Benchmarks and usage: Reduce-by-key
    3. Benchmarks and usage: Sparse matrix * vector (CSR)
    4. User notes
    5. Intra-tile algorithm
    6. Carry-out and carry-in
    7. CTASegReduce
    8. Reduce-by-key front-end
    9. Preprocessed segmented reduction - Apply
    10. CSR to COO
    11. Preprocessed segmented reduction - Construct
    12. Segmented reduction (CSR) front-end