CUDA编程模型基础

CUDA是英伟达为GPU编程提供的异构编程库。

异构模型

CUDA编程模型是一个异构模型。程序运行在一个异构系统之上，这个异构系统由CPU和GPU构成，它们之间由总线分开，程序运行时候是由CPU和GPU协同工作。

在CUDA之中，有两个重要概念：host和device。

Host ：CPU及其内存。
Device ：GPU及其内存。

因此，CUDA 架构下的一个程序也对应分为两个部份：Host 代码和Device代码，它们分别在CPU和GPU上运行。host与device之间可以通信进行数据拷贝。

主机代码（Host Code）：在 CPU 上执行的部份，使用Linux（GNU gcc）和Windows（Microsoft Visual C）编译器来编译。大致可以认为认为C语言工作对象是CPU和内存条。
设备代码（Device Code）：在GPU上执行的部份，使用 NVIDIA NVCC 编译器来编译。大致可以认为 CUDA C工作对象是GPU及GPU上内存（也叫设备内存）。

+-------------------+        +--------------------+
|                   |        |                    |
|   +----------+    |        |    +----------+    |
|   |          |    |        |    |          |    |
|   |   RAM    |    |        |    |   RAM    |    |
|   |          |    |        |    |          |    |
|   +----+-----+    |        |    +----+-----+    |
|        |          +--------+         |          |
|        |          |        |         |          |
|   +----+-----+    |        |    +----+-----+    |
|   |          |    |        |    |          |    |
|   |   CPU    |    |        |    |   GPU    |    |
|   |          |    |        |    |          |    |
|   +----------+    |        |    +----------+    |
|                   |        |                    |
+-------------------+        +--------------------+

      Host                           Device

并行思想

CUDA 编程的思路是并行思想，大致如下：

把一个很大的执行任务划分成若干个简单的可以重复的操作，然后使用若干个线程来分别执行这些操作，达到并行的目的。
执行任务处理的数据也要对应分组成多个小数据块。比如一个大数据分成若干个GPU组，每个GPU组要再次分成多个线程组，线程组内的张量可能需要再细分为张量处理器能处理的小组。

因此，一个典型的CUDA程序包括串行代码和并行代码。

串行代码是标准C代码，由host执行。
并行代码是CUDA C代码，在device中执行。

CUDA 主程序由CPU开始，即程序由host执行串行代码开始，当遇到需要数据并行处理的部分，则由device执行并行代码来作为补足。device可以独立于host进行大部分操作。当一个device代码启动之后，控制权会立刻返还给CPU来执行其他任务，所以这是一个异步过程。

图来自 [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html。](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html%E3%80%82)

图来自 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html。

处理流程

典型的CUDA程序的执行流程如下：

分配host内存空间并且初始化数据。
分配device显存空间。
将要计算的数据从Host内存之上复制到device显存之上。
调用CUDA核函数在device上完成用户指定的运算。
将计算后GPU内存上的结果复制到Host内存上。
释放device和host上分配的内存。

具体可以参见下图。

函数

核函数

核函数是在device线程中并行执行的函数。在 CUDA 程序中，主程序在调用GPU内核之前需要对核进行执行配置，以确定线程块数，每个线程块中线程数和共享内存大小。比如在调用时需要用<<参数1,参数2>>来指定核函数需要的线程数量以及线程是如何组织，这样在GPU之中就会启动若干个线程来并行执行这个核函数，每个线程被分配一个唯一的线程号。

CUDA通过函数类型限定词来区别host和device上的函数，主要的三个函数类型限定词为：

具体如下：

限定符	执行	调用
global	设备端执行	可以从主机调用也可以从某些特定设备调用
device	设备端执行	设备端调用
host	主机端执行	主机调用

具体如下：

+------------------------+   +------------------------+
|                        |   |                        |
|                        |   |                        |
| __host__   __global__  |   |          __device__    |
|    +           +       |   |                        |
|    |           |       |   |              +         |
|    |           |       |   |              |         |
|    |           v--------------->          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |                   |   |   |          |         |
|    |           +<--------------v          |         |
|    |           |       |   |              |         |
|    |           |       |   |              |         |
|    |           |       |   |              |         |
|    v           v       |   |              v         |
|                        |   |                        |
+------------------------+   +------------------------+

      Host                           Device

这三个限定词其实也是 CUDA 中常见的三种运行场景。其中，device 函数和global函数因为需要在GPU上运行，因此不能调用常见的一些 C/C++ 函数（因为这些函数没有对应的 GPU 实现）。

如下代码是 NVIDIA 的例子，使用内置的 threadIdx 变量，把 A 和 B 两个张量进行相加，得到 C。因此，N 个线程之中每个都会执行 VecAdd() 。

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main()
{
    ...
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);
    ...
}

PyTorch 样例

我们从 third_party/cub/cub/device/dispatch/dispatch_reduce.cuh 找一个核函数例子来看看。

/**
 * Reduce region kernel entry point (multi-block).  Computes privatized reductions, one per thread block.
 */
template <
    typename                ChainedPolicyT,             ///< Chained tuning policy
    typename                InputIteratorT,             ///< Random-access input iterator type for reading input items \iterator
    typename                OutputIteratorT,            ///< Output iterator type for recording the reduced aggregate \iterator
    typename                OffsetT,                    ///< Signed integer type for global offsets
    typename                ReductionOpT>               ///< Binary reduction functor type having member <tt>T operator()(const T &a, const T &b)</tt>
__launch_bounds__ (int(ChainedPolicyT::ActivePolicy::ReducePolicy::BLOCK_THREADS))
__global__ void DeviceReduceKernel(
    InputIteratorT          d_in,                       ///< [in] Pointer to the input sequence of data items
    OutputIteratorT         d_out,                      ///< [out] Pointer to the output aggregate
    OffsetT                 num_items,                  ///< [in] Total number of input data items
    GridEvenShare<OffsetT>  even_share,                 ///< [in] Even-share descriptor for mapping an equal number of tiles onto each thread block
    ReductionOpT            reduction_op)               ///< [in] Binary reduction functor
{
    // The output value type
    typedef typename If<(Equals<typename std::iterator_traits<OutputIteratorT>::value_type, void>::VALUE),  // OutputT =  (if output iterator's value type is void) ?
        typename std::iterator_traits<InputIteratorT>::value_type,                                          // ... then the input iterator's value type,
        typename std::iterator_traits<OutputIteratorT>::value_type>::Type OutputT;                          // ... else the output iterator's value type

    // Thread block type for reducing input tiles
    typedef AgentReduce<
            typename ChainedPolicyT::ActivePolicy::ReducePolicy,
            InputIteratorT,
            OutputIteratorT,
            OffsetT,
            ReductionOpT>
        AgentReduceT;

    // Shared memory storage
    __shared__ typename AgentReduceT::TempStorage temp_storage;

    // Consume input tiles
    OutputT block_aggregate = AgentReduceT(temp_storage, d_in, reduction_op).ConsumeTiles(even_share);

    // Output result
    if (threadIdx.x == 0)
        d_out[blockIdx.x] = block_aggregate;
}