Dispatcher

我们接下来通过源码来看看。

虚函数表

Schema 例子

每个kernel 算子（虚函数）都有一个对应的schema，我们可以从 aten/src/ATen/native/native_functions.yaml 之中找到一些虚函数 schema 的例子，这些都是以字符串的形式呈现。我们可以看到，schema 包括算子名称（比如zero_sparse_），输入参数个数和类型，返回值类型，是否需要check，如何分发等等。

# zero 操作对应的虚函数表
- func: zero_(Tensor(a!) self) -> Tensor(a!)
  device_check: NoCheck   # TensorIterator
  variants: method, function
  dispatch:
    CPU, CUDA: zero_
    Meta: zero_meta_
    SparseCPU, SparseCUDA: zero_sparse_
    MkldnnCPU: mkldnn_zero_

# sub.out 对应的虚函数表
- func: sub.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
  device_check: NoCheck   # TensorIterator
  structured: True
  structured_inherits: TensorIteratorBase
  dispatch:
    CPU, CUDA: sub_out
    SparseCPU, SparseCUDA: sub_out_sparse

# sub.Tensor 对应的虚函数表
- func: sub.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
  device_check: NoCheck   # TensorIterator
  variants: function, method
  structured_delegate: sub.out
  dispatch:
    SparseCPU, SparseCUDA: sub_sparse

Operator的实现

我们可以看看 zero 的两个实现，下面是MkldnnCPU的实现。

Tensor& mkldnn_zero_(Tensor& self) {
  using Vec = vec::Vectorized<float>;

  ideep::tensor& x = itensor_from_mkldnn(self);

  auto n = x.get_nelems();
  auto* x_ = static_cast<float*>(x.get_data_handle());
  parallel_for(0, n, 2048, [x_](int64_t begin, int64_t end) {
    vec::map(
        [](Vec /* unused */) { return 0.0; },
        x_ + begin,
        x_ + begin,
        end - begin);
  });

  return self;
}

又比如下面是SparseCPU, SparseCUDA 的对应实现：

// --------------------------------------------------------------------
// zero_(SparseTensor)
// --------------------------------------------------------------------
// hummu hummu
SparseTensor& zero_sparse_(SparseTensor& self) {
  AT_ASSERT(self.is_sparse());
  at::zeros_out(self, get_sparse_impl(self)->sizes());
  return self._coalesced_(true);
}

Dispatcher 定义

我们接下来看看Dispatcher的定义，这里只给出部分成员变量。

class TORCH_API Dispatcher final {
private:
  // For direct access to backend fallback information
  friend class impl::OperatorEntry;

  struct OperatorDef final {
    explicit OperatorDef(OperatorName&& op_name)
    : op(std::move(op_name)) {}
    impl::OperatorEntry op;
    size_t def_count = 0;
    size_t def_and_impl_count = 0;
  };
  friend class OperatorHandle;
  template<class> friend class TypedOperatorHandle;

public:

  static Dispatcher& realSingleton();

  //存储所有的算子，并在其成员变量中存储了每个算子的不同版本，比如cpu，cuda，autograd....
  std::list<OperatorDef> operators_;
  //注册算子时会将算子名称和方法也存储在这个里面, 这样就可以快速的通过名字查找到算子方法(其中包含了成员OperatorDef)
  LeftRight<ska::flat_hash_map<OperatorName, OperatorHandle>> operatorLookupTable_;
  // Map from namespace to debug string (saying, e.g., where the library was defined)
  ska::flat_hash_map<std::string, std::string> libraries_;
  std::array<impl::AnnotatedKernel, static_cast<uint8_t>(DispatchKey::NumDispatchKeys)> backendFallbackKernels_;
  std::unique_ptr<detail::RegistrationListenerList> listeners_;
  std::mutex mutex_;
};

逻辑大致如下，operators_ 存储了所有的算子：

+--------------------------------------------+
| Dispatcher                                 |
|                                            |
|                                            |
|                                            |
|     std::list<OperatorDef> operators_      |
|                                            |
|     operatorLookupTable_                   |
|                                            |
+--------------------------------------------+

注册

接下来给出注册虚函数表的方法。

RegistrationHandleRAII Dispatcher::registerImpl(
  OperatorName op_name,
  c10::optional<DispatchKey> dispatch_key,
  KernelFunction kernel,
  c10::optional<impl::CppSignature> cpp_signature,
  std::unique_ptr<FunctionSchema> inferred_function_schema,
  std::string debug
) {
  std::lock_guard<std::mutex> lock(mutex_);
  auto op = findOrRegisterName_(op_name);
  auto handle = op.operatorDef_->op.registerKernel( // 进行注册
    *this,
    dispatch_key,
    std::move(kernel),
    std::move(cpp_signature),
    std::move(inferred_function_schema),
    std::move(debug)
  );

  ++op.operatorDef_->def_and_impl_count;

  return RegistrationHandleRAII([this, op, op_name, dispatch_key, handle] {
    deregisterImpl_(op, op_name, dispatch_key, handle);
  });
}

注册表

OperatorEntry代表了一个算子，以及该算子的dispatch table，这里只给出成员变量。

class TORCH_API OperatorEntry final { //代表了一个算子，以及该算子的dispatch table
public:
  OperatorName name_;
  c10::optional<AnnotatedSchema> schema_;
  //存储了不同key对应的算子实现版本，比如cpu，cuda，autograd 等等，所有的算子版本都会在这个table里面
  std::array<KernelFunction, static_cast<uint8_t>(DispatchKey::NumDispatchKeys)> dispatchTable_;
  DispatchKeyExtractor dispatchKeyExtractor_;
  //不同 DispatchKey对应了不同的版本的kernel算子实现版本
  ska::flat_hash_map<DispatchKey, std::list<AnnotatedKernel>> kernels_;
};

逻辑如下：

+---------------------------+     +------------------------------------------+
| OperatorEntry             |     |                                          |
|                           |     | std::array<KernelFunction, uint8_t>      |
|                           |     |                                          |
|                           |     |                                          |
|                           |     |           int('CPU') : CPU_kernel        |
|        dispatchTable_ +-------> |                                          |
|                           |     |           int('GPU') : GPU_kernel        |
|                           |     |                                          |
|                           |     |           ......                         |
|                           |     |                                          |
|                           |     |           int('Metal') : Metal_kernel    |
|                           |     |                                          |
+---------------------------+     +------------------------------------------+

注册行为

最终注册行为就是往 dispatchTable_ 之中设置。

void OperatorEntry::updateDispatchTableEntry_(const c10::Dispatcher& dispatcher, DispatchKey dispatch_key) {
  auto dispatch_ix = static_cast<uint8_t>(dispatch_key);
  dispatchTable_[dispatch_ix] = computeDispatchTableEntry(dispatcher, dispatch_key);
  dispatchKeyExtractor_.setOperatorHasFallthroughForKey(dispatch_key, dispatchTable_[dispatch_ix].isFallthrough());
}

所以 Dispatcher 数据结构拓展近似如下，这里包含了两个OperatorEntry，分别对应了op1和op2，就是说，目前系统中一共有两个operator，每个 operator 有4个kernel函数，分别对应了CPU，GPU等四个后端。

+-----------------------------------------+
| Dispatcher                              |
|                                         |
|                                         |
|   std::list<OperatorDef> operators_ +--------+
|                                         |    |
|                                         |    |
|   operatorLookupTable_                  |    |
|                                         |    |
+-----------------------------------------+    |
                                               |
                                               |
                                               v
           +-----------------------------------+------------------------------------------+
           |  +---------------------------+     +--------------------------------------+  |
           |  | OperatorEntry             |     |                                      |  |
           |  |                           |     | std::array<KernelFunction, uint8_t>  |  |
           |  |                           |     |                                      |  |
           |  |        name_ = op1        |     |                                      |  |
           |  |                           |     |           int('CPU') : op1_cpu       |  |
           |  |        dispatchTable_ +-------> |                                      |  |
           |  |                           |     |           int('GPU') : op1_gpu       |  |
           |  |                           |     |                                      |  |
           |  |                           |     |           int('XLA') : op1_xla       |  |
           |  |                           |     |                                      |  |
           |  |                           |     |           int('Metal') : op1_metal   |  |
           |  |                           |     |                                      |  |
           |  +---------------------------+     +--------------------------------------+  |
           |                                                                              |
           |                                                                              |
           |  +---------------------------+     +--------------------------------------+  |
           |  | OperatorEntry             |     |                                      |  |
           |  |                           |     | std::array<KernelFunction, uint8_t>  |  |
           |  |                           |     |                                      |  |
           |  |        name_ = op2        |     |                                      |  |
           |  |                           |     |           int('CPU') : op2_cpu       |  |
           |  |        dispatchTable_ +-------> |                                      |  |
           |  |                           |     |           int('GPU') : op2_gpu       |  |
           |  |                           |     |                                      |  |
           |  |                           |     |           int('XLA') : op2_xla       |  |
           |  |                           |     |                                      |  |
           |  |                           |     |           int('Metal') : op2_metal   |  |
           |  |                           |     |                                      |  |
           |  +---------------------------+     +--------------------------------------+  |
           +------------------------------------------------------------------------------+

如何dispatch

调度依据

PyTorch 之中会依据dtype、device和layout的不同来调度不同的operator。
- 大多数类型（比如int32）可以使用模版方式直接进行映射，但是某些operator 不支持模版功能，就需要dispatcher这样的动态调度器。
- PyTorch的tensor不仅可以运行在CPU上，还可以跑在GPU，mkldnn和xla等设备，这也需要动态调度。
- layout是指tensor中元素的排布，这就有strided layout和sparse layout的区别，所以也需要动态调度。

调度代码

这里给出部分代码

算子调度的逻辑是：

通过 dispatcher 类 + operator name + 操作类型等联合的形式来查找对应的算子 schema，算子的schema 定义了本算子的输入/输出/参数等等的相关信息。
调用 dispatcher::call 完成算子操作。
1. 得到 dispatcher 中的 dispatchKetSet。
2. 利用 op.lookup 找到最高优先级的 key，并且依据 key 找到对应的 KernelFunction。
3. 调用 kernel。

首先，具体以range的定义来看看如何查找schema，具体在 findSchemaOrThrow 内部是通过operatorLookupTable_来查找op：

at::Tensor range::call(const at::Scalar & start, const at::Scalar & end, c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout, c10::optional<at::Device> device, c10::optional<bool> pin_memory) {
    static auto op = c10::Dispatcher::singleton()
        .findSchemaOrThrow("aten::range", "")
        .typed<at::Tensor (const at::Scalar &, const at::Scalar &, c10::optional<at::ScalarType>, c10::optional<at::Layout>, c10::optional<at::Device>, c10::optional<bool>)>();
    return op.call(start, end, dtype, layout, device, pin_memory);
}

其次，Dispatcher::call 定义如下：

template<class Return, class... Args>
C10_DISPATCHER_INLINE_UNLESS_MOBILE Return Dispatcher::call(const TypedOperatorHandle<Return(Args...)>& op, Args... args) const {
  detail::unused_arg_(args...);

  // 得到key set
  auto dispatchKeySet = op.operatorDef_->op.dispatchKeyExtractor()
    .template getDispatchKeySetUnboxed<Args...>(args...);
  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!c10::isAliasDispatchKey(dispatchKeySet.highestPriorityTypeId()));

  // 得到算子
  const KernelFunction& kernel = op.operatorDef_->op.lookup(dispatchKeySet.highestPriorityTypeId());

  // 进行调度
#ifndef PYTORCH_DISABLE_PER_OP_PROFILING
  bool pre_sampled = false;
  if (C10_UNLIKELY(at::shouldRunRecordFunction(&pre_sampled))) {
    return callWithDispatchKeySlowPath<Return, Args...>(op, pre_sampled, dispatchKeySet, kernel, std::forward<Args>(args)...);
  }
#endif  // PYTORCH_DISABLE_PER_OP_PROFILINGreturn kernel.template call<Return, Args...>(op, dispatchKeySet, std::forward<Args>(args)...);
}

key

我们接下来看看key的定义，因为太多，所以我们只给出部分数值。

enum class DispatchKey : uint8_t {
  CPU, // registered at build/aten/src/ATen/RegisterCPU.cpp
  CUDA, // registered at build/aten/src/ATen/RegisterCUDA.cpp
  HIP, // NB: I think this is not actually used, due to Note [Masquerading as
  // CUDA]
  FPGA, // Xilinx support lives out of tree at
  // https://gitlab.com/pytorch-complex/vitis_kernels
  MSNPU, // unused externally, but tested at
  // test/cpp_extensions/msnpu_extension.cpp
  XLA, // lives out of tree at https://github.com/pytorch/xla
  MLC, // lives out of tree at https://github.com/pytorch/MLCompute
  Vulkan,
  Metal,
  XPU, // For out of tree Intel's heterogeneous computing plug-in
  HPU, // For out of tree & closed source integration of HPU / Habana
  VE, // For out of tree & closed source integration of SX-Aurora / NEC
  Lazy, // For lazy tensor backends
  // A meta tensor is a tensor without any data associated with it.  (They
  // have also colloquially been referred to as tensors on the "null" device).
  // A meta tensor can be used to dry run operators without actually doing any
  // computation, e.g., add on two meta tensors would give you another meta
  // tensor with the output shape and dtype, but wouldn't actually add anything.
  Meta,
  // Here are backends which specify more specialized operators
  // based on the dtype of the tensor.
  QuantizedCPU, // registered at build/aten/src/ATen/RegisterQuantizedCPU.cpp
  QuantizedCUDA, // registered at build/aten/src/ATen/RegisterQuantizedCUDA.cpp
  QuantizedXPU, // For out of tree Intel's heterogeneous computing plug-in
  // This backend is to support custom RNGs; it lets you go
  // to a different kernel if you pass in a generator that is not a
  // traditional CPUGeneratorImpl/CUDAGeneratorImpl.  To make use of this
  // key:
  //  1) set it as a second parameter of at::Generator constructor call in
  //     the user-defined PRNG class.
  //  2) use it as a dispatch key while registering custom kernels
  //     (templatized kernels specialized for user-defined PRNG class)
  // intended for out of tree use; tested by aten/src/ATen/test/rng_test.cpp
  CustomRNGKeyId,

  // Here are backends which specify more specialized operators
  // based on the layout of the tensor.  Note that the sparse backends
  // are one case where ordering matters: sparse multi-dispatches with
  // the corresponding dense tensors, and must be handled before them.
  MkldnnCPU, // registered at build/aten/src/ATen/RegisterMkldnnCPU.cpp
  // NB: not to be confused with MKLDNN, which is Caffe2 only
  SparseCPU, // registered at build/aten/src/ATen/RegisterSparseCPU.cpp
  SparseCUDA, // registered at build/aten/src/ATen/RegisterSparseCUDA.cpp
  SparseHIP, // TODO: I think this is not actually used, due to Note
  // [Masquerading as CUDA]
  SparseXPU, // For out of tree Intel's heterogeneous computing plug-in
  SparseVE, // For out of tree & closed source integration of SX-Aurora / NEC
  SparseCsrCPU,
  SparseCsrCUDA,

  AutogradOther,
  AutogradCPU,
  AutogradCUDA,
  AutogradXLA,
  AutogradLazy,
  AutogradXPU,
  AutogradMLC,
  AutogradHPU,

  ......
};

key的使用

因为篇幅所限，我们无法深入分析每一种情况，这里只给出从 DeviceType 出发的情景。我们从下面函数可以看到，如何从 DeviceType 映射到 DispatchKey 类型。

template <typename Func>
inline CppFunction dispatch(c10::DeviceType type, Func&& raw_f) {
  auto deviceTypeToDispatchKey = [](c10::DeviceType t){
    switch (t) {
      // This list is synchronized with the k-constants in c10/core/DeviceType.h
      case c10::DeviceType::CPU:
        return c10::DispatchKey::CPU;
      case c10::DeviceType::CUDA:
        return c10::DispatchKey::CUDA;
      case c10::DeviceType::XLA:
        return c10::DispatchKey::XLA;
      case c10::DeviceType::Lazy:
        return c10::DispatchKey::Lazy;
      case c10::DeviceType::MLC:
        return c10::DispatchKey::MLC;
      case c10::DeviceType::Meta:
        return c10::DispatchKey::Meta;
      case c10::DeviceType::HIP:
        return c10::DispatchKey::HIP;
      case c10::DeviceType::MSNPU:
        return c10::DispatchKey::MSNPU;
      case c10::DeviceType::HPU:
        return c10::DispatchKey::HPU;
      default:
        TORCH_CHECK(false,
          "Device type ", t, " cannot be overloaded at dispatch time, "
          "please file a bug report explaining what you were trying to do.");
    }
  };
  return dispatch(deviceTypeToDispatchKey(type), std::forward<Func>(raw_f));
}

小结

至此，我们知道，通过 Dispatcher 机制，PyTorch 可以依据dtype、device和layout的不同来调度不同的operator。这就解答了我们第三个问题：如何在 CPU，GPU 操作之间无缝切换？

关于第四个问题：是否需要把损失函数移动到 GPU 之上？，我们也有了解答：

损失函数的参数是前向传播的outputs和label，outputs已经在GPU之上（因为训练数据已经在GPU之上），label 也被用户手动设置到GPU之上。所以损失函数的参数都已经在GPU之上，这样 Dispather 就依据device会调用到GPU对应的operator，所以不需要把损失函数移动到GPU之上。

我们整理一个总体逻辑如下，序列是：

把训练数据 inputs 移动到GPU。
进行前向操作，假设只有一个operator，就是 op1，使用 device=’GPU’ 这个 dispatch key 去 Dispatcher 查找。
找到了 op1-gpu 这个operator，进行计算，得出 outputs。
outputs 就自动存在于 GPU 之上。
把 Labels 也放到 GPU 之上。
进行损失函数运算，假设只有一个 operator，就是 op2，此时损失函数的参数都在GPU之上，所以使用 device= ‘GPU’ 这个 dispatch key 去 Dispatcher 查找。
找到了 op2-gpu 这个operator，进行计算，得出 loss。

                           +--------------------+
         +-----------+     | Forward            |      +------------+     +------------------+
         | GPU       |     |                    |      | GPU        |     | Loss Function    |
         |           +---> |    op1   op1-gpu() +----> |            +---> |                  |   +--------+
         |   Inputs  | 1   |                    |  4   |   Outputs  |     |                  |   | GPU    |
         |           |     |     +        ^     |      |            |     |                  |   |        |
         +-----------+     |     |        |     |      +------------+     |  op2   op2-gpu() +-->+  loss  |
                           |     |        |     |                         |                  |   |        |
                           +--------------------+      +------------+     |   +        ^     |   |        |
                                 |        |            | GPU        | 5   |   |        |     |   +--------+
                                 |        |            |            +---> |   | 6      | 7   |
                               2 |        | 3          |   Labels   |     |   |        |     |
                                 |        |            |            |     |   |        |     |
                                 |        |            +------------+     +------------------+
    +----------------------------+        +--------------------------------+  |        |
    |                                                                      |  |        |
+-----------------------------------------------------------------------------+        |
|   |                                                                      |           |
|   |          +-------------------------------------------------------+   |           |
|   |          | Dispather                                             |   |           |
|   |          |       +          +          +            +            |   |           |
|   |          |       |   XLA    |   CPU    |    Metal   |    GPU     |   |           |
|   |          | +---------------------------------------------------+ |   |           |
|   |          |       |          |          |            |            |   |           |
|   +--------> |   OP1 | op1-xla  | op1-cpu  |  op1-metal |  op1-gpu   +---+           |
| 'device=GPU' |       |          |          |            |  +------+  |               |
|              | +---------------------------------------------------+ |               |
|              |       |          |          |            |            |               |
+------------> |   OP2 | op2-xla  | op2-cpu  |  op2-metal |  op2-gpu   +---------------+
  'device=GPU' |       |          |          |            |  +------+  |
               | +---------------------------------------------------+ |
               |       |          |          |            |            |
               |   OP3 | op3-xla  | op3-cpu  |  op3-metal |  op3-gpu   |
               |       |          |          |            |            |
               | +---------------------------------------------------+ |
               +-------------------------------------------------------+

截图如下：