Dispatcher

我们接下来通过源码来看看。

虚函数表

  • Schema 例子

    每个kernel 算子(虚函数)都有一个对应的schema,我们可以从 aten/src/ATen/native/native_functions.yaml 之中找到一些虚函数 schema 的例子,这些都是以字符串的形式呈现。我们可以看到,schema 包括算子名称(比如zero_sparse_),输入参数个数和类型,返回值类型,是否需要check,如何分发等等。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    # zero 操作对应的虚函数表
    - func: zero_(Tensor(a!) self) -> Tensor(a!)
    device_check: NoCheck # TensorIterator
    variants: method, function
    dispatch:
    CPU, CUDA: zero_
    Meta: zero_meta_
    SparseCPU, SparseCUDA: zero_sparse_
    MkldnnCPU: mkldnn_zero_

    # sub.out 对应的虚函数表
    - func: sub.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
    device_check: NoCheck # TensorIterator
    structured: True
    structured_inherits: TensorIteratorBase
    dispatch:
    CPU, CUDA: sub_out
    SparseCPU, SparseCUDA: sub_out_sparse

    # sub.Tensor 对应的虚函数表
    - func: sub.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
    device_check: NoCheck # TensorIterator
    variants: function, method
    structured_delegate: sub.out
    dispatch:
    SparseCPU, SparseCUDA: sub_sparse

Operator的实现

我们可以看看 zero 的两个实现,下面是MkldnnCPU的实现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Tensor& mkldnn_zero_(Tensor& self) {
using Vec = vec::Vectorized<float>;

ideep::tensor& x = itensor_from_mkldnn(self);

auto n = x.get_nelems();
auto* x_ = static_cast<float*>(x.get_data_handle());
parallel_for(0, n, 2048, [x_](int64_t begin, int64_t end) {
vec::map(
[](Vec /* unused */) { return 0.0; },
x_ + begin,
x_ + begin,
end - begin);
});

return self;
}

又比如下面是SparseCPU, SparseCUDA 的对应实现:

1
2
3
4
5
6
7
8
9
10
// --------------------------------------------------------------------
// zero_(SparseTensor)
// --------------------------------------------------------------------
// hummu hummu
SparseTensor& zero_sparse_(SparseTensor& self) {
AT_ASSERT(self.is_sparse());
at::zeros_out(self, get_sparse_impl(self)->sizes());
return self._coalesced_(true);
}

Dispatcher 定义

我们接下来看看Dispatcher的定义,这里只给出部分成员变量。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class TORCH_API Dispatcher final {
private:
// For direct access to backend fallback information
friend class impl::OperatorEntry;

struct OperatorDef final {
explicit OperatorDef(OperatorName&& op_name)
: op(std::move(op_name)) {}
impl::OperatorEntry op;
size_t def_count = 0;
size_t def_and_impl_count = 0;
};
friend class OperatorHandle;
template<class> friend class TypedOperatorHandle;

public:

static Dispatcher& realSingleton();

//存储所有的算子,并在其成员变量中存储了每个算子的不同版本,比如cpu,cuda,autograd....
std::list<OperatorDef> operators_;
//注册算子时会将算子名称和方法也存储在这个里面, 这样就可以快速的通过名字查找到算子方法(其中包含了成员OperatorDef)
LeftRight<ska::flat_hash_map<OperatorName, OperatorHandle>> operatorLookupTable_;
// Map from namespace to debug string (saying, e.g., where the library was defined)
ska::flat_hash_map<std::string, std::string> libraries_;
std::array<impl::AnnotatedKernel, static_cast<uint8_t>(DispatchKey::NumDispatchKeys)> backendFallbackKernels_;
std::unique_ptr<detail::RegistrationListenerList> listeners_;
std::mutex mutex_;
};

逻辑大致如下,operators_ 存储了所有的算子:

1
2
3
4
5
6
7
8
9
10
11
+--------------------------------------------+
| Dispatcher |
| |
| |
| |
| std::list<OperatorDef> operators_ |
| |
| operatorLookupTable_ |
| |
+--------------------------------------------+

注册

  • 接下来给出注册虚函数表的方法。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    RegistrationHandleRAII Dispatcher::registerImpl(
    OperatorName op_name,
    c10::optional<DispatchKey> dispatch_key,
    KernelFunction kernel,
    c10::optional<impl::CppSignature> cpp_signature,
    std::unique_ptr<FunctionSchema> inferred_function_schema,
    std::string debug
    ) {
    std::lock_guard<std::mutex> lock(mutex_);
    auto op = findOrRegisterName_(op_name);
    auto handle = op.operatorDef_->op.registerKernel( // 进行注册
    *this,
    dispatch_key,
    std::move(kernel),
    std::move(cpp_signature),
    std::move(inferred_function_schema),
    std::move(debug)
    );

    ++op.operatorDef_->def_and_impl_count;

    return RegistrationHandleRAII([this, op, op_name, dispatch_key, handle] {
    deregisterImpl_(op, op_name, dispatch_key, handle);
    });
    }

    注册表

    OperatorEntry代表了一个算子,以及该算子的dispatch table,这里只给出成员变量。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    class TORCH_API OperatorEntry final { //代表了一个算子,以及该算子的dispatch table
    public:
    OperatorName name_;
    c10::optional<AnnotatedSchema> schema_;
    //存储了不同key对应的算子实现版本,比如cpu,cuda,autograd 等等,所有的算子版本都会在这个table里面
    std::array<KernelFunction, static_cast<uint8_t>(DispatchKey::NumDispatchKeys)> dispatchTable_;
    DispatchKeyExtractor dispatchKeyExtractor_;
    //不同 DispatchKey对应了不同的版本的kernel算子实现版本
    ska::flat_hash_map<DispatchKey, std::list<AnnotatedKernel>> kernels_;
    };

    逻辑如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    +---------------------------+     +------------------------------------------+
    | OperatorEntry | | |
    | | | std::array<KernelFunction, uint8_t> |
    | | | |
    | | | |
    | | | int('CPU') : CPU_kernel |
    | dispatchTable_ +-------> | |
    | | | int('GPU') : GPU_kernel |
    | | | |
    | | | ...... |
    | | | |
    | | | int('Metal') : Metal_kernel |
    | | | |
    +---------------------------+ +------------------------------------------+

    注册行为

    最终注册行为就是往 dispatchTable_ 之中设置。

    1
    2
    3
    4
    5
    6
    void OperatorEntry::updateDispatchTableEntry_(const c10::Dispatcher& dispatcher, DispatchKey dispatch_key) {
    auto dispatch_ix = static_cast<uint8_t>(dispatch_key);
    dispatchTable_[dispatch_ix] = computeDispatchTableEntry(dispatcher, dispatch_key);
    dispatchKeyExtractor_.setOperatorHasFallthroughForKey(dispatch_key, dispatchTable_[dispatch_ix].isFallthrough());
    }

    所以 Dispatcher 数据结构拓展近似如下,这里包含了两个OperatorEntry,分别对应了op1和op2,就是说,目前系统中一共有两个operator,每个 operator 有4个kernel函数,分别对应了CPU,GPU等四个后端。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    +-----------------------------------------+
    | Dispatcher |
    | |
    | |
    | std::list<OperatorDef> operators_ +--------+
    | | |
    | | |
    | operatorLookupTable_ | |
    | | |
    +-----------------------------------------+ |
    |
    |
    v
    +-----------------------------------+------------------------------------------+
    | +---------------------------+ +--------------------------------------+ |
    | | OperatorEntry | | | |
    | | | | std::array<KernelFunction, uint8_t> | |
    | | | | | |
    | | name_ = op1 | | | |
    | | | | int('CPU') : op1_cpu | |
    | | dispatchTable_ +-------> | | |
    | | | | int('GPU') : op1_gpu | |
    | | | | | |
    | | | | int('XLA') : op1_xla | |
    | | | | | |
    | | | | int('Metal') : op1_metal | |
    | | | | | |
    | +---------------------------+ +--------------------------------------+ |
    | |
    | |
    | +---------------------------+ +--------------------------------------+ |
    | | OperatorEntry | | | |
    | | | | std::array<KernelFunction, uint8_t> | |
    | | | | | |
    | | name_ = op2 | | | |
    | | | | int('CPU') : op2_cpu | |
    | | dispatchTable_ +-------> | | |
    | | | | int('GPU') : op2_gpu | |
    | | | | | |
    | | | | int('XLA') : op2_xla | |
    | | | | | |
    | | | | int('Metal') : op2_metal | |
    | | | | | |
    | +---------------------------+ +--------------------------------------+ |
    +------------------------------------------------------------------------------+

如何dispatch

  • 调度依据

    PyTorch 之中会依据dtype、device和layout的不同来调度不同的operator。

    • 大多数类型(比如int32)可以使用模版方式直接进行映射,但是某些operator 不支持模版功能,就需要dispatcher这样的动态调度器。
    • PyTorch的tensor不仅可以运行在CPU上,还可以跑在GPU,mkldnn和xla等设备,这也需要动态调度。
    • layout是指tensor中元素的排布,这就有strided layout和sparse layout的区别,所以也需要动态调度。
  • 调度代码

    这里给出部分代码

    算子调度的逻辑是:

    1. 通过 dispatcher 类 + operator name + 操作类型等联合的形式来查找对应的算子 schema,算子的schema 定义了本算子的输入/输出/参数等等的相关信息。
    2. 调用 dispatcher::call 完成算子操作。
      1. 得到 dispatcher 中的 dispatchKetSet。
      2. 利用 op.lookup 找到最高优先级的 key,并且依据 key 找到对应的 KernelFunction。
      3. 调用 kernel。

    首先,具体以range的定义来看看如何查找schema,具体在 findSchemaOrThrow 内部是通过operatorLookupTable_来查找op:

    1
    2
    3
    4
    5
    6
    7
    at::Tensor range::call(const at::Scalar & start, const at::Scalar & end, c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout, c10::optional<at::Device> device, c10::optional<bool> pin_memory) {
    static auto op = c10::Dispatcher::singleton()
    .findSchemaOrThrow("aten::range", "")
    .typed<at::Tensor (const at::Scalar &, const at::Scalar &, c10::optional<at::ScalarType>, c10::optional<at::Layout>, c10::optional<at::Device>, c10::optional<bool>)>();
    return op.call(start, end, dtype, layout, device, pin_memory);
    }

    其次,Dispatcher::call 定义如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    template<class Return, class... Args>
    C10_DISPATCHER_INLINE_UNLESS_MOBILE Return Dispatcher::call(const TypedOperatorHandle<Return(Args...)>& op, Args... args) const {
    detail::unused_arg_(args...);

    // 得到key set
    auto dispatchKeySet = op.operatorDef_->op.dispatchKeyExtractor()
    .template getDispatchKeySetUnboxed<Args...>(args...);
    TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!c10::isAliasDispatchKey(dispatchKeySet.highestPriorityTypeId()));

    // 得到算子
    const KernelFunction& kernel = op.operatorDef_->op.lookup(dispatchKeySet.highestPriorityTypeId());

    // 进行调度
    #ifndef PYTORCH_DISABLE_PER_OP_PROFILING
    bool pre_sampled = false;
    if (C10_UNLIKELY(at::shouldRunRecordFunction(&pre_sampled))) {
    return callWithDispatchKeySlowPath<Return, Args...>(op, pre_sampled, dispatchKeySet, kernel, std::forward<Args>(args)...);
    }
    #endif // PYTORCH_DISABLE_PER_OP_PROFILINGreturn kernel.template call<Return, Args...>(op, dispatchKeySet, std::forward<Args>(args)...);
    }

key

  • 我们接下来看看key的定义,因为太多,所以我们只给出部分数值。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    enum class DispatchKey : uint8_t {
    CPU, // registered at build/aten/src/ATen/RegisterCPU.cpp
    CUDA, // registered at build/aten/src/ATen/RegisterCUDA.cpp
    HIP, // NB: I think this is not actually used, due to Note [Masquerading as
    // CUDA]
    FPGA, // Xilinx support lives out of tree at
    // https://gitlab.com/pytorch-complex/vitis_kernels
    MSNPU, // unused externally, but tested at
    // test/cpp_extensions/msnpu_extension.cpp
    XLA, // lives out of tree at https://github.com/pytorch/xla
    MLC, // lives out of tree at https://github.com/pytorch/MLCompute
    Vulkan,
    Metal,
    XPU, // For out of tree Intel's heterogeneous computing plug-in
    HPU, // For out of tree & closed source integration of HPU / Habana
    VE, // For out of tree & closed source integration of SX-Aurora / NEC
    Lazy, // For lazy tensor backends
    // A meta tensor is a tensor without any data associated with it. (They
    // have also colloquially been referred to as tensors on the "null" device).
    // A meta tensor can be used to dry run operators without actually doing any
    // computation, e.g., add on two meta tensors would give you another meta
    // tensor with the output shape and dtype, but wouldn't actually add anything.
    Meta,
    // Here are backends which specify more specialized operators
    // based on the dtype of the tensor.
    QuantizedCPU, // registered at build/aten/src/ATen/RegisterQuantizedCPU.cpp
    QuantizedCUDA, // registered at build/aten/src/ATen/RegisterQuantizedCUDA.cpp
    QuantizedXPU, // For out of tree Intel's heterogeneous computing plug-in
    // This backend is to support custom RNGs; it lets you go
    // to a different kernel if you pass in a generator that is not a
    // traditional CPUGeneratorImpl/CUDAGeneratorImpl. To make use of this
    // key:
    // 1) set it as a second parameter of at::Generator constructor call in
    // the user-defined PRNG class.
    // 2) use it as a dispatch key while registering custom kernels
    // (templatized kernels specialized for user-defined PRNG class)
    // intended for out of tree use; tested by aten/src/ATen/test/rng_test.cpp
    CustomRNGKeyId,

    // Here are backends which specify more specialized operators
    // based on the layout of the tensor. Note that the sparse backends
    // are one case where ordering matters: sparse multi-dispatches with
    // the corresponding dense tensors, and must be handled before them.
    MkldnnCPU, // registered at build/aten/src/ATen/RegisterMkldnnCPU.cpp
    // NB: not to be confused with MKLDNN, which is Caffe2 only
    SparseCPU, // registered at build/aten/src/ATen/RegisterSparseCPU.cpp
    SparseCUDA, // registered at build/aten/src/ATen/RegisterSparseCUDA.cpp
    SparseHIP, // TODO: I think this is not actually used, due to Note
    // [Masquerading as CUDA]
    SparseXPU, // For out of tree Intel's heterogeneous computing plug-in
    SparseVE, // For out of tree & closed source integration of SX-Aurora / NEC
    SparseCsrCPU,
    SparseCsrCUDA,

    AutogradOther,
    AutogradCPU,
    AutogradCUDA,
    AutogradXLA,
    AutogradLazy,
    AutogradXPU,
    AutogradMLC,
    AutogradHPU,

    ......
    };

  • key的使用

    因为篇幅所限,我们无法深入分析每一种情况,这里只给出从 DeviceType 出发的情景。我们从下面函数可以看到,如何从 DeviceType 映射到 DispatchKey 类型。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    template <typename Func>
    inline CppFunction dispatch(c10::DeviceType type, Func&& raw_f) {
    auto deviceTypeToDispatchKey = [](c10::DeviceType t){
    switch (t) {
    // This list is synchronized with the k-constants in c10/core/DeviceType.h
    case c10::DeviceType::CPU:
    return c10::DispatchKey::CPU;
    case c10::DeviceType::CUDA:
    return c10::DispatchKey::CUDA;
    case c10::DeviceType::XLA:
    return c10::DispatchKey::XLA;
    case c10::DeviceType::Lazy:
    return c10::DispatchKey::Lazy;
    case c10::DeviceType::MLC:
    return c10::DispatchKey::MLC;
    case c10::DeviceType::Meta:
    return c10::DispatchKey::Meta;
    case c10::DeviceType::HIP:
    return c10::DispatchKey::HIP;
    case c10::DeviceType::MSNPU:
    return c10::DispatchKey::MSNPU;
    case c10::DeviceType::HPU:
    return c10::DispatchKey::HPU;
    default:
    TORCH_CHECK(false,
    "Device type ", t, " cannot be overloaded at dispatch time, "
    "please file a bug report explaining what you were trying to do.");
    }
    };
    return dispatch(deviceTypeToDispatchKey(type), std::forward<Func>(raw_f));
    }

小结

至此,我们知道,通过 Dispatcher 机制,PyTorch 可以依据dtype、device和layout的不同来调度不同的operator。这就解答了我们第三个问题:如何在 CPU,GPU 操作之间无缝切换?

关于第四个问题:是否需要把损失函数移动到 GPU 之上?,我们也有了解答:

损失函数的参数是前向传播的outputs和label,outputs已经在GPU之上(因为训练数据已经在GPU之上),label 也被用户手动设置到GPU之上。所以损失函数的参数都已经在GPU之上,这样 Dispather 就依据device会调用到GPU对应的operator,所以不需要把损失函数移动到GPU之上。

我们整理一个总体逻辑如下,序列是:

  1. 把训练数据 inputs 移动到GPU。
  2. 进行前向操作,假设只有一个operator,就是 op1,使用 device=’GPU’ 这个 dispatch key 去 Dispatcher 查找。
  3. 找到了 op1-gpu 这个operator,进行计算,得出 outputs。
  4. outputs 就自动存在于 GPU 之上。
  5. 把 Labels 也放到 GPU 之上。
  6. 进行损失函数运算,假设只有一个 operator,就是 op2,此时损失函数的参数都在GPU之上,所以使用 device= ‘GPU’ 这个 dispatch key 去 Dispatcher 查找。
  7. 找到了 op2-gpu 这个operator,进行计算,得出 loss。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
                           +--------------------+
+-----------+ | Forward | +------------+ +------------------+
| GPU | | | | GPU | | Loss Function |
| +---> | op1 op1-gpu() +----> | +---> | | +--------+
| Inputs | 1 | | 4 | Outputs | | | | GPU |
| | | + ^ | | | | | | |
+-----------+ | | | | +------------+ | op2 op2-gpu() +-->+ loss |
| | | | | | | |
+--------------------+ +------------+ | + ^ | | |
| | | GPU | 5 | | | | +--------+
| | | +---> | | 6 | 7 |
2 | | 3 | Labels | | | | |
| | | | | | | |
| | +------------+ +------------------+
+----------------------------+ +--------------------------------+ | |
| | | |
+-----------------------------------------------------------------------------+ |
| | | |
| | +-------------------------------------------------------+ | |
| | | Dispather | | |
| | | + + + + | | |
| | | | XLA | CPU | Metal | GPU | | |
| | | +---------------------------------------------------+ | | |
| | | | | | | | | |
| +--------> | OP1 | op1-xla | op1-cpu | op1-metal | op1-gpu +---+ |
| 'device=GPU' | | | | | +------+ | |
| | +---------------------------------------------------+ | |
| | | | | | | |
+------------> | OP2 | op2-xla | op2-cpu | op2-metal | op2-gpu +---------------+
'device=GPU' | | | | | +------+ |
| +---------------------------------------------------+ |
| | | | | |
| OP3 | op3-xla | op3-cpu | op3-metal | op3-gpu |
| | | | | |
| +---------------------------------------------------+ |
+-------------------------------------------------------+

截图如下:

https://img2020.cnblogs.com/blog/1850883/202111/1850883-20211106210302126-278214823.png