Dispatcher
Dispatcher
我们接下来通过源码来看看。
虚函数表
Schema 例子
每个kernel 算子(虚函数)都有一个对应的schema,我们可以从 aten/src/ATen/native/native_functions.yaml 之中找到一些虚函数 schema 的例子,这些都是以字符串的形式呈现。我们可以看到,schema 包括算子名称(比如zero_sparse_),输入参数个数和类型,返回值类型,是否需要check,如何分发等等。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27# zero 操作对应的虚函数表
- func: zero_(Tensor(a!) self) -> Tensor(a!)
device_check: NoCheck # TensorIterator
variants: method, function
dispatch:
CPU, CUDA: zero_
Meta: zero_meta_
SparseCPU, SparseCUDA: zero_sparse_
MkldnnCPU: mkldnn_zero_
# sub.out 对应的虚函数表
- func: sub.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
device_check: NoCheck # TensorIterator
structured: True
structured_inherits: TensorIteratorBase
dispatch:
CPU, CUDA: sub_out
SparseCPU, SparseCUDA: sub_out_sparse
# sub.Tensor 对应的虚函数表
- func: sub.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor
device_check: NoCheck # TensorIterator
variants: function, method
structured_delegate: sub.out
dispatch:
SparseCPU, SparseCUDA: sub_sparse
Operator的实现
我们可以看看 zero 的两个实现,下面是MkldnnCPU的实现。
1 | Tensor& mkldnn_zero_(Tensor& self) { |
又比如下面是SparseCPU, SparseCUDA 的对应实现:
1 | // -------------------------------------------------------------------- |
Dispatcher 定义
我们接下来看看Dispatcher的定义,这里只给出部分成员变量。
1 | class TORCH_API Dispatcher final { |
逻辑大致如下,operators_ 存储了所有的算子:
1 | +--------------------------------------------+ |
注册
接下来给出注册虚函数表的方法。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26RegistrationHandleRAII Dispatcher::registerImpl(
OperatorName op_name,
c10::optional<DispatchKey> dispatch_key,
KernelFunction kernel,
c10::optional<impl::CppSignature> cpp_signature,
std::unique_ptr<FunctionSchema> inferred_function_schema,
std::string debug
) {
std::lock_guard<std::mutex> lock(mutex_);
auto op = findOrRegisterName_(op_name);
auto handle = op.operatorDef_->op.registerKernel( // 进行注册
*this,
dispatch_key,
std::move(kernel),
std::move(cpp_signature),
std::move(inferred_function_schema),
std::move(debug)
);
++op.operatorDef_->def_and_impl_count;
return RegistrationHandleRAII([this, op, op_name, dispatch_key, handle] {
deregisterImpl_(op, op_name, dispatch_key, handle);
});
}注册表
OperatorEntry代表了一个算子,以及该算子的dispatch table,这里只给出成员变量。
1
2
3
4
5
6
7
8
9
10
11class TORCH_API OperatorEntry final { //代表了一个算子,以及该算子的dispatch table
public:
OperatorName name_;
c10::optional<AnnotatedSchema> schema_;
//存储了不同key对应的算子实现版本,比如cpu,cuda,autograd 等等,所有的算子版本都会在这个table里面
std::array<KernelFunction, static_cast<uint8_t>(DispatchKey::NumDispatchKeys)> dispatchTable_;
DispatchKeyExtractor dispatchKeyExtractor_;
//不同 DispatchKey对应了不同的版本的kernel算子实现版本
ska::flat_hash_map<DispatchKey, std::list<AnnotatedKernel>> kernels_;
};逻辑如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15+---------------------------+ +------------------------------------------+
| OperatorEntry | | |
| | | std::array<KernelFunction, uint8_t> |
| | | |
| | | |
| | | int('CPU') : CPU_kernel |
| dispatchTable_ +-------> | |
| | | int('GPU') : GPU_kernel |
| | | |
| | | ...... |
| | | |
| | | int('Metal') : Metal_kernel |
| | | |
+---------------------------+ +------------------------------------------+注册行为
最终注册行为就是往 dispatchTable_ 之中设置。
1
2
3
4
5
6void OperatorEntry::updateDispatchTableEntry_(const c10::Dispatcher& dispatcher, DispatchKey dispatch_key) {
auto dispatch_ix = static_cast<uint8_t>(dispatch_key);
dispatchTable_[dispatch_ix] = computeDispatchTableEntry(dispatcher, dispatch_key);
dispatchKeyExtractor_.setOperatorHasFallthroughForKey(dispatch_key, dispatchTable_[dispatch_ix].isFallthrough());
}所以 Dispatcher 数据结构拓展近似如下,这里包含了两个OperatorEntry,分别对应了op1和op2,就是说,目前系统中一共有两个operator,每个 operator 有4个kernel函数,分别对应了CPU,GPU等四个后端。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46+-----------------------------------------+
| Dispatcher |
| |
| |
| std::list<OperatorDef> operators_ +--------+
| | |
| | |
| operatorLookupTable_ | |
| | |
+-----------------------------------------+ |
|
|
v
+-----------------------------------+------------------------------------------+
| +---------------------------+ +--------------------------------------+ |
| | OperatorEntry | | | |
| | | | std::array<KernelFunction, uint8_t> | |
| | | | | |
| | name_ = op1 | | | |
| | | | int('CPU') : op1_cpu | |
| | dispatchTable_ +-------> | | |
| | | | int('GPU') : op1_gpu | |
| | | | | |
| | | | int('XLA') : op1_xla | |
| | | | | |
| | | | int('Metal') : op1_metal | |
| | | | | |
| +---------------------------+ +--------------------------------------+ |
| |
| |
| +---------------------------+ +--------------------------------------+ |
| | OperatorEntry | | | |
| | | | std::array<KernelFunction, uint8_t> | |
| | | | | |
| | name_ = op2 | | | |
| | | | int('CPU') : op2_cpu | |
| | dispatchTable_ +-------> | | |
| | | | int('GPU') : op2_gpu | |
| | | | | |
| | | | int('XLA') : op2_xla | |
| | | | | |
| | | | int('Metal') : op2_metal | |
| | | | | |
| +---------------------------+ +--------------------------------------+ |
+------------------------------------------------------------------------------+
如何dispatch
调度依据
PyTorch 之中会依据dtype、device和layout的不同来调度不同的operator。
- 大多数类型(比如int32)可以使用模版方式直接进行映射,但是某些operator 不支持模版功能,就需要dispatcher这样的动态调度器。
- PyTorch的tensor不仅可以运行在CPU上,还可以跑在GPU,mkldnn和xla等设备,这也需要动态调度。
- layout是指tensor中元素的排布,这就有strided layout和sparse layout的区别,所以也需要动态调度。
调度代码
这里给出部分代码
算子调度的逻辑是:
- 通过 dispatcher 类 + operator name + 操作类型等联合的形式来查找对应的算子 schema,算子的schema 定义了本算子的输入/输出/参数等等的相关信息。
- 调用 dispatcher::call 完成算子操作。
- 得到 dispatcher 中的 dispatchKetSet。
- 利用 op.lookup 找到最高优先级的 key,并且依据 key 找到对应的 KernelFunction。
- 调用 kernel。
首先,具体以range的定义来看看如何查找schema,具体在 findSchemaOrThrow 内部是通过operatorLookupTable_来查找op:
1
2
3
4
5
6
7at::Tensor range::call(const at::Scalar & start, const at::Scalar & end, c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout, c10::optional<at::Device> device, c10::optional<bool> pin_memory) {
static auto op = c10::Dispatcher::singleton()
.findSchemaOrThrow("aten::range", "")
.typed<at::Tensor (const at::Scalar &, const at::Scalar &, c10::optional<at::ScalarType>, c10::optional<at::Layout>, c10::optional<at::Device>, c10::optional<bool>)>();
return op.call(start, end, dtype, layout, device, pin_memory);
}其次,Dispatcher::call 定义如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21template<class Return, class... Args>
C10_DISPATCHER_INLINE_UNLESS_MOBILE Return Dispatcher::call(const TypedOperatorHandle<Return(Args...)>& op, Args... args) const {
detail::unused_arg_(args...);
// 得到key set
auto dispatchKeySet = op.operatorDef_->op.dispatchKeyExtractor()
.template getDispatchKeySetUnboxed<Args...>(args...);
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(!c10::isAliasDispatchKey(dispatchKeySet.highestPriorityTypeId()));
// 得到算子
const KernelFunction& kernel = op.operatorDef_->op.lookup(dispatchKeySet.highestPriorityTypeId());
// 进行调度
bool pre_sampled = false;
if (C10_UNLIKELY(at::shouldRunRecordFunction(&pre_sampled))) {
return callWithDispatchKeySlowPath<Return, Args...>(op, pre_sampled, dispatchKeySet, kernel, std::forward<Args>(args)...);
}
}
key
我们接下来看看key的定义,因为太多,所以我们只给出部分数值。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66enum class DispatchKey : uint8_t {
CPU, // registered at build/aten/src/ATen/RegisterCPU.cpp
CUDA, // registered at build/aten/src/ATen/RegisterCUDA.cpp
HIP, // NB: I think this is not actually used, due to Note [Masquerading as
// CUDA]
FPGA, // Xilinx support lives out of tree at
// https://gitlab.com/pytorch-complex/vitis_kernels
MSNPU, // unused externally, but tested at
// test/cpp_extensions/msnpu_extension.cpp
XLA, // lives out of tree at https://github.com/pytorch/xla
MLC, // lives out of tree at https://github.com/pytorch/MLCompute
Vulkan,
Metal,
XPU, // For out of tree Intel's heterogeneous computing plug-in
HPU, // For out of tree & closed source integration of HPU / Habana
VE, // For out of tree & closed source integration of SX-Aurora / NEC
Lazy, // For lazy tensor backends
// A meta tensor is a tensor without any data associated with it. (They
// have also colloquially been referred to as tensors on the "null" device).
// A meta tensor can be used to dry run operators without actually doing any
// computation, e.g., add on two meta tensors would give you another meta
// tensor with the output shape and dtype, but wouldn't actually add anything.
Meta,
// Here are backends which specify more specialized operators
// based on the dtype of the tensor.
QuantizedCPU, // registered at build/aten/src/ATen/RegisterQuantizedCPU.cpp
QuantizedCUDA, // registered at build/aten/src/ATen/RegisterQuantizedCUDA.cpp
QuantizedXPU, // For out of tree Intel's heterogeneous computing plug-in
// This backend is to support custom RNGs; it lets you go
// to a different kernel if you pass in a generator that is not a
// traditional CPUGeneratorImpl/CUDAGeneratorImpl. To make use of this
// key:
// 1) set it as a second parameter of at::Generator constructor call in
// the user-defined PRNG class.
// 2) use it as a dispatch key while registering custom kernels
// (templatized kernels specialized for user-defined PRNG class)
// intended for out of tree use; tested by aten/src/ATen/test/rng_test.cpp
CustomRNGKeyId,
// Here are backends which specify more specialized operators
// based on the layout of the tensor. Note that the sparse backends
// are one case where ordering matters: sparse multi-dispatches with
// the corresponding dense tensors, and must be handled before them.
MkldnnCPU, // registered at build/aten/src/ATen/RegisterMkldnnCPU.cpp
// NB: not to be confused with MKLDNN, which is Caffe2 only
SparseCPU, // registered at build/aten/src/ATen/RegisterSparseCPU.cpp
SparseCUDA, // registered at build/aten/src/ATen/RegisterSparseCUDA.cpp
SparseHIP, // TODO: I think this is not actually used, due to Note
// [Masquerading as CUDA]
SparseXPU, // For out of tree Intel's heterogeneous computing plug-in
SparseVE, // For out of tree & closed source integration of SX-Aurora / NEC
SparseCsrCPU,
SparseCsrCUDA,
AutogradOther,
AutogradCPU,
AutogradCUDA,
AutogradXLA,
AutogradLazy,
AutogradXPU,
AutogradMLC,
AutogradHPU,
......
};key的使用
因为篇幅所限,我们无法深入分析每一种情况,这里只给出从 DeviceType 出发的情景。我们从下面函数可以看到,如何从 DeviceType 映射到 DispatchKey 类型。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32template <typename Func>
inline CppFunction dispatch(c10::DeviceType type, Func&& raw_f) {
auto deviceTypeToDispatchKey = [](c10::DeviceType t){
switch (t) {
// This list is synchronized with the k-constants in c10/core/DeviceType.h
case c10::DeviceType::CPU:
return c10::DispatchKey::CPU;
case c10::DeviceType::CUDA:
return c10::DispatchKey::CUDA;
case c10::DeviceType::XLA:
return c10::DispatchKey::XLA;
case c10::DeviceType::Lazy:
return c10::DispatchKey::Lazy;
case c10::DeviceType::MLC:
return c10::DispatchKey::MLC;
case c10::DeviceType::Meta:
return c10::DispatchKey::Meta;
case c10::DeviceType::HIP:
return c10::DispatchKey::HIP;
case c10::DeviceType::MSNPU:
return c10::DispatchKey::MSNPU;
case c10::DeviceType::HPU:
return c10::DispatchKey::HPU;
default:
TORCH_CHECK(false,
"Device type ", t, " cannot be overloaded at dispatch time, "
"please file a bug report explaining what you were trying to do.");
}
};
return dispatch(deviceTypeToDispatchKey(type), std::forward<Func>(raw_f));
}
小结
至此,我们知道,通过 Dispatcher 机制,PyTorch 可以依据dtype、device和layout的不同来调度不同的operator。这就解答了我们第三个问题:如何在 CPU,GPU 操作之间无缝切换?
关于第四个问题:是否需要把损失函数移动到 GPU 之上?,我们也有了解答:
损失函数的参数是前向传播的outputs和label,outputs已经在GPU之上(因为训练数据已经在GPU之上),label 也被用户手动设置到GPU之上。所以损失函数的参数都已经在GPU之上,这样 Dispather 就依据device会调用到GPU对应的operator,所以不需要把损失函数移动到GPU之上。
我们整理一个总体逻辑如下,序列是:
- 把训练数据 inputs 移动到GPU。
- 进行前向操作,假设只有一个operator,就是 op1,使用 device=’GPU’ 这个 dispatch key 去 Dispatcher 查找。
- 找到了 op1-gpu 这个operator,进行计算,得出 outputs。
- outputs 就自动存在于 GPU 之上。
- 把 Labels 也放到 GPU 之上。
- 进行损失函数运算,假设只有一个 operator,就是 op2,此时损失函数的参数都在GPU之上,所以使用 device= ‘GPU’ 这个 dispatch key 去 Dispatcher 查找。
- 找到了 op2-gpu 这个operator,进行计算,得出 loss。
1 | +--------------------+ |
截图如下: