萬字長文，60分鐘閃電戰(zhàn)

O聽_海_軒O 2021-03-30

展開全文

大家好，我是 Jack。

本文是翻譯自官方版教程：DEEP LEARNING WITH PYTORCH: A 60 MINUTE BLITZ，一份 60 分鐘帶你快速入門 PyTorch 的官方教程。

收藏==學會，學起來！

本文目錄如下：

作者：鑫鑫淼淼焱焱（已授權）
https://zhuanlan.zhihu.com/p/66543791

1. Pytorch 是什么

Pytorch 是一個基于 Python 的科學計算庫，它面向以下兩種人群：

希望將其代替 Numpy 來利用 GPUs 的威力；
一個可以提供更加靈活和快速的深度學習研究平臺。

PyTorch 是由 Facebook 開發(fā)，基于 Torch 開發(fā)，從并不常用的 Lua 語言轉為 Python 語言開發(fā)的深度學習框架，Torch 是 TensorFlow 開源前非常出名的一個深度學習框架，而 PyTorch 在開源后由于其使用簡單，動態(tài)計算圖的特性得到非常多的關注，并且成為了 TensorFlow 的最大競爭對手。目前其 Github 也有 4w7+ 關注。

Github 地址：https://github.com/pytorch/pytorch
官網：https://pytorch.org/
論壇：https://discuss.pytorch.org/

1.1 安裝

pytorch 的安裝可以直接查看官網教程，如下所示，官網地址：

https://pytorch.org/get-started/locally/

根據(jù)提示分別選擇系統(tǒng)(Linux、Mac 或者 Windows)，安裝方式(Conda，Pip，LibTorch 或者源碼安裝)、使用的編程語言(Python 2.7 或者 Python 3.5,3.6,3.7 或者是 C++)，如果是 GPU 版本，就需要選擇 CUDA 的版本，所以，如果如上圖所示選擇，安裝的命令是：

conda install pytorch torchvision cudatoolkit=9.0 -c pytorch

這里推薦采用 Conda 安裝，即使用 Anaconda，主要是可以設置不同環(huán)境配置不同的設置，關于 Anaconda 可以查看我之前寫的別再折騰開發(fā)環(huán)境了，一勞永逸的搭建方法

當然這里會安裝最新版本的 Pytorch，也就是 1.1 版本，如果希望安裝之前的版本，可以點擊下面的網址：

https://pytorch.org/get-started/previous-versions/

如下圖所示，安裝 0.4.1 版本的 pytorch，在不同版本的 CUDA 以及沒有 CUDA 的情況。

然后還有其他的安裝方式，具體可以自己點擊查看。

安裝后，輸入下列命令：

from __future__ import print_function
import torch
x = torch.rand(5, 3)
print(x)

輸出結果類似下面的結果即安裝成功：

tensor([[0.3380, 0.3845, 0.3217],
        [0.8337, 0.9050, 0.2650],
        [0.2979, 0.7141, 0.9069],
        [0.1449, 0.1132, 0.1375],
        [0.4675, 0.3947, 0.1426]])

然后是驗證能否正確運行在 GPU 上，輸入下列代碼，這份代碼中 cuda.is_available() 主要是用于檢測是否可以使用當前的 GPU 顯卡，如果返回 True，當然就可以運行，否則就不能。

import torch
torch.cuda.is_available()

1.2 張量(Tensors)

Pytorch 的一大作用就是可以代替 Numpy 庫，所以首先介紹 Tensors ，也就是張量，它相當于 Numpy 的多維數(shù)組(ndarrays)。兩者的區(qū)別就是 Tensors 可以應用到 GPU 上加快計算速度。

首先導入必須的庫，主要是 torch

from __future__ import print_function
import torch

1.2.1 聲明和定義

首先是對 Tensors 的聲明和定義方法，分別有以下幾種：

torch.empty(): 聲明一個未初始化的矩陣。

# 創(chuàng)建一個 5*3 的矩陣
x = torch.empty(5, 3)
print(x)

輸出結果如下：

tensor([[9.2737e-41, 8.9074e-01, 1.9286e-37],
        [1.7228e-34, 5.7064e+01, 9.2737e-41],
        [2.2803e+02, 1.9288e-37, 1.7228e-34],
        [1.4609e+04, 9.2737e-41, 5.8375e+04],
        [1.9290e-37, 1.7228e-34, 3.7402e+06]])

torch.rand()：隨機初始化一個矩陣

# 創(chuàng)建一個隨機初始化的 5*3 矩陣
rand_x = torch.rand(5, 3)
print(rand_x)

輸出結果：

tensor([[0.4311, 0.2798, 0.8444],
        [0.0829, 0.9029, 0.8463],
        [0.7139, 0.4225, 0.5623],
        [0.7642, 0.0329, 0.8816],
        [1.0000, 0.9830, 0.9256]])

torch.zeros()：創(chuàng)建數(shù)值皆為 0 的矩陣

# 創(chuàng)建一個數(shù)值皆是 0，類型為 long 的矩陣
zero_x = torch.zeros(5, 3, dtype=torch.long)
print(zero_x)

輸出結果如下：

tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]])

類似的也可以創(chuàng)建數(shù)值都是 1 的矩陣，調用 torch.ones

torch.tensor()：直接傳遞 tensor 數(shù)值來創(chuàng)建

# tensor 數(shù)值是 [5.5, 3]
tensor1 = torch.tensor([5.5, 3])
print(tensor1)

輸出結果：

tensor([5.5000, 3.0000])

除了上述幾種方法，還可以根據(jù)已有的 tensor 變量創(chuàng)建新的 tensor 變量，這種做法的好處就是可以保留已有 tensor 的一些屬性，包括尺寸大小、數(shù)值屬性，除非是重新定義這些屬性。相應的實現(xiàn)方法如下：

tensor.new_ones()：new_*() 方法需要輸入尺寸大小

# 顯示定義新的尺寸是 5*3，數(shù)值類型是 torch.double
tensor2 = tensor1.new_ones(5, 3, dtype=torch.double)  # new_* 方法需要輸入 tensor 大小
print(tensor2)

輸出結果：

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], dtype=torch.float64)

torch.randn_like(old_tensor)：保留相同的尺寸大小

# 修改數(shù)值類型
tensor3 = torch.randn_like(tensor2, dtype=torch.float)
print('tensor3: ', tensor3)

輸出結果，這里是根據(jù)上個方法聲明的 tensor2 變量來聲明新的變量，可以看出尺寸大小都是 5*3，但是數(shù)值類型是改變了的。

tensor3:  tensor([[-0.4491, -0.2634, -0.0040],
        [-0.1624,  0.4475, -0.8407],
        [-0.6539, -1.2772,  0.6060],
        [ 0.2304,  0.0879, -0.3876],
        [ 1.2900, -0.7475, -1.8212]])

最后，對 tensors 的尺寸大小獲取可以采用 tensor.size() 方法：

print(tensor3.size())  
# 輸出: torch.Size([5, 3])

注意：torch.Size 實際上是元組(tuple)類型，所以支持所有的元組操作。

1.2.2 操作(Operations)

操作也包含了很多語法，但這里作為快速入門，僅僅以加法操作作為例子進行介紹，更多的操作介紹可以點擊下面網址查看官方文檔，包括轉置、索引、切片、數(shù)學計算、線性代數(shù)、隨機數(shù)等等：

https://pytorch.org/docs/stable/torch.html

對于加法的操作，有幾種實現(xiàn)方式：

+ 運算符
torch.add(tensor1, tensor2, [out=tensor3])
tensor1.add_(tensor2)：直接修改 tensor 變量

tensor4 = torch.rand(5, 3)
print('tensor3 + tensor4= ', tensor3 + tensor4)
print('tensor3 + tensor4= ', torch.add(tensor3, tensor4))
# 新聲明一個 tensor 變量保存加法操作的結果
result = torch.empty(5, 3)
torch.add(tensor3, tensor4, out=result)
print('add result= ', result)
# 直接修改變量
tensor3.add_(tensor4)
print('tensor3= ', tensor3)

輸出結果

tensor3 + tensor4=  tensor([[ 0.1000,  0.1325,  0.0461],
        [ 0.4731,  0.4523, -0.7517],
        [ 0.2995, -0.9576,  1.4906],
        [ 1.0461,  0.7557, -0.0187],
        [ 2.2446, -0.3473, -1.0873]])

tensor3 + tensor4=  tensor([[ 0.1000,  0.1325,  0.0461],
        [ 0.4731,  0.4523, -0.7517],
        [ 0.2995, -0.9576,  1.4906],
        [ 1.0461,  0.7557, -0.0187],
        [ 2.2446, -0.3473, -1.0873]])

add result=  tensor([[ 0.1000,  0.1325,  0.0461],
        [ 0.4731,  0.4523, -0.7517],
        [ 0.2995, -0.9576,  1.4906],
        [ 1.0461,  0.7557, -0.0187],
        [ 2.2446, -0.3473, -1.0873]])

tensor3=  tensor([[ 0.1000,  0.1325,  0.0461],
        [ 0.4731,  0.4523, -0.7517],
        [ 0.2995, -0.9576,  1.4906],
        [ 1.0461,  0.7557, -0.0187],
        [ 2.2446, -0.3473, -1.0873]])

注意：可以改變 tensor 變量的操作都帶有一個后綴 _, 例如 x.copy_(y), x.t_() 都可以改變 x 變量。

除了加法運算操作，對于 Tensor 的訪問，和 Numpy 對數(shù)組類似，可以使用索引來訪問某一維的數(shù)據(jù)，如下所示：：可以改變 tensor 變量的操作都帶有一個后綴 _, 例如 x.copy_(y), x.t_() 都可以改變 x 變量。

除了加法運算操作，對于 Tensor 的訪問，和 Numpy 對數(shù)組類似，可以使用索引來訪問某一維的數(shù)據(jù)，如下所示：

# 訪問 tensor3 第一列數(shù)據(jù)
print(tensor3[:, 0])

輸出結果：

tensor([0.1000, 0.4731, 0.2995, 1.0461, 2.2446])

對 Tensor 的尺寸修改，可以采用 torch.view() ，如下所示：

x = torch.randn(4, 4)
y = x.view(16)
# -1 表示除給定維度外的其余維度的乘積
z = x.view(-1, 8)
print(x.size(), y.size(), z.size())

輸出結果：

torch.Size([4, 4]) torch.Size([16]) torch.Size([2, 8])

如果 tensor 僅有一個元素，可以采用 .item() 來獲取類似 Python 中整數(shù)類型的數(shù)值：

x = torch.randn(1)
print(x)
print(x.item())

輸出結果:

tensor([0.4549])
0.4549027979373932

更多的運算操作可以查看官方文檔的介紹：

https://pytorch.org/docs/stable/torch.html

1.3 和 Numpy 數(shù)組的轉換

Tensor 和 Numpy 的數(shù)組可以相互轉換，并且兩者轉換后共享在 CPU 下的內存空間，即改變其中一個的數(shù)值，另一個變量也會隨之改變。

1.3.1 Tensor 轉換為 Numpy 數(shù)組

實現(xiàn) Tensor 轉換為 Numpy 數(shù)組的例子如下所示，調用 tensor.numpy() 可以實現(xiàn)這個轉換操作。

a = torch.ones(5)
print(a)
b = a.numpy()
print(b)

輸出結果：

tensor([1., 1., 1., 1., 1.])
[1. 1. 1. 1. 1.]

此外，剛剛說了兩者是共享同個內存空間的，例子如下所示，修改 tensor 變量 a，看看從 a 轉換得到的 Numpy 數(shù)組變量 b 是否發(fā)生變化。

a.add_(1)
print(a)
print(b)

輸出結果如下，很明顯，b 也隨著 a 的改變而改變。

tensor([2., 2., 2., 2., 2.])
[2. 2. 2. 2. 2.]

1.3.2 Numpy 數(shù)組轉換為 Tensor

轉換的操作是調用 torch.from_numpy(numpy_array) 方法。例子如下所示：

import numpy as np
a = np.ones(5)
b = torch.from_numpy(a)
np.add(a, 1, out=a)
print(a)
print(b)

輸出結果：

[2. 2. 2. 2. 2.]
tensor([2., 2., 2., 2., 2.], dtype=torch.float64)

在 CPU 上，除了 CharTensor 外的所有 Tensor 類型變量，都支持和 Numpy數(shù)組的相互轉換操作。

1.4. CUDA 張量

Tensors 可以通過 .to 方法轉換到不同的設備上，即 CPU 或者 GPU 上。例子如下所示：

# 當 CUDA 可用的時候，可用運行下方這段代碼，采用 torch.device() 方法來改變 tensors 是否在 GPU 上進行計算操作
if torch.cuda.is_available():
    device = torch.device('cuda')          # 定義一個 CUDA 設備對象
    y = torch.ones_like(x, device=device)  # 顯示創(chuàng)建在 GPU 上的一個 tensor
    x = x.to(device)                       # 也可以采用 .to('cuda') 
    z = x + y
    print(z)
    print(z.to('cpu', torch.double))       # .to() 方法也可以改變數(shù)值類型

輸出結果，第一個結果就是在 GPU 上的結果，打印變量的時候會帶有 device='cuda:0'，而第二個是在 CPU 上的變量。

tensor([1.4549], device='cuda:0')
tensor([1.4549], dtype=torch.float64)

本小節(jié)教程：

https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html

本小節(jié)的代碼：

https://github.com/ccc013/DeepLearning_Notes/blob/master/Pytorch/practise/basic_practise.ipynb

2. autograd

對于 Pytorch 的神經網絡來說，非常關鍵的一個庫就是 autograd ，它主要是提供了對 Tensors 上所有運算操作的自動微分功能，也就是計算梯度的功能。它屬于 define-by-run 類型框架，即反向傳播操作的定義是根據(jù)代碼的運行方式，因此每次迭代都可以是不同的。

接下來會簡單介紹一些例子來說明這個庫的作用。

2.1 張量

torch.Tensor 是 Pytorch 最主要的庫，當設置它的屬性 .requires_grad=True，那么就會開始追蹤在該變量上的所有操作，而完成計算后，可以調用 .backward() 并自動計算所有的梯度，得到的梯度都保存在屬性 .grad 中。

調用 .detach() 方法分離出計算的歷史，可以停止一個 tensor 變量繼續(xù)追蹤其歷史信息，同時也防止未來的計算會被追蹤。

而如果是希望防止跟蹤歷史（以及使用內存），可以將代碼塊放在 with torch.no_grad(): 內，這個做法在使用一個模型進行評估的時候非常有用，因為模型會包含一些帶有 requires_grad=True 的訓練參數(shù)，但實際上并不需要它們的梯度信息。

對于 autograd 的實現(xiàn)，還有一個類也是非常重要-- Function 。

Tensor 和 Function 兩個類是有關聯(lián)并建立了一個非循環(huán)的圖，可以編碼一個完整的計算記錄。每個 tensor 變量都帶有屬性 .grad_fn ，該屬性引用了創(chuàng)建了這個變量的 Function （除了由用戶創(chuàng)建的 Tensors，它們的 grad_fn=None )。

如果要進行求導運算，可以調用一個 Tensor 變量的方法 .backward() 。如果該變量是一個標量，即僅有一個元素，那么不需要傳遞任何參數(shù)給方法 .backward()，當包含多個元素的時候，就必須指定一個 gradient 參數(shù)，表示匹配尺寸大小的 tensor，這部分見第二小節(jié)介紹梯度的內容。

接下來就開始用代碼來進一步介紹。

首先導入必須的庫：

import torch

開始創(chuàng)建一個 tensor，并讓 requires_grad=True 來追蹤該變量相關的計算操作：

x = torch.ones(2, 2, requires_grad=True)
print(x)

輸出結果：

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)

執(zhí)行任意計算操作，這里進行簡單的加法運算：

y = x + 2
print(y)

輸出結果：

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward>)

y 是一個操作的結果，所以它帶有屬性 grad_fn：

print(y.grad_fn)

輸出結果：

<AddBackward object at 0x00000216D25DCC88>

繼續(xù)對變量 y 進行操作：

z = y * y * 3
out = z.mean()

print('z=', z)
print('out=', out)

輸出結果：

z= tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward>)

out= tensor(27., grad_fn=<MeanBackward1>)

實際上，一個 Tensor 變量的默認 requires_grad 是 False ，可以像上述定義一個變量時候指定該屬性是 True，當然也可以定義變量后，調用 .requires_grad_(True) 設置為 True ，這里帶有后綴 _ 是會改變變量本身的屬性，在上一節(jié)介紹加法操作 add_() 說明過，下面是一個代碼例子：

a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)

輸出結果如下，第一行是為設置 requires_grad 的結果，接著顯示調用 .requires_grad_(True)，輸出結果就是 True 。

False

True

<SumBackward0 object at 0x00000216D25ED710>

2.2 梯度

接下來就是開始計算梯度，進行反向傳播的操作。out 變量是上一小節(jié)中定義的，它是一個標量，因此 out.backward() 相當于 out.backward(torch.tensor(1.)) ，代碼如下：

out.backward()
# 輸出梯度 d(out)/dx
print(x.grad)

輸出結果：

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])

結果應該就是得到數(shù)值都是 4.5 的矩陣。這里我們用 o 表示 out 變量，那么根據(jù)之前的定義會有：

詳細來說，初始定義的 x 是一個全為 1 的矩陣，然后加法操作 x+2 得到 y ，接著 yy3，得到 z ，并且此時 z 是一個 2*2 的矩陣，所以整體求平均得到 out 變量應該是除以 4，所以得到上述三條公式。

因此，計算梯度：

從數(shù)學上來說，如果你有一個向量值函數(shù)：

那么對應的梯度是一個雅克比矩陣(Jacobian matrix)：

一般來說，torch.autograd 就是用于計算雅克比向量(vector-Jacobian)乘積的工具。這里略過數(shù)學公式，直接上代碼例子介紹：

x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

輸出結果：

tensor([ 237.5009, 1774.2396,  274.0625], grad_fn=<MulBackward>)

這里得到的變量 y 不再是一個標量，torch.autograd 不能直接計算完整的雅克比行列式，但我們可以通過簡單的傳遞向量給 backward() 方法作為參數(shù)得到雅克比向量的乘積，例子如下所示：

v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)

print(x.grad)

輸出結果：

tensor([ 102.4000, 1024.0000,    0.1024])

最后，加上 with torch.no_grad() 就可以停止追蹤變量歷史進行自動梯度計算：

print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)

輸出結果：

True

True

False

更多有關 autograd 和 Function 的介紹：

https://pytorch.org/docs/stable/autograd.html

本小節(jié)教程：

https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

本小節(jié)的代碼：

https://github.com/ccc013/DeepLearning_Notes/blob/master/Pytorch/practise/basic_practise.ipynb

3. 神經網絡

在 PyTorch 中 torch.nn 專門用于實現(xiàn)神經網絡。其中 nn.Module 包含了網絡層的搭建，以及一個方法-- forward(input) ，并返回網絡的輸出 output 。

下面是一個經典的 LeNet 網絡，用于對字符進行分類。

對于神經網絡來說，一個標準的訓練流程是這樣的：

定義一個多層的神經網絡
對數(shù)據(jù)集的預處理并準備作為網絡的輸入
將數(shù)據(jù)輸入到網絡
計算網絡的損失
反向傳播，計算梯度
更新網絡的梯度，一個簡單的更新規(guī)則是 weight = weight - learning_rate * gradient

3.1 定義網絡

首先定義一個神經網絡，下面是一個 5 層的卷積神經網絡，包含兩層卷積層和三層全連接層：

import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 輸入圖像是單通道，conv1 kenrnel size=5*5，輸出通道 6
        self.conv1 = nn.Conv2d(1, 6, 5)
        # conv2 kernel size=5*5, 輸出通道 16
        self.conv2 = nn.Conv2d(6, 16, 5)
        # 全連接層
        self.fc1 = nn.Linear(16*5*5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # max-pooling 采用一個 (2,2) 的滑動窗口
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # 核(kernel)大小是方形的話，可僅定義一個數(shù)字，如 (2,2) 用 2 即可
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        # 除了 batch 維度外的所有維度
        size = x.size()[1:]
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

net = Net()
print(net)

打印網絡結構：

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

這里必須實現(xiàn) forward 函數(shù)，而 backward 函數(shù)在采用 autograd 時就自動定義好了，在 forward 方法可以采用任何的張量操作。

net.parameters() 可以返回網絡的訓練參數(shù)，使用例子如下：

params = list(net.parameters())
print('參數(shù)數(shù)量: ', len(params))
# conv1.weight
print('第一個參數(shù)大小: ', params[0].size())

輸出：

參數(shù)數(shù)量:  10
第一個參數(shù)大小:  torch.Size([6, 1, 5, 5])

然后簡單測試下這個網絡，隨機生成一個 32*32 的輸入：

# 隨機定義一個變量輸入網絡
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

輸出結果：

tensor([[ 0.1005,  0.0263,  0.0013, -0.1157, -0.1197, -0.0141,  0.1425, -0.0521,
          0.0689,  0.0220]], grad_fn=<ThAddmmBackward>)

接著反向傳播需要先清空梯度緩存，并反向傳播隨機梯度：

# 清空所有參數(shù)的梯度緩存，然后計算隨機梯度進行反向傳播
net.zero_grad()
out.backward(torch.randn(1, 10))

注意：

torch.nn 只支持小批量(mini-batches)數(shù)據(jù)，也就是輸入不能是單個樣本，比如對于 nn.Conv2d 接收的輸入是一個 4 維張量--nSamples * nChannels * Height * Width 。所以，如果你輸入的是單個樣本，需要采用 input.unsqueeze(0) 來擴充一個假的 batch 維度，即從 3 維變?yōu)?4 維。

3.2 損失函數(shù)

損失函數(shù)的輸入是 (output, target) ，即網絡輸出和真實標簽對的數(shù)據(jù)，然后返回一個數(shù)值表示網絡輸出和真實標簽的差距。

PyTorch 中其實已經定義了不少的損失函數(shù)，這里僅采用簡單的均方誤差：nn.MSELoss ，例子如下：

output = net(input)
# 定義偽標簽
target = torch.randn(10)
# 調整大小，使得和 output 一樣的 size
target = target.view(1, -1)
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

輸出如下：

tensor(0.6524, grad_fn=<MseLossBackward>)

這里，整個網絡的數(shù)據(jù)輸入到輸出經歷的計算圖如下所示，其實也就是數(shù)據(jù)從輸入層到輸出層，計算 loss 的過程。

input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
      -> view -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss

如果調用 loss.backward() ，那么整個圖都是可微分的，也就是說包括 loss ，圖中的所有張量變量，只要其屬性 requires_grad=True ，那么其梯度 .grad張量都會隨著梯度一直累計。

用代碼來說明：

# MSELoss
print(loss.grad_fn)
# Linear layer
print(loss.grad_fn.next_functions[0][0])
# Relu
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])

輸出：

<MseLossBackward object at 0x0000019C0C349908>

<ThAddmmBackward object at 0x0000019C0C365A58>

<ExpandBackward object at 0x0000019C0C3659E8>

3.3 反向傳播

反向傳播的實現(xiàn)只需要調用 loss.backward() 即可，當然首先需要清空當前梯度緩存，即.zero_grad() 方法，否則之前的梯度會累加到當前的梯度，這樣會影響權值參數(shù)的更新。

下面是一個簡單的例子，以 conv1 層的偏置參數(shù) bias 在反向傳播前后的結果為例：

# 清空所有參數(shù)的梯度緩存
net.zero_grad()
print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

輸出結果：

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])

conv1.bias.grad after backward
tensor([ 0.0069,  0.0021,  0.0090, -0.0060, -0.0008, -0.0073])

了解更多有關 torch.nn 庫，可以查看官方文檔：

https://pytorch.org/docs/stable/nn.html

3.4 更新權重

采用隨機梯度下降(Stochastic Gradient Descent, SGD)方法的最簡單的更新權重規(guī)則如下：

weight = weight - learning_rate * gradient

按照這個規(guī)則，代碼實現(xiàn)如下所示：

# 簡單實現(xiàn)權重的更新例子
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

但是這只是最簡單的規(guī)則，深度學習有很多的優(yōu)化算法，不僅僅是 SGD，還有 Nesterov-SGD, Adam, RMSProp 等等，為了采用這些不同的方法，這里采用 torch.optim 庫，使用例子如下所示：

import torch.optim as optim
# 創(chuàng)建優(yōu)化器
optimizer = optim.SGD(net.parameters(), lr=0.01)

# 在訓練過程中執(zhí)行下列操作
optimizer.zero_grad() # 清空梯度緩存
output = net(input)
loss = criterion(output, target)
loss.backward()
# 更新權重
optimizer.step()

注意，同樣需要調用 optimizer.zero_grad() 方法清空梯度緩存。

本小節(jié)教程：

https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html

本小節(jié)的代碼：

https://github.com/ccc013/DeepLearning_Notes/blob/master/Pytorch/practise/neural_network.ipynb

4. 訓練分類器

上一節(jié)介紹了如何構建神經網絡、計算 loss 和更新網絡的權值參數(shù)，接下來需要做的就是實現(xiàn)一個圖片分類器。

4.1 訓練數(shù)據(jù)

在訓練分類器前，當然需要考慮數(shù)據(jù)的問題。通常在處理如圖片、文本、語音或者視頻數(shù)據(jù)的時候，一般都采用標準的 Python 庫將其加載并轉成 Numpy 數(shù)組，然后再轉回為 PyTorch 的張量。

對于圖像，可以采用 Pillow, OpenCV 庫；
對于語音，有 scipy 和 librosa;
對于文本，可以選擇原生 Python 或者 Cython 進行加載數(shù)據(jù)，或者使用 NLTK 和 SpaCy 。

PyTorch 對于計算機視覺，特別創(chuàng)建了一個 torchvision 的庫，它包含一個數(shù)據(jù)加載器(data loader)，可以加載比較常見的數(shù)據(jù)集，比如 Imagenet, CIFAR10, MNIST 等等，然后還有一個用于圖像的數(shù)據(jù)轉換器(data transformers)，調用的庫是 torchvision.datasets 和 torch.utils.data.DataLoader 。

在本教程中，將采用 CIFAR10 數(shù)據(jù)集，它包含 10 個類別，分別是飛機、汽車、鳥、貓、鹿、狗、青蛙、馬、船和卡車。數(shù)據(jù)集中的圖片都是 3x32x32。一些例子如下所示：

4.2 訓練圖片分類器

訓練流程如下：

1、通過調用 torchvision 加載和歸一化 CIFAR10 訓練集和測試集；2、構建一個卷積神經網絡；3、定義一個損失函數(shù)；4、在訓練集上訓練網絡；5、在測試集上測試網絡性能。

4.2.1 加載和歸一化 CIFAR10

首先導入必須的包：

import torch
import torchvision
import torchvision.transforms as transforms

torchvision 的數(shù)據(jù)集輸出的圖片都是 PILImage ，即取值范圍是 [0, 1] ，這里需要做一個轉換，變成取值范圍是 [-1, 1] , 代碼如下所示：

# 將圖片數(shù)據(jù)從 [0,1] 歸一化為 [-1, 1] 的取值范圍
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

這里下載好數(shù)據(jù)后，可以可視化部分訓練圖片，代碼如下：

import matplotlib.pyplot as plt
import numpy as np

# 展示圖片的函數(shù)
def imshow(img):
    img = img / 2 + 0.5     # 非歸一化
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()


# 隨機獲取訓練集圖片
dataiter = iter(trainloader)
images, labels = dataiter.next()

# 展示圖片
imshow(torchvision.utils.make_grid(images))
# 打印圖片類別標簽
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))

展示圖片如下所示：

其類別標簽為：

 frog plane   dog  ship

4.2.2 構建一個卷積神經網絡

這部分內容其實直接采用上一節(jié)定義的網絡即可，除了修改 conv1 的輸入通道，從 1 變?yōu)?3，因為這次接收的是 3 通道的彩色圖片。

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()

4.2.3 定義損失函數(shù)和優(yōu)化器

這里采用類別交叉熵函數(shù)和帶有動量的 SGD 優(yōu)化方法：

import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

4.2.4 訓練網絡

第四步自然就是開始訓練網絡，指定需要迭代的 epoch，然后輸入數(shù)據(jù)，指定次數(shù)打印當前網絡的信息，比如 loss 或者準確率等性能評價標準。

import time
start = time.time()
for epoch in range(2):

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # 獲取輸入數(shù)據(jù)
        inputs, labels = data
        # 清空梯度緩存
        optimizer.zero_grad()

        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # 打印統(tǒng)計信息
        running_loss += loss.item()
        if i % 2000 == 1999:
            # 每 2000 次迭代打印一次信息
            print('[%d, %5d] loss: %.3f' % (epoch + 1, i+1, running_loss / 2000))
            running_loss = 0.0
print('Finished Training! Total cost time: ', time.time()-start)

這里定義訓練總共 2 個 epoch，訓練信息如下，大概耗時為 77s。

[1,  2000] loss: 2.226
[1,  4000] loss: 1.897
[1,  6000] loss: 1.725
[1,  8000] loss: 1.617
[1, 10000] loss: 1.524
[1, 12000] loss: 1.489
[2,  2000] loss: 1.407
[2,  4000] loss: 1.376
[2,  6000] loss: 1.354
[2,  8000] loss: 1.347
[2, 10000] loss: 1.324
[2, 12000] loss: 1.311

Finished Training! Total cost time:  77.24696755409241

4.2.5 測試模型性能

訓練好一個網絡模型后，就需要用測試集進行測試，檢驗網絡模型的泛化能力。對于圖像分類任務來說，一般就是用準確率作為評價標準。

首先，我們先用一個 batch 的圖片進行小小測試，這里 batch=4 ，也就是 4 張圖片，代碼如下：

dataiter = iter(testloader)
images, labels = dataiter.next()

# 打印圖片
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

圖片和標簽分別如下所示：

GroundTruth:    cat  ship  ship plane

然后用這四張圖片輸入網絡，看看網絡的預測結果：

# 網絡輸出
outputs = net(images)

# 預測結果
_, predicted = torch.max(outputs, 1)
print('Predicted: ', ' '.join('%5s' % classes[predicted[j]] for j in range(4)))

輸出為：

Predicted:    cat  ship  ship  ship

前面三張圖片都預測正確了，第四張圖片錯誤預測飛機為船。

接著，讓我們看看在整個測試集上的準確率可以達到多少吧！

correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (100 * correct / total))

輸出結果如下：

Accuracy of the network on the 10000 test images: 55 %

這里可能準確率并不一定一樣，教程中的結果是 51% ，因為權重初始化問題，可能多少有些浮動，相比隨機猜測 10 個類別的準確率(即 10%)，這個結果是不錯的，當然實際上是非常不好，不過我們僅僅采用 5 層網絡，而且僅僅作為教程的一個示例代碼。

然后，還可以再進一步，查看每個類別的分類準確率，跟上述代碼有所不同的是，計算準確率部分是 c = (predicted == labels).squeeze()，這段代碼其實會根據(jù)預測和真實標簽是否相等，輸出 1 或者 0，表示真或者假，因此在計算當前類別正確預測數(shù)量時候直接相加，預測正確自然就是加 1，錯誤就是加 0，也就是沒有變化。

class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs, 1)
        c = (predicted == labels).squeeze()
        for i in range(4):
            label = labels[i]
            class_correct[label] += c[i].item()
            class_total[label] += 1


for i in range(10):
    print('Accuracy of %5s : %2d %%' % (classes[i], 100 * class_correct[i] / class_total[i]))

輸出結果，可以看到貓、鳥、鹿是錯誤率前三，即預測最不準確的三個類別，反倒是船和卡車最準確。

Accuracy of plane : 58 %
Accuracy of   car : 59 %
Accuracy of  bird : 40 %
Accuracy of   cat : 33 %
Accuracy of  deer : 39 %
Accuracy of   dog : 60 %
Accuracy of  frog : 54 %
Accuracy of horse : 66 %
Accuracy of  ship : 70 %
Accuracy of truck : 72 %

4.3 在 GPU 上訓練

深度學習自然需要 GPU 來加快訓練速度的。所以接下來介紹如果是在 GPU 上訓練，應該如何實現(xiàn)。

首先，需要檢查是否有可用的 GPU 來訓練，代碼如下：

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

輸出結果如下，這表明你的第一塊 GPU 顯卡或者唯一的 GPU 顯卡是空閑可用狀態(tài)，否則會打印 cpu 。

cuda:0

既然有可用的 GPU ，接下來就是在 GPU 上進行訓練了，其中需要修改的代碼如下，分別是需要將網絡參數(shù)和數(shù)據(jù)都轉移到 GPU 上：

net.to(device)
inputs, labels = inputs.to(device), labels.to(device)

修改后的訓練部分代碼：

import time
# 在 GPU 上訓練注意需要將網絡和數(shù)據(jù)放到 GPU 上
net.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

start = time.time()
for epoch in range(2):

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # 獲取輸入數(shù)據(jù)
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)
        # 清空梯度緩存
        optimizer.zero_grad()

        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # 打印統(tǒng)計信息
        running_loss += loss.item()
        if i % 2000 == 1999:
            # 每 2000 次迭代打印一次信息
            print('[%d, %5d] loss: %.3f' % (epoch + 1, i+1, running_loss / 2000))
            running_loss = 0.0
print('Finished Training! Total cost time: ', time.time() - start)

注意，這里調用 net.to(device) 后，需要定義下優(yōu)化器，即傳入的是 CUDA 張量的網絡參數(shù)。訓練結果和之前的類似，而且其實因為這個網絡非常小，轉移到 GPU 上并不會有多大的速度提升，而且我的訓練結果看來反而變慢了，也可能是因為我的筆記本的 GPU 顯卡問題。

如果需要進一步提升速度，可以考慮采用多 GPUs，也就是下一節(jié)的內容。

本小節(jié)教程：

https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

本小節(jié)的代碼：

https://github.com/ccc013/DeepLearning_Notes/blob/master/Pytorch/practise/train_classifier_example.ipynb

5. 數(shù)據(jù)并行

這部分教程將學習如何使用 DataParallel 來使用多個 GPUs 訓練網絡。

首先，在 GPU 上訓練模型的做法很簡單，如下代碼所示，定義一個 device 對象，然后用 .to() 方法將網絡模型參數(shù)放到指定的 GPU 上。

device = torch.device('cuda:0')
model.to(device)

接著就是將所有的張量變量放到 GPU 上：

mytensor = my_tensor.to(device)

注意，這里 my_tensor.to(device) 是返回一個 my_tensor 的新的拷貝對象，而不是直接修改 my_tensor 變量，因此你需要將其賦值給一個新的張量，然后使用這個張量。

Pytorch 默認只會采用一個 GPU，因此需要使用多個 GPU，需要采用 DataParallel ，代碼如下所示：

model = nn.DataParallel(model)

這代碼也就是本節(jié)教程的關鍵，接下來會繼續(xù)詳細介紹。

5.1 導入和參數(shù)

首先導入必須的庫以及定義一些參數(shù)：

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

這里主要定義網絡輸入大小和輸出大小，batch 以及圖片的大小，并定義了一個 device 對象。

5.2 構建一個假數(shù)據(jù)集

接著就是構建一個假的(隨機)數(shù)據(jù)集。實現(xiàn)代碼如下：

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

5.3 簡單的模型

接下來構建一個簡單的網絡模型，僅僅包含一層全連接層的神經網絡，加入 print() 函數(shù)用于監(jiān)控網絡輸入和輸出 tensors 的大小：

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print('\tIn Model: input size', input.size(),
              'output size', output.size())

        return output

5.4 創(chuàng)建模型和數(shù)據(jù)平行

這是本節(jié)的核心部分。首先需要定義一個模型實例，并且檢查是否擁有多個 GPUs，如果是就可以將模型包裹在 nn.DataParallel ，并調用 model.to(device) 。代碼如下：

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print('Let's use', torch.cuda.device_count(), 'GPUs!')
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

5.5 運行模型

接著就可以運行模型，看看打印的信息：

for data in rand_loader:
    input = data.to(device)
    output = model(input)
    print('Outside: input size', input.size(),
          'output_size', output.size())

輸出如下：

In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
        In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
        In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
        In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

5.6 運行結果

如果僅僅只有 1 個或者沒有 GPU ，那么 batch=30 的時候，模型會得到輸入輸出的大小都是 30。但如果有多個 GPUs，那么結果如下：

2 GPUs

# on 2 GPUs
Let's use 2 GPUs!
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
    In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

3 GPUs

Let's use 3 GPUs!
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
    In Model: input size torch.Size([10, 5]) output size torch.Size([10, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

8 GPUs

Let's use 8 GPUs!
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([4, 5]) output size torch.Size([4, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
    In Model: input size torch.Size([2, 5]) output size torch.Size([2, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])