NeurIPSとは
• Neural InformationProcessing Systems
– 略称の変遷
• ~ 2017︓NIPS
• 2018 ~ ︓NeurIPS
• 機械学習分野のトップ国際会議
– 神経科学,ニューラルネットの会議として発⾜
– 他の機械学習分野のトップ国際会議は
• International Conference on Machine Leaning
• International Conference on Learning Representations
4
5.
機械学習=深層学習ではない
• NeurIPS 2020においては
–例えばベストペーパー︓深層学習の研究は 1 / 3
• 深層学習関連︓Language Models are Few-Shot Learners
– いわゆる GPT-3 の紹介をしている論⽂
– 例えば招待講演
• 機械学習モデルの出⼒における研究者の社会的責任
– You Can’t Escape Hyperparameters and Latent Variables: Machine
Learning as a Software Engineering Enterprise
• クラウドワーカーの能率化について
– A Future of Work for the Invisible Workers in A.I.
• ただし,⾃分の発表では⾃然⾔語処理に関係する
深層学習の研究に着⽬
5
事前学習モデルの台頭
• BERT やGPT など事前学習モデルが台頭
– ⼤規模コーパスで学習 → ⽬的タスクに適⽤
– 様々なタスクで⾼い性能を達成
15
ウンベルト・エー
コの新作,『バウ
ドリーノ』は農⺠
の⼦が……
Baudolino, a new
novel by Umberto
Eco, is the story of
a peasant boy ……
1. ⼤量の⽂書で
⾔語モデルを学習
ウンベルト・エー
コの新作,『バウ
ドリーノ』は農⺠
の⼦が……
Baudolino, a new
novel by Umberto
Eco, is the story of
a peasant boy ……
2. 対訳⽂で学習
Where is my cat?
私の猫はどこですか︖
⽬的タスク︓翻訳の場合
3. 翻訳を⾏う
16.
補⾜︓事前学習⼿法の概要
16
何らかのニューラルネット
I have a
Ihave a dream
<BOS>
何らかのニューラルネット
MASK MASK a
I have
<BOS>
マスクド⾔語モデル(BERT系)
⼊⼒の⼀部をマスクし,その単語を予測
⼊⼒系列 X について次を最⼤化
マスクする単語を
ランダムに選択
マスクされた
単語を予測
⾔語モデル(GPT系)
与えられた⽂脈に対し次の単語を予測
⼊⼒系列 X について次を最⼤化
k-1から1つ前までの
単語を⽂脈とする
⽂脈の次の
単語を予測
• 学習⽅法で⼤別
– 構造は主に Transformer [Vaswani+ 17]
17.
[NeurIPS 2020] LanguageModels are
Few-Shot Learners
• NeurIPS 2020 ベストペーパーのひとつ
• GPT-3 の紹介をしている論⽂
– GPT-3 の zero-shot での性能の検証
• Zero-shot でも多くのタスクで⾼い性能
– テスト時に使⽤する事例数に応じて性能向上
• 性能はパラメータに対数⽐例
– GPT-3のパラメータ数は 1750億
– GPT-1: Improving Language Understanding by
Generative Pre-training
– GPT-2: Language Models are Unsupervised
Multitask Learners
17
18.
Zero-shot とは
• Zero-shot:応⽤タスクで学習せずに解く
18
翻訳での
通常の学習
ウンベルト・エー
コの新作,『バウ
ドリーノ』は農⺠
の⼦が……
Baudolino, a new
novel by Umberto
Eco, is the story of
a peasant boy ……
対訳⽂で学習
Where is my cat?
私の猫はどこですか︖
ウンベルト・エー
コの新作,『バウ
ドリーノ』は農⺠
の⼦が……
Baudolino, a new
novel by Umberto
Eco, is the story of
a peasant boy ……
⼤量の⽂書で
⾔語モデルを学習
English to Japanese:
Where is my cat?
私の猫はどこですか︖
⾔語モデル
による
Zero-shot
翻訳
19.
(本研究での)zero, one, few-shot
•解く際にどの程度事例を与えるか
– どの設定でも応⽤タスクでの学習は⾏わない
• どの設定も⼀般的には Zero-shot と呼ばれる
19
English to Japanese:
Where is my cat? ->
Zero-shot:
タスクの説明と
⼊⼒のみ与える
One-shot:
タスクの説明と
事例を1つ与え,
⼊⼒を与える
English to Japanese:
This is a pen. ->
これはペンです
Where is my cat? ->
English to Japanese:
This is a pen. ->
これはペンです
Have a good night ->
良い夜を
From the nothing, with love ->
虚無より愛をこめて
Where is my cat? ->
Few-shot:
タスクの説明と
事例を複数与え,
⼊⼒を与える
20.
性能はパラメータ数に対数⽐例
• パラメータ数に対する各データでの性能の平均値
– パラメータが多いほど性能が⾼い
•性能は Few > One > Zero
– 推論時に事例を多く⾒るほど性能が⾼い
20
Figure 1.3: Aggregate performance for all 42 accuracy-denominated benchmarks While zero-shot performance
improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are
more proficient at in-context learning. See Figure 3.8 for a more detailed analysis on SuperGLUE, a standard NLP
benchmark suite.
In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call
GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets,
図は Language Models are Few-Shot Learners より
21.
各タスクでの性能の傾向
21
タスク GPT-3の
Few-shotでの性能
⽳埋めテスト
例︓Alice wasfriends with Bob.
Alice went to visit her friend, __. -> Bob
良い
(既存のトップと同等)
翻訳
(英-仏,英-独,英-露)
良い
(教師なし翻訳と同等以上,
⾔語対によっては教師ありと同等)
質問応答・⽂書読解
(⽂書に対する問題に回答
例︓センター試験英語の⻑⽂読解)
混合
(データによって性能が異なる)
常識推論・含意関係認識
(含意︓前提⽂があるとき成⽴するか
例︓A は Bの著者である
→ A は作家である)
混合
(データによって性能が異なる,
既存のトップと⽐べると低め)
パラメータ数削減の戦略
• 蒸留(Distillation)
– 学習済みモデルの出⼒を⼩規模モデルで再現
•MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers
• モデル圧縮
– 学習済みモデル内の不要なパラメータを削除
• 不要︓予測への貢献が⼩さいパラメータ
• The Lottery Ticket Hypothesis for Pre-trained BERT Networks
• Movement Pruning: Adaptive Sparsity by Fine-Tuning
• Pruning neural networks without any data by iteratively conserving
synaptic flow
• 軽量なモデルの設計
– 最初から少数のパラメータで学習
• O(n) Connections are Expressive Enough: Universal Approximability of
Sparse Transformers
• All Word Embeddings from One Embedding
24
25.
蒸留(Distillation)
• 学習済みモデルの出⼒を⼩規模モデルで再現 [Hinton+14]
– ⾔語モデルの学習を例に
25
Baudolino, a new
novel by Umberto
Eco, is the story of
a peasant boy ……
元となるモデル
(教師モデル)
の学習
⼩さなモデル
(⽣徒モデル)
の学習
Baudolino, a new
novel by Umberto
Eco, is the story of
a peasant boy ……
Baudolino, a new
Baudolino, a new
novel
⽂脈を⼊⼒し,
次の単語を予測
Baudolino, a new
novel 教師モデルの
出⼒を再現する
ように学習
通常の学習
(次の単語
を予測)
Self-attention 部分の模倣
• 教師,⽣徒モデルの最終層のSelf-attention部分を模倣
– 各⾏列の KL Divergence を最⼩化
28
Q
K
QKT
V
V VT
Q
K
V
QKT
V VT
アテンション⾏列の模倣
⾏列 V からなる⾏列の模倣
教師モデル ⽣徒モデル
パラメータ数削減の戦略
• 蒸留(Distillation)
– 学習済みモデルの出⼒を⼩規模モデルで再現
•MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers
• モデル圧縮
– 学習済みモデル内の不要なパラメータを削除
• 不要︓予測への貢献が⼩さいパラメータ
• The Lottery Ticket Hypothesis for Pre-trained BERT Networks
• Movement Pruning: Adaptive Sparsity by Fine-Tuning
• Pruning neural networks without any data by iteratively conserving
synaptic flow
• 軽量なモデルの設計
– 最初から少数のパラメータで学習
• O(n) Connections are Expressive Enough: Universal Approximability of
Sparse Transformers
• All Word Embeddings from One Embedding
30
各タスクで削減 → 別タスクに適⽤の結果
35
Figure2: Transfer Winning Tickets. The performance of transferring IMP subnetworks between
tasks. Each row is a source task S. Each column is a target task T . Each cell is TRANSFER(S, T ):
パラメータ削減に使うタスク
適⽤先のタスク
Masked Language Model(MLM,事前学習に⽤いたタスク)を通して
パラメータの削減を⾏うと様々な応⽤タスクで⾼い性能を達成できる
⿊いセル=パラメータ削減なしのBERTと同等の性能
表はThe Lottery Ticket Hypothesis for Pre-trained BERT Networksより引⽤
学習前からパラメータ削減する
既存⼿法の問題
• 訓練データで1度だけ計算し,パラメータを削減する
⼿法が存在 [Lee+19, Wang+ 20]
– 訓練データに対する損失を元にパラメータにスコアを付与
• スコアを付与するだけで学習は⾏わない
– スコアに基づいてパラメータを削減
40
tacle to pruning at initialization
at initialization is defined by two steps. The first step scores
to some metric and the second step masks the parameters
ording to their scores. The pruning algorithms we consider
mply removing the parameters with the smallest scores. This
ly across the network, or layer-wise. Empirically, its been
better than layer-masking, in part because it introduces fewer
pruning rates across the network [24]. However, recent works
e mode, layer-collapse, for existing pruning algorithms using
when an algorithm prunes all parameters in a single weight
remain elsewhere in the network. This renders the network
n the achievable accuracy for the network as shown in Fig. 1.
layer-collapse we will define some useful terms inspired by a
[34].
Max
Compression
Figure 1: Layer-collapse leads to a
sudden drop in accuracy. Top-1 test
accuracy as a function of the compres-
sion ratio for a VGG-16 model pruned
) is the number of
ded by the number
For example, when
only one out of a
ter pruning. Max
sible compression
ayer-collapse. For
nd N parameters,
n ratio associated
ayer. Critical com-
sion ratio a given
ng layer-collapse.
of an algorithm is
pression of the net-
tivates the follow-
pruning algorithm
図はPruning neural networks without any data by iteratively conserving synaptic flowより引⽤
ある程度削減すると
性能が⼤きく下がる
何故突然性能が下がるのか︖
→ ある層のパラメータを
すべて削ってしまうので
提案⼿法︓スコアをパラメータの値から算出
• 訓練データの損失を使うと反復計算が難しい
– 訓練データ上での計算を何度も⾏う必要がある
→パラメータの値のみでスコアを算出する
– 学習データなしにスコアの反復計算が可能に
42
the width of a layer [34]. With magnitude pruning the widest layers,
or output dimensions, are the first to be fully pruned. Gradient-based
3] and GraSP [14] also prune layers at different rates, but it is less clear
preference is. In particular, both SNIP and GraSP aggressively prune
ith the most trainable parameters, evident by the sharp peaks in Fig. 2.
we hypothesize that gradient-based scores averaged within a layer are
layer size. We examine this hypothesis by constructing a theoretical
networks. We first define a general class of gradient-based scores, prove
scores, and then use this law to prove that our hypothesis of inverse
r size and average layer score holds exactly.
-based scores. Synaptic saliency is a class of score metrics that can be
product
S(✓) =
@R
@✓
✓, (1)
ion of the output y of a feed-forward network parameterized by ✓. When
esulting synaptic saliency metric is equivalent (modulo sign) to @L
@✓ ✓,
tonization [1], one of the first network pruning algorithms. The resulting
o @L
@✓ ✓ the score used in SNIP [13], H @L
@✓ ✓ the score used in
core used in the pruning after training algorithm Taylor-FO [28]. When
スコアの⼀般的な形式,
既存研究は として訓練データの損失を利⽤
ly proportional to the width of a layer [34]. With magnitude pruning the
with largest input or output dimensions, are the first to be fully pruned. G
gorithms SNIP [13] and GraSP [14] also prune layers at different rates, but
oot cause for this preference is. In particular, both SNIP and GraSP aggre
layer, the layer with the most trainable parameters, evident by the sharp p
his observation, we hypothesize that gradient-based scores averaged with
roportional to the layer size. We examine this hypothesis by constructing
grounded in flow networks. We first define a general class of gradient-based
tion law for these scores, and then use this law to prove that our hypothe
ality between layer size and average layer score holds exactly.
class of gradient-based scores. Synaptic saliency is a class of score metri
as the Hadamard product
S(✓) =
@R
@✓
✓,
a scalar loss function of the output y of a feed-forward network parameterize
ining loss L, the resulting synaptic saliency metric is equivalent (modulo sign
etric used in Skeletonization [1], one of the first network pruning algorithms.
so closely related to @L
@✓ ✓ the score used in SNIP [13], H @L
@✓ ✓ the
2
works, it requires an impractical amount of computation to obtain them.
more efficient pruning algorithm while still inheriting the key aspects of
the essential ingredients for a pruning algorithm to avoid layer-collapse
l Critical Compression? We prove the following theorem in Appendix 9.
ive, conservative scoring achieves Maximal Critical Compression. If a
bal-masking, assigns positive scores that respect layer-wise conservation
al score for the parameters pruned at any iteration, is strictly less than
for an entire layer, whenever possible, then the algorithm satisfies the
ion axiom.
ow Pruning (SynFlow) algorithm. Theorem 3 directly motivates the
algorithm, SynFlow, that provably reaches Maximal Critical Compression.
ve score evaluation discourages algorithms that involve backpropagation
ead motivates the development of an efficient data-independent scoring
ity and conservation motivates the construction of a loss function that
ency scores. We combine these insights to introduce a new loss function
tor and |✓[l]
| is the element-wise absolute value of parameters in the lth
RSF = 1T
L
Y
l=1
|✓[l]
|
!
1 (2)
aptic saliency scores (@RSF
@✓ ✓) we term Synaptic Flow. For a simple,
. f(x) = W[N]
. . . W[1]
x), we can factor the Synaptic Flow score for a
本研究でのスコア計算に⽤いる値
パラメータの値のみから求まる
43.
実験結果
• 複数のデータ,複数のネットワークで実験
– 性能を維持しつつパラメータ削除可能に
43
Weempirically benchmark the performance of our algorithm, SynFlow (red), against the baselines
random pruning and magnitude pruning, as well as the state-of-the-art algorithms SNIP [13] and
GraSP [14]. In Fig. 6, we test the five algorithms on 12 distinct combinations of modern architec-
tures (VGG-11, VGG-16, ResNet-18, WideResNet-18) and datasets (CIFAR-10, CIFAR-100, Tiny
ImageNet) over an exponential sweep of compression ratios (10↵
for ↵ = [0, 0.25, . . . , 3.75, 4]).
See Appendix 13 for more details and hyperparameters of the experiments. Consistently, SynFlow
outperforms the other algorithms in the high compression regime (101.5
< ⇢) and demonstrates more
stability, as indicated by its tight intervals. SynFlow is also quite competitive in the low compression
regime (⇢ < 101.5
). Although SNIP and GraSP can partially outperform SynFlow in this regime,
both methods suffer from layer-collapse as indicated by their sharp drops in accuracy.
Compression ratio
Top-1
accuracy
Compression ratio Compression ratio Compression ratio
Top-1
accuracy
Top-1
accuracy
SynFlow
SNIP GraSP
Magnitude
Random
VGG-11 VGG-16 ResNet-18 WideResNet-18
CIFAR-10
Tiny
ImageNet
CIFAR-100
Figure 6: SynFlow consistently outperforms other pruning methods in high compression
(提案⼿法)
図はPruning neural networks without any data by iteratively conserving synaptic flowより引⽤
44.
パラメータ数削減の戦略
• 蒸留(Distillation)
– 学習済みモデルの出⼒を⼩規模モデルで再現
•MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers
• モデル圧縮
– 学習済みモデル内の不要なパラメータを削除
• 不要︓予測への貢献が⼩さいパラメータ
• The Lottery Ticket Hypothesis for Pre-trained BERT Networks
• Movement Pruning: Adaptive Sparsity by Fine-Tuning
• Pruning neural networks without any data by iteratively conserving
synaptic flow
• 軽量なモデルの設計
– 最初から少数のパラメータで学習
• O(n) Connections are Expressive Enough: Universal Approximability of
Sparse Transformers
• All Word Embeddings from One Embedding
44
Sparse なアテンション [Child+19]
その表現⼒について
• アテンションの計算対象を限定
– Strided,Fixed の 2種を提案
– 計算対象を限定 → 計算量削減
• 層を積んだ際に全単語間に(間接的に)アテンションが存在すれば,
通常のアテンションと同等の表現⼒
– この例では 2層で x1 → x3 → x5 というアテンションがある
47
x1x2x3x4x5
通常のアテンション
全単語間に対し計算
x
5
x
4
x
3
x
2
x
1
x1x2x3x4x5
x
5
x
4
x
3
x
2
x
1
Strided アテンション
直前の N 単語に対し計算
(図では N = 3)
x1x2x3x4x5
x
5
x
4
x
3
x
2
x
1
Fixed アテンション
N 区間の単語に対し計算
(図では N = 3)
ランダムベクトル mw の構築
•M 個のランダムベクトルを組み合わせる
• 組み合わせが完全に衝突することはほぼない
– 確率は 1 - exp(-V2 / 2(cM))
– 例えば翻訳(c = 64, M = 8, V = 37K)では 1.0 × 10-6
53
ランダム⾏列
,… + …
+
各単語に対してランダムに列ベクトルを割り当て,組み合わせる
mw
Do
c M
ランダムベクトルの組み合わせで mw を構築
Do × c × M < De × V
参考⽂献
• Takase &Kobayashi 20: All Word Embeddings from One Embedding
• Brown+ 20: Language Models are Few-Shot Learners
• Vaswani+ 17: Attention Is All You Need
• Radford+ 19: Improving Language Understanding by Generative Pre-Training
• Radford+ 19: Language Models are Unsupervised Multitask Learners
• Wang+ 20: MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained
Transformers
• Frankle+ 19: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
• Sanh+ 20: Movement Pruning: Adaptive Sparsity by Fine-Tuning
• Tanaka+ 20: Pruning neural networks without any data by iteratively conserving synaptic flow
• Yun+ 20: O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
• Hinton+ 14: Distilling the Knowledge in a Neural Network
• Chen+ 20: The Lottery Ticket Hypothesis for Pre-trained BERT Networks
• Lee+ 19: SNIP: Single-shot Network Pruning based on Connection Sensitivity
• Wang+ 20: Picking Winning Tickets Before Training by Preserving Gradient Flow
• Mehta+ 20: DeFINE: DEep Factorized INput Token Embeddings for Neural Sequence Modeling
• Lan+ 20: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
•
56