FASPell A fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm
FASPell A fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm
Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, Junhui Liu
Intelligent Platform Division, iQIYI, Inc.
{hongyuzhong, yuxianguo, heneng, liunan, liujunhui}@qiyi.com
160
Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 160–169
Hong Kong, Nov 4, 2019. c 2019 Association for Computational Linguistics
chines are different from those for hu- 2. The DAE-decoder paradigm is sequence-
mans. Therefore, in practice, it is very to-sequence, which makes it resemble the
likely that the correct candidates for sub- encoder-decoder paradigm in tasks like ma-
stitution do not exist in a given confu- chine translation, grammar checking, etc.
sion set, which harms recall. Also, con- However, in the encoder-decoder paradigm,
sidering more similar characters to pre- the encoder extracts semantic information,
serve recall will risk lowering precision. and the decoder generates texts that embody
2. insufficiency in utilizing character simi- the information. In contrast, in the DAE-
larity. Since a cut-off threshold of quan- decoder paradigm, the DAE provides candi-
tified character similarity (Liu et al., dates to reconstruct texts from the corrupted
2010; Wang et al., 2018) is used to pro- ones based on contextual feature, and the de-
duce the confusion set, similar charac- coder3 selects the best candidates by incorpo-
ters are actually treated indiscriminately rating other features.
in terms of their similarity. This means Besides the new paradigm per se, there are two
the information of character similarity additional contributions in our proposed Chinese
is not sufficiently utilized. To compen- spell checking model:
sate this, Zhang et al. (2015) propose a
spell checker that has to consider many • we propose a more precise quantification
less salient features such as word seg- method of character similarity than the ones
mentation, which add more unnecessary proposed by Liu et al. (2010) and Wang et al.
noises to their model. (2018) (see subsection 2.2);
• we propose an empirically effective decoder
1.2 Motivation and contributions
to filter candidates under the principle of get-
The motivation of this paper is to circumvent the ting the highest possible precision with mini-
two bottlenecks in subsection 1.1 by changing the mal harm to recall (see subsection 2.3).
paradigm for Chinese spell checking.
1.3 Achievements
As a major contribution and as exemplified by
our proposed Chinese spell checking model in Fig- Thanks to our contributions mentioned in subsec-
ure 1, the most general form of the new paradigm tion 1.2, our model can be characterized by the fol-
consists of a denoising autoencoder2 (DAE) and a lowing achievements relative to previous state-of-
decoder. To prove that it is indeed a novel contri- the-art models, and thus is named FASPell.
bution, we compare it with two similar paradigms • Our model is Fast. It is shown (subsection
and show their differences as follows: 3.3) to be faster in filtering than previous
state-of-the-art models either in terms of ab-
1. Similar to the old paradigm used in previous
solute time consumption or time complexity.
Chinese spell checking models, a model un-
der the DAE-decoder paradigm also produces • Our model is Adaptable. To demonstrate this,
candidates (by DAE) and then filters the can- we test it on texts from different scenarios
didates (by the decoder). However, candi- – texts by humans, such as learners of Chi-
dates are produced on the fly based on con- nese as a Foreign Language (CFL), and by
texts. If the DAE is powerful enough, we machines, such as Optical Character Recog-
should expect that all contextually suitable nition (OCR). It can also be applied to both
candidates are recalled, which prevent the in- simplified Chinese and traditional Chinese,
flexibility issue caused by using confusion despite the challenging issue that some er-
set. The DAE will also prevent the overfit- roneous usages of characters in traditional
ting issue because it can be trained unsuper- texts are considered valid usages in simpli-
visedly using a large number of natural texts. fied texts (see Table 1). To the best of our
Moreover, character similarity can be used by knowledge, all previous state-of-the-art mod-
the decoder without losing any information. els only focus on human errors in traditional
2 Chinese texts.
the term denoising autoencoder follows the same sense
3
used by Yang et al. (2019), which is arguably more general The term decoder here is analogous as in Viterbi decoder
than the one used by Vincent et al. (2008). in the sense of finding the best path along candidates.
161
Table 1: Examples on the left are considered valid ✓ ✓
國 際 听 话 著 音 广 目 者 rank=2
0.0002 0.0000 0.0000 0.0000 0.1999 0.0019 0.0002 0.0002 0.0000
SC Examples Notes on TC usage
世 家 节 视 报 台 演 主 手 rank=3
周末 (weekend) 周 → 週 周 only in 周到, etc. 0.0000 0.0000 0.0000 0.0000 0.0429 0.0015 0.0000 0.0001 0.0000
162
1. given a sentence, we mask the erroneous
tokens with themselves and set their tar- 贫 : ⿱⿱⿰丿乁⿹𠃌丿⿵⿰丨𠃌⿰丿乁
--------------------------------------------------
get labels as their corresponding correct ⿱
characters; ① ⿱ ② /\
/\ ⿱ ⿵
2. to prevent overfitting, we also mask to-
分 贝 /" /"
kens that are not erroneous with them- 八 刀 冂 人
selves and set their target labels as them- ⿱
selves, too. ③ /\
⿱ ⿵
The two types of training examples are bal- /\ /\
anced to have roughly similar quantity. ⿰ ⿹ ⿰ ⿰
/"/\/"/"
Fine-tuning a pre-trained MLM is proven to be 丿 乁 𠃌 丿丨 𠃌 丿 乁
very effective in many downstream tasks (Devlin Figure 2: The IDS of a character can be given in dif-
et al., 2018; Yang et al., 2019; Song et al., 2019), ferent granularity levels as shown in the tree forms in
so one would argue that this is where the power of ¬-® for the simplified character 贫 (meaning poor).
FASPell mainly comes from. However, we would In FASPell, we only use stroke-level IDS in the form
like to emphasize that the power of FASPell should of a string, like the one above the dashed ruling line.
not be biasedly attributed to MLM. In fact, we Unlike using only actual strokes (Wang et al., 2018),
the Unicode standard Ideographic Description Charac-
show in our ablation studies (subsection 3.2) that
ters (e.g., the non-leaf nodes in the trees) describe the
MLM itself can only serve as a very weak Chinese layout of a character. They help us to model the sub-
spell checker (its performance can be as poor as tle nuances in different characters that are composed of
F1 being only 28.9%), and the decoder that uti- identical strokes (see examples in Table 2). Therefore,
lizes character similarity (see subsection 2.2 and IDS gives us a more precise shape representation of a
2.3) is also significantly indispensable to produc- character.
ing a strong Chinese spell checker.
163
Table 2: Examples of the computation of character similarities. IDS is used to compute visual similarity (V-sim)
and pronunciation representations in Mandarin Chinese (MC), Cantonese Chinese (CC), Japanese On’yomi (JO),
Korean (K) and Vietnamese (V) are used to compute phonological similarity (P-sim). Note that the normalization
of edit distance gives us the desired fact that less complex character pair (午, 牛) has smaller visual similarity than
more complex character pair (田, 由) even though both of their IDS edit distances are 1. Also, note that 午 and
牛 have more similar pronunciations in some languages than in others; the combination of the pronunciations in
multiple languages gives us a more continuous phonological similarity.
2.2.2 Phonological similarity the original characters. For those that are dif-
Different Chinese characters sharing identical pro- ferent, we can draw a confidence-similarity scat-
nunciation is very common (Yang et al., 2012), ter graph. If we compare the candidates with the
which is the case for any CJK language. Thus, If ground truths, the graph will resemble the plot
we were to use character pronunciations in only ¬ of Figure 3. We can observe that the true-
one CJK language, the phonological similarity of detection-and-correction candidates are denser to-
character pairs would be limited to a few discrete ward the upper-right corner; false-detection candi-
values. However, a more continuous phonologi- dates toward the lower-left corner; true-detection-
cal similarity is preferred because it can make the and-false-correction candidates in the middle area.
curve used for filtering candidates smoother (see If we draw a curve to filter out false-detection
subsection 2.3). candidates (plot of Figure 3) and use the rest
Therefore, we utilize character pronunciations as substitutions, we can optimize character-level
of all CJK languages (see examples in Table 2), precision with minimal harm to character-level
which are provided by the Unihan Database. To recall for detection; if true-detection-and-false-
compute the phonological similarity of two char- correction candidates are also filtered out (plot ®
acters, we first calculate one minus normalized of Figure 3), we can get the same effect for cor-
Levenshtein edit distance between their pronunci- rection. In FASPell, we optimize correction per-
ation representations in all CJK languages (if ap- formance and manually find the filtering curve us-
plicable). Then, we take the mean of the results. ing a training set, assuming its consistency with its
Hence, the similarity should range from 0 to 1. corresponding testing set. But in practice, we have
to find two curves – one for each type of similarity,
2.3 Confidence-Similarity Decoder and then take the union of the filtering results.
Candidate filters in many previous models are Now, consider the case where there are c > 1
based on setting various thresholds and weights for candidates. To reduce it into the previously de-
multiple features of candidate characters. Instead scribed simplest case, we rank the candidates for
of this naive approach, we propose a method that each original character according to their contex-
is empirically effective under the principle of get- tual confidence and put candidates that have the
ting the highest possible precision with minimal same rank into the same group (i.e., c groups in
harm to recall. Since the decoder utilizes contex- total). Thus, we can find a filter as previously de-
tual confidence and character similarity, we refer scribed for each group of candidates. All c filters
to it as the confidence-similarity decoder (CSD). combined further alleviate the harm to recall be-
The mechanism of CSD is explained, and its ef- cause more candidates are taken into account.
fectiveness is justified as follows: In the example of Figure 1, there are c = 4
First, consider the simplest case where only one groups of candidates. We get a correct substitution
candidate character is provided for each original 丰 → 主 from the group whose rank = 1, another
character. For those candidates that are the same one 苦 → 著 from the group whose rank = 2, and
as their original characters, we do not substitute no more from the other two groups.
164
¬
1 1
T-d&T-c
Similarity
0.6 0.6
filtered-out
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Confidence Confidence
® ¯
1 1
0.8 0.8
Similarity
Similarity
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Confidence Confidence
Figure 3: All four plots show the same confidence-similarity graph of candidates categorized by being true-
detection-and-true-correction (T-d&T-c), true-detection-and-false-correction (T-d&F-c) and false-detection (F-d).
But, each plot shows a different way of filtering candidates: in plot ¬, no candidates are filtered; in plot , the
filtering optimizes detection performance; in plot ®, as adopted in FASPell, the filtering optimizes correction
performance; in plot ¯, as adopted by previous models, candidates are filtered out by setting a threshold for
weighted confidence and similarity (0.8 × conf idence + 0.2 × similarity < 0.8 as an example in the plot). Note
that the four plots use the actual first-rank candidates (using visual similarity) for our OCR data (T rnocr ) except
that we randomly sampled only 30% of the candidates to make the plots more viewable on paper.
We first describe the data, metrics and model con- Dataset # erroneous sent # sent Avg. length
figurations adopted in our experiments in subsec- T rn13 350 700 41.8
tion 3.1. Then, in subsection 3.2, we show the per- T rn14 3432 3435 49.6
formance on spell checking texts written by hu- T rn15 2339 2339 31.3
mans to compare FASPell with previous state-of- T st13 996 1000 74.3
the-art models; we also show the performance on T st14 529 1062 50.0
T st15 550 1100 30.6
data that are harvested from OCR results to prove
the adaptability of the model. In subsection 3.3, T rnocr 3575 3575 10.1
T stocr 1000 1000 10.2
we compare the speed of FASPell and three state-
of-the-art models. In subsection 3.4, we investi-
gate how hyper-parameters affect the performance
of FASPell. recall and F1 given by SIGHAN13 - 15 shared
tasks on Chinese spell checking (Wu et al., 2013;
3.1 Data, metrics and configurations
Yu et al., 2014; Tseng et al., 2015). We also har-
We adopt the benchmark datasets (all in traditional vested 4575 sentences (4516 are simplified Chi-
Chinese) and sentence-level7 accuracy, precision, nese) from OCR results of Chinese subtitles in
7
Note that although we do not use character-level metrics videos. We used the OCR method by Shi et al.
(Fung et al., 2017) in evaluation, they are actually important (2017). Detailed data statistics are in Table 3.
in the justification of the effectiveness of the CSD as in sub-
section 2.3 We use the pre-trained masked language
165
Table 4: Configurations of FASPell. FT means the Table 5: Speed comparison (ms/sent). Note that the
training set for fine-tuning; CSD means the training set speed of FASPell is the average in several rounds.
for CSD; r means the number of rounds and c means
the number of candidates for each character. U is the Test set FASPell Wang et al. (2018)
union of all the spell checking data from SIGHAN13 -
T st13 446 680
15. T st14 284 745
T st15 177 566
FT CSD Test set r c FT steps
U − T st13 T rn13 T st13 1 4 10k
U − T st14 T rn14 T st14 3 4 10k
U − T st15 T rn15 T st15 3 4 10k derlying principle of the design of CSD.
(-) T rnocr T stocr 2 4 (-)
3.3 Filtering Speed10
model8 provided by Devlin et al. (2018). Set- First, we measure the filtering speed of Chinese
tings of its hyper-parameters and pre-training spell checking in terms of absolute time consump-
are available at https://github.com/ tion per sentence (see Table 5). We compare the
google-research/bert. Other configura- speed of FASPell with the model by Wang et al.
tions of FASPell used in our major experiments (2018) in this manner because they have reported
(subsection 3.2 - 3.3) are given in Table 4. For their absolute time consumption11 . Table 5 clearly
ablation experiments, the same configurations are shows that FASPell is much faster.
used except when CSD is removed, we take the Second, to compare FASPell with models
candidates ranked the first as default outputs. Note (Zhang et al., 2015; Zhao et al., 2017) whose ab-
that we do not fine-tune the mask language model solute time consumption has not been reported,
for OCR data because we learned in preliminary we analyze the time complexity. The time com-
experiments that fine-tuning worsens performance plexity of FASPell is O(scmn + sc log c), where
for this type of data9 . s is the sentence length, c is the number of can-
didates, mn accounts for computing edit distance
3.2 Performance
and c log c for ranking candidates. Zhang et al.
As shown in Table 6, FASPell achieves state-of- (2015) use more features than just edit distance, so
the-art F1 performance on both detection level and the time complexity of their model has additional
correction level. It is better in precision than the factors. Moreover, since we do not use confusion
model by Wang et al. (2018) and better in recall set, the number of candidates for each character of
than the model by Zhang et al. (2015). In compar- their model is practically larger than ours: x × 10
ison with Zhao et al. (2017), It is better by any met- vs. 4. Thus, FASPell is faster than their model.
ric. It also reaches comparable precision on OCR Zhao et al. (2017) filter candidates by finding the
data. The lower recall on OCR data is partially be- single-source shortest path (SSSP) in a directed
cause many OCR errors are harder to correct even graph consisting of all candidates for every token
for humans (Wang et al., 2018). in a sentence. The algorithm they used has a time
Table 6 also shows that all the components of complexity of O(|V | + |E|) where |V | is the num-
FASPell contribute effectively to its good perfor- ber of vertices and |E| is the number of edges in
mance. FASPell without both fine-tuning and the graph (Eppstein, 1998). Translating it in terms
CSD is essentially the pre-trained mask language of s and c, the time complexity of their model is
model. Fine-tuning it improves recall because O(sc + cs ). This implies that their model is expo-
FASPell can learn about common errors and how nentially slower than FASPell for long sentences.
they are corrected. CSD improves its precision
with minimal harm to recall because this is the un- 10
Considering only the filtering speed is because the
8
https://storage.googleapis.com/bert_ Transformer, the Bi-LSTM and language models used by pre-
models/2018_11_03/chinese_L-12_H-768_ vious state-of-the-art models or us before filtering are already
A-12.zip well studied in the literature.
9 11
It is probably because OCR errors are subject to random We have no access to the 4-core Intel Core i5-7500 CPU
noise in source pictures rather than learnable patterns as in used by Wang et al. (2018). To minimize the difference of
human errors. However, since the paper is not about OCR, speed caused by hardware, we only use 4 cores of a 12-core
we do not elaborate on this here. Intel(R) Xeon(R) CPU E5-2650 in the experiments.
166
Table 6: This table shows spell checking performances on both detection and correction level. Our model –
FASPell achieves similar performance to that of previous state-of-the-art models. Note that fine-tuning and CSD
both contribute effectively to its performance according to the results of ablation experiments. (− FT means
removing fine-tuning; − CSD means removing CSD.)
3.4 Exploring hyper-parameters small amount of spell checking data and gives up
First, we only change the number of candidates the troublesome notion of confusion set. With
in Table 4 to see its effect on spell checking per- FASPell as an example, each component of the
formance. As illustrated in Figure 4, when more paradigm is shown to be effective. We make our
candidates are taken into account, additional de- code and data publically available at https://
tections and corrections are recalled while max- github.com/iqiyi/FASPell.
imizing precision. Thus, increase in the number Future work may include studying if the DAE-
of candidates always results in the improvement of decoder paradigm can be used to detect and cor-
F1. The reason we set the number of candidates rect grammatical errors or other less frequently
c = 4 in Table 4 and no larger is because there is a studied types of Chinese spelling errors such as
trade-off with time consumption. dialectical colloquialism (Fung et al., 2017) and
Second, we do the same thing to the number of insertion/deletion errors.
rounds of spell checking in Table 4. We can ob- T st13 T st14 T st15 T stocr
serve in Figure 4 that the correction performance 0.7 0.7 0.7 0.4
number of rounds is 3. For T st13 and T stocr , that 0.55 0.55 0.55 0.25
can achieve high precision in detection in each 0.7 0.7 0.7 0.4
detected and corrected in the next round without 0.55 0.55 0.55 0.25
Detection
4 5 6
Number of rounds
4 Conclusion
Figure 4: The four plots in the first row show how
We propose a Chinese spell checker – FASPell that the number of candidates for each character affects F1
reaches state-of-the-art performance. It is based performances. The four in the second row show the
on DAE-decoder paradigm that requires only a impact of the number of rounds of spell checking.
167
Acknowledgments Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
Yan Liu. 2019. Mass: Masked sequence to sequence
The authors would like to thank the anonymous pre-training for language generation. arXiv preprint
reviewers for their comments. We also thank our arXiv:1905.02450.
colleagues from the IT Infrastructure team of
Xiang Tong and David A. Evans. 1996. A statistical
iQIYI, Inc. for the hardware support. Special approach to automatic OCR error correction in con-
thanks go to Prof. Yves Lepage from Graduate text. In Fourth Workshop on Very Large Corpora.
School of IPS, Waseda University for his insight-
ful advice about the paper. Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and
Hsin-Hsi Chen. 2015. Introduction to SIGHAN
2015 bake-off for Chinese spelling check. In Pro-
ceedings of the Eighth SIGHAN Workshop on Chi-
nese Language Processing, pages 32–37.
References
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and
Chao-Huang Chang. 1995. A new approach for auto- Pierre-Antoine Manzagol. 2008. Extracting and
matic chinese spelling correction. In Proceedings composing robust features with denoising autoen-
of Natural Language Processing Pacific Rim Sym- coders. In Proceedings of the 25th international
posium, volume 95, pages 278–283. Citeseer. conference on Machine learning, pages 1096–1103.
ACM.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. BERT: pre-training of Dingmin Wang, Yan Song, Jing Li, Jialong Han, and
deep bidirectional transformers for language under- Haisong Zhang. 2018. A hybrid approach to auto-
standing. CoRR, abs/1810.04805. matic corpus generation for Chinese spelling check.
In Proceedings of the 2018 Conference on Em-
David Eppstein. 1998. Finding the k shortest paths. pirical Methods in Natural Language Processing,
SIAM Journal on computing, 28(2):652–673. pages 2517–2527, Brussels, Belgium. Association
for Computational Linguistics.
Gabriel Fung, Maxime Debosschere, Dingmin Wang,
Bo Li, Jia Zhu, and Kam-Fai Wong. 2017. Nlptea Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee.
2017 shared task – Chinese spelling check. In 2013. Chinese spelling check evaluation at
Proceedings of the 4th Workshop on Natural Lan- SIGHAN bake-off 2013. In Proceedings of the Sev-
guage Processing Techniques for Educational Appli- enth SIGHAN Workshop on Chinese Language Pro-
cations (NLPTEA 2017), pages 29–34, Taipei, Tai- cessing, pages 35–42, Nagoya, Japan. Asian Federa-
wan. Asian Federation of Natural Language Process- tion of Natural Language Processing.
ing.
Shaohua Yang, Hai Zhao, Xiaolin Wang, and Bao liang
Chao-Lin Liu, Min-Hua Lai, Yi-Hsuan Chuang, and Lu. 2012. Spell checking for Chinese. In Proceed-
Chia-Ying Lee. 2010. Visually and phonologi- ings of the Eight International Conference on Lan-
cally similar characters in incorrect simplified chi- guage Resources and Evaluation (LREC’12), Istan-
nese words. In Proceedings of the 23rd Inter- bul, Turkey. European Language Resources Associ-
national Conference on Computational Linguistics: ation (ELRA).
Posters, pages 739–747, Beijing, China. Association
for Computational Linguistics. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Ruslan Salakhutdinov, and Quoc V Le.
Mateusz Pawlik and Nikolaus Augsten. 2015. Efficient 2019. Xlnet: Generalized autoregressive pretrain-
computation of the tree edit distance. ACM Trans. ing for language understanding. arXiv preprint
Database Syst., 40(1):3:1–3:40. arXiv:1906.08237.
Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-
distance: Robust and memory-efficient. Informa- Yi Chen, and Mao-Chuan Su. 2013. Chinese word
tion Systems, 56:157 – 173. spelling correction based on n-gram ranked inverted
index list. In Proceedings of the Seventh SIGHAN
Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An Workshop on Chinese Language Processing, pages
end-to-end trainable neural network for image-based 43–48, Nagoya, Japan. Asian Federation of Natural
sequence recognition and its application to scene Language Processing.
text recognition. IEEE transactions on pattern anal-
ysis and machine intelligence, 39(11):2298–2304. Junjie Yu and Zhenghua Li. 2014. Chinese spelling
error detection and correction based on language
DS Shih et al. 1992. A statistical method for locating model, pronunciation, and shape. In Proceedings of
typo in Chinese sentences. CCL Research Journal, The Third CIPS-SIGHAN Joint Conference on Chi-
pages 19–26. nese Language Processing, pages 220–223.
168
Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and
Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014
bake-off for Chinese spelling check. In Proceed-
ings of The Third CIPS-SIGHAN Joint Conference
on Chinese Language Processing, pages 126–132,
Wuhan, China. Association for Computational Lin-
guistics.
Shuiyuan Zhang, Jinhua Xiong, Jianpeng Hou, Qiao
Zhang, and Xueqi Cheng. 2015. HANSpeller++: A
unified framework for Chinese spelling correction.
In Proceedings of the Eighth SIGHAN Workshop on
Chinese Language Processing, pages 38–45, Bei-
jing, China. Association for Computational Linguis-
tics.
169