0% found this document useful (0 votes)
21 views

FASPell A fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm

Uploaded by

601770313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

FASPell A fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm

Uploaded by

601770313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker

Based On DAE-Decoder Paradigm

Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, Junhui Liu
Intelligent Platform Division, iQIYI, Inc.
{hongyuzhong, yuxianguo, heneng, liunan, liujunhui}@qiyi.com

Abstract 1.1 Related work and bottlenecks

We propose a Chinese spell checker – FASPell


Almost all previous Chinese spell checking mod-
based on a new paradigm which consists of els deploy a common paradigm where a fixed set
a denoising autoencoder (DAE) and a de- of similar characters of each Chinese character
coder. In comparison with previous state- (called confusion set) is used as candidates, and a
of-the-art models, the new paradigm allows filter selects the best candidates as substitutions for
our spell checker to be Faster in computa- a given sentence. This naive design is subjected to
tion, readily Adaptable to both simplified and two major bottlenecks, whose negative impact has
traditional Chinese texts produced by either
been unsuccessfully mitigated:
humans or machines, and to require much
Simpler structure to be as much Powerful in • overfitting to under-resourced Chinese
both error detection and correction. These four
spell checking data. Since Chinese spell
achievements are made possible because the
new paradigm circumvents two bottlenecks. checking data require tedious professional
First, the DAE curtails the amount of Chi- manual work, they have always been under-
nese spell checking data needed for super- resourced. To prevent the filter from over-
vised learning (to <10k sentences) by lever- fitting, Wang et al. (2018) propose an auto-
aging the power of unsupervisedly pre-trained matic method to generate pseudo spell check-
masked language model as in BERT, XLNet, ing data. However, the precision of their spell
MASS etc. Second, the decoder helps to elim-
checking model ceases to improve when the
inate the use of confusion set that is deficient
in flexibility and sufficiency of utilizing the
generated data reaches 40k sentences. Zhao
salient feature of Chinese character similarity. et al. (2017) use an extensive amount of ad
hoc linguistic rules to filter candidates, only
1 Introduction to achieve worse performance than ours even
though our model does not leverage any lin-
There has been a long line of research on detect- guistic knowledge.
ing and correcting spelling errors in Chinese texts
since some trailblazing work in the early 1990s • inflexibility and insufficiency of confusion
(Shih et al., 1992; Chang, 1995). However, de- set in utilizing character similarity. The
spite the spelling errors being reduced to substitu- feature of Chinese character similarity is very
tion errors in most researches1 and efforts of mul- salient as it is related to the main cause of
tiple recent shared tasks (Wu et al., 2013; Yu et al., spelling errors (see subsection 2.2). How-
2014; Tseng et al., 2015; Fung et al., 2017), Chi- ever, the idea of confusion set is troublesome
nese spell checking remains a difficult task. More- in utilizing it:
over, the methods for languages like English can
1. inflexibility to address the issue that
hardly be directly used for the Chinese language
confusing characters in one scenario
because there are no delimiters between words,
may not be confusing in another. The
whose lack of morphological variations makes the
difference between simplified and tradi-
syntactic and semantic interpretations of any Chi-
tional Chinese shown in Table 1 is an
nese character highly dependent on its context.
example. Wang et al. (2018) also sug-
1
Likewise, this paper only covers substitution errors. gest that confusing characters for ma-

160
Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 160–169
Hong Kong, Nov 4, 2019. c 2019 Association for Computational Linguistics
chines are different from those for hu- 2. The DAE-decoder paradigm is sequence-
mans. Therefore, in practice, it is very to-sequence, which makes it resemble the
likely that the correct candidates for sub- encoder-decoder paradigm in tasks like ma-
stitution do not exist in a given confu- chine translation, grammar checking, etc.
sion set, which harms recall. Also, con- However, in the encoder-decoder paradigm,
sidering more similar characters to pre- the encoder extracts semantic information,
serve recall will risk lowering precision. and the decoder generates texts that embody
2. insufficiency in utilizing character simi- the information. In contrast, in the DAE-
larity. Since a cut-off threshold of quan- decoder paradigm, the DAE provides candi-
tified character similarity (Liu et al., dates to reconstruct texts from the corrupted
2010; Wang et al., 2018) is used to pro- ones based on contextual feature, and the de-
duce the confusion set, similar charac- coder3 selects the best candidates by incorpo-
ters are actually treated indiscriminately rating other features.
in terms of their similarity. This means Besides the new paradigm per se, there are two
the information of character similarity additional contributions in our proposed Chinese
is not sufficiently utilized. To compen- spell checking model:
sate this, Zhang et al. (2015) propose a
spell checker that has to consider many • we propose a more precise quantification
less salient features such as word seg- method of character similarity than the ones
mentation, which add more unnecessary proposed by Liu et al. (2010) and Wang et al.
noises to their model. (2018) (see subsection 2.2);
• we propose an empirically effective decoder
1.2 Motivation and contributions
to filter candidates under the principle of get-
The motivation of this paper is to circumvent the ting the highest possible precision with mini-
two bottlenecks in subsection 1.1 by changing the mal harm to recall (see subsection 2.3).
paradigm for Chinese spell checking.
1.3 Achievements
As a major contribution and as exemplified by
our proposed Chinese spell checking model in Fig- Thanks to our contributions mentioned in subsec-
ure 1, the most general form of the new paradigm tion 1.2, our model can be characterized by the fol-
consists of a denoising autoencoder2 (DAE) and a lowing achievements relative to previous state-of-
decoder. To prove that it is indeed a novel contri- the-art models, and thus is named FASPell.
bution, we compare it with two similar paradigms • Our model is Fast. It is shown (subsection
and show their differences as follows: 3.3) to be faster in filtering than previous
state-of-the-art models either in terms of ab-
1. Similar to the old paradigm used in previous
solute time consumption or time complexity.
Chinese spell checking models, a model un-
der the DAE-decoder paradigm also produces • Our model is Adaptable. To demonstrate this,
candidates (by DAE) and then filters the can- we test it on texts from different scenarios
didates (by the decoder). However, candi- – texts by humans, such as learners of Chi-
dates are produced on the fly based on con- nese as a Foreign Language (CFL), and by
texts. If the DAE is powerful enough, we machines, such as Optical Character Recog-
should expect that all contextually suitable nition (OCR). It can also be applied to both
candidates are recalled, which prevent the in- simplified Chinese and traditional Chinese,
flexibility issue caused by using confusion despite the challenging issue that some er-
set. The DAE will also prevent the overfit- roneous usages of characters in traditional
ting issue because it can be trained unsuper- texts are considered valid usages in simpli-
visedly using a large number of natural texts. fied texts (see Table 1). To the best of our
Moreover, character similarity can be used by knowledge, all previous state-of-the-art mod-
the decoder without losing any information. els only focus on human errors in traditional
2 Chinese texts.
the term denoising autoencoder follows the same sense
3
used by Yang et al. (2019), which is arguably more general The term decoder here is analogous as in Viterbi decoder
than the one used by Vincent et al. (2008). in the sense of finding the best path along candidates.

161
Table 1: Examples on the left are considered valid ✓ ✓

usages in simplified Chinese (SC). Notes on the right 国 际 电 台 著 名 主 持 人


are about how they are erroneous in traditional Chi-
nese (TC) and suggested corrections. This inconsis-
tency is because multiple traditional characters were Confidence-Similarity Decoder
merged into identical characters in the simplification
process. Our model makes corrections for this type of
errors only in traditional texts. In simplified texts, they
are not detected as errors. 国 际 电 台 知 名 主 持 人 rank=1
0.9994 0.9999 0.9999 0.9999 0.2878 0.9626 0.9994 0.9981 0.9999

國 際 听 话 著 音 广 目 者 rank=2
0.0002 0.0000 0.0000 0.0000 0.1999 0.0019 0.0002 0.0002 0.0000
SC Examples Notes on TC usage
世 家 节 视 报 台 演 主 手 rank=3
周末 (weekend) 周 → 週 周 only in 周到, etc. 0.0000 0.0000 0.0000 0.0000 0.0429 0.0015 0.0000 0.0001 0.0000

旅游 (trip) 游 → 遊 游 only in 游泳, etc. 界 讲 播 冠 闻 支 节 持


台 0.0000 rank=4
0.0000 0.0000 0.0000 0.0252 0.0014 0.0000 0.0001 0.0000
制造 (make) 制 → 製 制 only in 制度, etc.

• Our model is Simple. As shown in Fig- Masked Language Model


ure 1, it has only a masked language model
and a filter as opposed to multiple feature-
producing models and filters being used in 国 际 电 台 苦 名 丰 持 人
✖ ✖
previous state-of-the-art proposals. More-
over, only a small training set and a set of Figure 1: A real example of how an erroneous sentence
visual and phonological features of charac- which is supposed to have the meaning of "A famous
international radio broadcaster" is successfully spell-
ters are required in our model. No extra data
checked with two erroneous characters 苦 and 丰 being
are necessary, including confusion set. This detected and corrected using FASPell. Note that with
makes our model even simpler. our proposed confidence-similarity decoder, the final
choice for substitution is not necessarily the candidate
• Our model is Powerful. On benchmark ranked the first.
data sets, it achieves similar F1 performances
(subsection 3.2) to those of previous state-of-
the-art models on both detection and correc- time, a random token in the vocabulary 10% of
tion level. It also achieves arguably high pre- the time and the original token 10% of the time. In
cision (78.5% in detection and 73.4% in cor- cases where a random token is used as the mask,
rection) on our OCR data set. the model actually learns how to correct an erro-
neous character; in cases where the original tokens
are kept, the model actually learns how to detect if
2 FASPell
a character is erroneous or not. For simplicity pur-
As shown in Figure 1, our model uses masked lan- poses, FASPell adopts the architecture of MLM as
guage model (see subsection 2.1) as the DAE to in BERT (Devlin et al., 2018). Recent variants –
produce candidates and confidence-similarity de- XLNet (Yang et al., 2019), MASS (Song et al.,
coder (see subsection 2.2 and 2.3) to filter can- 2019) have more complex architectures of MLM,
didates. In practice, doing several rounds of the but they are also suitable.
whole process is also proven to be helpful (sub- However, just using a pre-trained MLM raises
section 3.4). the issue that the errors introduced by random
masks may be very different from the actual errors
2.1 Masked language model
in spell checking data. Therefore, we propose the
Masked language model (MLM) guesses the to- following method to fine-tune the MLM on spell
kens that are masked in a tokenized sentence. It checking training sets:
is intuitive to use MLM as the DAE to detect and
correct Chinese spelling errors because it is in line • For texts that have no errors, we follow the
with the task of Chinese spell checking. In the original training process as in BERT;
original training process of MLM in BERT (De-
vlin et al., 2018), the errors are the random masks, • For texts that have errors, we create two types
which are the special token [MASK] 80% of the of training examples by:

162
1. given a sentence, we mask the erroneous
tokens with themselves and set their tar- 贫 : ⿱⿱⿰丿乁⿹𠃌丿⿵⿰丨𠃌⿰丿乁
--------------------------------------------------
get labels as their corresponding correct ⿱
characters; ① ⿱ ② /\
/\ ⿱ ⿵
2. to prevent overfitting, we also mask to-
分 贝 /" /"
kens that are not erroneous with them- 八 刀 冂 人
selves and set their target labels as them- ⿱
selves, too. ③ /\
⿱ ⿵
The two types of training examples are bal- /\ /\
anced to have roughly similar quantity. ⿰ ⿹ ⿰ ⿰
/"/\/"/"
Fine-tuning a pre-trained MLM is proven to be 丿 乁 𠃌 丿丨 𠃌 丿 乁
very effective in many downstream tasks (Devlin Figure 2: The IDS of a character can be given in dif-
et al., 2018; Yang et al., 2019; Song et al., 2019), ferent granularity levels as shown in the tree forms in
so one would argue that this is where the power of ¬-® for the simplified character 贫 (meaning poor).
FASPell mainly comes from. However, we would In FASPell, we only use stroke-level IDS in the form
like to emphasize that the power of FASPell should of a string, like the one above the dashed ruling line.
not be biasedly attributed to MLM. In fact, we Unlike using only actual strokes (Wang et al., 2018),
the Unicode standard Ideographic Description Charac-
show in our ablation studies (subsection 3.2) that
ters (e.g., the non-leaf nodes in the trees) describe the
MLM itself can only serve as a very weak Chinese layout of a character. They help us to model the sub-
spell checker (its performance can be as poor as tle nuances in different characters that are composed of
F1 being only 28.9%), and the decoder that uti- identical strokes (see examples in Table 2). Therefore,
lizes character similarity (see subsection 2.2 and IDS gives us a more precise shape representation of a
2.3) is also significantly indispensable to produc- character.
ing a strong Chinese spell checker.

2.2 Character similarity In our model, we only adopt the string-form


IDS. We define the visual similarity between two
Erroneous characters in Chinese texts by humans
characters as one minus normalized6 Levenshtein
are usually either visually (subsection 2.2.1) or
edit distance between their IDS representations.
phonologically similar (subsection 2.2.2) to corre-
The reason for normalization is twofold. Firstly,
sponding correct characters, or both (Chang, 1995;
we want the similarity to range from 0 to 1 for the
Liu et al., 2010; Yu and Li, 2014). It is also true
convenience of later filtering. Secondly, if a pair
that erroneous characters produced by OCR pos-
of more complex characters have the same edit dis-
sess visual similarity (Tong and Evans, 1996).
tance as a pair of less complex characters, we want
We base our similarity computation on two
the similarity of the more complex characters to be
open databases: Kanji Database Project4 and Uni-
slightly higher than that of the less complex char-
han Database5 because they provide shape and
acters (see examples in Table 2).
pronunciation representations for all CJK Unified
We do not use the tree-form IDS for two rea-
Ideographs in all CJK languages.
sons even as it seems to make more sense intu-
2.2.1 Visual similarity itively. Firstly, even with the most efficient algo-
rithm (Pawlik and Augsten, 2015, 2016) so far, tree
The Kanji Database Project uses the Unicode
edit distance (TED) has far greater time complex-
standard – Ideographic Description Sequence
ity than edit distance of strings (O(mn(m + n))
(IDS) to represent the shape of a character.
vs. O(mn)). Secondly, we did try TED in prelimi-
As illustrated by examples in Figure 2, the IDS
nary experiments, but there was no significant dif-
of a character is formally a string, but it is essen-
ference from using edit distance of strings in terms
tially the preorder traversal path of an ordered tree.
of spell checking performance.
4
http://kanji-database.sourceforge.
net/ 6
Since the maximal value of Levenshtein edit distance is
5
https://unicode.org/charts/unihan. the maximum of the lengths of the two strings in question, we
html normalize it simply by dividing it by the maximum length.

163
Table 2: Examples of the computation of character similarities. IDS is used to compute visual similarity (V-sim)
and pronunciation representations in Mandarin Chinese (MC), Cantonese Chinese (CC), Japanese On’yomi (JO),
Korean (K) and Vietnamese (V) are used to compute phonological similarity (P-sim). Note that the normalization
of edit distance gives us the desired fact that less complex character pair (午, 牛) has smaller visual similarity than
more complex character pair (田, 由) even though both of their IDS edit distances are 1. Also, note that 午 and
牛 have more similar pronunciations in some languages than in others; the combination of the pronunciations in
multiple languages gives us a more continuous phonological similarity.

IDS MC CC JO K V V-sim P-sim


午 (noon) ⿱⿰丿一⿻一丨 wu3 ng5 go o ngọ
0.857 0.280
牛 (cow) ⿻⿰丿一⿻一丨 niu2 ngau4 gyuu wu ngưu

田 (field) ⿵⿰丨𠃌⿱⿻一丨一 tian2 tin4 den cen điền


0.889 0.090
由 (from) ⿻⿰丨𠃌⿱⿻一丨一 you2 jau4 yuu yu do

2.2.2 Phonological similarity the original characters. For those that are dif-
Different Chinese characters sharing identical pro- ferent, we can draw a confidence-similarity scat-
nunciation is very common (Yang et al., 2012), ter graph. If we compare the candidates with the
which is the case for any CJK language. Thus, If ground truths, the graph will resemble the plot
we were to use character pronunciations in only ¬ of Figure 3. We can observe that the true-
one CJK language, the phonological similarity of detection-and-correction candidates are denser to-
character pairs would be limited to a few discrete ward the upper-right corner; false-detection candi-
values. However, a more continuous phonologi- dates toward the lower-left corner; true-detection-
cal similarity is preferred because it can make the and-false-correction candidates in the middle area.
curve used for filtering candidates smoother (see If we draw a curve to filter out false-detection
subsection 2.3). candidates (plot ­ of Figure 3) and use the rest
Therefore, we utilize character pronunciations as substitutions, we can optimize character-level
of all CJK languages (see examples in Table 2), precision with minimal harm to character-level
which are provided by the Unihan Database. To recall for detection; if true-detection-and-false-
compute the phonological similarity of two char- correction candidates are also filtered out (plot ®
acters, we first calculate one minus normalized of Figure 3), we can get the same effect for cor-
Levenshtein edit distance between their pronunci- rection. In FASPell, we optimize correction per-
ation representations in all CJK languages (if ap- formance and manually find the filtering curve us-
plicable). Then, we take the mean of the results. ing a training set, assuming its consistency with its
Hence, the similarity should range from 0 to 1. corresponding testing set. But in practice, we have
to find two curves – one for each type of similarity,
2.3 Confidence-Similarity Decoder and then take the union of the filtering results.
Candidate filters in many previous models are Now, consider the case where there are c > 1
based on setting various thresholds and weights for candidates. To reduce it into the previously de-
multiple features of candidate characters. Instead scribed simplest case, we rank the candidates for
of this naive approach, we propose a method that each original character according to their contex-
is empirically effective under the principle of get- tual confidence and put candidates that have the
ting the highest possible precision with minimal same rank into the same group (i.e., c groups in
harm to recall. Since the decoder utilizes contex- total). Thus, we can find a filter as previously de-
tual confidence and character similarity, we refer scribed for each group of candidates. All c filters
to it as the confidence-similarity decoder (CSD). combined further alleviate the harm to recall be-
The mechanism of CSD is explained, and its ef- cause more candidates are taken into account.
fectiveness is justified as follows: In the example of Figure 1, there are c = 4
First, consider the simplest case where only one groups of candidates. We get a correct substitution
candidate character is provided for each original 丰 → 主 from the group whose rank = 1, another
character. For those candidates that are the same one 苦 → 著 from the group whose rank = 2, and
as their original characters, we do not substitute no more from the other two groups.

164
¬ ­
1 1
T-d&T-c

0.8 0.8 T-d&F-c


F-d
filtering curve
Similarity

Similarity
0.6 0.6
filtered-out

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Confidence Confidence

® ¯
1 1

0.8 0.8
Similarity

Similarity
0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Confidence Confidence

Figure 3: All four plots show the same confidence-similarity graph of candidates categorized by being true-
detection-and-true-correction (T-d&T-c), true-detection-and-false-correction (T-d&F-c) and false-detection (F-d).
But, each plot shows a different way of filtering candidates: in plot ¬, no candidates are filtered; in plot ­, the
filtering optimizes detection performance; in plot ®, as adopted in FASPell, the filtering optimizes correction
performance; in plot ¯, as adopted by previous models, candidates are filtered out by setting a threshold for
weighted confidence and similarity (0.8 × conf idence + 0.2 × similarity < 0.8 as an example in the plot). Note
that the four plots use the actual first-rank candidates (using visual similarity) for our OCR data (T rnocr ) except
that we randomly sampled only 30% of the candidates to make the plots more viewable on paper.

3 Experiments and results Table 3: Statistics of datasets.

We first describe the data, metrics and model con- Dataset # erroneous sent # sent Avg. length
figurations adopted in our experiments in subsec- T rn13 350 700 41.8
tion 3.1. Then, in subsection 3.2, we show the per- T rn14 3432 3435 49.6
formance on spell checking texts written by hu- T rn15 2339 2339 31.3
mans to compare FASPell with previous state-of- T st13 996 1000 74.3
the-art models; we also show the performance on T st14 529 1062 50.0
T st15 550 1100 30.6
data that are harvested from OCR results to prove
the adaptability of the model. In subsection 3.3, T rnocr 3575 3575 10.1
T stocr 1000 1000 10.2
we compare the speed of FASPell and three state-
of-the-art models. In subsection 3.4, we investi-
gate how hyper-parameters affect the performance
of FASPell. recall and F1 given by SIGHAN13 - 15 shared
tasks on Chinese spell checking (Wu et al., 2013;
3.1 Data, metrics and configurations
Yu et al., 2014; Tseng et al., 2015). We also har-
We adopt the benchmark datasets (all in traditional vested 4575 sentences (4516 are simplified Chi-
Chinese) and sentence-level7 accuracy, precision, nese) from OCR results of Chinese subtitles in
7
Note that although we do not use character-level metrics videos. We used the OCR method by Shi et al.
(Fung et al., 2017) in evaluation, they are actually important (2017). Detailed data statistics are in Table 3.
in the justification of the effectiveness of the CSD as in sub-
section 2.3 We use the pre-trained masked language

165
Table 4: Configurations of FASPell. FT means the Table 5: Speed comparison (ms/sent). Note that the
training set for fine-tuning; CSD means the training set speed of FASPell is the average in several rounds.
for CSD; r means the number of rounds and c means
the number of candidates for each character. U is the Test set FASPell Wang et al. (2018)
union of all the spell checking data from SIGHAN13 -
T st13 446 680
15. T st14 284 745
T st15 177 566
FT CSD Test set r c FT steps
U − T st13 T rn13 T st13 1 4 10k
U − T st14 T rn14 T st14 3 4 10k
U − T st15 T rn15 T st15 3 4 10k derlying principle of the design of CSD.
(-) T rnocr T stocr 2 4 (-)
3.3 Filtering Speed10

model8 provided by Devlin et al. (2018). Set- First, we measure the filtering speed of Chinese
tings of its hyper-parameters and pre-training spell checking in terms of absolute time consump-
are available at https://github.com/ tion per sentence (see Table 5). We compare the
google-research/bert. Other configura- speed of FASPell with the model by Wang et al.
tions of FASPell used in our major experiments (2018) in this manner because they have reported
(subsection 3.2 - 3.3) are given in Table 4. For their absolute time consumption11 . Table 5 clearly
ablation experiments, the same configurations are shows that FASPell is much faster.
used except when CSD is removed, we take the Second, to compare FASPell with models
candidates ranked the first as default outputs. Note (Zhang et al., 2015; Zhao et al., 2017) whose ab-
that we do not fine-tune the mask language model solute time consumption has not been reported,
for OCR data because we learned in preliminary we analyze the time complexity. The time com-
experiments that fine-tuning worsens performance plexity of FASPell is O(scmn + sc log c), where
for this type of data9 . s is the sentence length, c is the number of can-
didates, mn accounts for computing edit distance
3.2 Performance
and c log c for ranking candidates. Zhang et al.
As shown in Table 6, FASPell achieves state-of- (2015) use more features than just edit distance, so
the-art F1 performance on both detection level and the time complexity of their model has additional
correction level. It is better in precision than the factors. Moreover, since we do not use confusion
model by Wang et al. (2018) and better in recall set, the number of candidates for each character of
than the model by Zhang et al. (2015). In compar- their model is practically larger than ours: x × 10
ison with Zhao et al. (2017), It is better by any met- vs. 4. Thus, FASPell is faster than their model.
ric. It also reaches comparable precision on OCR Zhao et al. (2017) filter candidates by finding the
data. The lower recall on OCR data is partially be- single-source shortest path (SSSP) in a directed
cause many OCR errors are harder to correct even graph consisting of all candidates for every token
for humans (Wang et al., 2018). in a sentence. The algorithm they used has a time
Table 6 also shows that all the components of complexity of O(|V | + |E|) where |V | is the num-
FASPell contribute effectively to its good perfor- ber of vertices and |E| is the number of edges in
mance. FASPell without both fine-tuning and the graph (Eppstein, 1998). Translating it in terms
CSD is essentially the pre-trained mask language of s and c, the time complexity of their model is
model. Fine-tuning it improves recall because O(sc + cs ). This implies that their model is expo-
FASPell can learn about common errors and how nentially slower than FASPell for long sentences.
they are corrected. CSD improves its precision
with minimal harm to recall because this is the un- 10
Considering only the filtering speed is because the
8
https://storage.googleapis.com/bert_ Transformer, the Bi-LSTM and language models used by pre-
models/2018_11_03/chinese_L-12_H-768_ vious state-of-the-art models or us before filtering are already
A-12.zip well studied in the literature.
9 11
It is probably because OCR errors are subject to random We have no access to the 4-core Intel Core i5-7500 CPU
noise in source pictures rather than learnable patterns as in used by Wang et al. (2018). To minimize the difference of
human errors. However, since the paper is not about OCR, speed caused by hardware, we only use 4 cores of a 12-core
we do not elaborate on this here. Intel(R) Xeon(R) CPU E5-2650 in the experiments.

166
Table 6: This table shows spell checking performances on both detection and correction level. Our model –
FASPell achieves similar performance to that of previous state-of-the-art models. Note that fine-tuning and CSD
both contribute effectively to its performance according to the results of ablation experiments. (− FT means
removing fine-tuning; − CSD means removing CSD.)

Detection Level Correction Level


Test set Models
Acc. (%) Prec. (%) Rec. (%) F1 (%) Acc. (%) Prec. (%) Rec. (%) F1 (%)
Wang et al. (2018) (-) 54.0 69.3 60.7 (-) (-) (-) 52.1
Yeh et al. (2013) (-) (-) (-) (-) 62.5 70.3 62.5 66.2
FASPell 63.1 76.2 63.2 69.1 60.5 73.1 60.5 66.2
T st13
FASPell − FT 40.9 75.5 40.9 53.0 39.6 73.2 39.6 51.4
FASPell − CSD 41.0 42.3 41.1 41.6 31.3 32.2 31.3 31.8
FASPell − FT − CSD 47.9 65.2 47.8 55.2 35.6 48.4 35.4 40.9
Zhao et al. (2017) (-) (-) (-) (-) (-) 55.5 39.1 45.9
Wang et al. (2018) (-) 51.9 66.2 58.2 (-) (-) (-) 56.1
FASPell 70.0 61.0 53.5 57.0 69.3 59.4 52.0 55.4
T st14
FASPell − FT 57.8 54.5 18.1 27.2 57.7 53.7 17.8 26.7
FASPell − CSD 49.0 31.0 42.3 35.8 44.9 25.0 34.2 28.9
FASPell − FT − CSD 56.3 38.4 26.8 31.6 52.1 26.0 18.0 21.3
Zhang et al. (2015) 70.1 80.3 53.3 64.0 69.2 79.7 51.5 62.5
Wang et al. (2018) (-) 56.6 69.4 62.3 (-) (-) (-) 57.1
FASPell 74.2 67.6 60.0 63.5 73.7 66.6 59.1 62.6
T st15
FASPell − FT 61.5 74.1 25.5 37.9 61.3 72.5 24.9 37.1
FASPell − CSD 65.5 49.3 59.1 53.8 60.0 40.2 48.2 43.8
FASPell − FT − CSD 63.7 59.1 35.3 44.2 57.6 38.3 22.7 28.5
FASPell 18.6 78.5 18.6 30.1 17.4 73.4 17.4 28.1
T stocr
FASPell − CSD 34.5 65.8 34.5 45.3 18.9 36.1 18.9 24.8

3.4 Exploring hyper-parameters small amount of spell checking data and gives up
First, we only change the number of candidates the troublesome notion of confusion set. With
in Table 4 to see its effect on spell checking per- FASPell as an example, each component of the
formance. As illustrated in Figure 4, when more paradigm is shown to be effective. We make our
candidates are taken into account, additional de- code and data publically available at https://
tections and corrections are recalled while max- github.com/iqiyi/FASPell.
imizing precision. Thus, increase in the number Future work may include studying if the DAE-
of candidates always results in the improvement of decoder paradigm can be used to detect and cor-
F1. The reason we set the number of candidates rect grammatical errors or other less frequently
c = 4 in Table 4 and no larger is because there is a studied types of Chinese spelling errors such as
trade-off with time consumption. dialectical colloquialism (Fung et al., 2017) and
Second, we do the same thing to the number of insertion/deletion errors.
rounds of spell checking in Table 4. We can ob- T st13 T st14 T st15 T stocr

serve in Figure 4 that the correction performance 0.7 0.7 0.7 0.4

0.65 0.65 0.65 0.35

on T st14 and T st15 reaches its peak when the


F1

0.6 0.6 0.6 0.3

number of rounds is 3. For T st13 and T stocr , that 0.55 0.55 0.55 0.25

number is 1 and 2, respectively. A larger num- 0.5


0 1 2 3 4 5
0.5
0 1 2 3 4 5
0.5
0 1 2 3 4 5
0.2
0 1 2 3 4 5
Number of candidates
ber of rounds sometimes helps because FASPell T st13 T st14 T st15 T stocr

can achieve high precision in detection in each 0.7 0.7 0.7 0.4

0.65 0.65 0.65 0.35

round, so undiscovered errors in last round may be


F1

0.6 0.6 0.6 0.3

detected and corrected in the next round without 0.55 0.55 0.55 0.25
Detection

falsely detecting too many non-errors. 0.5


0 1 2 3 4 5 6
0.5
0 1 2 3 4 5 6
0.5
0 1 2 3 4 5 6
0.2
0 1 2 3
Correction

4 5 6
Number of rounds

4 Conclusion
Figure 4: The four plots in the first row show how
We propose a Chinese spell checker – FASPell that the number of candidates for each character affects F1
reaches state-of-the-art performance. It is based performances. The four in the second row show the
on DAE-decoder paradigm that requires only a impact of the number of rounds of spell checking.

167
Acknowledgments Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
Yan Liu. 2019. Mass: Masked sequence to sequence
The authors would like to thank the anonymous pre-training for language generation. arXiv preprint
reviewers for their comments. We also thank our arXiv:1905.02450.
colleagues from the IT Infrastructure team of
Xiang Tong and David A. Evans. 1996. A statistical
iQIYI, Inc. for the hardware support. Special approach to automatic OCR error correction in con-
thanks go to Prof. Yves Lepage from Graduate text. In Fourth Workshop on Very Large Corpora.
School of IPS, Waseda University for his insight-
ful advice about the paper. Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and
Hsin-Hsi Chen. 2015. Introduction to SIGHAN
2015 bake-off for Chinese spelling check. In Pro-
ceedings of the Eighth SIGHAN Workshop on Chi-
nese Language Processing, pages 32–37.
References
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and
Chao-Huang Chang. 1995. A new approach for auto- Pierre-Antoine Manzagol. 2008. Extracting and
matic chinese spelling correction. In Proceedings composing robust features with denoising autoen-
of Natural Language Processing Pacific Rim Sym- coders. In Proceedings of the 25th international
posium, volume 95, pages 278–283. Citeseer. conference on Machine learning, pages 1096–1103.
ACM.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. BERT: pre-training of Dingmin Wang, Yan Song, Jing Li, Jialong Han, and
deep bidirectional transformers for language under- Haisong Zhang. 2018. A hybrid approach to auto-
standing. CoRR, abs/1810.04805. matic corpus generation for Chinese spelling check.
In Proceedings of the 2018 Conference on Em-
David Eppstein. 1998. Finding the k shortest paths. pirical Methods in Natural Language Processing,
SIAM Journal on computing, 28(2):652–673. pages 2517–2527, Brussels, Belgium. Association
for Computational Linguistics.
Gabriel Fung, Maxime Debosschere, Dingmin Wang,
Bo Li, Jia Zhu, and Kam-Fai Wong. 2017. Nlptea Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee.
2017 shared task – Chinese spelling check. In 2013. Chinese spelling check evaluation at
Proceedings of the 4th Workshop on Natural Lan- SIGHAN bake-off 2013. In Proceedings of the Sev-
guage Processing Techniques for Educational Appli- enth SIGHAN Workshop on Chinese Language Pro-
cations (NLPTEA 2017), pages 29–34, Taipei, Tai- cessing, pages 35–42, Nagoya, Japan. Asian Federa-
wan. Asian Federation of Natural Language Process- tion of Natural Language Processing.
ing.
Shaohua Yang, Hai Zhao, Xiaolin Wang, and Bao liang
Chao-Lin Liu, Min-Hua Lai, Yi-Hsuan Chuang, and Lu. 2012. Spell checking for Chinese. In Proceed-
Chia-Ying Lee. 2010. Visually and phonologi- ings of the Eight International Conference on Lan-
cally similar characters in incorrect simplified chi- guage Resources and Evaluation (LREC’12), Istan-
nese words. In Proceedings of the 23rd Inter- bul, Turkey. European Language Resources Associ-
national Conference on Computational Linguistics: ation (ELRA).
Posters, pages 739–747, Beijing, China. Association
for Computational Linguistics. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Ruslan Salakhutdinov, and Quoc V Le.
Mateusz Pawlik and Nikolaus Augsten. 2015. Efficient 2019. Xlnet: Generalized autoregressive pretrain-
computation of the tree edit distance. ACM Trans. ing for language understanding. arXiv preprint
Database Syst., 40(1):3:1–3:40. arXiv:1906.08237.

Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-
distance: Robust and memory-efficient. Informa- Yi Chen, and Mao-Chuan Su. 2013. Chinese word
tion Systems, 56:157 – 173. spelling correction based on n-gram ranked inverted
index list. In Proceedings of the Seventh SIGHAN
Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An Workshop on Chinese Language Processing, pages
end-to-end trainable neural network for image-based 43–48, Nagoya, Japan. Asian Federation of Natural
sequence recognition and its application to scene Language Processing.
text recognition. IEEE transactions on pattern anal-
ysis and machine intelligence, 39(11):2298–2304. Junjie Yu and Zhenghua Li. 2014. Chinese spelling
error detection and correction based on language
DS Shih et al. 1992. A statistical method for locating model, pronunciation, and shape. In Proceedings of
typo in Chinese sentences. CCL Research Journal, The Third CIPS-SIGHAN Joint Conference on Chi-
pages 19–26. nese Language Processing, pages 220–223.

168
Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and
Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014
bake-off for Chinese spelling check. In Proceed-
ings of The Third CIPS-SIGHAN Joint Conference
on Chinese Language Processing, pages 126–132,
Wuhan, China. Association for Computational Lin-
guistics.
Shuiyuan Zhang, Jinhua Xiong, Jianpeng Hou, Qiao
Zhang, and Xueqi Cheng. 2015. HANSpeller++: A
unified framework for Chinese spelling correction.
In Proceedings of the Eighth SIGHAN Workshop on
Chinese Language Processing, pages 38–45, Bei-
jing, China. Association for Computational Linguis-
tics.

Hai Zhao, Deng Cai, Yang Xin, Yuzhu Wang, and


Zhongye Jia. 2017. A hybrid model for Chinese
spelling check. ACM Trans. Asian Low-Resour.
Lang. Inf. Process., 16(3):21:1–21:22.

169

You might also like