0% found this document useful (0 votes)

21 views

FASPell A fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm

Uploaded by

601770313

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

FASPell A fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm

Uploaded by

601770313

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker

Based On DAE-Decoder Paradigm

Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, Junhui Liu
Intelligent Platform Division, iQIYI, Inc.
{hongyuzhong, yuxianguo, heneng, liunan, liujunhui}@qiyi.com

Abstract 1.1 Related work and bottlenecks

We propose a Chinese spell checker – FASPell

Almost all previous Chinese spell checking mod-
based on a new paradigm which consists of els deploy a common paradigm where a fixed set
a denoising autoencoder (DAE) and a de- of similar characters of each Chinese character
coder. In comparison with previous state- (called confusion set) is used as candidates, and a
of-the-art models, the new paradigm allows filter selects the best candidates as substitutions for
our spell checker to be Faster in computa- a given sentence. This naive design is subjected to
tion, readily Adaptable to both simplified and two major bottlenecks, whose negative impact has
traditional Chinese texts produced by either
been unsuccessfully mitigated:
humans or machines, and to require much
Simpler structure to be as much Powerful in • overfitting to under-resourced Chinese
both error detection and correction. These four
spell checking data. Since Chinese spell
achievements are made possible because the
new paradigm circumvents two bottlenecks. checking data require tedious professional
First, the DAE curtails the amount of Chi- manual work, they have always been under-
nese spell checking data needed for super- resourced. To prevent the filter from over-
vised learning (to <10k sentences) by lever- fitting, Wang et al. (2018) propose an auto-
aging the power of unsupervisedly pre-trained matic method to generate pseudo spell check-
masked language model as in BERT, XLNet, ing data. However, the precision of their spell
MASS etc. Second, the decoder helps to elim-
checking model ceases to improve when the
inate the use of confusion set that is deficient
in flexibility and sufficiency of utilizing the
generated data reaches 40k sentences. Zhao
salient feature of Chinese character similarity. et al. (2017) use an extensive amount of ad
hoc linguistic rules to filter candidates, only
1 Introduction to achieve worse performance than ours even
though our model does not leverage any lin-
There has been a long line of research on detect- guistic knowledge.
ing and correcting spelling errors in Chinese texts
since some trailblazing work in the early 1990s • inflexibility and insufficiency of confusion
(Shih et al., 1992; Chang, 1995). However, de- set in utilizing character similarity. The
spite the spelling errors being reduced to substitu- feature of Chinese character similarity is very
tion errors in most researches1 and efforts of mul- salient as it is related to the main cause of
tiple recent shared tasks (Wu et al., 2013; Yu et al., spelling errors (see subsection 2.2). How-
2014; Tseng et al., 2015; Fung et al., 2017), Chi- ever, the idea of confusion set is troublesome
nese spell checking remains a difficult task. More- in utilizing it:
over, the methods for languages like English can
1. inflexibility to address the issue that
hardly be directly used for the Chinese language
confusing characters in one scenario
because there are no delimiters between words,
may not be confusing in another. The
whose lack of morphological variations makes the
difference between simplified and tradi-
syntactic and semantic interpretations of any Chi-
tional Chinese shown in Table 1 is an
nese character highly dependent on its context.
example. Wang et al. (2018) also sug-
1
Likewise, this paper only covers substitution errors. gest that confusing characters for ma-

160
Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 160–169
Hong Kong, Nov 4, 2019. c 2019 Association for Computational Linguistics
chines are different from those for hu- 2. The DAE-decoder paradigm is sequence-
mans. Therefore, in practice, it is very to-sequence, which makes it resemble the
likely that the correct candidates for sub- encoder-decoder paradigm in tasks like ma-
stitution do not exist in a given confu- chine translation, grammar checking, etc.
sion set, which harms recall. Also, con- However, in the encoder-decoder paradigm,
sidering more similar characters to pre- the encoder extracts semantic information,
serve recall will risk lowering precision. and the decoder generates texts that embody
2. insufficiency in utilizing character simi- the information. In contrast, in the DAE-
larity. Since a cut-off threshold of quan- decoder paradigm, the DAE provides candi-
tified character similarity (Liu et al., dates to reconstruct texts from the corrupted
2010; Wang et al., 2018) is used to pro- ones based on contextual feature, and the de-
duce the confusion set, similar charac- coder3 selects the best candidates by incorpo-
ters are actually treated indiscriminately rating other features.
in terms of their similarity. This means Besides the new paradigm per se, there are two
the information of character similarity additional contributions in our proposed Chinese
is not sufficiently utilized. To compen- spell checking model:
sate this, Zhang et al. (2015) propose a
spell checker that has to consider many • we propose a more precise quantification
less salient features such as word seg- method of character similarity than the ones
mentation, which add more unnecessary proposed by Liu et al. (2010) and Wang et al.
noises to their model. (2018) (see subsection 2.2);
• we propose an empirically effective decoder
1.2 Motivation and contributions
to filter candidates under the principle of get-
The motivation of this paper is to circumvent the ting the highest possible precision with mini-
two bottlenecks in subsection 1.1 by changing the mal harm to recall (see subsection 2.3).
paradigm for Chinese spell checking.
1.3 Achievements
As a major contribution and as exemplified by
our proposed Chinese spell checking model in Fig- Thanks to our contributions mentioned in subsec-
ure 1, the most general form of the new paradigm tion 1.2, our model can be characterized by the fol-
consists of a denoising autoencoder2 (DAE) and a lowing achievements relative to previous state-of-
decoder. To prove that it is indeed a novel contri- the-art models, and thus is named FASPell.
bution, we compare it with two similar paradigms • Our model is Fast. It is shown (subsection
and show their differences as follows: 3.3) to be faster in filtering than previous
state-of-the-art models either in terms of ab-
1. Similar to the old paradigm used in previous
solute time consumption or time complexity.
Chinese spell checking models, a model un-
der the DAE-decoder paradigm also produces • Our model is Adaptable. To demonstrate this,
candidates (by DAE) and then filters the can- we test it on texts from different scenarios
didates (by the decoder). However, candi- – texts by humans, such as learners of Chi-
dates are produced on the fly based on con- nese as a Foreign Language (CFL), and by
texts. If the DAE is powerful enough, we machines, such as Optical Character Recog-
should expect that all contextually suitable nition (OCR). It can also be applied to both
candidates are recalled, which prevent the in- simplified Chinese and traditional Chinese,
flexibility issue caused by using confusion despite the challenging issue that some er-
set. The DAE will also prevent the overfit- roneous usages of characters in traditional
ting issue because it can be trained unsuper- texts are considered valid usages in simpli-
visedly using a large number of natural texts. fied texts (see Table 1). To the best of our
Moreover, character similarity can be used by knowledge, all previous state-of-the-art mod-
the decoder without losing any information. els only focus on human errors in traditional
2 Chinese texts.
the term denoising autoencoder follows the same sense
3
used by Yang et al. (2019), which is arguably more general The term decoder here is analogous as in Viterbi decoder
than the one used by Vincent et al. (2008). in the sense of finding the best path along candidates.

161
Table 1: Examples on the left are considered valid ✓ ✓

usages in simplified Chinese (SC). Notes on the right 国际电台著名主持人

are about how they are erroneous in traditional Chi-
nese (TC) and suggested corrections. This inconsis-
tency is because multiple traditional characters were Confidence-Similarity Decoder
merged into identical characters in the simplification
process. Our model makes corrections for this type of
errors only in traditional texts. In simplified texts, they
are not detected as errors. 国际电台知名主持人 rank=1
0.9994 0.9999 0.9999 0.9999 0.2878 0.9626 0.9994 0.9981 0.9999

國際听话著音广目者 rank=2
0.0002 0.0000 0.0000 0.0000 0.1999 0.0019 0.0002 0.0002 0.0000
SC Examples Notes on TC usage
世家节视报台演主手 rank=3
周末 (weekend) 周 → 週周 only in 周到, etc. 0.0000 0.0000 0.0000 0.0000 0.0429 0.0015 0.0000 0.0001 0.0000

旅游 (trip) 游 → 遊游 only in 游泳, etc. 界讲播冠闻支节持

台 0.0000 rank=4
0.0000 0.0000 0.0000 0.0252 0.0014 0.0000 0.0001 0.0000
制造 (make) 制 → 製制 only in 制度, etc.

• Our model is Simple. As shown in Fig- Masked Language Model

ure 1, it has only a masked language model
and a filter as opposed to multiple feature-
producing models and filters being used in 国际电台苦名丰持人
✖ ✖
previous state-of-the-art proposals. More-
over, only a small training set and a set of Figure 1: A real example of how an erroneous sentence
visual and phonological features of charac- which is supposed to have the meaning of "A famous
international radio broadcaster" is successfully spell-
ters are required in our model. No extra data
checked with two erroneous characters 苦 and 丰 being
are necessary, including confusion set. This detected and corrected using FASPell. Note that with
makes our model even simpler. our proposed confidence-similarity decoder, the final
choice for substitution is not necessarily the candidate
• Our model is Powerful. On benchmark ranked the first.
data sets, it achieves similar F1 performances
(subsection 3.2) to those of previous state-of-
the-art models on both detection and correc- time, a random token in the vocabulary 10% of
tion level. It also achieves arguably high pre- the time and the original token 10% of the time. In
cision (78.5% in detection and 73.4% in cor- cases where a random token is used as the mask,
rection) on our OCR data set. the model actually learns how to correct an erro-
neous character; in cases where the original tokens
are kept, the model actually learns how to detect if
2 FASPell
a character is erroneous or not. For simplicity pur-
As shown in Figure 1, our model uses masked lan- poses, FASPell adopts the architecture of MLM as
guage model (see subsection 2.1) as the DAE to in BERT (Devlin et al., 2018). Recent variants –
produce candidates and confidence-similarity de- XLNet (Yang et al., 2019), MASS (Song et al.,
coder (see subsection 2.2 and 2.3) to filter can- 2019) have more complex architectures of MLM,
didates. In practice, doing several rounds of the but they are also suitable.
whole process is also proven to be helpful (sub- However, just using a pre-trained MLM raises
section 3.4). the issue that the errors introduced by random
masks may be very different from the actual errors
2.1 Masked language model
in spell checking data. Therefore, we propose the
Masked language model (MLM) guesses the to- following method to fine-tune the MLM on spell
kens that are masked in a tokenized sentence. It checking training sets:
is intuitive to use MLM as the DAE to detect and
correct Chinese spelling errors because it is in line • For texts that have no errors, we follow the
with the task of Chinese spell checking. In the original training process as in BERT;
original training process of MLM in BERT (De-
vlin et al., 2018), the errors are the random masks, • For texts that have errors, we create two types
which are the special token [MASK] 80% of the of training examples by:

162
1. given a sentence, we mask the erroneous
tokens with themselves and set their tar- 贫 : ⿱⿱⿰丿乁⿹𠃌丿⿵⿰丨𠃌⿰丿乁
--------------------------------------------------
get labels as their corresponding correct ⿱
characters; ① ⿱ ② ／＼
／＼ ⿱ ⿵
2. to prevent overfitting, we also mask to-
分贝 /" /"
kens that are not erroneous with them- 八刀冂人
selves and set their target labels as them- ⿱
selves, too. ③ ／＼
⿱ ⿵
The two types of training examples are bal- ／＼／＼
anced to have roughly similar quantity. ⿰ ⿹ ⿰ ⿰
/"/\/"/"
Fine-tuning a pre-trained MLM is proven to be 丿乁𠃌丿丨𠃌丿乁
very effective in many downstream tasks (Devlin Figure 2: The IDS of a character can be given in dif-
et al., 2018; Yang et al., 2019; Song et al., 2019), ferent granularity levels as shown in the tree forms in
so one would argue that this is where the power of ¬-® for the simplified character 贫 (meaning poor).
FASPell mainly comes from. However, we would In FASPell, we only use stroke-level IDS in the form
like to emphasize that the power of FASPell should of a string, like the one above the dashed ruling line.
not be biasedly attributed to MLM. In fact, we Unlike using only actual strokes (Wang et al., 2018),
the Unicode standard Ideographic Description Charac-
show in our ablation studies (subsection 3.2) that
ters (e.g., the non-leaf nodes in the trees) describe the
MLM itself can only serve as a very weak Chinese layout of a character. They help us to model the sub-
spell checker (its performance can be as poor as tle nuances in different characters that are composed of
F1 being only 28.9%), and the decoder that uti- identical strokes (see examples in Table 2). Therefore,
lizes character similarity (see subsection 2.2 and IDS gives us a more precise shape representation of a
2.3) is also significantly indispensable to produc- character.
ing a strong Chinese spell checker.

2.2 Character similarity In our model, we only adopt the string-form

IDS. We define the visual similarity between two
Erroneous characters in Chinese texts by humans
characters as one minus normalized6 Levenshtein
are usually either visually (subsection 2.2.1) or
edit distance between their IDS representations.
phonologically similar (subsection 2.2.2) to corre-
The reason for normalization is twofold. Firstly,
sponding correct characters, or both (Chang, 1995;
we want the similarity to range from 0 to 1 for the
Liu et al., 2010; Yu and Li, 2014). It is also true
convenience of later filtering. Secondly, if a pair
that erroneous characters produced by OCR pos-
of more complex characters have the same edit dis-
sess visual similarity (Tong and Evans, 1996).
tance as a pair of less complex characters, we want
We base our similarity computation on two
the similarity of the more complex characters to be
open databases: Kanji Database Project4 and Uni-
slightly higher than that of the less complex char-
han Database5 because they provide shape and
acters (see examples in Table 2).
pronunciation representations for all CJK Unified
We do not use the tree-form IDS for two rea-
Ideographs in all CJK languages.
sons even as it seems to make more sense intu-
2.2.1 Visual similarity itively. Firstly, even with the most efficient algo-
rithm (Pawlik and Augsten, 2015, 2016) so far, tree
The Kanji Database Project uses the Unicode
edit distance (TED) has far greater time complex-
standard – Ideographic Description Sequence
ity than edit distance of strings (O(mn(m + n))
(IDS) to represent the shape of a character.
vs. O(mn)). Secondly, we did try TED in prelimi-
As illustrated by examples in Figure 2, the IDS
nary experiments, but there was no significant dif-
of a character is formally a string, but it is essen-
ference from using edit distance of strings in terms
tially the preorder traversal path of an ordered tree.
of spell checking performance.
4
http://kanji-database.sourceforge.
net/ 6
Since the maximal value of Levenshtein edit distance is
5
https://unicode.org/charts/unihan. the maximum of the lengths of the two strings in question, we
html normalize it simply by dividing it by the maximum length.

163
Table 2: Examples of the computation of character similarities. IDS is used to compute visual similarity (V-sim)
and pronunciation representations in Mandarin Chinese (MC), Cantonese Chinese (CC), Japanese On’yomi (JO),
Korean (K) and Vietnamese (V) are used to compute phonological similarity (P-sim). Note that the normalization
of edit distance gives us the desired fact that less complex character pair (午, 牛) has smaller visual similarity than
more complex character pair (田, 由) even though both of their IDS edit distances are 1. Also, note that 午 and
牛 have more similar pronunciations in some languages than in others; the combination of the pronunciations in
multiple languages gives us a more continuous phonological similarity.

IDS MC CC JO K V V-sim P-sim

午 (noon) ⿱⿰丿一⿻一丨 wu3 ng5 go o ngọ
0.857 0.280
牛 (cow) ⿻⿰丿一⿻一丨 niu2 ngau4 gyuu wu ngưu

田 (field) ⿵⿰丨𠃌⿱⿻一丨一 tian2 tin4 den cen điền

0.889 0.090
由 (from) ⿻⿰丨𠃌⿱⿻一丨一 you2 jau4 yuu yu do

2.2.2 Phonological similarity the original characters. For those that are dif-
Different Chinese characters sharing identical pro- ferent, we can draw a confidence-similarity scat-
nunciation is very common (Yang et al., 2012), ter graph. If we compare the candidates with the
which is the case for any CJK language. Thus, If ground truths, the graph will resemble the plot
we were to use character pronunciations in only ¬ of Figure 3. We can observe that the true-
one CJK language, the phonological similarity of detection-and-correction candidates are denser to-
character pairs would be limited to a few discrete ward the upper-right corner; false-detection candi-
values. However, a more continuous phonologi- dates toward the lower-left corner; true-detection-
cal similarity is preferred because it can make the and-false-correction candidates in the middle area.
curve used for filtering candidates smoother (see If we draw a curve to filter out false-detection
subsection 2.3). candidates (plot of Figure 3) and use the rest
Therefore, we utilize character pronunciations as substitutions, we can optimize character-level
of all CJK languages (see examples in Table 2), precision with minimal harm to character-level
which are provided by the Unihan Database. To recall for detection; if true-detection-and-false-
compute the phonological similarity of two char- correction candidates are also filtered out (plot ®
acters, we first calculate one minus normalized of Figure 3), we can get the same effect for cor-
Levenshtein edit distance between their pronunci- rection. In FASPell, we optimize correction per-
ation representations in all CJK languages (if ap- formance and manually find the filtering curve us-
plicable). Then, we take the mean of the results. ing a training set, assuming its consistency with its
Hence, the similarity should range from 0 to 1. corresponding testing set. But in practice, we have
to find two curves – one for each type of similarity,
2.3 Confidence-Similarity Decoder and then take the union of the filtering results.
Candidate filters in many previous models are Now, consider the case where there are c > 1
based on setting various thresholds and weights for candidates. To reduce it into the previously de-
multiple features of candidate characters. Instead scribed simplest case, we rank the candidates for
of this naive approach, we propose a method that each original character according to their contex-
is empirically effective under the principle of get- tual confidence and put candidates that have the
ting the highest possible precision with minimal same rank into the same group (i.e., c groups in
harm to recall. Since the decoder utilizes contex- total). Thus, we can find a filter as previously de-
tual confidence and character similarity, we refer scribed for each group of candidates. All c filters
to it as the confidence-similarity decoder (CSD). combined further alleviate the harm to recall be-
The mechanism of CSD is explained, and its ef- cause more candidates are taken into account.
fectiveness is justified as follows: In the example of Figure 1, there are c = 4
First, consider the simplest case where only one groups of candidates. We get a correct substitution
candidate character is provided for each original 丰 → 主 from the group whose rank = 1, another
character. For those candidates that are the same one 苦 → 著 from the group whose rank = 2, and
as their original characters, we do not substitute no more from the other two groups.

164
¬
1 1
T-d&T-c

0.8 0.8 T-d&F-c

F-d
filtering curve
Similarity

Similarity
0.6 0.6
filtered-out

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Confidence Confidence

® ¯
1 1

0.8 0.8
Similarity

Similarity
0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Confidence Confidence

Figure 3: All four plots show the same confidence-similarity graph of candidates categorized by being true-
detection-and-true-correction (T-d&T-c), true-detection-and-false-correction (T-d&F-c) and false-detection (F-d).
But, each plot shows a different way of filtering candidates: in plot ¬, no candidates are filtered; in plot , the
filtering optimizes detection performance; in plot ®, as adopted in FASPell, the filtering optimizes correction
performance; in plot ¯, as adopted by previous models, candidates are filtered out by setting a threshold for
weighted confidence and similarity (0.8 × conf idence + 0.2 × similarity < 0.8 as an example in the plot). Note
that the four plots use the actual first-rank candidates (using visual similarity) for our OCR data (T rnocr ) except
that we randomly sampled only 30% of the candidates to make the plots more viewable on paper.

3 Experiments and results Table 3: Statistics of datasets.

We first describe the data, metrics and model con- Dataset # erroneous sent # sent Avg. length
figurations adopted in our experiments in subsec- T rn13 350 700 41.8
tion 3.1. Then, in subsection 3.2, we show the per- T rn14 3432 3435 49.6
formance on spell checking texts written by hu- T rn15 2339 2339 31.3
mans to compare FASPell with previous state-of- T st13 996 1000 74.3
the-art models; we also show the performance on T st14 529 1062 50.0
T st15 550 1100 30.6
data that are harvested from OCR results to prove
the adaptability of the model. In subsection 3.3, T rnocr 3575 3575 10.1
T stocr 1000 1000 10.2
we compare the speed of FASPell and three state-
of-the-art models. In subsection 3.4, we investi-
gate how hyper-parameters affect the performance
of FASPell. recall and F1 given by SIGHAN13 - 15 shared
tasks on Chinese spell checking (Wu et al., 2013;
3.1 Data, metrics and configurations
Yu et al., 2014; Tseng et al., 2015). We also har-
We adopt the benchmark datasets (all in traditional vested 4575 sentences (4516 are simplified Chi-
Chinese) and sentence-level7 accuracy, precision, nese) from OCR results of Chinese subtitles in
7
Note that although we do not use character-level metrics videos. We used the OCR method by Shi et al.
(Fung et al., 2017) in evaluation, they are actually important (2017). Detailed data statistics are in Table 3.
in the justification of the effectiveness of the CSD as in sub-
section 2.3 We use the pre-trained masked language

165
Table 4: Configurations of FASPell. FT means the Table 5: Speed comparison (ms/sent). Note that the
training set for fine-tuning; CSD means the training set speed of FASPell is the average in several rounds.
for CSD; r means the number of rounds and c means
the number of candidates for each character. U is the Test set FASPell Wang et al. (2018)
union of all the spell checking data from SIGHAN13 -
T st13 446 680
15. T st14 284 745
T st15 177 566
FT CSD Test set r c FT steps
U − T st13 T rn13 T st13 1 4 10k
U − T st14 T rn14 T st14 3 4 10k
U − T st15 T rn15 T st15 3 4 10k derlying principle of the design of CSD.
(-) T rnocr T stocr 2 4 (-)
3.3 Filtering Speed10

model8 provided by Devlin et al. (2018). Set- First, we measure the filtering speed of Chinese
tings of its hyper-parameters and pre-training spell checking in terms of absolute time consump-
are available at https://github.com/ tion per sentence (see Table 5). We compare the
google-research/bert. Other configura- speed of FASPell with the model by Wang et al.
tions of FASPell used in our major experiments (2018) in this manner because they have reported
(subsection 3.2 - 3.3) are given in Table 4. For their absolute time consumption11 . Table 5 clearly
ablation experiments, the same configurations are shows that FASPell is much faster.
used except when CSD is removed, we take the Second, to compare FASPell with models
candidates ranked the first as default outputs. Note (Zhang et al., 2015; Zhao et al., 2017) whose ab-
that we do not fine-tune the mask language model solute time consumption has not been reported,
for OCR data because we learned in preliminary we analyze the time complexity. The time com-
experiments that fine-tuning worsens performance plexity of FASPell is O(scmn + sc log c), where
for this type of data9 . s is the sentence length, c is the number of can-
didates, mn accounts for computing edit distance
3.2 Performance
and c log c for ranking candidates. Zhang et al.
As shown in Table 6, FASPell achieves state-of- (2015) use more features than just edit distance, so
the-art F1 performance on both detection level and the time complexity of their model has additional
correction level. It is better in precision than the factors. Moreover, since we do not use confusion
model by Wang et al. (2018) and better in recall set, the number of candidates for each character of
than the model by Zhang et al. (2015). In compar- their model is practically larger than ours: x × 10
ison with Zhao et al. (2017), It is better by any met- vs. 4. Thus, FASPell is faster than their model.
ric. It also reaches comparable precision on OCR Zhao et al. (2017) filter candidates by finding the
data. The lower recall on OCR data is partially be- single-source shortest path (SSSP) in a directed
cause many OCR errors are harder to correct even graph consisting of all candidates for every token
for humans (Wang et al., 2018). in a sentence. The algorithm they used has a time
Table 6 also shows that all the components of complexity of O(|V | + |E|) where |V | is the num-
FASPell contribute effectively to its good perfor- ber of vertices and |E| is the number of edges in
mance. FASPell without both fine-tuning and the graph (Eppstein, 1998). Translating it in terms
CSD is essentially the pre-trained mask language of s and c, the time complexity of their model is
model. Fine-tuning it improves recall because O(sc + cs ). This implies that their model is expo-
FASPell can learn about common errors and how nentially slower than FASPell for long sentences.
they are corrected. CSD improves its precision
with minimal harm to recall because this is the un- 10
Considering only the filtering speed is because the
8
https://storage.googleapis.com/bert_ Transformer, the Bi-LSTM and language models used by pre-
models/2018_11_03/chinese_L-12_H-768_ vious state-of-the-art models or us before filtering are already
A-12.zip well studied in the literature.
9 11
It is probably because OCR errors are subject to random We have no access to the 4-core Intel Core i5-7500 CPU
noise in source pictures rather than learnable patterns as in used by Wang et al. (2018). To minimize the difference of
human errors. However, since the paper is not about OCR, speed caused by hardware, we only use 4 cores of a 12-core
we do not elaborate on this here. Intel(R) Xeon(R) CPU E5-2650 in the experiments.

166
Table 6: This table shows spell checking performances on both detection and correction level. Our model –
FASPell achieves similar performance to that of previous state-of-the-art models. Note that fine-tuning and CSD
both contribute effectively to its performance according to the results of ablation experiments. (− FT means
removing fine-tuning; − CSD means removing CSD.)

Detection Level Correction Level

Test set Models
Acc. (%) Prec. (%) Rec. (%) F1 (%) Acc. (%) Prec. (%) Rec. (%) F1 (%)
Wang et al. (2018) (-) 54.0 69.3 60.7 (-) (-) (-) 52.1
Yeh et al. (2013) (-) (-) (-) (-) 62.5 70.3 62.5 66.2
FASPell 63.1 76.2 63.2 69.1 60.5 73.1 60.5 66.2
T st13
FASPell − FT 40.9 75.5 40.9 53.0 39.6 73.2 39.6 51.4
FASPell − CSD 41.0 42.3 41.1 41.6 31.3 32.2 31.3 31.8
FASPell − FT − CSD 47.9 65.2 47.8 55.2 35.6 48.4 35.4 40.9
Zhao et al. (2017) (-) (-) (-) (-) (-) 55.5 39.1 45.9
Wang et al. (2018) (-) 51.9 66.2 58.2 (-) (-) (-) 56.1
FASPell 70.0 61.0 53.5 57.0 69.3 59.4 52.0 55.4
T st14
FASPell − FT 57.8 54.5 18.1 27.2 57.7 53.7 17.8 26.7
FASPell − CSD 49.0 31.0 42.3 35.8 44.9 25.0 34.2 28.9
FASPell − FT − CSD 56.3 38.4 26.8 31.6 52.1 26.0 18.0 21.3
Zhang et al. (2015) 70.1 80.3 53.3 64.0 69.2 79.7 51.5 62.5
Wang et al. (2018) (-) 56.6 69.4 62.3 (-) (-) (-) 57.1
FASPell 74.2 67.6 60.0 63.5 73.7 66.6 59.1 62.6
T st15
FASPell − FT 61.5 74.1 25.5 37.9 61.3 72.5 24.9 37.1
FASPell − CSD 65.5 49.3 59.1 53.8 60.0 40.2 48.2 43.8
FASPell − FT − CSD 63.7 59.1 35.3 44.2 57.6 38.3 22.7 28.5
FASPell 18.6 78.5 18.6 30.1 17.4 73.4 17.4 28.1
T stocr
FASPell − CSD 34.5 65.8 34.5 45.3 18.9 36.1 18.9 24.8

3.4 Exploring hyper-parameters small amount of spell checking data and gives up
First, we only change the number of candidates the troublesome notion of confusion set. With
in Table 4 to see its effect on spell checking per- FASPell as an example, each component of the
formance. As illustrated in Figure 4, when more paradigm is shown to be effective. We make our
candidates are taken into account, additional de- code and data publically available at https://
tections and corrections are recalled while max- github.com/iqiyi/FASPell.
imizing precision. Thus, increase in the number Future work may include studying if the DAE-
of candidates always results in the improvement of decoder paradigm can be used to detect and cor-
F1. The reason we set the number of candidates rect grammatical errors or other less frequently
c = 4 in Table 4 and no larger is because there is a studied types of Chinese spelling errors such as
trade-off with time consumption. dialectical colloquialism (Fung et al., 2017) and
Second, we do the same thing to the number of insertion/deletion errors.
rounds of spell checking in Table 4. We can ob- T st13 T st14 T st15 T stocr

serve in Figure 4 that the correction performance 0.7 0.7 0.7 0.4

0.65 0.65 0.65 0.35

on T st14 and T st15 reaches its peak when the

0.6 0.6 0.6 0.3

number of rounds is 3. For T st13 and T stocr , that 0.55 0.55 0.55 0.25

number is 1 and 2, respectively. A larger num- 0.5

0 1 2 3 4 5
0.5
0 1 2 3 4 5
0.5
0 1 2 3 4 5
0.2
0 1 2 3 4 5
Number of candidates
ber of rounds sometimes helps because FASPell T st13 T st14 T st15 T stocr

can achieve high precision in detection in each 0.7 0.7 0.7 0.4

0.65 0.65 0.65 0.35

round, so undiscovered errors in last round may be

0.6 0.6 0.6 0.3

detected and corrected in the next round without 0.55 0.55 0.55 0.25
Detection

falsely detecting too many non-errors. 0.5

0 1 2 3 4 5 6
0.5
0 1 2 3 4 5 6
0.5
0 1 2 3 4 5 6
0.2
0 1 2 3
Correction

4 5 6
Number of rounds

4 Conclusion
Figure 4: The four plots in the first row show how
We propose a Chinese spell checker – FASPell that the number of candidates for each character affects F1
reaches state-of-the-art performance. It is based performances. The four in the second row show the
on DAE-decoder paradigm that requires only a impact of the number of rounds of spell checking.

167
Acknowledgments Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
Yan Liu. 2019. Mass: Masked sequence to sequence
The authors would like to thank the anonymous pre-training for language generation. arXiv preprint
reviewers for their comments. We also thank our arXiv:1905.02450.
colleagues from the IT Infrastructure team of
Xiang Tong and David A. Evans. 1996. A statistical
iQIYI, Inc. for the hardware support. Special approach to automatic OCR error correction in con-
thanks go to Prof. Yves Lepage from Graduate text. In Fourth Workshop on Very Large Corpora.
School of IPS, Waseda University for his insight-
ful advice about the paper. Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and
Hsin-Hsi Chen. 2015. Introduction to SIGHAN
2015 bake-off for Chinese spelling check. In Pro-
ceedings of the Eighth SIGHAN Workshop on Chi-
nese Language Processing, pages 32–37.
References
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and
Chao-Huang Chang. 1995. A new approach for auto- Pierre-Antoine Manzagol. 2008. Extracting and
matic chinese spelling correction. In Proceedings composing robust features with denoising autoen-
of Natural Language Processing Pacific Rim Sym- coders. In Proceedings of the 25th international
posium, volume 95, pages 278–283. Citeseer. conference on Machine learning, pages 1096–1103.
ACM.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. BERT: pre-training of Dingmin Wang, Yan Song, Jing Li, Jialong Han, and
deep bidirectional transformers for language under- Haisong Zhang. 2018. A hybrid approach to auto-
standing. CoRR, abs/1810.04805. matic corpus generation for Chinese spelling check.
In Proceedings of the 2018 Conference on Em-
David Eppstein. 1998. Finding the k shortest paths. pirical Methods in Natural Language Processing,
SIAM Journal on computing, 28(2):652–673. pages 2517–2527, Brussels, Belgium. Association
for Computational Linguistics.
Gabriel Fung, Maxime Debosschere, Dingmin Wang,
Bo Li, Jia Zhu, and Kam-Fai Wong. 2017. Nlptea Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee.
2017 shared task – Chinese spelling check. In 2013. Chinese spelling check evaluation at
Proceedings of the 4th Workshop on Natural Lan- SIGHAN bake-off 2013. In Proceedings of the Sev-
guage Processing Techniques for Educational Appli- enth SIGHAN Workshop on Chinese Language Pro-
cations (NLPTEA 2017), pages 29–34, Taipei, Tai- cessing, pages 35–42, Nagoya, Japan. Asian Federa-
wan. Asian Federation of Natural Language Process- tion of Natural Language Processing.
ing.
Shaohua Yang, Hai Zhao, Xiaolin Wang, and Bao liang
Chao-Lin Liu, Min-Hua Lai, Yi-Hsuan Chuang, and Lu. 2012. Spell checking for Chinese. In Proceed-
Chia-Ying Lee. 2010. Visually and phonologi- ings of the Eight International Conference on Lan-
cally similar characters in incorrect simplified chi- guage Resources and Evaluation (LREC’12), Istan-
nese words. In Proceedings of the 23rd Inter- bul, Turkey. European Language Resources Associ-
national Conference on Computational Linguistics: ation (ELRA).
Posters, pages 739–747, Beijing, China. Association
for Computational Linguistics. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Ruslan Salakhutdinov, and Quoc V Le.
Mateusz Pawlik and Nikolaus Augsten. 2015. Efficient 2019. Xlnet: Generalized autoregressive pretrain-
computation of the tree edit distance. ACM Trans. ing for language understanding. arXiv preprint
Database Syst., 40(1):3:1–3:40. arXiv:1906.08237.

Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-
distance: Robust and memory-efficient. Informa- Yi Chen, and Mao-Chuan Su. 2013. Chinese word
tion Systems, 56:157 – 173. spelling correction based on n-gram ranked inverted
index list. In Proceedings of the Seventh SIGHAN
Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An Workshop on Chinese Language Processing, pages
end-to-end trainable neural network for image-based 43–48, Nagoya, Japan. Asian Federation of Natural
sequence recognition and its application to scene Language Processing.
text recognition. IEEE transactions on pattern anal-
ysis and machine intelligence, 39(11):2298–2304. Junjie Yu and Zhenghua Li. 2014. Chinese spelling
error detection and correction based on language
DS Shih et al. 1992. A statistical method for locating model, pronunciation, and shape. In Proceedings of
typo in Chinese sentences. CCL Research Journal, The Third CIPS-SIGHAN Joint Conference on Chi-
pages 19–26. nese Language Processing, pages 220–223.

168
Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and
Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014
bake-off for Chinese spelling check. In Proceed-
ings of The Third CIPS-SIGHAN Joint Conference
on Chinese Language Processing, pages 126–132,
Wuhan, China. Association for Computational Lin-
guistics.
Shuiyuan Zhang, Jinhua Xiong, Jianpeng Hou, Qiao
Zhang, and Xueqi Cheng. 2015. HANSpeller++: A
unified framework for Chinese spelling correction.
In Proceedings of the Eighth SIGHAN Workshop on
Chinese Language Processing, pages 38–45, Bei-
jing, China. Association for Computational Linguis-
tics.

Hai Zhao, Deng Cai, Yang Xin, Yuzhu Wang, and

Zhongye Jia. 2017. A hybrid model for Chinese
spelling check. ACM Trans. Asian Low-Resour.
Lang. Inf. Process., 16(3):21:1–21:22.

169

C# & .NET Interview Questions 2024 Edition
100% (1)
C# & .NET Interview Questions 2024 Edition
14 pages
Chapter 5. Probabilistic Models of Pronunciation and Spelling
No ratings yet
Chapter 5. Probabilistic Models of Pronunciation and Spelling
40 pages
Assembler Pass 1 and Pass2 Algorithm
81% (37)
Assembler Pass 1 and Pass2 Algorithm
2 pages
A hybrid approach to automatic corpus generation for Chinese spelling check
No ratings yet
A hybrid approach to automatic corpus generation for Chinese spelling check
11 pages
General and domain-adaptive chinese spelling check with error-consistent pretraining
No ratings yet
General and domain-adaptive chinese spelling check with error-consistent pretraining
18 pages
C076
No ratings yet
C076
5 pages
Bytedance Soft Mask Bert
No ratings yet
Bytedance Soft Mask Bert
9 pages
Spell Correction
No ratings yet
Spell Correction
46 pages
Synopsis On Spell Cheker
No ratings yet
Synopsis On Spell Cheker
12 pages
A Survey of Spelling Error Detection and Correction Techniques
No ratings yet
A Survey of Spelling Error Detection and Correction Techniques
3 pages
Chinese Spelling Check Based On Sequence Labeling: Zijia Han Chengguo LV
No ratings yet
Chinese Spelling Check Based On Sequence Labeling: Zijia Han Chengguo LV
6 pages
Cme4408 p7 Probmodels Med
No ratings yet
Cme4408 p7 Probmodels Med
50 pages
2022 Aacl-Short 51
No ratings yet
2022 Aacl-Short 51
7 pages
A Simple yet Effective Training-free Prompt-free Approach
No ratings yet
A Simple yet Effective Training-free Prompt-free Approach
22 pages
Relevant
No ratings yet
Relevant
7 pages
An Improved Error Model For Noisy Channel Spelling Correction
No ratings yet
An Improved Error Model For Noisy Channel Spelling Correction
8 pages
Project
No ratings yet
Project
3 pages
Mid-Term Project Report On Spell Checker
No ratings yet
Mid-Term Project Report On Spell Checker
15 pages
Introduction to SIGHAN 2015 Bake-Off for Chinese Spelling Check
No ratings yet
Introduction to SIGHAN 2015 Bake-Off for Chinese Spelling Check
6 pages
Spelling Noisy Channel
No ratings yet
Spelling Noisy Channel
5 pages
18 nlp2
No ratings yet
18 nlp2
13 pages
NLP SEM 5
No ratings yet
NLP SEM 5
4 pages
A Multilayer Convolutional Encoder-Decoder Neural Network For Grammatical Error Correction
No ratings yet
A Multilayer Convolutional Encoder-Decoder Neural Network For Grammatical Error Correction
8 pages
Spell Checker For Gurmukhi Script: Thesis Submitted in Partial Fulfillment of The Requirements For The Award of Degree of
No ratings yet
Spell Checker For Gurmukhi Script: Thesis Submitted in Partial Fulfillment of The Requirements For The Award of Degree of
60 pages
Chinese Gra KDistillation
No ratings yet
Chinese Gra KDistillation
9 pages
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
No ratings yet
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
22 pages
8.chinese phonemic
No ratings yet
8.chinese phonemic
2 pages
Spelling Correction For Search Engine Queries
No ratings yet
Spelling Correction For Search Engine Queries
13 pages
Automatic Spelling Correction in Scientific and Scholarly Text
No ratings yet
Automatic Spelling Correction in Scientific and Scholarly Text
11 pages
How To Write A Spelling Corrector
No ratings yet
How To Write A Spelling Corrector
10 pages
Finite-State Spell-Checking With Weighted Language
No ratings yet
Finite-State Spell-Checking With Weighted Language
7 pages
C90 2036 PDF
No ratings yet
C90 2036 PDF
6 pages
Ucam CL TR 794
No ratings yet
Ucam CL TR 794
163 pages
Spell Checker Intro
No ratings yet
Spell Checker Intro
1 page
Spell Check and Soundex
No ratings yet
Spell Check and Soundex
19 pages
6-Spelling Correction Soundex
No ratings yet
6-Spelling Correction Soundex
52 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Using Large N-Gram For Vietnamese Spell Checking: Advances in Intelligent Systems and Computing January 2015
No ratings yet
Using Large N-Gram For Vietnamese Spell Checking: Advances in Intelligent Systems and Computing January 2015
11 pages
AUTOCORRECT WITH PYTHON
No ratings yet
AUTOCORRECT WITH PYTHON
6 pages
Article 3
No ratings yet
Article 3
18 pages
Lec04 SpellingCorrection
No ratings yet
Lec04 SpellingCorrection
25 pages
IRJET-V6I674
No ratings yet
IRJET-V6I674
6 pages
practical file 2
No ratings yet
practical file 2
6 pages
Neuspell: A Neural Spelling Correction Toolkit
No ratings yet
Neuspell: A Neural Spelling Correction Toolkit
7 pages
PDF Ref
No ratings yet
PDF Ref
10 pages
IR PRACTICAL 3
No ratings yet
IR PRACTICAL 3
4 pages
A Low-Resource Approach to the Grammatical Error Correction of Ukrainian
No ratings yet
A Low-Resource Approach to the Grammatical Error Correction of Ukrainian
7 pages
Singh 2016
No ratings yet
Singh 2016
5 pages
Korean
No ratings yet
Korean
6 pages
PBL Sample
No ratings yet
PBL Sample
19 pages
Sentence-Level Feedback Generation For English Lan
No ratings yet
Sentence-Level Feedback Generation For English Lan
7 pages
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
From Everand
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
Silas Quantum
5/5 (1)
Grammar
No ratings yet
Grammar
10 pages
Word Suggestions For Non-Word Text Errors Using Similarity Measure
No ratings yet
Word Suggestions For Non-Word Text Errors Using Similarity Measure
6 pages
Seen Ha Us
No ratings yet
Seen Ha Us
10 pages
Spell Correction Static Translation
No ratings yet
Spell Correction Static Translation
10 pages
Study of Spell Checking Techniques and A
No ratings yet
Study of Spell Checking Techniques and A
4 pages
FULLTEXT01
No ratings yet
FULLTEXT01
63 pages
Gector - Grammatical Error Correction: Tag, Not Rewrite
No ratings yet
Gector - Grammatical Error Correction: Tag, Not Rewrite
8 pages
Automatic Error Detection and Correction in Malayalam
100% (1)
Automatic Error Detection and Correction in Malayalam
5 pages
Constraint-Based Learning of Phonological Processes
No ratings yet
Constraint-Based Learning of Phonological Processes
11 pages
Kiit Internationational School: Project Synopsis ON Spell Correction
No ratings yet
Kiit Internationational School: Project Synopsis ON Spell Correction
15 pages
Debian Howto Start To Finish Using Webmin
No ratings yet
Debian Howto Start To Finish Using Webmin
612 pages
NTAG I2C Plus Your Entryway To NFC v1.0 Public - 2 PDF
No ratings yet
NTAG I2C Plus Your Entryway To NFC v1.0 Public - 2 PDF
43 pages
operating systems R18-Lab Manual
No ratings yet
operating systems R18-Lab Manual
90 pages
Enduser
No ratings yet
Enduser
125 pages
Artificial Intelligence and Virtual Reality
No ratings yet
Artificial Intelligence and Virtual Reality
3 pages
Find Back Your Nokia Phone Security Code
No ratings yet
Find Back Your Nokia Phone Security Code
5 pages
HSC Chemistry® 7.0 User's Guide Leaching PDF
No ratings yet
HSC Chemistry® 7.0 User's Guide Leaching PDF
70 pages
Trends in computer operating systems
No ratings yet
Trends in computer operating systems
5 pages
HPE - Dp00002639en - Us - HPE Smart Storage Administrator GUI User Guide
No ratings yet
HPE - Dp00002639en - Us - HPE Smart Storage Administrator GUI User Guide
142 pages
Nokia Optical Diagnostics and Troubleshooting
100% (2)
Nokia Optical Diagnostics and Troubleshooting
2 pages
Login Issues in Apps
No ratings yet
Login Issues in Apps
5 pages
HTML Technical MCQ
No ratings yet
HTML Technical MCQ
18 pages
Types of Benches in Greenhouse
No ratings yet
Types of Benches in Greenhouse
11 pages
DSS-Upgrade-Guide_V8.2.0_20221222
No ratings yet
DSS-Upgrade-Guide_V8.2.0_20221222
36 pages
Mid Term Imp Questions
No ratings yet
Mid Term Imp Questions
24 pages
Java Exercise Programs
No ratings yet
Java Exercise Programs
16 pages
Question1 SWE201c2
No ratings yet
Question1 SWE201c2
9 pages
Ctags Tutorial
No ratings yet
Ctags Tutorial
4 pages
M.S. Excel PDF (Sscstudy - Com)
No ratings yet
M.S. Excel PDF (Sscstudy - Com)
50 pages
Built Robot Arm100
100% (1)
Built Robot Arm100
75 pages
Abhijeet Shete - Globant
No ratings yet
Abhijeet Shete - Globant
3 pages
Ume Ingepac Ef LD Eng
No ratings yet
Ume Ingepac Ef LD Eng
717 pages
Program 1 OBJECT OF THE PROGRAM: Study Window's API and Their Relationship With MFC Classes
No ratings yet
Program 1 OBJECT OF THE PROGRAM: Study Window's API and Their Relationship With MFC Classes
41 pages
Full Stack Javascript: Learn Backbone - JS, Node - Js and Mongodb Second Edition
No ratings yet
Full Stack Javascript: Learn Backbone - JS, Node - Js and Mongodb Second Edition
24 pages
Mapúa Institute of Technology: School of EE-ECE-COE
No ratings yet
Mapúa Institute of Technology: School of EE-ECE-COE
5 pages
Agilent Technologies Logicwave (E9340A) : Service Guide
No ratings yet
Agilent Technologies Logicwave (E9340A) : Service Guide
82 pages
Seastar DP
No ratings yet
Seastar DP
4 pages
E-Gift Shoppy Project Report
100% (2)
E-Gift Shoppy Project Report
134 pages