0% found this document useful (0 votes)
100 views8 pages

Gurmukhi to Shahmukhi Transliteration

The document describes an omni-font Gurmukhi to Shahmukhi transliteration system that can transliterate Gurmukhi text to Shahmukhi at over 98.6% accuracy regardless of the Gurmukhi font used. It first detects the Gurmukhi font through a character-level trigram language model and 41 keyboard mappings. It then converts the text to Unicode using mapping tables. Special processing handles long vowels, conjunct characters, and the interaction of the Sihari vowel with subjoined consonants. The transliterated text is then output in Shahmukhi.

Uploaded by

music2850
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views8 pages

Gurmukhi to Shahmukhi Transliteration

The document describes an omni-font Gurmukhi to Shahmukhi transliteration system that can transliterate Gurmukhi text to Shahmukhi at over 98.6% accuracy regardless of the Gurmukhi font used. It first detects the Gurmukhi font through a character-level trigram language model and 41 keyboard mappings. It then converts the text to Unicode using mapping tables. Special processing handles long vowels, conjunct characters, and the interaction of the Sihari vowel with subjoined consonants. The transliterated text is then output in Shahmukhi.

Uploaded by

music2850
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Proceedings of COLING 2012: Demonstration Papers, pages 313320,

COLING 2012, Mumbai, December 2012.


An Omni-font Gurmukhi to Shahmukhi Transliteration
System
Gurpreet Singh LEHAL
1

Tejinder Singh SAINI
2
Savleen Kaur CHOWDHARY
3

(1) DCS, Punjabi University, Patiala
(2) ACTDPL, Punjabi University, Patiala
(3) CEC, Chandigarh Group of Colleges, Landran, Mohali
gslehal@[Link], tej@[Link], iridiumsavleen@[Link]
ABSTRACT
This paper describes a font independent Gurmukhi-to-Shahmukhi transliteration system. Even
though Unicode is gaining popularity, but still there is lot of material in Punjabi, which is
available in ASCII based fonts. A problem with ASCII fonts for Punjabi is there is no
standardisation of mapping of Punjabi characters and a Gurmukhi character may be internally
mapped to different keys in different Punjabi fonts. In fact there are more than 40 mapping tables
in use for the commonly used Punjabi fonts. Thus there is an urgent need to convert the ASCII
fonts to standard mapping, such as Unicode without any loss of information. Already many such
font converters have been developed and are available online. But one limitation is that, all these
systems need manual intervention in which the user has to know the name of source font. In the
first stage, we have proposed a statistical model for automatic font detection and conversion into
Unicode. Our system supports around 225 popular Gurmukhi font encodings. The ASCII to
Unicode conversion accuracy of the system is 99.73% at word level with TOP1 font detection.
The second stage is conversion of Gurmukhi to Shahmukhi at high accuracy. The proposed
Gurmukhi to Shahmukhi transliteration system can transliterate any Gurmukhi text to Shahmukhi
at more than 98.6% accuracy at word level.
KEYWORDS: n-gram language model, Shahmukhi, Gurmukhi, Machine Transliteration, Punjabi,
Font, Font detection, font Conversion, Unicode
313
1 Introduction
There are thousands of fonts used for publishing text in Indian languages. Punjabi (Gurmukhi)
alone has more than 225 popular fonts which are still in use along with Unicode. Though online
web content has shown the sign of migration from legacy-font to Unicode but books, magazine,
news paper and other publishing industry is still working with ASCII based fonts. They have not
adopted Unicode due to following reasons:
Lack of awareness of Unicode standard
People resist to change, due to Unicode typing issues and little support of Unicode in
publishing software they are working with
Less availability of Unicode fonts has shown very less verity of text representation
A problem with ASCII fonts for Punjabi is there is no standardisation of mapping of Punjabi
characters and a Gurmukhi character may be internally mapped to different keys in different
Punjabi fonts. For example, the word is internally stored at different keys as shown in
figure 1. Therefore, there is
an urgent need to develop a
converter to convert text in
ASCII based fonts to
Unicode without any loss
of information. Already
many such font converters
have been developed and
are available online. But
one limitation with all these systems is that the user has to know the name of source font. This
may not be a big issue but there are many occasions when a user may not be aware of the name
of the ASCII font and in that case he cannot convert his text to Unicode. To overcome this
hurdle, we have developed a font
identification system, which
automatically detects the font and
then converts the text to Unicode.
2 Omni-Font Detection and
Conversion to Unicode
Our system supports around 225
popular Gurmukhi font encodings.
Some of the popular fonts
supported are Akhar, Anmol Lipi,
Chatrik, Joy, Punjabi, Satluj etc. In
fact, these fonts correspond to 41
keyboard mappings. It means if
41 2 1
..... , k k k be the 41 keyboard
mappings and
225 2 1
...... , f f f be the Gurmukhi fonts, then each of fonts
i
f will belong to one of the
keyboard mapping
i
k . We could also have multiple fonts belonging to same keyboard map and
fonts belonging to same keyboard map have same internal mappings for all the Gurmukhi
0070+004D+006A+0077+0062+0049 in Akhar
0066+002E+0075+006A+0057+0067 in Gold
0070+00B5+006A+003B+0062+0049 in AnandpurSahib
0067+007A+0069+006B+0070+0068 in Asees
0050+005E+004A+0041+0042+0049 in Sukhmani
00EA+00B3+00DC+00C5+00EC+00C6 in Satluj
FIGURE 1 Internal representation of in different fonts
Input
ASCII
Gurmukhi
text
Character-level
Analysis
Font Detection
and Unicode
Conversion Char Maps
Unicode text
Trigram Language
Model
41 2 1
..... , k k k

11 Million
Word Corpus
Keyboard Mappings
FIGURE 2 Model Components
314
characters. For example, Akhar2010R and Joy font belong to same keyboard map family. The
problem is now reduced from 225 distinct fonts to just 41 group classes corresponding to each
keyboard map. Therefore, our font detection problem is to classify the input text to one of these
41 keyboard mappings. It could be thought of as a 41 class pattern recognition problem. The
proposed system is based on character-level trigram language model (see figure 2). We have
trained the trigrams corresponding to each keyboard map. For training purpose a raw corpus of
11 million words has been used. To identify the font of a text, all the character trigrams are
extracted and their probability is determined in each of the keyboard map. The keyboard map
having maximum product of trigram probability is identified as the keyboard map corresponding
to the input text.
2.1 Font Detection
The omni font detection problem is
formulated as character level trigram
language model. The single font has 255
character code points. Therefore, in a
trigram model, we need to process 255x255x255=255 code points for a single keyboard map.
But we are dealing with 41 distinct keyboard maps so the memory requirement will be further
increased to process and hold 41x 255 code points. After detailed analysis we found that the
array representation of this task is sparse in nature i.e. the majority of code points in omni fonts
have zero values. We observed that each keyboard map has around 26,000 non-zero code points
which is 0.156% of the original code points. Hence, to avoid sparse array formation, we have
created trigram link list representation (see figure 3) having on an average 26,000 trigram nodes
for only valid code points with respect to each keyboard map. The structure of the node is
expressed in figure 2. The trigram probability of a word
l
w having length l is:

l
w =

=

l
i
i i i
c c c P
1
2 1
) , | ( (1)
The total probability of input text sentence is:
n
w
, 1
=

n
l
w
1
and the keyboard map is detected by K= arg max

k n
l
w
1 1
(2)
2.2 Font Conversion
After the identification of the keyboard map, the handcrafted mapping table of identified font is
used by the system to transform input text into Unicode characters and finally produce the output
Unicode text. Special care has been given for transforming long vowels [], [u], [o], [],
[], [], [i], [e], [], [] and Gurmukhi short vowel sign []. This is because in Unicode
these nine independent vowels [], [u], [o], [], [], [], [i], [e], [], [] with three
bearer characters Ura [], Aira [] and Iri [] have single code points and need to be mapped
accordingly as shown in figure 4. There are no explicit code points for half characters in Unicode.
They will be generated automatically by the Unicode rendering system when it finds special
0 105 58 0.86775

Ch1 Ch2 Ch3 Probability
ptr
Data next
FIGURE 3 Tri-gram Character Node
315
symbol called halant or Viram []. Therefore, Gurmukhi transformation of subjoined consonants
is performed by prefixing halant symbol along with the respective consonant.
Long Vowels
[] + [] + [] + []

[] + [] + [u] + [i]

[] + [] + [o] + [e]
Short Vowel [] [k] +

Nukta Symbol
+ = ; + = + = ; + = ; + = ; + = ;
Subjoined
Consonants
++ = [n
h
] ++ = [pr] ++= [sv]
FIGURE 4 Unicode Transforming Rules
The Unicode transformation
becomes complex when Gurmukhi
short vowel Sihari [] and
subjoined consonants comes
together at a single word position.
For example, consider the word
. In omni font Sihari [] come as first character. But according to Unicode it must be after the
bearing consonant. Therefore, it must go after the subjoined consonant + as shown in figure 5.
3 Gurmukhi-to-Shahmukhi Transliteration System
The overall transliteration process is divided into three tasks as shown in figure 6.
3.1 Pre-Processing
The integration of omni font detection and conversion module enhances the scope, usefulness and
utilization of this transliteration system. The Gurmukhi word is cleaned and prepared for
transliteration by passing it through the Gurmukhi spell-checker and normalizing it according to
the Shahmukhi spellings and pronunciation as shown in Table 1.
Sr. Gurmukhi Word Spell-Checker Normalized Shahmukhi
1
/khudgaraj/ /udaraz/ /daraz/

.-
2
/khush/ /ush/ /sh/ -
3
/pharg/ /farg/ /farg/ ,

TABLE 1 Spell-Checking and Normalization of Gurmukhi Words


3.2 Transliteration Engine
The normalised form of Gurmukhi word is first transliterated to Shahmukhi by using dictionary
lookup. Dictionary lookup is a limited but effective and fast method for handling the complex
spelling words of Shahmukhi and transliteration of proper nouns. Therefore, a one-to-one
FIGURE 5 Complex Unicode Transformation
316
Gurmukhi-Shahmukhi dictionary resource has been created and used for directly transliterating
frequently occurring Gurmukhi words and transliterating Gurmukhi words with typical
Shahmukhi spellings.
Gurmukhi Stemmer: In our case, stemming is primarily a process of suffix removal. A list of
common suffixes has been created. We have taken only the most common Gurmukhi suffixes
such as , , , , etc. The Shahmukhi transliteration of these suffixes is stored in the
suffix list. Thus if the word is not found, we search the suffix list and find suffix . This
suffix is removed from
and the resultant word
is then searched again in
the dictionary. If found then
the transliterated word r is
then appended with , [Link] is
the Shahmukhi transliteration
of the suffix . Thus, the
correct transliteration i.e. r
is achieved.
3.2.1 Rule-base
Transliteration
The rule based transliteration
is used when dictionary lookup
and Gurmukhi stemmer failed
to transliterate the input word.
Using the direct grapheme
based approach; Gurmukhi
consonants and vowel are
directly mapped to similar
sounding Shahmukhi
characters. In case of multiple
equivalent Shahmukhi
characters or one-to-many
mappings, the most frequently
occurring Shahmukhi character
is selected. Thus, is mapped
to and is mapped to . Besides, these simple mapping rules, some special pronunciation based
rules have also been developed. For example, if two vowels in Gurmukhi come together, then
Shahmukhi symbol Hamza is placed in between them. After simple character based mapping we
resolve character ambiguity. That is, the Gurmukhi characters with multiple Shahmukhi
mappings, all word forms using all possible mappings are generated and the word with the
highest frequency of occurrence in the Shahmukhi word frequency list is selected. For example,
consider the Gurmukhi word . It has two ambiguous character [s] and [h]. The system
will generate all the possible forms and then choose the most frequent .
-
. (6432) unigram as
output as shown in the figure 7. Finally, the output word is inspected for correct spelling using
Shahmukhi resources.
Gurmukhi Text
Normalized Words
Transliterated Words
Gurmukhi,
Shahmukhi
Unigram, bi-
gram, trigram
tables
Transliteration Engine
Pre-Processing
Post-Processing
Text Normalization
Omni font to Unicode Converter
Spell-Checking
Rule based Transliteration
Check Shahmukhi Spellings
Gurmukhi-Shahmukhi Dictionary
Shahmukhi Text
HMM Model for
Shahmukhi
Word Disambiguation
Using HMM
Gurmukhi
Spell-Checker
Gurmukhi Stemmer
FIGURE 6 System Architecture
317
Char1



All forms
(with unigram frequency)
Char2

.

.~ (o) .
-
.~ (o)
Char3



.

. (o) .
-
. (o4sz)
Char4

nil .

.
_
(o) .
-
.
_
(o)
Char5


Selected
(most frequent)
.
-
.
FIGURE 7 Steps for handling Multiple Character Mapping
3.3 Post-Processing
The transliteration of Gurmukhi word has two Shahmukhi spellings with different senses as
(state, condition, circumstance) and (Hall; big room). In post-processing we have
modelled 2
nd
order HMM for Shahmukhi word disambiguation where the number of HMM states
are corresponding to each word in the sentence and their possible interpretations or ambiguity.
The transition probabilities and inter-state observation probabilities are calculated as proposed by
Thede and Harper (1999) for POS system.
4 Evaluation and Results
There are two main stages of our system first is font detection and Conversion of detected font to
Unicode and the second stage is conversion of Gurmukhi text to Shahmukhi. The evaluation
results of both the stages are:
Font Detection and Conversion Accuracy: In order to test font detection module we have a set
of 35 popular Gurmukhi fonts. A randomly selected test data of having 194 words and having
962 characters (with spaces) is converted into different Gurmukhi fonts and used for system
testing. The font detection module has correctly predicted 32 fonts. Hence, they have 100%
conversion rate into Unicode. The remaining 3 fonts were not correctly recognized and confused
with other fonts with which the character mapping table is almost similar. The fonts varied by
only 2-3 code maps and hence the confusion. But this confusion did not create much problem in
the subsequent Unicode and then Shahmukhi conversion.
Transliteration Accuracy: We have tested our system on more that 100 pages of text compiled
from newspapers, books and poetry. The overall transliteration accuracy of this system is 98.6%
at word level, which is quite high. In the error analysis we found that this system fails to produce
good transliteration when the input words have typical spellings. The accuracy of this system can
be increased further by increasing the size of the training corpus and having plentiful of data
covering maximum senses of all ambiguous words in the target script.
Conclusion
For the first time a high accuracy Gurmukhi-Shahmukhi transliteration system has been
developed, which can accept data in any of the popular Punjabi font encoding. This will fulfil the
demand from the users to develop such a system, where the Gurmukhi text in any font could be
converted to Shahmukhi.
318
References
Lehal G.S. (2009), A Gurmukhi to Shahmukhi Transliteration System. In Proceedings of ICON:
7
th
International Conference on Natural Language Procesing, pages 167-173, Hyderabad, India.
Lehal, G. S. (2007). Design and Implementation of Punjabi Spell Checker, International
Journal of Systemics, Cybernetics and Informatics, pages 70-75.
Malik, M.G.A. (2006). Punjabi Machine Transliteration. In Proceedings of the 21st
International Conference on Computational Linguistics and 44th Annual Meeting of the ACL,
pages 1137-1144.
Saini, T. S., Lehal, G. S. and Kalra, V. S. (2008). Shahmukhi to Gurmukhi Transliteration
System. In Proceedings of 22nd international Conference on Computational Linguistics
(Coling), pages 177-180, Manchester, UK.
Saini, T. S. and Lehal, G. S. (2008). Shahmukhi to Gurmukhi Transliteration System: A Corpus
based Approach. Research in Computing Science, 33:151-162, Mexico.
Thede, S.M., Harper, M.P. (1999). A Second-Order Hidden Markov Model for Part-of-speech
Tagging, In Proceedings of the 37th annual meeting of the ACL on Computational Linguistics,
pages 175-182.
Davis, M., Whistler, K. (Eds.) (2010). Unicode Normalization Forms, Technical Reports,
Unicode Standard Annex #15, Revision 33. Retrieved February 22, 2011, from
[Link]
319

You might also like