Proposal Bangla LGR 20may20 en
Proposal Bangla LGR 20may20 en
1. General Information
This document lays down the Label Generation Rule Set (LGR) for the Bangla (or
‘Bengali’)1 script under the general rubric of the Neo-Brāhmī Writing System. Three
main components of the Bangla Script LGR i.e. (i) Code point repertoire, (ii) Variants
and (iii) Whole Label Evaluation Rules which have been described in detail here, having
given a brief historical background of the Script under Section 3.
1 The term ‘Bangla’ is used in the descriptive text and the term ‘Bengali’ is used in the normative part of this
proposal.
well as in the Andaman and Nicobar Islands (close to a hundred thousand) - accounting
for 8.3% of India. It is a major language in Jharkhand (2.6 million), too and a language
with a sizable population in Bihar (0.44 million). Apart from these, there are a huge
number of Bangla-speaking diasporas spread all over the world. It is the seventh largest
spoken and written language in the world. Bangla is the national and official language of
Bangladesh, and one of the 22 Official languages in India (listed in the 8th Schedule of
the Indian Constitution). It is also one of the official languages of Sierra Leone. The
script is also called Bangla [102], which is an eastern variety of the ‘Brāhmī’ Writing
System, written from left to right. Historically it derives from the Brāhmī alphabet as
used in the Ashokan inscriptions (269-232 BC).
Bangla and its cognate languages, as mentioned above, together form a linguistic group
known as the Eastern New Indo-Aryan (NIA). There is a gross inadequacy of the
inscriptions and manuscripts in the Eastern Apabhraṅ śa or ‘Avahaṭṭha’ except for small
inscriptions and the manuscripts of the Tantric Buddhist text titled
‘Caryyācaryyaviniścaya’ or the Caryā-Pada [114] dating back to the 9th-11th century. As
a result, there is not much epigraphic evidence for the development of its writing
system. However, what evidence is available of the genesis of Bangla writing system is
discussed in the section 3.1 [109].
Historically, the Bangla language is divided into three periods as evident from various
sources:
2
Bangla prose had developed two literary styles during the 19th-20th Century: The
Sādhubhāṣā (সাধু ভাষা - "Elegant Language or Style") and the Calitabhāṣā (চিলতভাষা
"Current Language, or Modern Style"). It is the latter style that is prevalent today in
written prose.
The Language Movement in Bangladesh (the then East Pakistan) began in 1948, as civil
society dissented to the elimination of the Bangla script from currency and stamps,
which were in use since the British Raj. The movement reached its pinnacle in 1952,
when on 21 February the police fired on demonstrating students and civilians,
triggering numerous injuries and deaths2. Later, following the Language movement, on
27 April 1952, the All Party National Language Committee decided to demand
establishment of an organization for the promotion of Bengali language. Bangla
Academy, Dhaka right from its inception in 1955 has been engaged in promoting and
fostering Bangla as the lingua franca of the country before and after independence from
Pakistan in 1971. Through the various commissions and committees constituted by the
Government of Bangladesh (Bā ṅ lā deś a Jā tı̄ya Sy ikṣ ā Kamiś ana in 1972, Jā tı̄ya Sy ikṣ ā
Upadeṣ ṭā Pariṣ ad in 1979, Bā ṅ lā Bhā ṣ ā Bā stabā yana Sela in 1982, Bā ṅ lā Bhā ṣ ā Kamiṭi in
1983, etc.3) after independence in 1971 Bangla was made the primary medium of
instruction/communication in all Governmental and educational activities. Through a
great struggle and bloodshed, the Bengalis established Bangla as an official language of
the state.4.
2
The UN declared Ekuśe February (21st February) as the International Mother Language Day at the UNESCO
General Conference in Paris on 17 November 1999 “in recognition of the sanctity and preservation of all
vernacular languages in the world.”22
3
Bāṅlā Bhāṣā Kamiṭi. 1983. Bāṅlā Bhāṣā Kamiṭi Riporṭ (Report of the Bangla Bhasha Committee). Dhakaː Śikṣā,
Dharma, Krīṛā O Saṅskṛti Mantraṇālaya, Peoples Republic of Bangladesh.
4
Chakraborty, Rajib. 2018. The Fishermen’s Community: A Language-Culture Interplay (A Study of Post-1971
Select Bangla Novels). Unpublished Ph.D. Dissertation, Visva-Bharati.
5 William Dwight Whitney in his Sanskrit Grammar unequivocally said, “This name (Devanā garı̄) is of
doubtful origin and value” (Whitney, William Dwight. 1994 reprint. Sanskrit Grammar. New Delhiː Motilal
Banarasidass Publishers, p. 1)
3
Mithilākṣara was used for Maithili from the 14th Century until the early-20th century
[106]. In this context, one finds a mention of ‘Sylheti Nā garı̄ lipi’ or ‘Siloṭi’ (added to the
Unicode Standard in March 2005 with the release of version 4.1) the details of which
could be of interest only to historians and historical linguists (See 137 and 144). But
Sylheti Bangla is generally written by many in the modern-day Bangla script now for all
practical purposes. Originally, during the reign of the Pāla dynasty (750-1154 AD) in
the eastern India, and even earlier, perhaps during the Malla period (694 AD onwards),
the present-day Bangla writing system got a shape comparable to the modern-day ones
[111, 119]. A pictorial description of Brāhmī to Modern Bangla Script could be
presented here in a tabular form:
Modern
ক জ ম র স অ
k j m r s a
The inscriptional evidence in Brāhmī is found in the Archaic Brāhmī from the 3rd
century B.C. to the 1st century B.C, and in Middle Brāhmī – soon after (1st-3rd Century
A.D.) and then on in the Late Brāhmī (4th-6th Century A.D.). This evidence could be seen
in both Bangladesh and West Bengal [108] by 1) The Mahā sthā nagaṛ a (Bogra district,
Bangladesh — the ancient name being Puṇ ḍ ranagara or Pauṇ ḍ ravardhanapura)
inscriptions, 2) Brāhmī (and Kharoṣṭhī) inscriptions from the lower ‘Gangetic Bengal’
and (3) Copper plate inscriptions of the Imperial Guptas from Northern part of West
Bengal and North-West Bangladesh — in the areas under Dharmā ditya, Gopachandra
and Samācāradeva (about whom one only knows from five Copper-plates found in
Kotā lipā ṛ ā in the Faridpur district in Bangladesh, one in Mallasā rul in the Burdwan
district (West Bengal), and one in Jayrā mapura (Balleś vara district, now in Odisha).
4
These epigraphs from the eastern part of Undivided India (dating back to the 4th-6th
Centuries A.D.) showed some characteristic features of letters (especially in ম ‘ma’, ল ‘la’,
শ ‘ś a’, স ‘sa’ and হ ‘ha’), which led to the development of eastern variety of Gupta script.
Epigraphic records from Bangladesh demonstrate remarkable developments in Eastern
Brāhmī. In this context, the Tippera copper plate inscription of the ‘Samataṭa’ rulers
(139, pp 265) such as Lokanātha (dated 7th Century A.D., during the latter half), the
Kailan inscription of Sy ridharaṇ a Rāta as well as the Astafpur copper plates. The letters
seem to hang down from wedge shaped solid triangles with right hand verticals bending
down at the bottom, because of which it was described by Prinsep and Fleet as Kuṭila-
lipi (literally, ‘Cursive writing style’), whereas the term Siddhamātrikā (as a mā trā or
bar is placed over each of the letters) was used by Al Biruni (973-1048) to designate the
script of Northern India. The next stage of development is illustrated by the 9th Century
copper plate inscriptions from Khalimpur of the reign of Dharmapāla, from Monghyr
and Nā landā of the time of Devapāla in Bihar, and from Jagjı̄vanpura (Malda) of the
reign of Mahendrapāla. The Siddhamātrikā (mentioned as ‘Siddham’ in Chinese sources)
is said to have been prevalent also in this region up to the end of the tenth century. Also
called the Gauri (i.e. Gandi) in Pūrvadeśā or the Eastern country, it was regarded as the
same script to which is given the appellative Proto-Bangla characteristics in
rudimentary forms, in the period between A.D. 875 and A.D. 1025.
The evolution of the Bangla script (Cf. 136) is aligned with the story of advancement of
printing technology. The first “Movable type” scripts technically created and used while
printing Nathaniel Brassey Halhed's (1751-1830) 1778-book titled, 'A Grammar of the
Bengal Language'. In 1785, Governor-General Warren Hastings (1732-1818) requested
another civilian, Charles Wilkins (1749-1836) to cut punches for Bangla printing
characters. The current printed form of Bangla script appeared soon after. It is generally
agreed that Wilkins developed Bangla print script [111]. He passed on this knowledge
to Pañ cā nana Karmakā ra (?-1804), a renowned artist in Bengal. Later it was Karmakar
and his family that became famous in Bangla printing technology. Shepherd was
another assistant of Wilkins in this designing of script, which became more angular with
sharper turns and edges [133]. A few archaic letters were modernized during the 19th
5
century. It was standardized by Pandit Ishwar Chandra Vidyasagar when the Bangla
type fonts were to be used to publish on a large scale under the Calcutta School Book
Society [116 for several references].
Much later, in 1935, the Linotype technique, invented by Ottmar Mergenthaler (1854-
1899) in 1886, was introduced into Bangla printing in 1935, by the efforts of Suresh
Chandra Majumdar (1888-1954), Rajsekhar Basu (1880-1960), Jatindra Kumar Sen
(1882-1966) and his disciple, Sushil Kumar Bhattacharya and had begun being used by
the Aƒ nandabā zara Patrikā group, later followed by others. Within a few years the more
advanced monotype technology came to be used in Bangla printing. However, in Bangla
printing culture, monotype has a very limited acceptance and linotype held stage till,
eventually, the digital technology came in to replace all earlier techniques.
3rd Century B.C. Use of Brāhmī and Kharoṣ ṭhī scripts begin in the Brāhmī
subcontinent. Brāhmī was widely used during the
Mauryan King, Aśoka. In one theory, Brāhmī is
based on North Semitic alphabet but suitably
modified to fit the need of local languages. It is
currently believed to have been an independent
development.
1st-3rd Century The Kuṣ āṇ a script, named after the Kuṣ āṇ a royal Kuṣ āṇ a script
AD dynasty.
4th-5th Century The next stage of its evolution was into the Gupta Gupta script
AD script, named after the Gupta royal dynasty.
8th Century AD Some copper plate inscriptions are found in the Siddhamātikā
Khalimpur, Bangladesh during the reign of
Dharmapāla, from Monghyr and Nālandā in Bihar, of
the time of Devapāla, and from Jagjı̄vanapura in
West Bengal of the reign of Mahendrapāla.
6
PERIOD DESCRIPTION NAMES
The overall development of Bangla Script from the Kuṭila-lipi period to Modern Bangla
could be seen here in Table 3 ([102 and 146] and also see the web-page in 147).
7
Table 3: Bangla Script in Different Centuries
8
EGIDS EGIDS EGIDS EGIDS EGIDS EGIDS 6
Scale 1 Scale 2 Scale 3 Scale 4 Scale 5
9
● Vowels can be written as independent letters, or by using a variety of diacritical
marks which are written above, below, before, after or both of the last two
positions the consonant they follow in pronunciation [105].
● All Bangla consonants when pronounced in isolation are uttered with an inherent
vowel - / ɔ/; hence ক ‘k’, খ ‘kh’ or গ ‘g’ are usually pronounced as [kɔ], [khɔ], or
[gɔ], etc. Phonologically, Bangla vowel - / ɔ/ corresponds to the Hindi schwa /ə/
● When consonants occur together in clusters, special conjunct letters are formed.
In printed Bangla, many of these consonantal clusters or conjoined consonants
are in use. The letters for the consonants other than the final one in the group are
generally reduced. But there are a few special conjunct characters which are
compounds of the consonant characters, e.g. 7(k)+ষ(ṣ )=8(kṣ )/,
9(ñ )+জ(j)=:(ñ j), ;(j)+ঞ(ñ )==(jñ ), >(h)+ম(m)=?(hm). There are other issues
also—র as the second member of a cluster is reduced to a secondary symbol, e.g.
@(p)+র(r)=A(pr), B(ṣ )+C(ṭ)+র(r)=D(ṣ ṭr) (as in উD uṣ ṭra “camel”); য (y), when used
as a primary symbol, represents /jɔ/ in Bangla. But its secondary symbol
(allograph) jɔ-phalā has two phonetic values. When added to the initial
consonant in a word, it is a vowel /æ/ (as in শGামল (ś yā mala) “green”, রGাপার
(ryā pā ra) “wrapper”, etc.). But after a non-initial consonant, it just doubles it in
pronunciation (as in কাযH, ধাযH, etc.). The I(r)+য(y) combination has two
renderings—রG(ry) and যH(ry). In case of J(d)+ধ(dh), K(g)+ধ(dh), L(n)+ধ(dh) the
shape of the second member is changed—e.g. M(ddh), N(gdh), and O(ndh)
respectively. The solitary example of I(r)+ঋ(ṛ )=ঋH (as in QনঋHত nairṛ t
"Southwest") – used mostly in cases of Classical borrowings, shows the use of
secondary symbol of a consonant followed by the primary symbol of a vowel.
The inherent vowel only applies to the final consonant of the cluster.
● In consonant clusters, many consonants took a completely different form. Some
typical examples are S (kt), T (kr), 8 (kṣ ), N (gdh), = (jñ ), U (ñ c), : (ñ j), V (ṭṭ), W
(nt), O (ndh), X (bdh), Y (bhr), Z (mb), [ (st) etc. র has two allographs, apart from
this full shape : one is ‘repha’, as found in কH (rk), পH (rp); and another is ra- phalā ,
as in A (pr), T (kr). \ (ṣ +ṇ ) is another one, where the cerebral nasal consonant
sign takes a queer shape. [151]
● The Bangla script has at least fifty-two primary symbols and quite a few
allographs (positional variants of them), corresponding to forty-four (7 oral and
7 nasal vowels and 30 consonants) phonemes (150) or functional speech sounds,
with some obvious redundancies, although in one of the first phonemic analysis,
the number was thought to be thirty-five phonemes [140].
10
● As mentioned above, in Bangla, several graphemic symbols have secondary
shapes, technically called ‘allographs’ with a complementary distribution in each
case. These graphs or markings are generally added to the following positions of
the primary symbol [113] in the following manner:
11
Similarly, there could be vowel modifiers of ঊ or ‘(Long) ū’ as well; e.g.
m (bh) + র (r) (n bhrū “eyebrow”), o (ś ) +র (r) (p ś rū ), ঋ (ṛ ) after হ (h) (q hṛ ), etc.
● There have been many notable contributions in simplifying and modifying Bangla
spellings and combinatory techniques, especially by scholars such as Pabitra
Sarkar (1992) [134]. In this there has been an attempt to reduce the number of
allographs of both vowels and consonants in clusters, and it has been widely
accepted in the printing of school texts in both Bangladesh and West Bengal [151,
152]. As of now, two systems, the old (traditional), and the new, go on side by
side, operative in different domains.
However, in preparation of this LGR document, the aim has been to consider the widely
used and usable sequences and combinations and their variations across the sister
scripts belonging to the basket of Brāhmī writing systems.
Bangla Academy, Dhaka published Standard Bangla Spelling Rules in 1992 following the
recommendations of a committee constituted through a workshop jointly organized by
the Jā tı̄ya Sy ikṣ ākrama and Pā ṭhyapustaka Board in 1988. A throughly revised edition of
the Rules was published in September 2012.6
After the establishment of Bā ṅ lā Aƒ kā demi of West Bengal in 1986, its first President,
Annadasankar Ray (1904-2002), in his inaugural address, gave a direction for
standardization of Bangla alphabet, script, the spelling system and clearly argued that
they would not blindly follow the Sanskritic model of conventional grammar. A broad
list of proposals was sent to experts on Bangla, and a broad agreement was reached for
‘homogenization of Bangla spelling’ by 1988. Based on opinions received from different
quarters, a unanimous list of ‘rules’ was agreed upon. This was published by a ‘Spelling
Dictionary’ titled, Ākādemi Bānāna Abhidhāna (1997), which was obviously more
comprehensive than ‘The University of Calcutta proposals’, made in 1936. Along with
the ‘rationalization’ of spellings, another step was taken to make the writing system
easier to read, by making the symbols used, both single and combined ones, more
‘transparent’. These reforms were originally suggested by Sarkar (1987, first published
in 1978) [134] [153] where he used the terms Swaccha (‘Transparent’) and Aswaccha
(‘Opaque’ or non-transparent), even adding Ardha Swaccha (‘half transparent) in
between the two. Some sample examples are:
Transparent: r (nn), s (pt), [ (st), where both member of the cluster can be
recognized.
6Bangla Academy. 2012. Bāṅlā Ekaḍemī Pramita Bāṅlā Bānānera Niyama (Bangla Academy Standard
Bangla Spelling Rules). Dhakaː Bangla Academy.
12
Opaque: where neither of the two could be (easily) recognized—8 kṣ (7 k + ষ ṣ ), = jñ
(; j + ঞ ñ ), t ṅ g (u ṅ + গ g), ? hm (> h + ম m).
Semi-transparent: A (pr), পH (rp) where one symbol is recognizable and the other is
not. In case of three-term clusters, at least one symbol will not be transparent, e.g. v str
(w s+x t+র r), D ṣ ṭr (B ṣ +C ṭ+র r), etc.
There were, in fact, two types of proposals. One concerned the shape of the letters,
those of consonant + vowel (CV) combinations and conjuncts, which is consonant +
consonant combinations. There were further complex shapes, i.e. those of consonant +
consonant+ (consonant+) vowel (CC(CV) signs, as in y (pru), or z (skru). Some
decisions in this area were necessary because a few of the CC(C) symbols represented
complexities that made learning them difficult for the children. The other dealt with the
spellings of words only, without any reference to the shapes of letters in which they
were written. The basic objective here was ‘one word, one spelling’, to the greatest
extent that was possible. [151]
Below we place a statement of the most salient changes that affect the consonant +
vowel combinations. [153]
a. The variants of the short u (^{ উ-কার hrasva u-kāra) vowel sign have been
brought down to one, i.e., ◌ু. So (gu) is now গু. Similarly h (ru) > রু,
(śu)> শ,ু j (hu)>হ.ু and therefore, cluster + short u sign : k (ntu)> Wু
(ন+◌্+ত+উ), } (stu)>[ু (স+◌্+ত+উ)
b. The variants of long u (দীঘH ঊ-কার dīrgha u-kāra) have also been reduced. €
(rū)> র;ূ n (bhrū) > Yূ (ভ bh+◌্+র r+ঊ ū); • (drū)> ‚ূ (দ d+◌্+র r+ঊ ū); p (śrū)>
ƒূ (শ ś+◌্+র r+ঊ ū)
c. The variants of ঋ-কার (ṛ-kāra "secondary symbol of ṛ") have been brought
down to one: q (hṛ) > হৃ
Regarding consonant + consonant + (consonant)…+ (vowel) clusters Paschimbanga
Bangla Akademi proposed transparent or semi-transparent shapes for clusters to the
extent admissible in Bangla writing system. Some examples will clarify the proposal (A
slash will mean that the traditional cluster-shape precedes it, while the Bangla Akademi
innovation follows.) [153]
X/…ধ bdh († b+ ধ dh), M/‡ধ ddh (J d+ধ dh), ˆ/‰থ, " nth (L n+থ th), U/‹চ, # ñc (9
ñ+চ c), Œ/‹ছ, $ ñch (9+ছ), :/‹জ, % ñj (9 ñ+জ j), S/Žত, & kt (7 k+ত t), T/' kr (7
k+র r), N/•ধ, ( gdh (K g+ধ dh), •/) ṅk (u ṅ+ক k), t/ * ṅg (u ṅ+গ g), \/+ ṣṇ (B ṣ+ণ
ṇ), ’/‰“, , ndhr (L n+” dh+র r), •/- ṇḍr (– ṇ+— ḍ+র r), ˜/. ktr (7 k+x t+র r)
13
3.3.1 The Consonants
As per traditional classification Bangla Consonants are categorized according to their
phonetic properties, especially in terms of place and manner of articulation [107]. There
are Five ‘Varga’ (pronounced as ‘Barga’ in Bangla) or Groups (sets or classes)
distinguished by Place of Articulation, and one Non-‘varga’ group [105]. Each Varga,
which corresponds to Stops at a certain place of articulation, contains a series of five
consonants classified as per their phonetic qualities (i.e. manner of articulation),
beginning from Unvoiced and Unaspirated to Voiced and Aspirated (in the fourth
column), finally ending with a Homorganic or Corresponding nasal [107]. Consider the
following table:
Table 6: Non-Varga consonants (Not falling into any of the five categories)
14
3.3.2 The Implicit Vowel Killer: Hasanta (called ’Halant’ or ‘Halanta’ in other
Brā hmı̄-based scripts)
As stated earlier, all consonants are pronounced in isolation with an implicit vowel
(central back /-ɔ/ in Bangla as the neutral vowel) assumed to be associated with them
[121]. The ‘Hasanta’ (=’ Halant’ or ‘Halanta’ in other Brā hmı̄-based scripts) or the term
‘Virāma’7 (=’Dā ̃ri’ in Bangla) as preferred in UNICODE (cf. Unicode 3.0 and above) have
been used in this report as terms that have been used to denote the character that mark
the absence of this inherent vowel. It may be noted that the term virā ma has been
adopted in UNICODE in a sense that is different from the traditional definition of
grammar, and hence it requires some explanation here. Considering the importance of
the document this note should be a part of this LGR document, so that anybody refering
to it should be able to know the proper grammatical explanation of the term. Because a
special sign is needed whenever this implicit vowel is stripped off, the symbol is known
as the Hasanta (= Halant) "◌्" (U+09CD). By placing the Hasanta under the first
consonant of a combination or cluster, one could – in common parlance, “kill” its vowel,
and create conjuncts. In this manner, conjunct characters can be generally written by
joining two to four consonant combinations. In rare cases, this process can join up to
five consonants. However, the notion of a maximum number of consonants joining to
form one akṣara8 is to be bounded empirically. This is an observation based on the CIIL-
Emille Corpora of Bangla words [132 & 133] as seen in print these days. Given the
mixture of scripts and languages happening on the web, the possibility that one may
want a generic Top Level Domain [gTLD] which may have more than the observed
maximum cannot be ruled out. This can be the case when a foreign language word,
which admits a large number of consonants, is transliterated into Bangla. Hence, in the
Bangla LGR work, this limit will not be enforced.
3.3.3 Vowels
Separate symbols exist for all ‘Swara’ or Vowels in Bangla, which are pronounced
independently either at the beginning of the word or after another vowel or consonant
sound. To indicate a Vowel sound other than the implicit one, a Vowel sign, called ‘kār’
in Bangla or Mātrā in Nā garı̄9 is attached to the consonant. Since the consonant has this
built in neutral vowel at the end, there are equivalent kāras (Mātrās) for all vowels
except the অ (pronounced /-ɔ/). The correlation is shown as follows:
7 Virāma, as used here, is also a misnomer according to the Indian grammatical traditions. No where mere
absence of a vowel is marked as virā ma. Hasanta just marks the absence of a vowel, nothing else.
(Abhyankar, Kashinath Vasudev & J. M. Shukla. 1961. A Dictionary of Sanskrit Grammar. Barodaː Oriental
Institute.)
8 This term needs to be disambiguated. Akṣ ara also means ‘syllable‘ in Indian grammatical treaditions
9 Although the term ‘Mā trā ‘ in Bangla stands for an altogether different concept, viz.the top bar placed
over a letter – typically available in Hindi and Bangla but missing in Gujarati.
15
Vowel Corresponding vowel sign
(kāras (Mātrās)
অ ‘A’ U+0985
- ◌ৗ U+09D7
- ঽ U+09BD Avagraha
16
3.3.4 The Anusvāra /onuʃʃār/ (◌ং - U+0982)
The Anusvāra or /onuʃʃār/ in Bangla at times represents a homorganic nasal but not
always. It replaces a conjunct group of a ‘Nasal Consonant+Hasanta +Consonant’ where
the second consonant belongs to the Velar varga or set as in লংকা. But it often appears
also for such combinations involving non-velars appearing as the last member of the
combination as in লGাংটা “naked”, or লGাংচা “a kind of sweet/to limp”. Before a non-varga
consonant, the Anusvā ra represents a nasal sound that may have an alternative
conjoined writing symbol representing the corresponding nasal consonant of the
particular set. Although Modern Hindi, Marathi and Konkani prefer the anusvāra to the
corresponding Half-nasal, in Bangla it is clearly demarcated as to where one must use
the Anusvāra and where it has to be a conjunct cluster with a nasal as the first or the
second component.
The IDNA Protocol (RFC 5891) states that IDNs must be in Unicode Normalization Form
C (NFC). RFC 7940 applies this requirement to LGRs. The definition of NFC in the
Unicode Standard contains a number of composition exclusions. As a result, the Bangla
letters য় YYA, ড় RRA and ঢ় RRHA have to be represented in the this LGR by using the
sequences (YA +Nukta: U+9AF + U+09BC), (DDA + Nukta: U+9A1 + U+09BC), and
(DDHA + Nukta: U+9A2 + U+09BC) instead of the single code points YYA (U+9DF), RRA
(U+09DC), and RRHA (U+09DD), although the use of ‘Nukta’ is otherwise completely
unnatural in Bangla.
It is noted that in the current Unicode Standard chart, these characters are listed as
additional consonants. As per the LGR Procedure, however, these decisions depend on
the IDNA Protocol through a set of prodedures developed by the IETF. Even though the
Unicode Standard also prescribes methods to produce these three characters both as
atomic characters (for example, 09DC for ড় [ṛ ], 09DD for ঢ় [ṛ h], and 09DF as য় [y] as
single key stroke), the IDNA protocol requires that we treat them as conjunct characters
and then allocate codes for these in the Unicode Bengali Block.
17
It may be noted that there could be sporadic attempts or cases of writing Muslim names,
Urdu poetic words and Perso-Arabic loan words with nukta under ক (k), খ (kh), গ (g), জ
(j) and ফ (ph) only for the sake of correct pronunciation and for maintaining the
sanctity of the loan word. These were also like using Bangla writing system to work like
the IPA script. It is, however, not in use in Bangla writing in printing.
The Avagraha "ঽ" (U+09BD) is mainly used in Sanskrit, Pā li, Prā kṛ t or Maithili texts
written in Bangla. It is gradually being replaced by an upper comma (e.g. নেরাঽপরািণ re-
written as নেরা’পরািণ). It is rarely used now even in other languages using Bangla script.
In case of LGR, the Avagraha is not part of the repertoire. It has been decided, therefore,
not to retain Avagraha (ঽ) (U+09BD) because it is blocked in TLDs as per the Maximal
Starting Repertoire (MSR).
Please see Appendix II in section 11 for a complete list of Bangla consonants and their
allographs.
3.3.8 Zero Width Non-joiner (U+200C) and Zero Width Joiner (U+200D)
This note is pertinent to the use of Zero Width Joiner (ZWJ) and Zero Width Non Joiner
(ZWNJ) as used in Bangla. It needs to be noted that Nepali, Konkani and Hindi use these
two signs in a different manner.
ZWJ (U+0200D) and ZWNJ (U+0200C) are code points that have been provided by the
Unicode standard to instruct the rendering of a string where the script has the option
between joining and non-joining characters. Without the use of these control codes, the
string may be rendered in an alternate form from what is intended.
Use of ZWJ
• Insofar as Bangla is concerned ZWJ is used for the proper rendering of characters
such as khaṇḍa-ta /ৎ/ as in সতGিজৎ (satyajit) “Satyajit” and সৎ (sat) “honest”. This
is typed as follows:
ta + Hasanta + ZWJ (U+0200D)
18
• However, ZWJ is more important where same combination of consonantal
characters is represented differently depending upon the contexts. E.g. র+◌্+য
have two representations in Bangla—as যH and as রG. To get the form যH one has to
type in the following manner—র+◌্+য, but for রG the sequence would be
র+ZWJ+◌্+য. [154]. In other words, ZWJ is used in the rendering of words
demanding ya-phalā after ra which is otherwise not possible to type (render)
due to the same order of ra+hasanta+antastha ja in the medial and/or final
position. Interestingly, ra+hasanta+antastha ja is used to type repha on the
consonant - antastha ja as in কায6 (kaarjo). In order to get a ya-phalā after the
consonant -ra it is therefore obligatory to use ZWJ after -ra as in রGাপার
(wrapper), রGাশ (rash), রGািল (rally) etc. The typing sequence is given below:
ra (র) + ZWJ + hasanta (◌্) + antastha ja (য) = রG
Use of ZWNJ
• The use of ZWNJ in Bangla is used to represent the explicit Hasanta or Halant. In
order to avoid conjunct formation in cases where there is an explicit hasanta
before the succeeding consonant the ZWNJ is used.
The Zero Width Non-joiner (ZWNJ) is an invisible character used in certain cases (after
Hasanta) where default conjunct formation is to be explicitly restricted and the Hasanta
joining the two consonants participating in the conjunct formation needs to be explicitly
shown.
Ya-Phalaa sequences are two instances in Bangla where Hasanta is preceded by a full
vowel (U+0985 অ - BENGALI LETTER A and U+098F এ - BENGALI LETTER E).
• অ"া 0985 09CD 09AF 09BE
BENGALI LETTER A + BENGALI SIGN VIRAMA +
BENGALI LETTER YA + BENGALI VOWEL SIGN AA
• এ"া 098F 09CD 09AF 09BE
19
BENGALI LETTER E+ BENGALI SIGN VIRAMA + BENGALI LETTER YA +
BENGALI VOWEL SIGN AA
For rendering Ya-phalā followed by অ and এ, it is necessary to type U+09CD Hasanta
plus U+09AF ya preceded by the said vowels. This is a purely ligatural entity and the
addition of Ya-phalā and ākā ra is used to elicit the /æ/ sound as in English 'acid' অGািসড,
'association' অGােসািসেয়শন, ‘bat’ বGাট, ‘fat’ ফGাট, ‘mat’ মGাট, ‘cap’ কGাপ etc.
The Brāhmī script, by nature does not have Hasanta after a vowel. Hasanta is generally
described as ‘vowel killer’, although it actually indicates absence of a vowel after the
marked consonant. Only the consonants can have the Hasanta marked. But as we see
here, Bangla ends up with a deviant feature in the orthography here in which Hasanta
comes immediately after a vowel in ligatures অ8া and এ8া (Cf Unicode 10.0 p. 473 [100]).
Owing to co-occurrence with HASANTA, RA either loses its own implicit vowel (REPHA),
or suppresses the implicit vowel of the preceding consonant (RA-PHALAƒ ). For instance,
repha = ra + Hasanta + C (e.g. কH i.e. ra + Hasanta + ka, as in অকH arka “the sun“); ra-phalā=
C + Hasanta + ra (e.g. T i.e. ka + Hasanta + ra, as in চT chakra “cycle”). The point is in
both the cases the slot for ra could be Bangla ra র (U+09B0) or the Assamese ra ৰ
(U+09F0), followed/ preceded by the common Hasanta (U+09CD), whereas the shapes
of repha and ra-phalā in both the cases remain the same. The LGR makes a note of this
point of concern with respect to the two RAs in disguise as it would be compeltely
impossible to distinguish between them with naked eyes in a lable so generated which
may consequently lead to concerns related to spoofing and other kind of cyber
irregularities. The motive to class these two CPs as (blocking) variants is because fully
rendered labels may mask the distinction between Bangla ra র (U+09B0) or the
Assamese ra ৰ (U+09F0). That provides the justification for Variant Set 4, though only in
the context of following Hasant. The difference between the RAs is only distinguishable
if one looks into their Unicode values. Therefore, labels such as অকH arka, শীষH ś ır̄ ṣ a ‘top/
apex’, অY abhra ‘cloud/the sky’, ƒম śrama ‘physical labour’ could be extremely
dangerous as the web-user may never verify the digital content (the labels) with its
unicode value/code points. This point is made explicitly, with reference to Table 9 (of
sequences, p. 36) and Table 16 (of WLE Symbols, p. 47) that are to follow. Moreover, it
20
is noteworthy that the REPHA can also occur with KHANDA TA. The conditions in this
context of KHANDA TA are liable to be such that the C should be either RA U+09B0 (র)
(used in Bangla) or RA U+09F0 (ৰ) (used in Assamese).
The Neo-Brāhmī Generation Panel (NBGP) has been formed by members having
experience in Linguistics (especially in NLP / Computational linguistics), Literature,
Language History and Epigraphy. Under the Neo-Brāhmī Generation Panel, Bangla and
eight other scripts belonging to separate Unicode blocks are being taken up to assign a
separate LGR for each. However, an attempt is made to ensure that the fundamental
philosophy behind building those LGRs consistent with all other Brāhmī-derived scripts.
The present LGR will cater to multiple languages belonging to EGIDS scale 1 to 4 (see
Table 4) that use Bangla script.
The following guiding principles are used in making decisions about Bangla LGR Code-
points:
21
4.2.1 External Limits of Scope
The code point repertoire for root zone being a very special case, at the top of protocol
hierarchies, the canvas of available characters for selection as a part of the Root Zone
code point repertoire is already constrained by various protocol layers beneath it. The
following three main protocols/standards act as successive filters:
Out of all the characters that are needed by the script in question, if a particular
character is not encoded in Unicode, it cannot be incorporated in the code point
repertoire. Such cases are quite rare, and especially so in Bangla-Asamiyā-Maṇ ipuri
Writing System, given the elaborate and exhaustive character inclusion efforts made by
the Unicode consortium.
Unicode being the character-encoding standard for providing the maximum possible
representation of a given script/language, it has encoded as far as possible all the
possible characters needed by the script. However, the Domain name being a
specialized case, it is governed by an additional protocol known as IDNA
(Internationalized Domain Names in Applications). The IDNA protocol excludes some
characters out of Unicode repertoire from being part of the domain names.
The Root-zone LGR being the repertoire of characters which are going to be used for
creation of the Root-zone TLDs, which in turn constitute an even more specialized case
of domain names, the ROOT LGR procedure introduces additional exclusions on the
IDNA’s allowed set of characters.
Example: Bangla Sign Avagraha "ঽ" (U+093D) even if allowed by IDNA protocol, is not
permitted in the Root Zone Repertoire as per the MSR.
To sum up, the restrictions start off with admitting only such characters as are part of
the code-block of the given script/language. The IDNA Protocol further narrows this
down and finally an additional filter in the form of Maximal Starting Repertoire restricts
the character set associated with the given language even more.
22
4.2.1.2 No Symbols and Abbreviations
Abbreviations, weights and measures and other such iconic characters like BANGLA
ISSHAR "৺" (U+09FA), BANGLA CURRENCY DENOMINATOR SIXTEEN "৹" (U+09F9) etc.
will also not be included.
4.2.1.5 ABNF
The Augmented Backus-Naur Formalism (ABNF) is described in Section 5.4.1 and
Appendix (Section 10.1).
5. Repertoire
The Bangla Writing System is represented in UNICODE using the Bengali (Bangla) script
name as enumerated in ISO 15924 corresponding to languages such as Asamiyā
(Assamese), Bangla (Bengali) and Maṇ ipuri. The BENGALI block used for Bangla-
Asamiyā-Maṇ ipuri in the UNICODE has 93 entries. This section details the code-point
repertoire that the Neo-Brāhmī Generation Panel [NBGP] proposes to be included in the
Bangla LGR.
It may be mentioned here that the Government of Assam has submitted a proposal to
Bureau of Indian Standards (BIS) on 26th February 2016 for dis-unification of Bangla
and Asamiyā Scripts. The BIS in its 8th Meeting of Indian Language Technologies and
Products Sectional Committee, LITD 20, held on 23rd Aug 2017, decided to refer the
proposal for recognition of Assamese script in ISO/IEC 10646 to ISO. Until the UNICODE
Consortium takes any further action, it will be assumed that the Code Point Repertoire
under Table 11 will be valid for all the three languages as above.
23
For each of the code points, language references have been given in the last column
titled "Reference" under Table 8 titled the “Code Point Repertoire”. For entire coverage
of Bangla code points, references of Bangla, Asamiyā (Assamese), Maṇ ipuri (Meitei), and
Bishnupriya are given. Kokborok, written in Bangla script, is not known to have
introduced many new complications, except for one particular character. Though only a
few representative languages under EGIDS Scale 1-4 have been chosen for referencing,
they together cover all the code-points required for all the languages that NBGP has
considered as given under Bangla Unicode Points (as given in UNICODE 6.3).
However, before the details are presented, it is ideal to look at the Bangla Code Point
Chart from Maximal Starting Repertoire [MSR] Version 3. It may be noted that the shapes of
the reference glyphs given below in the code charts are based on one of the many fonts
designed, and are not prescriptive, because there could be some variations in actual
fonts – both UNICODE-compatible and True-Type ones. Consider the following Code
point table:
24
Colour convention:
Given the Bangla Unicode Block as in Figure 1, for the code points those are included in the
MSR, the following symbols will need a separate treatment:
ৎ U+09CE Bangla Letter Khaṇ ḍ a-Ta
ৰ U+09F0 Asamiyā -Bangla Letter Ra With Middle Diagonal
ৱ U+09F1 Asamiyā -Bangla Letter Ra With Lower Diagonal
25
5.1 Code Point Repertoire Inclusion
No. Unicode Gly Character Category Language(s), References and
Code ph Name with EGIDS Comment
Point Value
26
No. Unicode Gly Character Category Language(s), References and
Code ph Name with EGIDS Comment
Point Value
27
No. Unicode Gly Character Category Language(s), References and
Code ph Name with EGIDS Comment
Point Value
28
No. Unicode Gly Character Category Language(s), References and
Code ph Name with EGIDS Comment
Point Value
29
No. Unicode Gly Character Category Language(s), References and
Code ph Name with EGIDS Comment
Point Value
30
No. Unicode Gly Character Category Language(s), References and
Code ph Name with EGIDS Comment
Point Value
31
No. Unicode Gly Character Category Language(s), References and
Code ph Name with EGIDS Comment
Point Value
32
No. Unicode Gly Character Category Language(s), References and
Code ph Name with EGIDS Comment
Point Value
33
No. Unicode Gly Character Category Language(s), References and
Code ph Name with EGIDS Comment
Point Value
34
No. Unicode Gly Character Category Language(s), References and
Code ph Name with EGIDS Comment
Point Value
Apart from the above individual code-points, the Neo-Brāhmī Generation Panel also
proposes some specific sequences which enable conditional inclusion of the "Bangla
LETTER A and E" followed by Bangla SIGN VIRAMA and Bangla LETTER YA again
followed by Bangla VOWEL SIGN AA in the repertoire for enabling inclusion of /æ/
sound as in English ‘bat’, ‘cat’ etc.
35
Sr. Unicode Seque Character Names Example Reference
No. Code nce languages
Points using the
code-point
(Not
exhaustive
list)
Table 9: Sequences
36
5.3 Code point not used alone
BENGALI SIGN NUKTA U+09BC (See 3.3.6) is excluded from repertoire since it will
never be used alone. It will be used as sequence in three special characters in
normalized form for ড়, ঢ়, য়.
The Augmented Backus-Naur Formalism (ABNF) will use the following Operators:
1 “|“ Alternative
2 “[ ]” Optional
4 “( )” Sequence Group
37
Table 11: The ABNF Formalism
In what follows, the Vowel Sequence and the Consonant Sequence pertinent to Bangla
are given to facilitate understanding.
A vowel sequence is made up of a single vowel. It may be followed but not necessarily
(optionally) by an Anusvāra /onuʃʃār/ (B), Candrabindu (D) or a Visarga /biʃɔrgo/ (X).
The number of D, B or X which can follow a V in Bangla may not be restricted to one.
Going by the rules illustrated in the document it is clear that formations such as VDD,
VBB and VXX are invalid orthographic units. However, it is valid and possible to have
formations or sequences such as anusvā ra followed by a chandrabindu on one hand and
visarga followed by a chandrabindu on the other as in হ8াঁংচা ‘hænchā’ and ‘hæn’ হ8াঁঃ
respectively.
Examples: V অ अ
38
Examples:
VB অং अं
VD অঁ अँ
VX অঃ अः
VDB অঁং अँ◌ं◌ं
VDX অঁঃ अँ◌ं◌ः
VHCM অ8া /এ8া
Examples:
VHCMB অ8াং/এ8াং
VHCMD অ8াঁ/এ8াঁ
VHCMX অ8াঃ/এ8াঃ
VHCMDB অ8াঁং/এ8াঁং
VHCMDX অ8াঁঃ/এ8াঁঃ
Example: C ক क
Example:
CM িক/ কৃ -क/ कृ
CB কং कं
39
CD কঁ कँ
CX কঃ कः
CH p क् (Pure consonant)
CDB কঁ ং कँ◌ं ◌ं
CDX কঁ ঃ कँः
5.4.3.3 CM Sequence
A CM sequence can be optionally followed by B, D, X, DB or DX.
Example:
CMB কীং/ কৃ ং क/ं/ कं ृ
CMD কাঁ काँ
CMX বীঃ वीः
CMDB কাঁং काँ ◌ं
CMDX কাঁঃ काँः
*3(CH)C
Example:
CHC W → ন+◌্+ ত न ्+ त
5.4.3.5 Subsets:
40
CHCX ³ঃ →ক ◌্ ক ◌ঃ 4कः → क ◌् क ◌ঃ
CHCDB ³ঁ ◌ং →ক ◌্ ক ◌ঁ ◌ং 4कँ◌ं◌ं→ क ◌् क ◌ं◌ँ ◌ं
CHCDX ³ঁঃ →ক ◌্ ক ◌ঁ ◌ঃ 4कँ◌ं◌ः→ क ◌् क ◌ं◌ँ ◌ः
[CH]Z
Example:
র + ◌্ + ৎ = ৎH as in ভৎHসনা (bhartsanā) "scolding"
Note: The conditions in this context of KHANDA TA are that the C should be either RA
U+09B0 (র) (used in Bangla) or RA U+09F0 (ৰ) (used in Assamese).
Two special cases involving Sequences (referred to as S and P in Table 16 under Section
7) could be described briefly here. Let us take up S in the first instance. It is noteworthy
that there are two instances in Bangla where Hasanta (U+09CD) is preceded by a full
10
Refer to Rule P in Section 7, Table 16.
41
vowel (U+0985 অ - BANGLA LETTER A and U+098F এ - BANGLA LETTER E). For
rendering Ya-phalā followed by অ and এ, it is necessary to type U+09CD plus U+09AF ya
preceded by the said vowels. This is a purely ligatural entity and the addition of Ya-
phalā and ā-kāra is used to elicit the /æ/ sound as in English ‘bat’, ‘fat’ etc. The Brā hmı̄
script, by nature does not have Hasanta after a vowel. Hasanta is generally described as
‘vowel killer’, although it actually indicates absence of a vowel after the marked
consonant. Only the consonants can have the Hasanta marked. But as we see here,
Bangla ends up with a deviant feature in the orthography here in which Hasanta comes
immediately after a vowel in ligatures অGা and এGা (Cf Unicode 10.0 p. 473 [100]).
Another case refers to the formation of repha and ra-phalā in the said script and
mentioned in the table above as P. Owing to co-occurrence with HASANTA, RA either
loses its own implicit vowel (REPHA), or suppresses the implicit vowel of the preceding
consonant (RA-PHALAƒ ). For instance, repha = ra + Hasanta + C (e.g. কH i.e. ra + Hasanta +
ka, as in অকH arka “the sun“); ra-phalā= C + Hasanta + ra (e.g. T i.e. ka + Hasanta + ra, as
in চT cakra “cycle“). The point is in both the cases the slot for ra could be Bangla ra র
(U+09B0) or the Assamese ra ৰ (U+09F0), followed/ preceded by the common Hasanta
(U+09CD), whereas the shapes of repha and ra-phalā in both the cases remain the same.
42
6. Variants
This section talks about the variants in the Bangla script. The NBGP categorizes these
confusingly variants in two groups.
For Group 1, any identical code points are defined as variants. The confusable, but not
identical, cases are not proposed, as there is another panel (String similarity assessment
panel) entrusted to deal with such cases. However, cases which belong to Group 2 are
proposed to be considered as variants. These cases are not of mere visual similarity as
they involve some deviations from the widely accepted norms of Bangla Akshar
formations. These can cause confusion even to a careful observer and hence being
proposed as variants.
The variants are generated in a script when two or more forms are formed with
different storage or code points. In Bangla the e-kāra, ā-kāra and the o-kāra have
different code points. One can type o with a consonant at one go and the same by typing
e-kāra and ā-kāra as two separate keys getting the same results. A reader cannot
differentiate between the two ko (`কা), one typed with a single key and the other one
typed with two different keys. Moreover, this will not be considered as a case of variant
because a kāra followed by a kāra is not allowed.
CASE I:
As far as true variants in Bangla are concerned, we may draw our attention to cases
wherein Hasanta with (U+09A5) থ (tha) appears as conjunct with (U+09B8) স (sa) and
(U+09A8) ন (na).
43
The above combinations, if written in traditional orthography, could be little confusing,
where the থ (tha) in conjunct appears like a হ (ha). The conjunct could be in the initial,
medial or final positions (as shown below in e.g. no 1). It could be typed wrong as well,
thinking it was a হ (ha) U+09B9, increasing the chances of risks in label writing and
identification.
Examples:
1. ´ and µহ (as in ´ান sthāna, ´ূল sthū la, {া´G svāsthya, অ´ায়ী asthāyı̄)
2. ˆ and ‰হ (as in ¶ˆ grantha)
The fonts which represent traditional Bangla writing system could tend to create this
problem. Therefore, these may be taken as cases of variants in Bangla.
CASE II:
Another interesting example of variant is encountered in ra + Hasanta and Hasanta + ra
combinations in writing labels in the Bangla script (for languages such as Bangla,
Assamese and Maṇ ipuri). The variant cases arise in typing ‘repha’ (involving ra +
Hasanta) and ‘ra-phalā’ (involving Hasanta + ra).
‘Repha’ could be formed by two sequences (mainly because both Assamese and Bangla
find place in the same UNICODE points, and ‘B_RA’ as well as ‘A_RA’ refer to the same
phonetic element). Here, the final ligatures look the same, and will be as follows:
(1) B_RA + H + C
(2) A_RA + H + C
Where
B_RA = U+09B0 BENGALI LETTER RA (র) or
A_RA = U+09F0 BENGALI LETTER RA WITH MIDDLE DIAGONAL (ৰ)
H = U+09CD BENGALI SIGN VIRAMA (◌্)
C = any consonant (theoretically)
Example:
U+09B0 (র) U+09CD (◌্)U+0995 (ক) কH U+09F0 (ৰ) U+09CD (◌্) U+0995 (ক) কH
U+09B0 (র) U+09CD (◌্)U+09A0 (ঠ) ঠH U+09F0 (ৰ) U+09CD (◌্) U+09A0 (ঠ) ঠH
44
Note: As Bangla and Assamese ক and ঠ look exactly the same, the resultant
combinations with 'Repha' look identical. Addition of 'Repha' does not make any
difference.
‘Ra-phalā ’ could be formed by two sequences on similar grounds, and the final ligatures
would look the same
(1) C1 + H + B_RA
(2) C1 + H + A_RA
Where
C1 = any consonants except Khaṇ ḍ a-ta
Example:
U+0995 (ক) U+09CD (◌্) U+09B0 (র) ' U+0995 (ক) U+09CD (◌্) U+09F0 (ৰ) '
U+09A8 (ন) U+09CD (◌্) U+09B0 (র) ) U+09A8 (ন) U+09CD (◌্) U+09F0 (ৰ) )
As the Assamese and Bangla Repha and Ra-phalā conjunct forms look the same, this
could cause confusability to the end-users. Hence, the repha and ra-phalā cases need to
be defined as variants.
NBGP concluded to define র and ৰ as variant code points, where only one variant set
between র and ৰ could cover all cases. But this will create blocked variant labels, e.g. if
someone registers “র র র” the variant label “ৰৰৰ” will be generated as variant and will
be blocked and vice versa. However, it is only blocked at the label level, if someone else
needs to register other labels e.g. ৰৰ or ৰৰৰৰ, it is still possible.
After the public comment, the NBGP reviewed the disposition for র and ৰ variants.
These code points are used equally. Therefore, for the usability, the NBGP decided that র
and ৰ are variant “allocatable”. In addition, these code points 09B0 and 09F0 should
not be used in the same label, therefore the no-mix rule should be implemented.
45
6.2 Cross Script Variants
A crisp cross script study for Bangla has been done with respect to sister scripts such as
Devanāgarī, Gurmukhı̄ and Odia11 (formerly Oriya) keeping in mind the visual and
technical confusions they may cause as labels on the web domain. Moreover, there is no
in-script variant in Bangla as far as the orthography is concerned. The following
characters are being proposed by the NBPG as variants. Although there are certain
characters which are somewhat similar they but have not been included here. They
have been provided in the Appendix (10.2) for reference.
Bangla Devanāgarī
ম म
U+09AE U+092E
ি◌ ि◌
U+09BF U+093F
Bangla Gurmukhı̄
ম ਸ
U+09AE U+0A38
ি◌ ਿ◌
U+09BF U+0A3F
This section provides the WLEs that are required by all the languages mentioned in
section 3.2 when written in Bangla12 Script. The rules have been drafted in such a
way that they can be easily translated into the LGR specifications.
11
Unicode uses Oriya for the script, although Odia is now the official term used.
12
As used by the Unicode, denoting and including both Assamese and Maṇipuri.
46
Below are the symbols used in the WLE rules, for each of the "Indic Syllabic
Category" as mentioned in the table provided in Code point repertoire (Section 5.1).
C → Consonant
M → Kāra (Mātrā)
V → Vowel
B → Anusvāra
D → Candrabindu
X → Visarga
H → Hasanta
Z → Khaṇ ḍ a Ta
S → S1, S2 (from Table 9)
or
P → Ra-Hasanta (C2 H)
where
C2 is either 09B0 (র - BENGALI LETTER RA)
or 09F0 (ৰ - ASSAMESE LETTER RA/
Unicode name: BENGALI LETTER RA
WITH MIDDLE DIAGONAL)
H is 09CD (◌্ - BENGALI SIGN VIRAMA)
47
It is also perhaps ideal to mention here that in Bangla, the consonant letters (or
graphemes) are physically joined to form “clusters” that could theoretically conjoin
from two to four consonants and combine to create new shapes. Dash and Chaudhuri
(1998) state that there are “nearly 380 unique consonant...clusters” out of which Bi-
consonantal combinations are 290, three-letter combinations account for another 80
and the rarer ones with four letters number 10 more [136, Pg 4]. More details of such
combinations could be seen in Pabitra Sarkar (1993) [135].
The prevalent patterns in Bangla, and various restrictions, below are the specific WLE
rules that need to be implemented.
Now let us elaborate each rule with examples from the script keeping in mind
the Bangla, Assamese and Maṇ ipuri communities. Some combinations of
characters may seem unrealistic or rare in usage but there is no harm in adding
such ligatures because it is possible to create them by any user easily but may
not be attested combinations.
48
7.1.1 Case of V Preceded by H:
This is the case where two different words are joined together first of which ends
with an H (অu) and the second word begins with a V (ইিvয়া). Some sections of
the linguistic community require the explicit presence of H for full
representation of the sound intended. However, by and large, the form of the
first word without an H (U+09CD) is considered enough for full representation of
the sound intended for the first word.
This is a unique situation necessitated by the lack of hyphen, space or the Zero
Width Non-joiner character in the permissible set of characters in the Root zone
repertoire. Otherwise, V is never required to be allowed to follow an H.
Permitting this may create a perceptive similarity between two labels (with and
without H) for majority of the linguistic communities hence this is explicitly
prohibited by the NBGP.
Below are a few examples which help one understand some of the rules ABNF
puts in place. These are just given for reference purposes and are not meant to
be comprehensive.
49
Indian language syllable and is quasi-automatically applied wherever supported
by the OS.
50
কংঃ कं ◌ः
কঃং कः ◌ं
8. Contributors
Professor Udaya Narayana Singh, Chair-Professor of Linguistics & Dean, Faculty of Arts,
Amity University Haryana, Gurgaon; Pachgaon, Manesar PIN 122431 (Haryana), India.
Dr Atiur Rahman Khan, Principal Technical Officer, GIST Group, C-DAC, Pune, PIN
411008 (Maharashtra), India.
Mr Akshat Joshi, Project Engineer, GIST Group, C-DAC, Pune, PIN 411008 (Maharashtra),
India.
Ms Moumita Chowdhury, Senior Technical Officer, GIST Group, C-DAC, Pune, PIN
411008 (Maharashtra), India.
51
Prof Rafiqul Islam, National Professor of Humanities, Dhaka.
Prof Jinnat Imtiaz Ali, Director-General, International Mother Language Institute, Dhaka
52
Janab Istiaque Arif, Senior Assistant Director, Bangladesh Telecommunications
Regulatory Authority, Dhaka
Ms. Afifa Abbas, Information Security and Governance Lead Engineer at Banglalink, and
ICANN Fellow.
Mr Imran Hossen, CEO, EyeSoft and key member of Bangladesh Association of Software
& Information Services (BASIS).
Ms Shahida Khatun, Director, Folklore, Museum & Archive Division, Bangla Academy,
Dhaka
9. References
[100] Unicode Consortium. 2017. Unicode Standard 10.0. Mountain View CA.
[102] Banerji, R.D. 1919. The Origin of the Bengali Script. Kolkata. New Delhi; Asian
Educational Services; 2003 reprint.
[103] Chatterji, S.K. 1926. The Origin and Development of the Bengali Language.
Calcutta: Calcutta University Press.
[105] Hai, Muhammad Abdul. 1964. Dhvani Vijnan O Bangla Dhvani-tattwa (Phonetics
and Bengali Phonology), Dhaka: Bangla Academy.
[106] Jha, Subhadra. 1958. The Formation of Maithili. London: Luzac & Co.
[107] Kostic, Djordje; Das, Rhea S. 1972. A Short Outline of Bengali Phonetics, Calcutta:
Statistical Publishing Company.
[109] Mazumdar, Bijaychandra. 1920/2000. The History of the Bengali Language (Repr.
Calcutta, 1920. ed.). New Delhi: Asian Educational Services.
[110] Pandey, Anshuman. 2001. Proposal to Encode the Tirhuta Script in ISO/IEC
10646.
53
[111] Pal, Palash Baran. 2001. Dhwanimala Barnamala. Kolkata: Papyrus.
[112] -----. 2007. ‘Bangla Harapher Panch Parba’. In Swapan Chakraborty, ed.
Mudraner Sanskriti O Bangla Boi. Kolkata: Ababhas.
[113] Ross, Fiona. 1999. The Printed Bengali Character and its Evolution. London:
Curzon.
[114] Shastri, Mahamahopadhyay Hara Prasad. 1916. Hājār Bacharēr Purāṇa Bāṅgālā
Bhāṣāy Bauddha Gān ō Dōhā. Calcutta: Bangiya Sahitya Parishat.
[116] -----. 1987. A Bibliography of Bengali Linguistics. Mysore: CIIL. xii+316 pp.
[117] -----. 2017. (with Rajib Chakraborty, Bidisha Bhattacharjee & Arimardan Kumar
Tripathy) Languages and Cultures on the Margin: Guidelines for Fieldwork on Endangered
Languages. Mimeo. Centre for Endangered Languages, Visva-Bharati.
[118] -----. 1980. Scriptal choice and spelling reform: An essay in language and
planning. Journal of the M.S. University of Baroda, Social Science Number, 29.2 : 173-
186. A modified version reprinted E. Annamalai, Bjorn Jernudd and Joan Rubin, eds.
Language Planning: Proceedings of an Institute. Mysore: CIIL. 405-417.
[120] Sur, Atul. 1986. Bangla Mudraner Dusho Bachar. Kolkata: Jijnasa.
[121] Script Behaviour for Bengali, Version 1.1, TDIL and C-DAC Pune.
[122] Bora, Mahendra. 1981. The Evolution of Assamese Script. Jorhat: Assam Sahitya
Sabha.
54
[131] The EMILLE/CIIL Corpus, http://metashare.elda.org/repository/browse/the-
emilleciil-
corpus/abdd35c8de6f11e2b1e400259011f6ea6bce74d38dbb42d881da76c64a6adb20
/ accessed on 10.5.2018
[134] Sarkar, Pabitra. 1992. Bangla Banan Sanskar: Samasya o Sambhabana. Kolkata:
Chirayata Prakashan.
[135] Sarkar, Pabitra. 1993. Bangla Bhashar Yuktabyanjan. Bhasha 1.1: 23-45.
[136] Dash, Niladri Shekhar and B.B.Chaudhuri. 1998. Bangla Script: A Structural
Study. Linguistics Today 1.2: 1-28. Also available at
https://www.academia.edu/9967428/Bangla_Script_A_Structural_Study
[137] Dani, Ahmed Hasan. (1957) ‘Srīhaṭṭa-Nāgarī Lipir Utpatti o Bikāś.’ Bangla
Academy Patrika (Dhaka), Vol 1.2. (Bhadra-Agrahayan, 1364 Bangabda Number).pg 1.
[139] Furui, Ryosuke. (2015). ‘Variegated Adaptations: State Formation in Bengal from
the Fifth to Seventh Century’, in Bhairabi Prasad Sahu & Hermann Kulke, eds.
Interrogating Political Systems: Integrative Processes and States in Pre-Modern India.
Chapter 9. Pp 255-73. New Delhi: Manohar.
[141] Shahidullah, Muhammad. (2007) Buddhist Mystic Songs. Dhaka: Mowla Brothers.
[143] Hai, Muhammad Abdul. (1960) A phonetic and phonological study of nasals and
nasalization in Bengali. Dhaka: University of Dhaka.
55
[144] Unicode Consortium, Proposal Summary Form to Accompany Submissions for
Additions to the Repertoire of ISO/IEC 10646 / UNICODE,
https://www.unicode.org/L2/L2002/02387r-syloti-form.pdf accessed on May 21,
2018
[149] Das, Sisir Kumar. (1975) Sahibs and Munshis: An Account of the College of Fort
William. Calcutta.
[150] Islam, Rafiqul, Pabitra Sarkar, Mahbubul Haq & Rajib Chakraborty (eds.). (2014)
Bangla Academy Promito Bangla Byabaharik Byakaran (A Functional Grammar of
Standard Bangla). Dhaka: Bangla Academy.
[151] Sarkar, Pabitra. [2013] ‘Bangla Spelling Reform: the Long and Short of It’. Bangla
Journal 19: 215-232.
[152] Bangla Academy. (2012) Bangla Academy Promito Bangla Bananer Niyam
(Standard Bangla Spelling as adopted by Bangla Academy). Dhaka: Bangla Academy.
[153] Sarkar, Pabitra & Rajib Chakraborty. 2018. “What has happened So Far In terms
of Script Reforms”. Paper presented at the Face to Face meeting jointly held by the
Bangla Academy, Dhaka & ICANN at Bangla Academy, Dhaka on 10.07.2018.
[154] The Unicode Consortium. 2018. The Unicode® Standard Version 11.0 – Core
Specification. Chapter 12, P. 473.
56
10. Appendix- I
1. Khaṇḍa ta (ৎ) is NOT allowed at the beginning of an IDN label. The same
applies to ঞ and the velar nasal ঙ in the Bangla Scheme of five-fold ‘varga’ (as
defined under Table 5). Moreover, Bangla does not allow ya (য়) in the
beginning of a word either but we can cite a couple of native examples, for
example, the word য়8াwেড়া (yæbbɔRo) from the poem ‘Lichuchor’ written by
Kazi Nazrul Islam. However, there are instances of it being used in names,
mostly of foreign origin such as Yaqub which may be written with ya (য়) in
the beginning as in য়াxব). In very recent times, while transliterating some
Chinese and Japanese names in Bangla, one does come across the possibility
of Khaṇḍa ta (ৎ) followed by sa (স) in the beginning of a word, for example
yেসিরং (Tsering).
2. CH can come with Khaṇ ḍ a Ta in only the case where C is ra (র) (09B0).
ৎ6 as in ভৎ6 সনা
13 This section specifically takes up issues of restrictions pertaining to Bangla (Bengali) language. Assamese
and Maṇipuri have not been covered in this section.
57
Nā garı̄ or Siloti (129) – such as ‘Jā lā lā bā da Nā garı̄’, ‘Fula (flower) Nā garı̄’, ‘Muslim
Nā garı̄’, or ‘Muhā mmad Nā garı̄’. It is said that Shā h Jā lā la had brought the script with
him in 13th-14th Century in Sylhet (138), although some suggested that it was an
invention by the Afghan rulers of Sylhet (137). Some ascribe the credit to the
Buddhist Bhikkhus from Nepal. Purely for historical reasons, the details of the script
with 32 symbols are reproduced here (138):
NBGP
Bangla Devanāgarī Decision
◌ঃ U+0983 ◌ः U+0903 Confusable
ও U+0993 उ U+0909 Confusable
ঘ U+0998 घ U+0918 Confusable
◌ঁ U+0981 ◌ॅ U+0945 Confusable
Table 18: Bangla and Devanāgarī confusable code points
58
10.3.2 Bangla and Gurmukhi
NBGP
Bangla Gurmukhi decision
ঘ U+0998 ਬ U+0A2C Confusable
◌ঁ U+0981 ◌ੱ U+0A71 Confusable
59
11. Appendix -II
Bengali consonants and their allographs
Consonants Phonetic Value Allographs
Clusters Transparent
Form (Bangla
Akademi font)
†/† (•+ফ)
60
Consonants Phonetic Value Allographs
Clusters Transparent
Form (Bangla
Akademi font)
» (p+ট), ¼ (½+ট)
ঠ /ʈʰ/ ঠ8 (¾+য)
¿ (À+ঠ), Á (½+ঠ)
ঢ /ɖʱ/ ঢ8 (Å+য)
Æ (À+ঢ)
Í (Î+চ), Ï (Ð+চ)
# (A+চ)
61
Consonants Phonetic Value Allographs
Clusters Transparent
Form (Bangla
Akademi font)
ছ /t͡ʃʰ/ Ñ (Ë+র)
Ú (Î+জ) % (A+জ)
× (Õ+ঝ), Û (Î+ঝ)
t (ç+ক), è (•+p+র)
) (H+ক), J
(:+5+র)
é (ç+খ)
62
Consonants Phonetic Value Allographs
Clusters Transparent
Form (Bangla
Akademi font)
õ (ç+ঘ)
(À+য), ú (À+ব)
কং, অং
63
Consonants Phonetic Value Allographs
Clusters Transparent
Form (Bangla
Akademi font)
1 (•+ন)
â (p+ষ), ã (p+½+ণ), ä
(p+½+ম)
64
Consonants Phonetic Value Allographs
Clusters Transparent
Form (Bangla
Akademi font)
æ (p+স)
ড় /ɽ/ C (D+গ)
65
Consonants Phonetic Value Allographs
Clusters Transparent
Form (Bangla
Akademi font)
◌ঁ / ̃/ অঁ, বঁ
66