0% found this document useful (0 votes)

31 views11 pages

Lexical Similarity TSD2

The document presents a new database and visualization of lexical similarity among contemporary lexicons, utilizing an automated approach based on an 8-million-entry cognate database. This method contrasts with traditional comparative linguistics, which relies on small, curated vocabularies, by providing a more comprehensive dataset of over 27,000 language pairs. The authors aim to enhance cross-lingual NLP applications and offer insights into modern language diversity through their findings and visualizations.

Uploaded by

siyabep904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views11 pages

Lexical Similarity TSD2

Uploaded by

siyabep904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

A Database and Visualization

of the Similarity of Contemporary Lexicons

Gábor Bella1[0000−0002−3868−1740] ,
Khuyagbaatar Batsuren2[0000−0002−6819−5444] , and
Fausto Giunchiglia1[0000−0002−5903−6150]
1
University of Trento, via Sommarive, 5, 38123 Trento, Italy
{gabor.bella, fausto.giunchiglia}@unitn.it
2
National University of Mongolia, Ulanbaatar, Mongolia
[email protected]

Abstract. Lexical similarity data, quantifying the “proximity” of lan-

guages based on the similarity of their lexicons, has been increasingly
used to estimate the cross-lingual reusability of language resources, for
tasks such as bilingual lexicon induction or cross-lingual transfer. Ex-
isting similarity data, however, originates from the field of comparative
linguistics, computed from very small expert-curated vocabularies that
are not supposed to be representative of modern lexicons. We explore
a different, fully automated approach to lexical similarity computation,
based on an existing 8-million-entry cognate database created from on-
line lexicons orders of magnitude larger than the word lists typically
used in linguistics. We compare our results to earlier efforts, and auto-
matically produce intuitive visualizations that have traditionally been
hand-crafted. With a new, freely available database of over 27 thousand
language pairs over 331 languages, we hope to provide more relevant data
to cross-lingual NLP applications, as well as material for the synchronic
study of contemporary lexicons.

Keywords: lexical similarity · cognate · language diversity · lexicostatis-

tics · visualization

1 Introduction

The notion of lexical similarity, also known as lexical distance, refers to a quanti-
fied comparison of the proportion of words shared across languages. It is defined
by The Ethnologue as “the percentage of lexical similarity between two linguistic
varieties is determined by comparing a set of standardized wordlists and count-
ing those forms that show similarity in both form and meaning.” 3 Computaton
methods are typically based on the amount of cognates—words of common ori-
gin with (more or less) similar pronunciation and meaning—found for a given
language pair. The resulting similarity data is used in comparative linguistics to
3
https://www.ethnologue.com/about/language-info
2 G. Bella et al.

infer or back up hypotheses of phylogeny among languages. In computational lin-

guistics, lexical similarity has also been used in bilingual lexicon induction and,
more generally, in the context of the cross-lingual transfer of language process-
ing tools and resources, in order to estimate the differing performance of specific
language pairs or directly as input features [7, 11, 13]. Graphical visualizations of
lexical similarity—beyond their popularity among the general public—are useful
for a quick qualitative interpretation of the similarity data.
The typical approach in comparative linguistics has been to use a small num-
ber (typically less than 100) of carefully selected words with equivalent meanings
in each language studied. The word meanings are deliberately chosen from the
core vocabularies, and comparisons are made strictly on phonetic representa-
tions, also taking sound changes into account in historical studies.
Because these methods have been carefully tuned to the needs of language
genealogy, they are less adapted to studies characterizing contemporary vocab-
ularies. For the purposes of computational linguistics or the synchronic study
of language diversity [8], similarity information computed on “everyday” writ-
ten lexicons is more representative than data deliberately tuned for historical
studies. English, for instance, borrowed a significant portion of its vocabulary
from (the otherwise only distantly related) French. Due to the relative lexical
homogeneity of the Romance family, these French borrowings bring the English
lexicon closer to Spanish, Portuguese, or Romanian as well. While such evidence
of lexical proximity can be useful for computational applications, similarity data
from comparative linguistics does not provide this type of insight as they con-
sider borrowings as “noise” over phylogenetic evidence and exclude them by
design.
We investigate a different approach based on the the free online CogNet
database4 of 8.1-million cognate pairs covering 338 languages, itself computed
from large-scale online lexicons. CogNet can be considered reliable (with a pre-
cision evaluated to 96%) and is based on a permissive interpretation of the
notion of cognacy that includes loanwords, and as such it is well suited to prac-
tical cross-lingual applications. From CogNet we compute pairwise similarities
among 331 languages, that we make freely downloadable for downstream uses in
computational linguistics, e.g. cross-lingual NLP applications. We also provide
visualizations of our results that provide an immediate qualitative interpreta-
tion of the similarity data and that, contrary to prior work, are computed fully
automatically.
The rest of the paper is organized as follows. Section 2 presents the state
of the art with respect to known lexical similarity databases and computation
methods, as well as existing visualization techniques. Section 3 describes our
lexical similarity computation method. Section 4 compares our results quanti-
tatively against existing lexicostatistical similarity data. Section 5 presents our
visualization method and results, as well as providing a qualitative visual in-
terpretation of historic versus contemporary lexical similarity. Section 6, finally,
provides conclusions.
4
http://cognet.ukc.disi.unitn.it
A Database and Visualization of the Similarity of Contemporary Lexicons 3

2 State of the Art

The comparison of lexicons has a methodology established in the framework of

lexicostatistics, with the underlying idea of inferring the phylogeny of languages
from their lexicons considered in diachrony [17, 14]. Studies typically span a
large number (hundreds or even thousands) of languages, using a small but fully
meaning-aligned vocabulary selected from each language. To be able to consider
phonetic evolution spanning millennia, very basic words are used—such as water,
sun, or hand —and only in phonetic representations, such as from the well-known
Swadesh list [16]. While such data are of the highest possible quality, they are
scarce and only reflect a tiny fraction of the lexicon. Thus, while well-suited for
diachronic studies, by design they provide less information about the present
state of lexicons and the more recent linguistic and cultural influences to which
they were subjected.
There are many examples of popular graph-based visualizations of such data.5
While informative to non-experts, they are typically human-drawn based on
only a handful of language pairs, and therefore are prone to subjective and
potentially biased emphasis on certain languages or relationships. For example,
for Estonian, the second graph listed in the footnote highlights its two distant
European phylogenetic relatives (Hungarian and Finnish), as well as Latvian
from the neighboring country, while it does not say anything about its significant
Germanic and Slavic loans.
The most similar project we know of is EZ Glot.6 They used a total of roughly
1.5 million contemporary dictionary words taken from overall 93 languages,
mined from resources such as Wiktionary, OmegaWiki, FreeDict, or Apertium.
The precision of their input evidence was self-evaluated to be about 80%.
While our work is also based on comparing online lexicons, we took as our
starting point a high-quality cognate database, CogNet [1, 2], evaluated through
multiple methods to a precision of 96% and covering 338 languages. CogNet
employs etymologic and phonetic evidence, as well as transliteration across
40 scripts, expanding the language pairs covered. In terms of visualization, in
contrast to hand-produced graphs, our approach is entirely automatic and free
from the bias of manual cherry-picking, favoring a global optimum as it is com-
puted over the entire similarity graph.

3 Automated Similarity Computation

Our input data, v2 of the CogNet database, consists of over 8 million sense-
tagged cognate pairs. CogNet was computed from the Universal Knowledge Core
5
To cite a few: https://en.wikipedia.org/wiki/Romance_languages,
https://elms.wpcomstaging.com/2008/03/04/lexical-distance-among-
languages-of-europe/,
https://alternativetransport.wordpress.com/2015/05/05/34/
6
http://www.ezglot.com
4 G. Bella et al.

resource [9], itself built from wordnets, Wiktionary, and other high-quality lexical
resources [6, 4, 12, 3].
The identification of cognate pairs having already been done by CogNet,
we compute the cognate-content-based similarity between the lexicons of lan-
guages A and B as follows:
A B
P
∀<cA
i
,cB
i
> α + (1 − α)sim(ci , ci )
SAB = 2|L ||L |
A B
|LA |+|LB |

where < cA B th cognate word pair retrieved from CogNet for the
i , ci > is the i
languages A and B and sim(cA B
i , ci ) is a string similarity value:

max(lw1 , lw2 ) − LD(w1 , w2 )

sim(w1 , w2 ) =
max(lw1 , lw2 )
where LD is the Levenshtein distance and lw is the length of word w, our hy-
pothesis being that the more similar the cognate words between two languages,
the closer the languages themselves to each other. In case w1 and w2 use dif-
ferent writing systems, we compare their Latin transliterations, also provided
by CogNet. The smoothing factor 0 < α < 1 lets us avoid penalizing dissimi-
lar cognates excessively, while α = 1 cancels word similarity and simplifies the
numerator to cognate counting.
The denominator of SAB normalizes the sum by the harmonic mean of the
lexicon sizes |LA | and |LB |: these can range from tens to more than a hundred
thousand word senses. Normalization addresses lexicon incompleteness, in order
to avoid bias towards larger lexicons that obviously provide more cognates. The
harmonic mean we use is lower than the arithmetic and geometric means but
higher than the minimum value (i.e. the size of the smaller lexicon). This choice
is intuitively explained by the fact that the amount of cognates found between
two lexicons depends on the sizes of both, but is more strongly determined by
the smaller lexicon.
Another source of bias is the presence of specialized vocabulary inside lexi-
cons. Even though CogNet was built solely from general lexicons, some of them
still contain a significant amount of domain terms (such as binomial nomencla-
ture or medical terms), as the boundary between the general and the specialized
vocabulary is never clear-cut. Domain terms such as myocardiopathy or inter-
ferometer tend to be shared across a large number of languages. Due to the
tendency of domain terminology to be universal and potentially to grow orders
of magnitude larger than the general lexicon, their presence in our input lexicons
would have resulted in the uniformization of the similarities computed.
In order to exclude domain terms, we filtered our input to include only a sub-
set of about 2,500 concepts that correspond to basic-level categories, i.e. that are
neither too abstract nor too specialised and that are the most frequently used in
general language. Note that the core vocabulary words used in comparative lin-
guistics are also taken from basic-level categories, representing everyday objects
and phenomena. Thus, in our case, dog or heart would remain in our input while
A Database and Visualization of the Similarity of Contemporary Lexicons 5

Table 1. Evaluation results with respect to ASJP data (root mean square error, stan-
dard deviation, and correlation), for three robustness levels (full dataset including all
language pairs, pairs with medium or high robustness, and pairs with high robustness).
We also provide comparisons to EZ Glot over the 27 language pairs it supports.

Difference w.r.t. ASJP

Dataset Size RMSE σ R
CogNet full data 6,420 9.61 8.26 0.61
CogNet high+medium robustness 3,975 8.83 7.61 0.65
CogNet high robustness 1,399 10.72 8.94 0.69
CogNet over EZ Glot language pairs 27 23.01 16.01 0.69
EZ Glot 27 30.07 17.27 0.48

the too specific—and from our perspective irrelevant—Staffordshire bullterrier

or myocardial infarction would be filtered out. As an existing list of basic-level
categories, we used the BLC resource developed as part of the development of
the Basque wordnet [15]. From this resource we used the broadest, frequency-
based category list, as it also takes corpus-based frequencies into account and is
therefore more representative of general language.
Finally, we annotated our similarity scores in terms of the robustness of
supporting evidence as low, medium, or high, depending on the lexicon sizes
used to compute cognates: robustness is considered low below a harmonic mean
of 1,000 senses, and high above 10,000.
Our final result is a database of 27,196 language pairs, containing language
names, ISO 639-3 language tags, similarity values, and a robustness annotation
for each similarity value. 11.4% of all similarities are highly robust while 34.9%
have medium robustness).

4 Comparison to Results from Lexicostatistics

In lexicostatistics, the standard benchmark is the ability of similarity data to
predict well-established phylogenetic classifications. As we have different goals
and work with different input data (e.g. we do not restrict our study to the core
historical vocabularies), we cannot consider phylogeny as a gold standard against
which to evaluate our results. Instead, we have quantified the difference between
our similarity data and recent results from lexicostatistics, as produced by the
state-of-the-art ASJP tool (based on the latest v19 of the ASJP Database)7 .
We have also compared the (more scarce) symmetric similarity data that was
available from EZ Glot to ASJP data.
The intersection of our output with ASJP contained 6,420 language pairs,
and 27 European language pairs with EZ Glot. After linearly scaling similarities
to fall between 0 and 100, We computed the Pearson correlation coefficient R,
the standard deviation σ, as well as the root mean square error RMSE with
respect to ASJP, for both CogNet and EZ Glot. Among these three measures,
7
https://asjp.clld.org
6 G. Bella et al.

Fig. 1. Comparison of our similarity results from the high-robustness dataset (y axis)
with the corresponding language pairs from ASJP (x axis). With respect to the core
historical vocabularies covered by ASJP, the green trendline shows a generally higher
similarity among genetically unrelated languages and a lower similarity among strongly
related ones.

we consider correlation to be the most robust, being invariant to linear transfor-

mations such as how data is scaled. We generated three test sets: one restricted
to high-robustness result (consisting of 1,399 pairs), one containing both high
and medium results (3,975 pairs), and finally the full dataset (6,420 pairs).
The results are shown in Table 1 and in the scatterplot in Fig. 1. From
both we see significant variance with respect to ASJP results. Yet, correlation
with ASJP remains generally strong and is clearly increasing with robustness
(from 0.61 up to 0.69). This result suggests that our robustness annotations are
meaningful. EZ Glot results are more distant from ASJP and are more weakly
correlated (R = 0.48, while for CogNet R = 0.69 over the same 27-language-
pair subset). We experimentally set α = 0.5, although its effect was minor, for
instance over the full dataset R(α = 0) = 0.590, R(α = 0.5) = 0.610, while for
simple cognate counting R(α = 1) = 0.597.
The green trendline on the scatterplot shows that, on the whole, we compute
higher similarities than ASJP for genetically unrelated languages (bottom left,
S < 10) and lower similarities for genetically strongly related ones (top right). We
attribute these non-negligible differences in part to borrowings across contem-
porary globalized lexicons and, in particular, to universal words (e.g. “tennis”,
“sumo”, or “internet”) that increase the similarity of otherwise unrelated lan-
A Database and Visualization of the Similarity of Contemporary Lexicons 7

guages. On the other hand, language change—well known to affect the lexicon
to a greater extent than it affects grammar—explains why the vocabularies of
historically related languages generally show a higher dissimilarity today. This
is our interpretation of the slope of the green trendline that always stays below
unity.

5 Automated Visualization
Our aim was to reproduce the popular graph-based visualization of lexical simi-
larities in a fully automated manner and based on the entire similarity graph, as
opposed to human-produced illustrations based on cherry-picked data. We used
the well-known Sigma graph visualization library,8 combined with the JavaScript
implementation of the ForceAtlas2 algorithm [10]. The latter applies a physical
model that considers graph edges to be springs, with tensions being proportional
to the edge weights. We modeled languages as nodes and their lexical similarities
as weighted edges, resulting in more similar languages displayed closer together.
Because of the nature of the solution based on a physical tension-based model
that dynamically evolves towards a global equilibrium, our visualizations favor
a global optimum as opposed to locally precise distances. Thus, the visualiza-
tions produced give a realistic view of the “big picture”, but distances of specific
language pairs should be interpreted qualitatively rather than quantitatively.
Figure 2 shows a small portion of the graph computed.9 In order to keep the
graph compact, we restricted it to high-robustness similarities, covering about
a hundred languages. In order to get an intuitive idea of the effect of both
phylogeny and geography on the similarity of contemporary lexicons, we created
two versions of the graph: in the first one (top), nodes are colored according to
language families, as it is usually done in comparative linguistics that focuses on
phylogenetic relationships. In the second version (bottom), we colored the nodes
according to the approximate geographic position of language speakers, taking
into account latitude–longitude coordinates as well as continents. Simply put,
speakers of similar-colored languages live closer together. Both phylogenetic and
geographic metadata were retrieved from the World Atlas of Language Structures
[5].10
The visualization tool can also be used to display similarity data from dif-
ferent sources (provided that it is converted to the input format expected by
the Sigma library). In particular, we used the tool to obtain a visual impression
of the difference between our contemporary similarity data and those produced
by the phylogeny-oriented ASJP tool. The result on ASJP data can be seen in
Figure 3.
The visualizations in Figs. 2 and 3 provide remarkable insight into language
change and the state of modern lexicons. In both the contemporary and the ASJP
datasets, language families are clearly identifiable as their respective nodes tend
8
http://sigmajs.org
9
The full graphs are visible on the page http://ukc.datascientia.eu/lexdist.
10
http://wals.info
8 G. Bella et al.

Fig. 2. Detail from the automatically-computed lexical similarity visualization, with

colors corresponding to language families (top) and to the geographic location of speak-
ers (bottom).
A Database and Visualization of the Similarity of Contemporary Lexicons 9

Fig. 3. Detail from the visualization of ASJP similarity data.

to aggregate together. The clusters are, however, much more salient in the ASJP
data (Figure 3), where subfamilies within the Indo-European phylum form sep-
arate groups, all the while remaining within an Indo-European “macro-cluster”.
Dravidian languages from the Indian subcontinent (right-hand-side, in yellow)
are far removed from the culturally and geographically close Indic group. Such
a result was expected, as ASJP lexicons are optimized to highlight phylogenetic
relationships. Due to borrowings, modern lexicons are less clearly distinguished
from each other. This is evident, in Figure 2, from the proximity of the Germanic,
Romance, and Slavic families, or from unrelated languages such as Japanese or
Tagalog “approaching” the Indo-European families due to borrowings. Likewise,
the fact that English is detached from the Germanic cluster to move closer to
the Romance family reflects its massive French loanword content.

Further insights are gained on the effect of geography on contemporary

lexicons. The bottom image in Figure 2 shows that even-colored (geographi-
cally close) nodes tend to group together (with self-evident exceptions such as
Afrikaans). Remarkably, the languages of India aggregate into a single cluster
far apart, despite the internal linguistic heterogeneity of the Indian subcontinent
that is home to three fully distinct language families—Indic, Dravidian, and
Sino-Tibetan—and despite the Indo-European relatedness of the Indic family.
In this case, the effect of geography and culture seems stronger than phylogeny
or English borrowings.
10 G. Bella et al.

6 Conclusions and Future Work

We have found significant correlation between the similarity data obtained from
large contemporary lexicons and from lexicostatistical databases geared towards
language phylogeny research. At the same time, we have also found that, on
the whole, large contemporary lexicons tend to resemble each other more. We
believe that the uniformizing effect of globalized culture on languages plays a
role in this observation.
Due to these differences, we consider our data to be more relevant to cross-
lingual uses applied to contemporary language, such as machine translation,
cross-lingual transfer, or bilingual lexicon induction, where—other things being
equal—lexical similarities may predict efficiency over language pairs. On the
other hand, our data is not suitable for use in historical linguistics that is based
on a more strict definition of cognacy and on a more controlled concept set.
Our full lexical similarity data, as well as the dynamic visualizations, are
made freely available online.11

Acknowledgments
This paper was partly supported by the InteropEHRate project, co-funded by
the European Union (EU) Horizon 2020 programme under grant number 826106.

References
1. Batsuren, K., Bella, G., Giunchiglia, F.: Cognet: a large-scale cognate database.
In: Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics. pp. 3136–3145 (2019)
2. Batsuren, K., Bella, G., Giunchiglia, F.: A large and evolving cognate database.
Language Resources and Evaluation (2021). https://doi.org/10.1007/s10579-021-
09544-6
3. Batsuren, K., Ganbold, A., Chagnaa, A., Giunchiglia, F.: Building the mongolian
wordnet. In: Proceedings of the 10th Global Wordnet Conference. pp. 238–244
(2019)
4. Bella, G., McNeill, F., Gorman, R., Donnaile, C.O., MacDonald, K., Chan-
drashekar, Y., Freihat, A.A., Giunchiglia, F.: A major wordnet for a minority
language: Scottish gaelic. In: Proceedings of The 12th Language Resources and
Evaluation Conference. pp. 2812–2818 (2020)
5. Comrie, B.: The world atlas of language structures. Oxford University Press (2005)
6. Dellert, J., Daneyko, T., Münch, A., Ladygina, A., Buch, A., Clarius, N., Grigorjew,
I., Balabel, M., Boga, H.I., Baysarova, Z., et al.: Northeuralex: a wide-coverage
lexical database of northern eurasia. Language resources and evaluation 54(1),
273–301 (2020)
7. Garcia, M., Gómez-Rodrı́guez, C., Alonso, M.A.: New treebank or repurposed? on
the feasibility of cross-lingual parsing of romance languages with universal depen-
dencies. Natural Language Engineering 24(1), 91–122 (2018)
11
http://ukc.datascientia.eu/
A Database and Visualization of the Similarity of Contemporary Lexicons 11

8. Giunchiglia, F., Batsuren, K., Bella, G.: Understanding and exploiting language
diversity. In: Proceedings of the Twenty-Sixth International Joint Conference on
Artificial Intelligence (IJCAI-17). pp. 4009–4017 (2017)
9. Giunchiglia, F., Batsuren, K., Freihat, A.A.: One world—seven thousand lan-
guages. In: Proceedings 19th International Conference on Computational Linguis-
tics and Intelligent Text Processing, CiCling2018, 18-24 March 2018 (2018)
10. Jacomy, M., Venturini, T., Heymann, S., Bastian, M.: Forceatlas2, a continuous
graph layout algorithm for handy network visualization designed for the gephi
software. PloS one 9(6), e98679 (2014)
11. Lin, Y.H., Chen, C.Y., Lee, J., Li, Z., Zhang, Y., Xia, M., Rijhwani, S., He, J.,
Zhang, Z., Ma, X., et al.: Choosing transfer languages for cross-lingual learning.
arXiv preprint arXiv:1905.12688 (2019)
12. Nair, N.C., Velayuthan, R.S., Batsuren, K.: Aligning the indowordnet with the
princeton wordnet. In: Proceedings of the 3rd International Conference on Natural
Language and Speech Processing. pp. 9–16 (2019)
13. Nasution, A.H., Murakami, Y., Ishida, T.: Constraint-based bilingual lexicon in-
duction for closely related languages. In: Proceedings of the Tenth International
Conference on Language Resources and Evaluation (LREC’16). pp. 3291–3298
(2016)
14. Petroni, F., Serva, M.: Measures of lexical distance between languages. Physica A:
Statistical Mechanics and its Applications 389(11), 2280–2283 (2010)
15. Pociello, E., Agirre, E., Aldezabal, I.: Methodology and construction of the basque
wordnet. Language resources and evaluation 45(2), 121–142 (2011)
16. Swadesh, M.: Towards greater accuracy in lexicostatistic dating. International jour-
nal of American linguistics 21(2), 121–137 (1955)
17. Wichmann, S., Müller, A., Velupillai, V., Brown, C.H., Holman, E.W., Brown,
P., Sauppe, S., Belyaev, O., Urban, M., Molochieva, Z., et al.: The asjp database
(version 13). URL: http://email. eva. mpg. de/˜ wichmann/ASJPHomePage. htm
3 (2010)

2014 Alina Maria Ciobanu, Liviu P. Dinu, 2014. An Etymological Approach To CrossLanguage Orthographic Similarity. Application On Romanian
No ratings yet
2014 Alina Maria Ciobanu, Liviu P. Dinu, 2014. An Etymological Approach To CrossLanguage Orthographic Similarity. Application On Romanian
12 pages
Computational Historical Linguistics: Gerhard Jäger
No ratings yet
Computational Historical Linguistics: Gerhard Jäger
32 pages
Measure and Evaluations of Semantic
No ratings yet
Measure and Evaluations of Semantic
12 pages
Comparing Fifty Natural Languages and Twelve Genet
No ratings yet
Comparing Fifty Natural Languages and Twelve Genet
27 pages
How To Distinguish Languages and Dialects
No ratings yet
How To Distinguish Languages and Dialects
9 pages
Genes and Languages
No ratings yet
Genes and Languages
10 pages
Lexicostatistics - Wikipedia
No ratings yet
Lexicostatistics - Wikipedia
20 pages
Linguistic Typology and Language Classification Under Lexicostatistics
No ratings yet
Linguistic Typology and Language Classification Under Lexicostatistics
10 pages
Distinguishing Cognates from False Friends
No ratings yet
Distinguishing Cognates from False Friends
8 pages
Fractal Pattern
No ratings yet
Fractal Pattern
20 pages
2009-Identification of Cognates
No ratings yet
2009-Identification of Cognates
35 pages
Building Large Monolingual Dictionaries at The Leipzig Corpora Collection - From 100 To 200 Languages
No ratings yet
Building Large Monolingual Dictionaries at The Leipzig Corpora Collection - From 100 To 200 Languages
7 pages
The Mouton Atlas of Languages and Cultures) 1. Introduction
No ratings yet
The Mouton Atlas of Languages and Cultures) 1. Introduction
13 pages
Comparing Language Similarity Across Genetic and T
No ratings yet
Comparing Language Similarity Across Genetic and T
10 pages
How To Distinguish Languages and Dialects
No ratings yet
How To Distinguish Languages and Dialects
9 pages
From Language Identi Cation To Language Distance
No ratings yet
From Language Identi Cation To Language Distance
28 pages
The Design of A System For The Automatic Extraction of A Lexical Database Analogous To Wordnet From Raw Text
No ratings yet
The Design of A System For The Automatic Extraction of A Lexical Database Analogous To Wordnet From Raw Text
8 pages
The Typology of Polysemy: A Multilingual Distributional Framework
No ratings yet
The Typology of Polysemy: A Multilingual Distributional Framework
7 pages
Fulbright Grant: Research Proposal
100% (4)
Fulbright Grant: Research Proposal
2 pages
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
No ratings yet
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
29 pages
Science2012 - Origins and Expansion of The Indo-European Language Family - S
No ratings yet
Science2012 - Origins and Expansion of The Indo-European Language Family - S
59 pages
Lexical and Lexical-Semantic Comparisons of Classical Arabic and Dialects
No ratings yet
Lexical and Lexical-Semantic Comparisons of Classical Arabic and Dialects
33 pages
Comparing Germanic, Romance and Slavic: Relationships Among Linguistic Distances
No ratings yet
Comparing Germanic, Romance and Slavic: Relationships Among Linguistic Distances
23 pages
Phylogeny and Geometry of Languages From Normalized Levenshtein Distance
No ratings yet
Phylogeny and Geometry of Languages From Normalized Levenshtein Distance
11 pages
Comparative Linguistics
No ratings yet
Comparative Linguistics
4 pages
Trends in Linguistics
100% (2)
Trends in Linguistics
363 pages
Interconnecting Romanian Lexical Resources
No ratings yet
Interconnecting Romanian Lexical Resources
6 pages
Comparative Research Across Language Levels A Comprehensive Linguistic
No ratings yet
Comparative Research Across Language Levels A Comprehensive Linguistic
9 pages
Wälchli - Cysouw (2012) - Lexical Typology Through Similarity Semantics - Toward A Semantic Map of Motion Verbs
No ratings yet
Wälchli - Cysouw (2012) - Lexical Typology Through Similarity Semantics - Toward A Semantic Map of Motion Verbs
40 pages
Visualising Language in Space - New Approaches in Linguistic Cartography
No ratings yet
Visualising Language in Space - New Approaches in Linguistic Cartography
4 pages
EsTenTen, A Vast Web Corpus of Peninsular and American Spanish
No ratings yet
EsTenTen, A Vast Web Corpus of Peninsular and American Spanish
8 pages
New Tendencies in Geographical Dialectology: The Catalan Corpus Oral Dialectal (Cod)
No ratings yet
New Tendencies in Geographical Dialectology: The Catalan Corpus Oral Dialectal (Cod)
14 pages
Mod 5
No ratings yet
Mod 5
16 pages
Week 4 Note
No ratings yet
Week 4 Note
5 pages
Graph Databases For Diachronic Language Data Modelling
No ratings yet
Graph Databases For Diachronic Language Data Modelling
11 pages
Akulov Ainu Minoan3
No ratings yet
Akulov Ainu Minoan3
24 pages
Linguistic Learning Practice Portfolio
No ratings yet
Linguistic Learning Practice Portfolio
28 pages
Corpus Linguistics: National Conference On Artificial Intelligence. 1, PP
No ratings yet
Corpus Linguistics: National Conference On Artificial Intelligence. 1, PP
4 pages
WorldLanguageTree 002 PDF
No ratings yet
WorldLanguageTree 002 PDF
50 pages
A Distributional Similarity Approach To The Detection of Semantic Change in The Google Books Ngram Corpus
No ratings yet
A Distributional Similarity Approach To The Detection of Semantic Change in The Google Books Ngram Corpus
5 pages
Corpus Linguistics 1
No ratings yet
Corpus Linguistics 1
48 pages
Using Linguistic Typology To Enrich Multilingual Lexicons: The Case of Lexical Gaps in Kinship
No ratings yet
Using Linguistic Typology To Enrich Multilingual Lexicons: The Case of Lexical Gaps in Kinship
10 pages
Linguistic Learning Practice Portfolio
No ratings yet
Linguistic Learning Practice Portfolio
21 pages
Preliminary Lexicostatistics As A Basis For Language Classification: A New Approach
No ratings yet
Preliminary Lexicostatistics As A Basis For Language Classification: A New Approach
38 pages
Building and Using Comparable Corpora: Serge Sharoff Reinhard Rapp Pierre Zweigenbaum Pascale Fung
No ratings yet
Building and Using Comparable Corpora: Serge Sharoff Reinhard Rapp Pierre Zweigenbaum Pascale Fung
333 pages
Swadesh List, Paper
No ratings yet
Swadesh List, Paper
8 pages
Applying Language Technology in Humanities Research - Barbara McGillivray, Gábor Mihály Tóth - Palgrave Pivot, 1st Ed - 2020, Cham, 2020
No ratings yet
Applying Language Technology in Humanities Research - Barbara McGillivray, Gábor Mihály Tóth - Palgrave Pivot, 1st Ed - 2020, Cham, 2020
133 pages
The Nivkh Language Is A Bridge Between T
No ratings yet
The Nivkh Language Is A Bridge Between T
21 pages
Historical and Comparative Linguistics
0% (1)
Historical and Comparative Linguistics
38 pages
CA Sumary Phuong
No ratings yet
CA Sumary Phuong
91 pages
1 Corpus Linguistics
No ratings yet
1 Corpus Linguistics
38 pages
Understanding Language Similarity
No ratings yet
Understanding Language Similarity
1 page
Computer Modelling of Innovations Relative To Lati
No ratings yet
Computer Modelling of Innovations Relative To Lati
32 pages
Text
No ratings yet
Text
102 pages
Cognacy and Computational Cladistics Iss
No ratings yet
Cognacy and Computational Cladistics Iss
21 pages
Understanding Lexicons in Computational Linguistics
50% (4)
Understanding Lexicons in Computational Linguistics
29 pages
10.1.1.103.627 Near-Synonymy
No ratings yet
10.1.1.103.627 Near-Synonymy
40 pages
Principles of Insurance
No ratings yet
Principles of Insurance
2 pages
Solomon Press C2B
No ratings yet
Solomon Press C2B
18 pages
Grimbergen Brand Identity Guidelines
No ratings yet
Grimbergen Brand Identity Guidelines
42 pages
Hurricane Katrina: Race and Poverty Analysis
No ratings yet
Hurricane Katrina: Race and Poverty Analysis
3 pages
Agriculture: Paper 5038/01 Paper 1
No ratings yet
Agriculture: Paper 5038/01 Paper 1
6 pages
Kundalini: Divine Awakening Guide
100% (2)
Kundalini: Divine Awakening Guide
5 pages
Aicff 16iff Tou ST Nomi 2025fm
No ratings yet
Aicff 16iff Tou ST Nomi 2025fm
8 pages
The State of AI Benchmark Report
No ratings yet
The State of AI Benchmark Report
28 pages
Psychaitry and Control
No ratings yet
Psychaitry and Control
210 pages
Konica Copier 7145-7222-7228-7235 Parts & Service
67% (3)
Konica Copier 7145-7222-7228-7235 Parts & Service
563 pages
Kruger's Confrontational Installations
100% (1)
Kruger's Confrontational Installations
29 pages
Of Mice and Men Quiz
No ratings yet
Of Mice and Men Quiz
5 pages
Health Benefits and Bioactive Compounds of Eggplant
No ratings yet
Health Benefits and Bioactive Compounds of Eggplant
9 pages
Fruit Development Stages
No ratings yet
Fruit Development Stages
13 pages
Mzo4 Systematics Biodiversity Part-01
No ratings yet
Mzo4 Systematics Biodiversity Part-01
15 pages
George R Knight A Search For Identity 58-61
No ratings yet
George R Knight A Search For Identity 58-61
2 pages
Short Story Terms & Authors Guide
No ratings yet
Short Story Terms & Authors Guide
2 pages
Class 12 Accountancy Pre-Board Exam 2021-22
No ratings yet
Class 12 Accountancy Pre-Board Exam 2021-22
5 pages
Student List with Matric Numbers
No ratings yet
Student List with Matric Numbers
128 pages
Digest 2024
No ratings yet
Digest 2024
1,182 pages
Dual RSA Cryptanalysis Report
No ratings yet
Dual RSA Cryptanalysis Report
26 pages
12 Angry Men Movie Questions
100% (1)
12 Angry Men Movie Questions
3 pages
Licensure Examination For Teachers
No ratings yet
Licensure Examination For Teachers
102 pages
David's Greatest Sin and Confession
No ratings yet
David's Greatest Sin and Confession
6 pages
The Influence of Peer Pressure On The Social Behavior of Grade 12 Humanities and Social Science Learners in Speaker Eugenio Perez National Agricultural School
No ratings yet
The Influence of Peer Pressure On The Social Behavior of Grade 12 Humanities and Social Science Learners in Speaker Eugenio Perez National Agricultural School
36 pages
Emotional Intelligence Test PDF
100% (1)
Emotional Intelligence Test PDF
4 pages
Flapper's Diary: Society & Wit
No ratings yet
Flapper's Diary: Society & Wit
74 pages
Solid Edge Overview: Siemens PLM Software
No ratings yet
Solid Edge Overview: Siemens PLM Software
3 pages
Examples of Good and Bad Chute Design
100% (2)
Examples of Good and Bad Chute Design
13 pages
Merisiel: Level 7 Elf Rogue Profile
No ratings yet
Merisiel: Level 7 Elf Rogue Profile
1 page

Lexical Similarity TSD2

Uploaded by

Lexical Similarity TSD2

Uploaded by

A Database and Visualization

of the Similarity of Contemporary Lexicons

Abstract. Lexical similarity data, quantifying the “proximity” of lan-

Keywords: lexical similarity · cognate · language diversity · lexicostatis-

infer or back up hypotheses of phylogeny among languages. In computational lin-

2 State of the Art

The comparison of lexicons has a methodology established in the framework of

3 Automated Similarity Computation

max(lw1 , lw2 ) − LD(w1 , w2 )

Difference w.r.t. ASJP

the too specific—and from our perspective irrelevant—Staffordshire bullterrier

4 Comparison to Results from Lexicostatistics

we consider correlation to be the most robust, being invariant to linear transfor-

Fig. 2. Detail from the automatically-computed lexical similarity visualization, with

Fig. 3. Detail from the visualization of ASJP similarity data.

Further insights are gained on the effect of geography on contemporary

6 Conclusions and Future Work

You might also like