0% found this document useful (0 votes)

29 views12 pages

CLASSLA-web - Comparable Web Corpora of South Slavic2

The document presents the CLASSLA-web corpora, a collection of comparable web corpora for South Slavic languages, which includes Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, totaling 13 billion tokens from 26 million documents. The corpora are linguistically annotated and enriched with genre information, revealing genre distributions influenced by the economic status of the language communities. The paper outlines the methodology for corpus creation, including web crawling, linguistic annotation, and genre classification, aimed at enhancing language technology development for these under-resourced languages.

Uploaded by

720matheusmendes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views12 pages

CLASSLA-web - Comparable Web Corpora of South Slavic2

Uploaded by

720matheusmendes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

CLASSLA-web: Comparable Web Corpora of South Slavic

Languages Enriched with Linguistic and Genre Annotation

Nikola Ljubešić, Taja Kuzman

Jožef Stefan Institute, University of Ljubljana, Institute of Contemporary History
Jamova cesta 39, Večna pot 113, Privoz 11, 1000 Ljubljana, Slovenia
[email protected], [email protected]

Abstract
This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin,
Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic
language space. The collection of these corpora comprises a total of 13 billion tokens of texts from 26 million
documents. The comparability of the corpora is ensured by a comparable crawling setup and the usage of identical
crawling and post-processing technology. All the corpora were linguistically annotated with the state-of-the-art
arXiv:2403.12721v2 [cs.CL] 26 Mar 2024

CLASSLA-Stanza linguistic processing pipeline, and enriched with document-level genre information via the
Transformer-based multilingual X-GENRE classifier, which further enhances comparability at the level of linguistic
annotation and metadata enrichment. The genre-focused analysis of the resulting corpora shows a rather consistent
distribution of genres throughout the seven corpora, with variations in the most prominent genre categories being
well-explained by the economic strength of each language community. A comparison of the distribution of genre
categories across the corpora indicates that web corpora from less developed countries primarily consist of news
articles. Conversely, web corpora from economically more developed countries exhibit a smaller proportion of news
content, with a greater presence of promotional and opinionated texts.

Keywords: web corpora, South Slavic languages, linguistic processing, genre identification

1. Introduction crawling, an automated method for collecting text

corpora, is one of the primary approaches for quick
collection of such data. Recently, the MaCoCu1
The South Slavic languages constitute one of the project (Bañón et al., 2022) used this method to
three major branches within the Slavic language develop and make freely available monolingual and
family. They are mostly spoken in Central and parallel datasets for over 10 under-resourced lan-
Southeastern Europe, spanning from the Slove- guages, including the South Slavic languages.
nian Alps, across the Adriatic and Dinarides re- This paper introduces the CLASSLA-web cor-
gions, to Bulgaria and the Black Sea. Despite pora, which are based on the MaCoCu datasets
their widespread use, many languages within this and have been additionally enriched with linguistic
group remain relatively low-resourced and under- annotation and genre information. This collection
represented in the field of natural language process- of corpora represents, to best of our knowledge,
ing. According to a recent report on the state of the first comparable corpus collection that encom-
technologies for European languages (Rehm and passes the entire language group. The corpora are
Way, 2023), the support with the core Language considered comparable as they were collected and
Technologies (LT) – which include corpora, mod- processed with the same tools and within the same
els and language technologies, such as text and time frame. The paper presents a comprehensive
speech processing and machine translation – was overview of the various stages involved in creat-
shown to be less than moderate for South Slavic ing the corpora and offers initial insights into their
languages, with Bosnian and Macedonian having content based on genre information. The creation
weak support to no support in all LT areas. In order of the CLASSLA-web corpora involved three main
to facilitate the development of language technolo- steps: 1) web crawling based on top-level national
gies for these languages, it is crucial to have access domains, and subsequent data post-processing
to substantial amounts of textual data. Such data and filtering; 2) linguistic annotation with the latest
serve as the foundation for language technologies, CLASSLA-Stanza pipeline (Ljubešić and Dobro-
including language models and machine translation voljc, 2019; Terčon and Ljubešić, 2023), the state-
systems. Additionally, general text corpora – that of-the-art pipeline for processing South Slavic lan-
is, corpora that cover broad language use and are guages; and 3) genre annotation, which provides
not specifically restricted to any particular subject, an insight into the functional content of these cor-
register or genre – enable linguists and other re- pora. Genres are text categories that are defined
searchers to analyze linguistic phenomena based
on statistically significant amounts of data. Web 1
https://macocu.eu/
considering the author’s purpose, common func- of Croatian content in the mC4 dataset (Xue et al.,
tion of the text, and the text’s conventional form (Or- 2021).
likowski and Yates, 1994). Examples of genres are The previously developed open web corpora for
News, Promotion, Legal, etc. Genre annotation pro- South Slavic languages adhere to the naming con-
vides valuable insights into the functional content vention of the WaCky (Baroni et al., 2009) initiative
of the corpora, enabling genre-based corpus stud- (e.g., slWaC, hrWaC, srWaC). However, they were
ies and more focused natural language processing actually built using the TenTen technology, more
applications in machine translation (Van der Wees specifically, the SpiderLing crawler (Suchomel et al.,
et al., 2018) or summarization (Stewart and Callan, 2012), the jusText text extraction tool, and the
2009). Onion near-deduplication tool (Pomikálek, 2011).
The paper is organized as follows. Firstly, we The first web corpora of South Slavic languages
provide an overview of the related work on con- were compiled for Slovenian (slWaC) and Croatian
struction of web corpora and the automatic genre (hrWaC) (Ljubešić and Erjavec, 2011). Three years
identification in Section 2. Next, in Section 3, we later, the list of languages was expanded to include
present the process of creating and curating the Bosnian (bsWaC) and Serbian (srWaC) (Ljubešić
CLASSLA-web corpora, whose content is analyzed and Klubička, 2014). Additionally, the Croatian and
in Section 4. Lastly, the paper concludes with Sec- Slovenian crawls were later updated (Erjavec et al.,
tion 5, where we summarize the main findings and 2015). However, there have been no recent activi-
present future work. ties related to open web corpora for South Slavic
languages.

2. Related Work 2.2. Genre Prediction

2.1. Web Corpora Until recently, the available technologies for text
classification, including support vector machines
The tradition of building web corpora can be fol- and logistic regression classifiers, were insuffi-
lowed back to the WaCky initiative (Baroni et al., cient to accurately identify genres across lan-
2009), inside which first web corpora for large Eu- guages and datasets (Sharoff et al., 2010; Kuz-
ropean languages were built. Two other notable man et al., 2023a). However, recent advance-
initiatives include the CoW corpora (Schäfer et al., ments in deep neural technologies led to a break-
2012), developed for large European languages, through in this field. Transformer-based language
and the TenTen corpora (Jakubíček et al., 2013), models (Vaswani et al., 2017), specifically those
created for many languages, including some of fine-tuned on manually-annotated genre datasets,
South Slavic languages, namely Slovenian2 and demonstrated the capability to identify genres in var-
Bulgarian3 . However, an important limitation of ious web corpora and languages (see Rönnqvist
the TenTen corpora is that they are accessible only et al. (2021); Kuzman et al. (2022)). The main
through concordancers of the Lexical Computing advantage of Transformer models in this task is
company, which require a paid subscription. that their representations incorporate both lexical
Recently, numerous web-based datasets have and syntactic knowledge about a text, which is
emerged from the Common Crawl4 and the Internet crucial for accurate genre prediction (Kuzman and
Archive data collections5 . Most prominent collec- Ljubešić, 2022). The advancement in these tech-
tions include the cc100 dataset (Conneau et al., nologies has facilitated the development of multi-
2020), the mC4 dataset (Xue et al., 2021), and the lingual Transformer-based genre classifiers, which
OSCAR dataset (Suárez et al., 2019). While these can now already be effectively applied to large-
multilingual collections are highly useful for multi- scale datasets in various languages. Recently,
lingual language modelling tasks, they suffer, inter Laippala et al. (2022) published the Register Os-
alia, from blind spots, such as an unexplainable car dataset, consisting of 351 million documents
small amount of all Western South Slavic (Slove- in 14 languages to which genre labels were au-
nian, Croatian, Bosnian, Montenegrin, Serbian) tomatically assigned. The dataset was created
data in OSCAR6 and an unexpectedly small amount by applying the multilingual genre classifier (Rön-
nqvist et al., 2021) on the OSCAR datasets (Suárez
2
https://www.sketchengine.eu/
et al., 2019). Their classifier was trained on genre
sltenten-slovenian-corpus/ datasets in multiple languages and uses a schema
3
https://www.sketchengine.eu/ with 8 genre labels, such as Narrative, Opinion
bgtenten-bulgarian-corpus/ and Interactive Discussion. The unseen-language
4
https://commoncrawl.org classification was evaluated based on annotated
5
https://archive.org datasets in eight new languages (Arabic, Catalan,
6
https://oscar-project.github.io/ Chinese, Hindi, Indonesian, Portuguese, Spanish
documentation/versions/oscar-2301/ and Urdu). The evaluation results for these new
languages demonstrated promising performance, language identification accuracy and to evaluate
with F1 scores ranging from 0.58 to 0.82. the fluency of paragraphs using a language model.
These endeavors were closely followed by the Lastly, we further refined the corpora by remov-
first experiments with automatic genre annotation ing very short texts (shorter than 75 words or con-
on the MaCoCu data. Kuzman et al. (2023b) ap- sisting only of paragraphs shorter than 70 charac-
plied the X-GENRE classifier, which is described in ters) and texts originating from web domains that
more detail in Section 3.2, to the English part of the had been manually or automatically identified as
parallel MaCoCu corpora. Based on a preliminary automatically-generated12 .
manual evaluation, the genre classifier achieved Language identification for certain under-
a macro F1 score of 0.73 and a micro F1 score resourced languages within the MaCoCu corpora
of 0.88 on a small sample of English texts derived has proven to be a challenge, as these languages
from the Slovenian-English MaCoCu corpus. are frequently under-represented in the training
data of wide-coverage language identification
3. Construction of the Corpora tools. This challenge is particularly pronounced
when attempting to differentiate between Bosnian,
3.1. Collecting and Curating Web Croatian, Montenegrin, and Serbian, as these
South Slavic languages exhibit a significant level
Corpora
of mutual intelligibility. Thus, to ensure a high
The CLASSLA-web corpora are derived from the level of accuracy, we employed multiple language
web crawls that were gathered and curated in- identification tools at various stages of the pipeline.
side the MaCoCu project7 (Bañón et al., 2022). Firstly, the identification process began with the
Namely, the following datasets were used: Bosnian Google’s Compact Language Detector 2 (CLD2)13
web corpus MaCoCu-bs 1.0 (Bañón et al., 2023a), at the document level during the crawling phase.
Bulgarian web corpus MaCoCu-bg 2.0 (Bañón Secondly, the FastSpell14 tool was used for a more
et al., 2023b), Croatian web corpus MaCoCu- refined language identification at the paragraph
hr 2.0 (Bañón et al., 2023c), Macedonian web level during the post-processing of the corpora
corpus MaCoCu-mk 2.0 (Bañón et al., 2023d), with the Monotextor tool. In these initial two steps,
Montenegrin web corpus MaCoCu-cnr 1.0 (Bañón Bosnian, Croatian, Serbian and Montenegrin lan-
et al., 2023e), Serbian web corpus MaCoCu-sr guages were treated as a single macro-language,
1.0 (Bañón et al., 2023f), and Slovenian web cor- and the objective in the case of these languages
pus MaCoCu-sl 2.0 (Bañón et al., 2023g). They was to determine whether the text belongs to this
comprise 7 South Slavic languages written in Latin macro-language or not. Finally, a specialized
and/or Cyrillic script. classifier was employed to distinguish between
The MaCoCu corpora were collected by crawl- these four languages. Specifically, they were
ing the national top-level domains, such as, in the identified using a Naive Bayes classifier (Rupnik
case of Slovenian, the Slovenian top-level domain et al., 2023), which relied on lists of words specific
.si. The primary focus of the crawl was on the to each variety, which were extracted from web
top-level domain, but texts from generic domains texts published on national top-level domains.
(.com, .net etc.) were also collected if they were This approach assumed that, for instance, the
1. either in the list of seed URLs obtained from Croatian top-level domain (.hr) is predominantly
previous crawls, or 2. connected well through hy- associated with the Croatian language.
perlinks with the websites in the national top-level
domain, and containing sufficient data in the target 3.2. Genre Annotation
language. The MaCoCu crawler8 , which is based
on the SpiderLing crawler (Suchomel et al., 2012), One crucial part of the creation of CLASSLA-
was used for the crawling. The crawling process web corpora was also enrichment of the corpora
was followed by a thorough post-processing to as- with metadata on genres. This information of-
sure high quality data. Firstly, the jusText9 tool was fers valuable insights into the functional content
employed to remove boilerplate content (Pomikálek, of the corpora. Furthermore, it facilitates the
2011). Secondly, the onion10 tool was used to iden- creation of subcorpora based on genres, which
tify near-duplicate documents (Pomikálek, 2011). can be used for genre-based linguistic analy-
Thirdly, the Monotextor11 tool was used to enhance ses. We used the multilingual X-GENRE classi-

7
https://macocu.eu/ monotextor/releases/tag/v1.1
8 12
https://github.com/macocu/ Further details regarding the preparation of cor-
MaCoCu-crawler pora can be found at https://github.com/macocu/
9
http://corpus.tools/wiki/Justext Monolingual-Curation/.
10 13
http://corpus.tools/wiki/Onion https://github.com/CLD2Owners/cld2
11 14
https://github.com/bitextor/ https://github.com/mbanon/fastspell
Language Other Mix texts, annotated as Other or Mix, therefore the la-
Slovenian (sl) 2.7% 5.0% bel not having a straightforward application. If we
Croatian (hr) 1.9% 4.1% omit these labels, we can report that the X-GENRE
Bosnian (bs) 1.5% 2.9% classifier assigned a specific genre label to approx-
Montenegrin (cnr) 2.0% 3.3% imately 92%–96% of documents in each corpus.
Serbian (sr) 1.8% 3.7% After annotation, we performed a manual analy-
Macedonian (mk) 1.2% 3.3% sis of samples from Slovenian, Croatian and Mace-
Bulgarian (bg) 2.1% 4.1% donian corpora. Samples consisted of 10 instances
per genre label (excluding Other and Mix), amount-
Table 1: Percentage of documents annotated with ing to 80 instances per evaluated corpus. The eval-
genre labels that do not relate to a specific genre uation showed very high performance of the model,
(labels Other and Mix). both in Latin and Cyrillic scripts, namely 0.88 in
macro F1 for Croatian, 0.93 in macro F1 for Mace-
donian and 0.94 in macro F1 in case of Slovenian16 .
fier15 (Kuzman et al., 2023a) to automatically an-
notate the corpora with genre labels. The classi-
fier uses the following genre categories: Informa- 3.3. Linguistic Annotation
tion/Explanation, Instruction, News, Legal, Promo- The final layer of enrichment of the original MaCoCu
tion, Opinion/Argumentation, Prose/Lyrical, Forum datasets was linguistic annotation, which enables
and Other (for a detailed description of labels, refer linguistic analyses and simplified querying of the
to Kuzman et al. (2023a)). When tested in the in- corpora through concordancers. For linguistic an-
dataset scenario, i.e., on the test split that is derived notation the CLASSLA-Stanza pipeline (Ljubešić
from the same dataset as the training data, the X- and Dobrovoljc, 2019; Terčon and Ljubešić, 2023)
GENRE classifier achieves a micro F1 score of 0.80 was used, which provides the state-of-the-art lin-
and a macro F1 score of 0.79. In the cross-dataset guistic annotation of Slovenian, Croatian, Serbian,
scenario, the classifier still maintains a strong per- Bulgarian, and Macedonian.
formance with a micro F1 score of 0.68 and macro The CLASSLA-Stanza pipeline is based on the
F1 score of 0.69 (Kuzman et al., 2023a). Stanza neural pipeline (Qi et al., 2020), but was
We applied the genre classifier to each of the further improved for processing of South Slavic
seven CLASSLA-web corpora. The processing languages (Terčon and Ljubešić, 2023). Notable
was performed on the NVIDIA A100 40GB GPU, enhancements include the support of external in-
with approximately 2,000 predictions executed per flectional lexicons, which greatly increases perfor-
second. This resulted in a total processing time of mance for morphologically rich languages (Ljubešić
300 hours. and Dobrovoljc, 2019). Additionally, the training
In addition to genre specific categories, the clas- datasets used for all models in the pipeline were ex-
sifier also returns the category Other, which de- panded beyond the Universal Dependencies data.
notes texts that do not fit inside any of the gen- Moreover, the pipeline uses CLARIN.SI-embed
res present in the schema, as would be the case word embeddings (Terčon et al., 2023; Terčon
with an exam, interview, etc. This phenomenon is and Ljubešić, 2023a,b,c,d) which were trained on
present in around 2 percent of the documents of larger and more diverse datasets compared to the
each corpus. embeddings used by Stanza. As a result, the
Additionally, given that the classifier returns per- CLASSLA-Stanza pipeline demonstrates superior
class logits that can then be transformed into class performance compared to Stanza, with error re-
probabilities via the softmax function, we introduction between 34% and 98% on the Slovenian
duced a new label, Mix. This label is used in cases official benchmark SloBENCH17 (Žitnik and Dragar,
where none of the categories reached a probability 2021) (see Terčon and Ljubešić (2023) for further
of 0.8, which is a relatively infrequent phenomenon, details).
affecting between 3 and 5 percent of the documents One highly useful feature of the CLASSLA-
in each corpus. We have come to the decision to Stanza pipeline is that it provides non-standard
use the Mix label by performing a manual analysis linguistic processing models for Slovenian, Croa-
of samples of instances with lower probabilities of tian and Serbian. These models are trained on
the most probable label. The analysis showed that non-standard social media texts, as well as on texts
the suggested upper threshold of 0.8 isolates the from closely related languages, such as Croatian in
documents containing multiple genres very well, the case of the Serbian language, and vice versa.
with both high precision and high recall.
Table 1 shows the distribution of documents in 16
More details on the evaluation of the model’s cross-
lingual performance will be provided in a future publica-
15
https://huggingface.co/classla/xlm-roberta-base- tion that we are currently working on.
17
multilingual-text-genre-classifier https://slobench.cjvt.si/
Furthermore, the training data used to train these Corpus # tokens # docs
models had diacritics partially removed, which is CLASSLA-web.sl 2,153M 4,063k
a phenomenon often seen in non-curated online CLASSLA-web.hr 2,575M 5,422k
texts. The CLASSLA-Stanza system includes a CLASSLA-web.bs 802M 1,993k
“web” processing module that uses a standard tok- CLASSLA-web.cnr 177M 401k
enizer, but relies on non-standard models for mor- CLASSLA-web.sr 2,765M 5,256k
phosyntactic processing and lemmatization. This CLASSLA-web.mk 557M 1,482k
setup was shown to be a great solution for linguistic CLASSLA-web.bg 3,917M 7,456k
processing of the variety of texts that can be found Total 12,948M 26,076k
online. Additionally, the flexibility of these models
allows the Croatian web module to be applied to Table 2: Sizes of CLASSLA-web corpora in mil-
Bosnian and Montenegrin web corpora with high lions of tokens and thousands of documents. The
accuracy. Note that Bosnian and Montenegrin lan- following language codes are used: “sl” for Slove-
guage can be considered, in simplified terms, a mix- nian, “hr” for Croatian, “bs” for Bosnian, “cnr” for
ture of the Croatian and Serbian language (Ljubešić Montenegrin, “sr” for Serbian, “mk” for Macedonian,
et al., 2018) and that the Croatian web module in and “bg” for Bulgarian.
CLASSLA-Stanza is capable of handling features
of both.
The corpora are also available on the CLARIN.SI
Thus, we linguistically annotated the CLASSLA-
NoSketch Engine concordancers19 which enable
web corpora using the web module of the
easy querying and linguistic analyses.
CLASSLA-Stanza pipeline, which is implemented
as a Python library18 . The linguistic processing
involved tokenization, morphosyntactic annotation 4. CLASSLA-web Corpora
and lemmatization. The CLASSLA-Stanza pipeline
also allows annotation on the levels of dependency In this section, we perform basic analyses of the
parsing and named entity recognition. However, seven newly introduced corpora. In the first part, we
these two processing stages are not supported run some general analyses on the size of each of
for the Macedonian language due to the lack of the corpora in terms of token and document count,
required training data for that language. If the re- as well as some basic analyses of the top-level do-
search community expresses interest for these two mains from which the data originate. We proceed
annotation layers in CLASSLA-web corpora for the with a genre-based analysis of the corpora, explor-
available languages, or, even more, if appropriate ing the relationship between corpora as well as the
training data for Macedonian are produced in the relationship of specific genres within them.
meantime, we will add these two additional anno-
tation layers in the next version of the CLASSLA- 4.1. General Analysis
web corpora. The final CLASSLA-web corpora
have been made freely available on the CLARIN.SI The sizes of the resulting corpora for all 7 official
repository as: South Slavic languages are presented in Table 2.
The combined corpora amount to almost 13 billion
• Bosnian CLASSLA-web.bs corpus (Ljubešić tokens of running text coming from 26 million docu-
et al., 2024a), ments. Among them, the Montenegrin web corpus
is the smallest, consisting of 177 million tokens
• Bulgarian CLASSLA-web.bg corpus (Ljubešić from 401 thousand documents. The largest corpus
et al., 2024b), was derived from the Bulgarian web, consisting of
• Montenegrin CLASSLA-web.cnr cor- almost 4 billion tokens, obtained from more than
pus (Ljubešić et al., 2024e), 7.4 million documents.
Each of the presented corpora represents the
• Croatian CLASSLA-web.hr corpus (Ljubešić largest general corpus available for the respec-
et al., 2024c), tive language. What is more, the Macedonian
CLASSLA-web corpus is the first linguistically-
• Macedonian CLASSLA-web.mk cor- annotated general corpus of Macedonian. This
pus (Ljubešić et al., 2024d), achievement was made possible not only through
• Slovenian CLASSLA-web.sl corpus (Ljubešić the crawl performed inside the MaCoCu project,
et al., 2024g), but especially due to the recent inclusion
of basic linguistic processing of Macedonian
• Serbian CLASSLA-web.sr corpus (Ljubešić into the CLASSLA-Stanza linguistic processing
et al., 2024f). pipeline (Terčon and Ljubešić, 2023).

18 19
https://pypi.org/project/classla/ https://www.clarin.si/ske/
Slovenian
News Croatian

Information / Bosnian
Explanation
Montenegrin
Instruction Serbian

Macedonian
Legal
Bulgarian
Genre

Promotion

Opinion /
Argumentation

Forum

Prose/Lyrical

% 20% 40% 60% 80%

Figure 1: Genre distribution in CLASSLA-web corpora (in percentages of texts).

Corpus National Generic across these corpora in Figure 1. The first observa-
CLASSLA-web.sl 78.2% 21.8% tion to be made is that genres in general are simi-
CLASSLA-web.hr 75.8% 24.2% larly distributed across all seven corpora. The most
CLASSLA-web.bs 62.5% 37.5% prevalent genre categories in the analyzed web
CLASSLA-web.cnr 46.6% 53.4% corpora are News, Information/Explanation, Pro-
CLASSLA-web.sr 63.6% 36.4% motion, Opinion/Argumentation and Forum. Con-
CLASSLA-web.mk 95.2% 4.8% versely, Instruction, Legal and Prose/Lyrical are the
CLASSLA-web.bg 71.1% 28.9% least represented, with Prose/Lyrical accounting for
1% to 4% of texts in the corpora.
Table 3: Distribution of texts, derived from national
On the other hand, the most significant dispar-
top-level domains and other, generic domains.
ities in genre distribution across corpora are ob-
served in the case of Promotion and News. While
Table 3 shows the differences in the distribution the Slovenian web corpus consists only of 28% of
of texts that were derived from the national top- News texts but 23% of Promotion texts, two-thirds
level domain, e.g., .si, and all other, generic do- of the Bosnian, Montenegrin and Macedonian webs
mains, such as .com, .net etc. The percentage of are dominated by the News genre, with Promotion
texts originating from the national top-level domain content constituting only 5–8% of the total texts.
varies significantly, ranging from only 47% of texts The most significant differences in the occur-
for Montenegrin to 95% for Macedonian. On aver- rence of Promotion in web corpora appear to be ob-
age, approximately 70% of texts were sourced from served among corpora derived from webs of coun-
the national top-level domain, highlighting the im- tries with the greatest disparity in development lev-
portance of crawling beyond the national top-level els, as measured by gross domestic product (GDP)
domain. per capita. Among the seven South Slavic nations,
Slovenia has the highest GDP per capita, corrected
4.2. Genre-Based Analysis for purchasing power parity (GDP PPP per capita),
whereas the GDP PPP per capita values of Bosnia
We start our genre-based analysis of our seven new and Herzegovina, Montenegro and Macedonia are
corpora by plotting the genre distribution in texts almost half of that of Slovenia. Given this back-
Genre Pearson r p-value Genre pair Pearson r p-value
Promotion 0.938 0.002** News, Promotion -0.972 0.000**
Opinion 0.873 0.010* News, Information -0.833 0.020*
Information 0.861 0.013* News, Opinion -0.812 0.026*
Legal 0.861 0.013* News, Legal -0.777 0.040*
News -0.900 0.006* Information, Opinion 0.783 0.037*
Information, Promotion 0.813 0.026*
Table 4: Results of the Pearson correlation Promotion, Opinion 0.831 0.021*
test of the GDP (PPP) per capita for each South Legal, Promotion 0.834 0.020*
Slavic country and the distribution of each genre Legal, Opinion 0.851 0.015*
in CLASSLA-web corpora. Asterisks denote p-
values: ** for p<0.005 and * for p<0.05. Only cor- Table 5: Results of the Pearson correlation test
relations that are statistically significant, i.e., with over pairwise genre categories. Categories Mix
the p-value below 0.05, are included. Informa- and Other are not included in the analysis. Only cor-
tion/Explanation is shortened to Information, and relations that are statistically significant, i.e., with
Opinion/Argumentation is shortened to Opinion. the p-value below 0.05, are included. Asterisks
denote p-values: ** for p<0.005 and * for p<0.05.
Information/Explanation is shortened to Informa-
ground knowledge, we decided to examine the rela-
tion, and Opinion/Argumentation is shortened to
tionship between the genre distributions across the
Opinion.
7 web corpora, and the GDP (PPP) per capita for
each of the corresponding nations, which serves
as a strong metric of development. We inspect greater variety of promotional, opinionated, infor-
the relationship based on the Pearson correlation mational, and legal content. It is important to note
test. Table 4 shows the results of the correlation that this observation does not take into account
test that are statistically significant with p-value be- other possible factors that might have an impact
low the 0.05 threshold. As anticipated, the Promo- on the differences between South Slavic webs. De-
tion genre exhibits the strongest positive correlation spite being very preliminary, we put forward this
with the GDP PPP per capita, with an exception- hypothesis in hopes of sparking interest for further
ally high correlation coefficient of 0.938. Interest- research on this phenomenon in the wider research
ingly, the Information/Explanation, Legal and Opin- community.
ion/Argumentation also demonstrate a relatively
strong correlation ranging from 0.86 to 0.87. In
contrast, the prevalence of News displays a signifi- 5. Conclusions
cant negative correlation with the GDP (PPP) per
capita, with a correlation coefficient of -0.90. This paper introduces a collection of comparable
Given the observation that News appears to have web corpora for South Slavic languages. To the
a negative correlation with the other genres, as evi- best of our knowledge, this is the first corpus col-
denced by the contrasting shapes in Figure 1, we lection that comprises comparable general cor-
also conducted a pairwise Pearson correlation test pora for all languages within a language group.
among all the genres. The results, presented in Furthermore, these corpora represent the largest
Table 5, reveal a nearly perfect negative correlation general corpora available for each respective lan-
of -0.972 between the relative frequency of News guage. The corpus collection comprises 13 billion
and Promotion. This suggests that, as the country tokens and 26 million documents in total. Addi-
evolves from less-developed to better-developed, tionally, for the least resourced language in this
there is a phenomenon wherein newspaper con- group, the Macedonian language, the Macedo-
tent in the country’s web is increasingly replaced nian CLASSLA-web corpus is the first general
by promotional material. Additionally, high, albeit linguistically-annotated corpus.
lower, correlation coefficients can be observed be- The creation of these corpora was made possi-
tween the News genre on one side and the Infor- ble through the contribution of multiple separate en-
mation/Explanation, Opinion/Argumentation and deavors. This paper aims to document all the nec-
Legal genre. This analysis allows us to classify the essary steps that were undertaken in order to cre-
genres into two distinct clusters – one consisting ate the corpora. Firstly, the MaCoCu project (Bañón
of the News content, and the other consisting of et al., 2022) played a crucial role by collecting the
Information/Explanation, Opinion/Argumentation, corpora based on a comprehensive crawl of the
Promotion and Legal texts. Based on these find- national top-level domains and well-interconnected
ings, we can suggest a preliminary hypothesis that, general domains. Additionally, the project imple-
as a country experiences economic development, mented various post-processing methods to ensure
its web becomes more diverse and incorporates a the high quality of the final datasets. Notably, that
research highlighted the significant challenges as- development of language technologies for South
sociated with language identification of less preva- Slavic languages and future linguistic analyses.
lent languages, especially between closely-related The corpora are already being used for the de-
South Slavic languages, such as the mutually intel- velopment of BERT-like and generative pretrained
ligible Croatian, Bosnian, Montenegrin and Serbian (GPT) language models20 , specific to South Slavic
languages. languages. Furthermore, the data can be useful
Secondly, high-quality linguistic annotation of as the starting point for development of manually-
the CLASSLA-web corpora was made possible annotated training data for numerous tasks. For
by recent improvements of the CLASSLA-Stanza instance, as part of the Slovenian EMMA project21 ,
pipeline (Terčon and Ljubešić, 2023). In addition focused on providing NLP solutions to the media
to improving annotation accuracy with extended industry, samples of the texts annotated as News
training datasets and embeddings, the new ver- are planned to be used for development of datasets
sion of CLASSLA-Stanza pipeline provides a “web” for multilingual topic prediction in news. More-
module, specifically tailored for linguistic annota- over, the corpora are highly useful for linguistic
tion of web corpora. This is particularly advanta- analyses as was already shown by their predeces-
geous, as web corpora pose a unique challenge sors slWaC, hrWaC (Ljubešić and Erjavec, 2011),
to linguistic processing pipelines due to their com- srWaC (Ljubešić and Klubička, 2014) and others.
position of standard and non-standard texts. What Accordingly, we have made the corpora available
is even more, while Bosnian and Montenegrin do through the CLARIN.SI concordancers22 as well,
exhibit combinations of linguistic features that dis- to enable easy corpus querying. To promote their
tinguish them from Croatian and Serbian, most of use in the wider linguistic community, which also
these features are covered either in the one or the includes language teachers and digital humanists,
other language training data, enabling for the Croa- the use of the corpora through concordancers will
tian “web” module, trained on standard and non- be presented in CLASSLA Express23 , a series of
standard datasets of both Croatian and Serbian, to five workshops that will take place in four South
very successfully annotate the Bosnian and Mon- Slavic countries.
tenegrin CLASSLA-web corpora. Linguistic annota- The MaCoCu approach demonstrated significant
tion facilitates comprehensive analyses of language success in automatically collecting texts by crawl-
phenomena for corpus linguists, who perform anal- ing the top-level domains and beyond, resulting in
yses based on part-of-speech and other linguistic the creation of the largest general text collections
information. for each of the targeted South Slavic languages.
Further enrichment of these text collections with lin-
Thirdly, we applied to the corpora a recently-
guistic and genre information resulted in a corpus
developed multilingual genre classifier. This au-
collection that was not just collected, but also en-
tomatic annotation with genre categories allowed
riched in a highly comparable way, enabling compa-
us to enrich the datasets with valuable information
rable insights in the functional composition of these
on the communicative function of documents inside
corpora, as well as unlocking the linguistic research
the seven corpora. Our findings demonstrate that
potential of the corpora by performing multi-layer
the CLASSLA-web corpora exhibit similar genre
linguistic annotation. To ensure transparency and
distributions. However, we observed that certain
reproducibility, all steps involved in the process are
corpora, such as Bosnian, Montenegrin, and Mace-
publicly available. Our future plans involve conduct-
donian, predominantly consist of news content. In
ing iterative crawling, following the MaCoCu web
contrast, the Slovenian corpus contains a smaller
corpora collection approach, and enrichment of the
proportion of news content, with promotional texts
South Slavic languages on a yearly or bi-yearly ba-
representing a significant portion of the corpus. In
sis. We have set up a crawling infrastructure inside
future research, we plan to significantly extend ex-
the CLARIN.SI research infrastructure24 which is
plorations of genres inside the CLASSLA-web cor-
dedicated to iterative crawling of South Slavic webs
pora. Firstly, we plan to perform manual analysis
and webs in other languages. We have already
on all 7 languages to ascertain what is the overall
started performing a new run of crawling for Slove-
performance of the multilingual genre classifier on
each of the corpora. Secondly, we will investigate 20
possible patterns of biases that could negatively See for instance the XLM-R-BERTić (https:
impact downstream research relying on genre la- //huggingface.co/classla/xlm-r-bertic)
and YugoGPT (https://huggingface.co/
bels obtained from multilingual models. Finally, we
gordicaleksa/YugoGPT) models.
plan to use genre information to analyze the linguis- 21
https://emma.ijs.si/en/about-project/
tic characteristics of genres in South Slavic web 22
https://www.clarin.si/ske/
corpora. 23
https://www.clarin.si/info/k-centre/
Given the substantial sizes of the CLASSLA- workshops/classla-express/
24
web corpora, they are immensely valuable for the https://www.clarin.si/
nian, Croatian, Serbian, Bosnian and Montenegrin, corpora upon their request.
which will be followed by crawling of Bulgarian and
Macedonian. This will allow us to further expand
7. Bibliographical References
and update the current version of the CLASSLA-
web corpora. Consistent updating the corpora will
also enable research on how the texts and their
distribution on the web are evolving through time, Marta Bañón, Malina Chichirau, Miquel Esplà-
and also enable research in the field of semantic Gomis, Mikel L. Forcada, Aarón Galiano-
change for South Slavic languages. Additionally, Jiménez, Cristian García-Romero, Taja Kuz-
by presenting this process in this paper, we aim to man, Nikola Ljubešić, Rik van Noord, Leopoldo
inspire similar initiatives to develop web corpora for Pla Sempere, Gema Ramírez-Sánchez, Mar-
other languages lacking large high-quality corpora. ija Runić, Peter Rupnik, Vít Suchomel, Antonio
Toral, and Jaume Zaragoza-Bernabeu. 2023a.
6. Acknowledgments Bosnian web corpus MaCoCu-bs 1.0. Slovenian
language resource repository CLARIN.SI.
The research presented in this paper was con-
Marta Bañón, Malina Chichirau, Miquel Esplà-
ducted within the research project “Basic Research
Gomis, Mikel L. Forcada, Aarón Galiano-
for the Development of Spoken Language Re-
Jiménez, Cristian García-Romero, Taja Kuz-
sources and Speech Technologies for the Slove-
man, Nikola Ljubešić, Rik van Noord, Leopoldo
nian Language” (J7-4642), the research project
Pla Sempere, Gema Ramírez-Sánchez, Peter
“Embeddings-based techniques for Media Monitor-
Rupnik, Vít Suchomel, Antonio Toral, and Jaume
ing Applications” (L2-50070, co-funded by the Klip-
Zaragoza-Bernabeu. 2023b. Bulgarian web cor-
ing d.o.o. agency) and within the research pro-
pus MaCoCu-bg 2.0. Slovenian language re-
gramme “Language resources and technologies
source repository CLARIN.SI.
for Slovene” (P6-0411), all funded by the Slovenian
Research and Innovation Agency (ARIS). Marta Bañón, Malina Chichirau, Miquel Esplà-
This work has received funding from the Eu- Gomis, Mikel L. Forcada, Aarón Galiano-
ropean Union’s Connecting Europe Facility 2014- Jiménez, Cristian García-Romero, Taja Kuz-
2020 - CEF Telecom, under Grant Agreement No. man, Nikola Ljubešić, Rik van Noord, Leopoldo
INEA/CEF/ICT/A2020/2278341. This communica- Pla Sempere, Gema Ramírez-Sánchez, Peter
tion reflects only the author’s view. The Agency is Rupnik, Vít Suchomel, Antonio Toral, and Jaume
not responsible for any use that may be made of Zaragoza-Bernabeu. 2023c. Croatian web cor-
the information it contains. pus MaCoCu-hr 2.0. Slovenian language re-
We would like to thank Petra Bago, Virna Kar- source repository CLARIN.SI.
lić and Lidija Milković in helping to validate and
improve the content of the Croatian corpus. We Marta Bañón, Malina Chichirau, Miquel Esplà-
would also like to extend our gratitude to Marija Gomis, Mikel L. Forcada, Aarón Galiano-
Runić for giving guidance through the complexity Jiménez, Cristian García-Romero, Taja Kuz-
of the Bosnian web. We are finally grateful to all man, Nikola Ljubešić, Rik van Noord, Leopoldo
our collaborators in the MaCoCu project who made Pla Sempere, Gema Ramírez-Sánchez, Peter
these corpora significantly better. Rupnik, Vít Suchomel, Antonio Toral, and Jaume
Zaragoza-Bernabeu. 2023d. Macedonian web
corpus MaCoCu-mk 2.0. Slovenian language
6.1. Ethical Considerations and resource repository CLARIN.SI.
Limitations
Marta Bañón, Malina Chichirau, Miquel Esplà-
We are aware that using data that was collected Gomis, Mikel L. Forcada, Aarón Galiano-
from the web can raise questions of respecting Jiménez, Cristian García-Romero, Taja Kuz-
the intellectual property and privacy rights of the man, Nikola Ljubešić, Rik van Noord, Leopoldo
original authors of the texts. The authors of the Pla Sempere, Gema Ramírez-Sánchez, Peter
MaCoCu datasets, on which the CLASSLA-web Rupnik, Vít Suchomel, Antonio Toral, and Jaume
corpora are based on, assured that no sensitive Zaragoza-Bernabeu. 2023e. Montenegrin web
data would be included by only collecting the texts corpus MaCoCu-cnr 1.0. Slovenian language
that were freely accessible. Nevertheless, we are resource repository CLARIN.SI.
aware that the datasets might still include some
texts that the authors do not consent to be included. Marta Bañón, Malina Chichirau, Miquel Esplà-
To mitigate this, the CLASSLA-web corpora are Gomis, Mikel L. Forcada, Aarón Galiano-
published with a notice, which informs the authors Jiménez, Cristian García-Romero, Taja Kuz-
of the text that the texts can be taken out of the man, Nikola Ljubešić, Rik van Noord, Leopoldo
Pla Sempere, Gema Ramírez-Sánchez, Peter Datasets via Cross-Lingual and Cross-Dataset
Rupnik, Vít Suchomel, Antonio Toral, and Jaume Experiments. In Jezikovne tehnologije in
Zaragoza-Bernabeu. 2023f. Serbian web corpus digitalna humanistika: zbornik konference,
MaCoCu-sr 1.0. Slovenian language resource Jezikovne tehnologije in digitalna humanistika:
repository CLARIN.SI. zbornik konference, page 100–107. Institute of
Contemporary History.
Marta Bañón, Malina Chichirau, Miquel Esplà-
Gomis, Mikel L. Forcada, Aarón Galiano- Taja Kuzman, Igor Mozetič, and Nikola Ljubešić.
Jiménez, Cristian García-Romero, Taja Kuz- 2023a. Automatic Genre Identification for Robust
man, Nikola Ljubešić, Rik van Noord, Leopoldo Enrichment of Massive Text Collections: Inves-
Pla Sempere, Gema Ramírez-Sánchez, Peter tigation of Classification Methods in the Era of
Rupnik, Vít Suchomel, Antonio Toral, and Jaume Large Language Models. Machine Learning and
Zaragoza-Bernabeu. 2023g. Slovene web cor- Knowledge Extraction, 5(3):1149–1175.
pus MaCoCu-sl 2.0. Slovenian language re-
source repository CLARIN.SI. Taja Kuzman, Peter Rupnik, and Nikola Ljubešić.
2023b. Get to Know Your Parallel Data: Perform-
Marta Bañón, Miquel Esplà-Gomis, Mikel L For- ing English Variety and Genre Classification over
cada, Cristian García-Romero, Taja Kuzman, MaCoCu Corpora. In Tenth Workshop on NLP
Nikola Ljubešić, Rik van Noord, Leopoldo Pla for Similar Languages, Varieties and Dialects
Sempere, Gema Ramírez-Sánchez, Peter Rup- (VarDial 2023), pages 91–103.
nik, et al. 2022. MaCoCu: Massive collection
and curation of monolingual and bilingual data: Veronika Laippala, Anna Salmela, Samuel Rön-
focus on under-resourced languages. In 23rd nqvist, Alham Fikri Aji, Li-Hsin Chang, Asma Dhi-
Annual Conference of the European Association fallah, Larissa Goulart, Henna Kortelainen, Marc
for Machine Translation, pages 301–302. Pàmies, Deise Prina Dutra, and Valtteri Skantsi.
2022. Towards better structured and less noisy
Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Web data: Oscar with Register annotations. In
and Eros Zanchetta. 2009. The WaCky wide web: Proceedings of the Eighth Workshop on Noisy
a collection of very large linguistically processed User-generated Text (W-NUT 2022), pages 215–
web-crawled corpora. Language resources and 221.
evaluation, 43:209–226.
Nikola Ljubešić and Kaja Dobrovoljc. 2019. What
Alexis Conneau, Kartikay Khandelwal, Naman
does Neural Bring? Analysing Improvements in
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Morphosyntactic Annotation and Lemmatisation
Francisco Guzmán, Edouard Grave, Myle Ott,
of Slovenian, Croatian and Serbian. In Proceed-
Luke Zettlemoyer, and Veselin Stoyanov. 2020.
ings of the 7th Workshop on Balto-Slavic Natural
Unsupervised cross-lingual representation learn-
Language Processing, pages 29–34, Florence,
ing at scale. In Proceedings of the 58th Annual
Italy. Association for Computational Linguistics.
Meeting of the Association for Computational Lin-
guistics, pages 8440–8451, Online. Association Nikola Ljubešić and Tomaž Erjavec. 2011. hrWaC
for Computational Linguistics. and slWaC: Compiling web corpora for Croat-
Tomaž Erjavec, Nikola Ljubešić, and Nataša Logar. ian and Slovene. In Text, Speech and Dialogue:
2015. The slWaC Corpus of the Slovene Web. 14th International Conference, TSD 2011, Pilsen,
Informatica, 39(1). Czech Republic, September 1-5, 2011. Proceed-
ings 14, pages 395–402. Springer.
Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář,
Pavel Rychlỳ, and Vít Suchomel. 2013. The Ten- Nikola Ljubešić and Filip Klubička. 2014.
Ten corpus family. In 7th international corpus bs,hr,srWaC - web corpora of Bosnian,
linguistics conference CL, pages 125–127. Croatian and Serbian. In Proceedings of the
9th Web as Corpus Workshop (WaC-9), pages
Taja Kuzman and Nikola Ljubešić. 2022. Exploring 29–35, Gothenburg, Sweden. Association for
the Impact of Lexical and Grammatical Features Computational Linguistics.
on Automatic Genre Identification. In Odkrivanje
znanja in podatkovna skladišča - SiKDD: 10. ok- Nikola Ljubešić, Maja Miličević Petrović, and Tanja
tober 2022, 10 October 2022, Ljubljana, Slove- Samardžić. 2018. Borders and boundaries in
nia, Odkrivanje znanja in podatkovna skladišča Bosnian, Croatian, Montenegrin and Serbian:
- SiKDD: 10. oktober 2022, 10 October 2022, Twitter data to the rescue. Journal of Linguis-
Ljubljana, Slovenia. Institut “Jožef Stefan”. tic Geography, 6(2):100–124.

Taja Kuzman, Nikola Ljubešić, and Senja Pol- Nikola Ljubešić, Peter Rupnik, and Taja Kuzman.
lak. 2022. Assessing Comparability of Genre 2024a. Bosnian web corpus CLASSLA-web.bs
1.0. Slovenian language resource repository and Serbian. In Tenth Workshop on NLP for Sim-
CLARIN.SI. ilar Languages, Varieties and Dialects (VarDial
2023), pages 113–120.
Nikola Ljubešić, Peter Rupnik, and Taja Kuzman.
2024b. Bulgarian web corpus CLASSLA-web.bg Roland Schäfer, Felix Bildhauer, et al. 2012. Build-
1.0. Slovenian language resource repository ing large corpora from the web using a new effi-
CLARIN.SI. cient tool chain. In Lrec, pages 486–493.
Nikola Ljubešić, Peter Rupnik, and Taja Kuzman. Serge Sharoff, Zhili Wu, and Katja Markert. 2010.
2024c. Croatian web corpus CLASSLA-web.hr The Web Library of Babel: evaluating genre col-
1.0. Slovenian language resource repository lections. In LREC. Citeseer.
CLARIN.SI.
Jade Goldstein Stewart and J Callan. 2009. Genre
Nikola Ljubešić, Peter Rupnik, and Taja Kuzman. oriented summarization. Ph.D. thesis, Carnegie
2024d. Macedonian web corpus CLASSLA- Mellon University, Language Technologies Insti-
web.mk 1.0. Slovenian language resource repos- tute, School of Computer Science.
itory CLARIN.SI.
Pedro Javier Ortiz Suárez, Benoît Sagot, and Lau-
Nikola Ljubešić, Peter Rupnik, and Taja Kuzman. rent Romary. 2019. Asynchronous pipeline for
2024e. Montenegrin web corpus CLASSLA- processing huge corpora on medium to low re-
web.cnr 1.0. Slovenian language resource repos- source infrastructures. In 7th Workshop on the
itory CLARIN.SI. Challenges in the Management of Large Corpora
Nikola Ljubešić, Peter Rupnik, and Taja Kuzman. (CMLC-7). Leibniz-Institut für Deutsche Sprache.
2024f. Serbian web corpus CLASSLA-web.sr Vít Suchomel, Jan Pomikálek, et al. 2012. Efficient
1.0. Slovenian language resource repository web crawling for large text corpora. In Proceed-
CLARIN.SI. ings of the seventh Web as Corpus Workshop
Nikola Ljubešić, Peter Rupnik, and Taja Kuzman. (WAC7), pages 39–43.
2024g. Slovenian web corpus CLASSLA-web.sl
Luka Terčon and Nikola Ljubešić. 2023. CLASSLA-
1.0. Slovenian language resource repository
Stanza: The Next Step for Linguistic Process-
CLARIN.SI.
ing of South Slavic Languages. arXiv preprint
Wanda J Orlikowski and JoAnne Yates. 1994. arXiv:2308.04255.
Genre repertoire: The structuring of communica-
Luka Terčon and Nikola Ljubešić. 2023a. Word em-
tive practices in organizations. Administrative
science quarterly, pages 541–574. beddings CLARIN.SI-embed.bg 1.0. Slovenian
language resource repository CLARIN.SI.
Jan Pomikálek. 2011. Removing boilerplate and
duplicate content from web corpora. Luka Terčon and Nikola Ljubešić. 2023b. Word em-
beddings CLARIN.SI-embed.hr 2.0. Slovenian
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, language resource repository CLARIN.SI.
and Christopher D. Manning. 2020. Stanza: A
Python Natural Language Processing Toolkit for Luka Terčon and Nikola Ljubešić. 2023c. Word em-
Many Human Languages. In Proceedings of the beddings CLARIN.SI-embed.mk 2.0. Slovenian
58th Annual Meeting of the Association for Com- language resource repository CLARIN.SI.
putational Linguistics: System Demonstrations.
Luka Terčon and Nikola Ljubešić. 2023d. Word em-
Georg Rehm and Andy Way. 2023. European Lan- beddings CLARIN.SI-embed.sr 2.0. Slovenian
guage Equality: A Strategic Agenda for Digital language resource repository CLARIN.SI.
Language Equality. Springer Nature.
Luka Terčon, Nikola Ljubešić, and Tomaž Erjavec.
Samuel Rönnqvist, Valtteri Skantsi, Miika Oinonen, 2023. Word embeddings CLARIN.SI-embed.sl
and Veronika Laippala. 2021. Multilingual and 2.0. Slovenian language resource repository
Zero-Shot is Closing in on Monolingual Web Reg- CLARIN.SI.
ister Classification. In 23rd Nordic Conference
on Computational Linguistics (NoDaLiDa), pages Marlies Van der Wees, Arianna Bisazza, and
157–165. Christof Monz. 2018. Evaluation of machine
translation performance across multiple genres
Peter Rupnik, Taja Kuzman, and Nikola Ljubešić. and languages. In Eleventh International Con-
2023. BENCHić-lang: A Benchmark for Discrim- ference on Language Resources and Evaluation
inating between Bosnian, Croatian, Montenegrin (LREC 2018).
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. Atten-
tion is all you need. Advances in neural informa-
tion processing systems, 30.
Linting Xue, Noah Constant, Adam Roberts, Mi-
hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya
Barua, and Colin Raffel. 2021. mT5: A mas-
sively multilingual pre-trained text-to-text trans-
former. In Proceedings of the 2021 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, pages 483–498, Online. Associa-
tion for Computational Linguistics.

Slavko Žitnik and Frenk Dragar. 2021. SloBENCH

evaluation framework. Slovenian language re-
source repository CLARIN.SI.

Barnard14 Sltu
No ratings yet
Barnard14 Sltu
7 pages
Keith Langston, Anita Peti-Stantić-Language Planning and National Identity in Croatia (2014)
No ratings yet
Keith Langston, Anita Peti-Stantić-Language Planning and National Identity in Croatia (2014)
361 pages
Building Large Monolingual Dictionaries at The Leipzig Corpora Collection - From 100 To 200 Languages
No ratings yet
Building Large Monolingual Dictionaries at The Leipzig Corpora Collection - From 100 To 200 Languages
7 pages
Language Technologies Customized For Processing Greek Textual Cultural Heritage Data
No ratings yet
Language Technologies Customized For Processing Greek Textual Cultural Heritage Data
10 pages
Building-Blocks, Exposed Through A Particular Kind of Experimental Study, That The (Standard)
No ratings yet
Building-Blocks, Exposed Through A Particular Kind of Experimental Study, That The (Standard)
25 pages
Bailyn DegreeCroatianSerbian 2010
No ratings yet
Bailyn DegreeCroatianSerbian 2010
40 pages
EsTenTen, A Vast Web Corpus of Peninsular and American Spanish
No ratings yet
EsTenTen, A Vast Web Corpus of Peninsular and American Spanish
8 pages
A Corpus Driven Approach To Language Contact Endangered Languages in A Comparative Perspective Evangelia Adamou Instant Download
100% (1)
A Corpus Driven Approach To Language Contact Endangered Languages in A Comparative Perspective Evangelia Adamou Instant Download
43 pages
Los Corpus Del Español: Javier Rodríguez Molina
No ratings yet
Los Corpus Del Español: Javier Rodríguez Molina
56 pages
2022 Acl-Long 385
No ratings yet
2022 Acl-Long 385
20 pages
Unit 7 Extended Well-Known and Influential Corpora
No ratings yet
Unit 7 Extended Well-Known and Influential Corpora
56 pages
1 Corpus Linguistics
No ratings yet
1 Corpus Linguistics
38 pages
A Corpus-Driven Approach To Language Contact Endangered Languages in A Comparative Perspective
No ratings yet
A Corpus-Driven Approach To Language Contact Endangered Languages in A Comparative Perspective
261 pages
LAcT CrestiMoneglia2005
No ratings yet
LAcT CrestiMoneglia2005
79 pages
Exploring Lexical and Syntactic Features For Langu
No ratings yet
Exploring Lexical and Syntactic Features For Langu
11 pages
Livre-GATTO-The Web As Corpus Theory and Practice
No ratings yet
Livre-GATTO-The Web As Corpus Theory and Practice
255 pages
Corpus Linguistics 1
No ratings yet
Corpus Linguistics 1
48 pages
McEnery Corpusit 2001
No ratings yet
McEnery Corpusit 2001
47 pages
CL2015 AbstractBook PDF
No ratings yet
CL2015 AbstractBook PDF
460 pages
Corpus Based Language Studies PDF
20% (5)
Corpus Based Language Studies PDF
6 pages
South Africa's Language Diversity
No ratings yet
South Africa's Language Diversity
16 pages
Linguistics Researchers' Guide
100% (1)
Linguistics Researchers' Guide
13 pages
The Romance Balkans Serbian Academy of Science and Arts Institute For Balkan Studies Special Editions 103 Biljana Sikimic PDF Download
No ratings yet
The Romance Balkans Serbian Academy of Science and Arts Institute For Balkan Studies Special Editions 103 Biljana Sikimic PDF Download
146 pages
Documento (34) Slavoc 3
No ratings yet
Documento (34) Slavoc 3
3 pages
First Steps in Language Documentation For Minority Languages
No ratings yet
First Steps in Language Documentation For Minority Languages
98 pages
Corpus Linguistics Overview
No ratings yet
Corpus Linguistics Overview
23 pages
Interslavic Zonal Constructed Language An Introduction For English Speakers Ukazka PDF
100% (1)
Interslavic Zonal Constructed Language An Introduction For English Speakers Ukazka PDF
20 pages
Introduction To Corpus Linguistics PDF
No ratings yet
Introduction To Corpus Linguistics PDF
12 pages
Aranea Yet Another Family of (Comparable) Web Corpora
No ratings yet
Aranea Yet Another Family of (Comparable) Web Corpora
11 pages
Roberta - Facchinetti Corpus - Linguistics (25.years - On)
100% (1)
Roberta - Facchinetti Corpus - Linguistics (25.years - On)
392 pages
Hal 131 PDF
No ratings yet
Hal 131 PDF
237 pages
Corpus BasedSociolinguistics Partington
No ratings yet
Corpus BasedSociolinguistics Partington
7 pages
Compiling A Corpus of South Asian Online Englishes Some Reflections and A Pilot Study Shakir and Deuber (2023)
No ratings yet
Compiling A Corpus of South Asian Online Englishes Some Reflections and A Pilot Study Shakir and Deuber (2023)
21 pages
JamaMusse J Somali Corpus StateOfTheArt PDF
No ratings yet
JamaMusse J Somali Corpus StateOfTheArt PDF
8 pages
Corpus Typology
No ratings yet
Corpus Typology
23 pages
Cheng 2012 PP 3-8 Intro
No ratings yet
Cheng 2012 PP 3-8 Intro
6 pages
Passive Constructions in English and Chinese A Corpus-Based Contrastive Study
No ratings yet
Passive Constructions in English and Chinese A Corpus-Based Contrastive Study
43 pages
2007 Liviu P. Dinu, Denis Enachescu, 2007. On Clustering Romance Languages
No ratings yet
2007 Liviu P. Dinu, Denis Enachescu, 2007. On Clustering Romance Languages
8 pages
Corpus Linguistics: History and Analysis
No ratings yet
Corpus Linguistics: History and Analysis
66 pages
Topics
No ratings yet
Topics
85 pages
Bosnian Language and Boshnjak Identity
No ratings yet
Bosnian Language and Boshnjak Identity
14 pages
Text Variability Measures in Corpus Design For Setswana Lexicography 1st Edition Thapelo J. Otlogetswe Instant Download
No ratings yet
Text Variability Measures in Corpus Design For Setswana Lexicography 1st Edition Thapelo J. Otlogetswe Instant Download
52 pages
Ammar of Contemporary Slovak 1988 OCR PDF
100% (1)
Ammar of Contemporary Slovak 1988 OCR PDF
158 pages
A Grammar of Contemporary Slovak
100% (4)
A Grammar of Contemporary Slovak
158 pages
Romanian and The Balkans: Some Comparative Perspectives
No ratings yet
Romanian and The Balkans: Some Comparative Perspectives
23 pages
RoutledgeHandbooks 9780367076399 Chapter4
No ratings yet
RoutledgeHandbooks 9780367076399 Chapter4
14 pages
Sociolinguistics and Corpus Linguistics 1st Edition Paul Baker Available Instanly
100% (4)
Sociolinguistics and Corpus Linguistics 1st Edition Paul Baker Available Instanly
168 pages
2021 Register
No ratings yet
2021 Register
26 pages
Hewlett e
No ratings yet
Hewlett e
77 pages
Seminar 1
No ratings yet
Seminar 1
7 pages
Linguistic Corpora Overview
No ratings yet
Linguistic Corpora Overview
41 pages
INTERSLAV
No ratings yet
INTERSLAV
166 pages
Developing Products Alert System Users Using HtmlData and
No ratings yet
Developing Products Alert System Users Using HtmlData and
9 pages
A Benchmark of Information Retrieval Tasks With Complex
No ratings yet
A Benchmark of Information Retrieval Tasks With Complex
25 pages
Darknet Data Mining - A Canadian Cyber-Crime
No ratings yet
Darknet Data Mining - A Canadian Cyber-Crime
13 pages
Dap - Domain - Ware Prompt Learning For Vision-And-language Navigation
No ratings yet
Dap - Domain - Ware Prompt Learning For Vision-And-language Navigation
5 pages
Building A Large Japanese Web Corpus
No ratings yet
Building A Large Japanese Web Corpus
17 pages
Do Language Models Care About Text Quality
No ratings yet
Do Language Models Care About Text Quality
14 pages
Do Not Worry If You Do Not Have Data
No ratings yet
Do Not Worry If You Do Not Have Data
18 pages
Dismantling Common Internet Services
No ratings yet
Dismantling Common Internet Services
4 pages
Generating Potent Poisons and Backdoors From
No ratings yet
Generating Potent Poisons and Backdoors From
21 pages
Image in Words
No ratings yet
Image in Words
45 pages
HARE - HumAn Priors - Key To Small Language Model Efficiency
No ratings yet
HARE - HumAn Priors - Key To Small Language Model Efficiency
10 pages
OPSD - Offensive Persian Social Media Dataset
No ratings yet
OPSD - Offensive Persian Social Media Dataset
16 pages
On Pretraining Data Diversity For Self-Supervised Learning
No ratings yet
On Pretraining Data Diversity For Self-Supervised Learning
16 pages
Retrieval Augmented Verification - Unveiling Disinformation
No ratings yet
Retrieval Augmented Verification - Unveiling Disinformation
12 pages
CUI24
No ratings yet
CUI24
11 pages
A M A: A: SK E Nything Simple Strategy For Prompting Language Models
No ratings yet
A M A: A: SK E Nything Simple Strategy For Prompting Language Models
63 pages
Compress - Align - Urating Image-Text Data With Human Knowledge2
No ratings yet
Compress - Align - Urating Image-Text Data With Human Knowledge2
13 pages
VeCAF - Vision-Language Collaborative Active Finetuning With
No ratings yet
VeCAF - Vision-Language Collaborative Active Finetuning With
13 pages
Asme Viii d1 Ma Appendix 3
No ratings yet
Asme Viii d1 Ma Appendix 3
3 pages
Nursing Documentation Report
No ratings yet
Nursing Documentation Report
4 pages
One Allocation
No ratings yet
One Allocation
29 pages
Example - 5.0 To 5.1 Upgrade Summary - Unix
No ratings yet
Example - 5.0 To 5.1 Upgrade Summary - Unix
8 pages
Transformational Leadership Behavioe Inventory (TLI) by Podsakoff Et Al. (PG 117)
No ratings yet
Transformational Leadership Behavioe Inventory (TLI) by Podsakoff Et Al. (PG 117)
131 pages
2008, 3-27 Texas Medical Board Complaint Against DR Thomas Beaver
No ratings yet
2008, 3-27 Texas Medical Board Complaint Against DR Thomas Beaver
9 pages
Management Essentials for Students
No ratings yet
Management Essentials for Students
12 pages
Digital Imaging
No ratings yet
Digital Imaging
13 pages
Disney Strategic Analysis
100% (1)
Disney Strategic Analysis
15 pages
Sample Questions L3 Module 4
100% (3)
Sample Questions L3 Module 4
7 pages
Resume Saikrishna
No ratings yet
Resume Saikrishna
3 pages
Child Study 2003
100% (1)
Child Study 2003
37 pages
ASOC Application & Exam Preparation Guidelines
No ratings yet
ASOC Application & Exam Preparation Guidelines
16 pages
Rise Goals and Objectives
No ratings yet
Rise Goals and Objectives
2 pages
Installation Load and Working Capacity of Jacked Piles
No ratings yet
Installation Load and Working Capacity of Jacked Piles
5 pages
DWP Staff Insights & Updates
No ratings yet
DWP Staff Insights & Updates
28 pages
Geology For Ce Case Study
No ratings yet
Geology For Ce Case Study
6 pages
Faculty Workload Distribution
No ratings yet
Faculty Workload Distribution
2 pages
French Exam Evaluation Guide
No ratings yet
French Exam Evaluation Guide
16 pages
Advanced Gas Turbine Cycles For Power Generation
No ratings yet
Advanced Gas Turbine Cycles For Power Generation
10 pages
CINI at 30
No ratings yet
CINI at 30
73 pages
Business Plan For Cultural Village
0% (1)
Business Plan For Cultural Village
13 pages
Reliabilityweb Uptime 20120203
100% (1)
Reliabilityweb Uptime 20120203
69 pages
Unit 3
No ratings yet
Unit 3
58 pages
Aboriginal History V28
100% (1)
Aboriginal History V28
242 pages
RRJ 1
No ratings yet
RRJ 1
2 pages
Business Letters Punctuations and Styles
No ratings yet
Business Letters Punctuations and Styles
16 pages
Training Catalog Bobst Italia en 2023
No ratings yet
Training Catalog Bobst Italia en 2023
67 pages
Definitions and The Scope of Applied Linguistics (Revised) - Ulfahnurfarida2
No ratings yet
Definitions and The Scope of Applied Linguistics (Revised) - Ulfahnurfarida2
3 pages
Brief History of Globalization
No ratings yet
Brief History of Globalization
2 pages

CLASSLA-web - Comparable Web Corpora of South Slavic2

Uploaded by

CLASSLA-web - Comparable Web Corpora of South Slavic2

Uploaded by

CLASSLA-web: Comparable Web Corpora of South Slavic

Languages Enriched with Linguistic and Genre Annotation

Nikola Ljubešić, Taja Kuzman

1. Introduction crawling, an automated method for collecting text

2. Related Work 2.2. Genre Prediction

% 20% 40% 60% 80%

Figure 1: Genre distribution in CLASSLA-web corpora (in percentages of texts).

Slavko Žitnik and Frenk Dragar. 2021. SloBENCH

You might also like