uOttawa at LegalLens-2024: Transformer-based Classification
Experiments
Nima Meghdadi and Diana Inkpen
School of Electrical Engineering and Computer Science
University of Ottawa,
Ottawa, ON, K1N 6N5
[email protected] [email protected] Abstract Our team participated in both subtasks of the
This paper presents the methods used for shared task. In subtask A, we used the spaCy li-
LegalLens-2024 shared task, which focused on brary and a DeBERTa-based model. In subtask B,
we developed a RoBERTa-based model combined
arXiv:2410.21139v2 [cs.CL] 31 Oct 2024
detecting legal violations within unstructured
textual data and associating these violations with a CNN-based model.
with potentially affected individuals. The
shared task included two subtasks: A) Legal
Named Entity Recognition (L-NER) and B)
2 Related Work
Legal Natural Language Inference (L-NLI).
For subtask A, we utilized the spaCy library, There has been extensive research on Legal Named
while for subtask B, we employed a combined Entity Recognition (NER) for German legal doc-
model incorporating RoBERTa and CNN. Our uments. Leitner et al. (2019) developed NER
results were 86.3% in the L-NER subtask models using CRF and BiLSTM, while Darji et al.
and 88.25% in the L-NLI subtask. Overall, (2023) used a BERT-based model. Many lan-
our paper demonstrates the effectiveness of guages are using NER to expedite the process
transformer models in addressing complex of judicial decision-making. For the Turkish lan-
tasks in the legal domain. The source code
guage, Çetindağ et al. (2023) developed an NER
for our implementation is publicly available at
https://github.com/NimaMeghdadi/uOttawa- model using BiLSTM and several word embed-
at-LegalLens-2024-Transformer-based- dings like GloVe, Morph2Vec, and neural network-
Classification based character feature extraction techniques. In
Portuguese, Bonifacio et al. (2020) and Albu-
1 Introduction
querque et al. (2023) focused on NER models spe-
The huge amount of information and massive use cific to the legal domain. The former developed a
of the internet has propelled to ignore legal viola- model using ELMo and BERT with the LeNER-Br
tions, individual rights, cultural values and societal dataset (Luz de Araujo et al., 2018), while the lat-
norms. These hidden violations demand serious at- ter evaluated BiLSTM+CRF and fine-tuned BERT
tention and urgent solution due to serious effects on models on legal and legislative domains to auto-
rights and justice and it requires advanced tools for mate and accelerate tasks such as analysis, cat-
professionals to effectively manage large amount egorization, search, and summarization. In Ital-
of paperwork. ian, Pozzi et al. (2023) created a model that com-
Legal violation identification seeks to automati- bines transformer-based Named Entity Recognition
cally detect legal violations within unstructured text (NER), transformer-based Named Entity Linking
and link these violations to potential victims. The (NEL), and NIL prediction. In Chinese, Zhang
LegalLens 2024 shared task (Bernsohn et al., 2024) et al. (2023) proposed a NER method for the legal
aims to foster a legal research community by tack- domain named RoBERTa-GlobalPointer, combin-
ling two key challenges in the legal domain. Sub- ing character-level and word-level feature repre-
task A focuses on identifying legal violations (a.k.a sentations using RoBERTa and Skip-Gram, which
Identification Setup) using Named Entity Recog- were then concatenated and scored with the Global-
nition (NER). Subtask B focuses on linking these Pointer method. Lee et al. (2023) also developed a
violations to potentially affected individuals (a.k.a legal domain NER model called LeArNER, which
Identification Setup) using Natural Language Infer- employs Bouma’s unsupervised learning for fea-
ence (NLI). ture extraction and utilizes the LERT and LSTM
models for sequence annotation. Type Number of documents
Kim et al. (2024) described methods for the Training 568
COLIEE 2023 competition, using a sentence trans- Validation 142
former model for case law retrieval and a fine- Test 617
tuned DeBERTa model for legal entailment that
used SNLI (Bowman et al., 2015) and MultiNLI Table 1: The number of documents used to train the
model is detailed
(Williams et al., 2018) datasets for training. Tang
(2023) explored improving legal Natural Language
Hyperparameter Value
Inference (NLI) by employing general NLI datasets
Learning Rate 5e-5
with supervised fine-tuning and examining the im-
Batch Size 16
pact of transfer learning from Adversarial NLI
Maximum Steps 20,000
to ContractNLI. The objective of Valentino and
Dropout Rate 0.1
Freitas (2024) is to offer a theoretically grounded
Optimizer Adam
characterization of explanation-based Natural Lan-
guage Inference (NLI) by integrating contempo- Table 2: Hyperparameters of the fine-tuned model for
rary philosophical accounts of scientific explana- subtask A (L-NER)
tion with an analysis of natural language explana-
tion corpora. Gubelmann et al. (2023) investigated
how large language models (LLMs) handle differ- sohn et al., 2024).
ent pragmatic sentence types, like questions and
3.2 Preprocessing
commands, in natural language inference (NLI),
highlighting the insensitivity of MNLI and its fine- For this subtask, we configure the spaCy pipeline
tuned models to these sentence types. It developed with an emphasis on tokenization and vector
and publicly released fine-tuning datasets to ad- initialization. The tokenizer used is the stan-
dress this issue and explored ChatGPT’s approach dard spaCy tokenizer, which splits the text into
to entailment. tokens for downstream tasks. We utilize the
spacy.Tokenizer.v1 configuration, which efficiently
3 Subtask A: Legal Named Entity handles tokenization according to spaCy’s stan-
Recognition(L-NER) dards.
Next, we handle vector initialization. In this
Subtask A, which involves finding named entities setup, vectors map tokens to high-dimensional rep-
for specific types that may appear in legal texts, is resentations, which helps capture semantic mean-
explained in this section. ing during training. The data by converting the text
We developed a BERT-based model for this sub- and its annotations into a format compatible with
task as part of the LegalLens task, achieving an spaCy.
F1-score(Macro F1-score) of 86.3%.
3.3 Model Training
3.1 Dataset Details Our training utilizes the SpaCy pipeline configured
We used the dataset provided by the organizers with a transformer model and a transition-based
of the shared task. The provided data was split parser for NER tasks. The deberta-v3-base model
into training and test sets, with each set consisting has been selected for the main transformer archi-
of tokenized text and the corresponding entities tecture, offering robust contextual embeddings for
for those tokens. It is important to note that the token-level classification (He et al., 2021).
provided test set includes labeled data, which is dif- Hyperparameters for the training are optimized
ferent from the separate test data that the organizers based on performance on the development set. The
will use to evaluate the model. The split dataset key hyperparameters can be seen in Table 2.
used for validation in this research consists of 20%
of training data and is shown in Table 1. 3.4 Results and Discussion
The entity types are fully described in (Bernsohn We found that models utilizing spaCy achieved
et al., 2024). The labels include four entity types: better results compared to those without it. Ad-
violation, violation by, violation on, and law, with ditionally, BERT base models outperformed other
detailed counts for each entity available in (Bern- models in our experiments. However, we discov-
Model F1-score are provided in (Bernsohn et al., 2024) and related
roberta-base 52.55 resources.
nlpaueb/legal-bert-base-uncased 53.29
lexlms/legal-roberta-base 54.80 4.1 Dataset Details
lexlms/legal-roberta-base The LegalLensNLI dataset, provided by the orga-
(Alibaba-NLP/gte-large-en-v1.5) 62.69 nizers of the shared task, is specifically designed
roberta-base with spacy 80.49 to explore the connections between legal cases and
deberta-v3-base with spacy 86.37 the individuals affected by them, with a particu-
lar focus on class action complaints. This dataset
Table 3: Comparison of F1-score in various models for contains 312 entries. A comprehensive description
the NER subtask of the dataset collection process can be found in
(Bernsohn et al., 2024). For this subtask, only the
Model F1-score training set is included, and the validation set is
Nowj 0.416 separated into four specific domains, as outlined in
Flawless Lawgic 0.402 Table 6.
UOttawa 0.402
Baseline 0.381 4.2 Preprocessing
Masala-chai 0.380 This subtask has a different objective com-
UMLaw&TechLab 0.321 pared to Subtask A, so SpaCy may not per-
Bonafide 0.305 form well for this task. In this subtask,
we began by loading the ynie/roberta-large-
Table 4: Comparison of top 5 teams results on the snli_mnli_fever_anli_R1_R2_R3-nli model (Nie
hidden test set for the NER subtask, measured by F1-
et al., 2020) using the AutoTokenizer and Auto-
score (Hagag et al., 2024).
Model classes from the transformers library. The
AutoTokenizer class was employed to tokenize the
ered that initializing embeddings from Hugging input sentences, converting them into a format suit-
Face leaderboard embeddings did not lead to im- able for the roberta-large model. The tokeniza-
proved results. Table 3 compares the F1-scores of tion process involved splitting the text into tokens
various models on the labelled test data. and converting them into numerical representations,
which are then padded and truncated to a consistent
3.5 Direct Comparison to Related Work length. This ensures that the input sequences are
The organizers of the shared task provided a hidden properly aligned when fed into the model.
test set, on which our model achieved an F1-score Following tokenization, we implemented a
of 0.402, securing third place in the competition. method to encode the combined premise and hy-
The performance of the top five teams is presented pothesis sentences for both the Roberta model and
in Table 4. a CNN model. The CNN model required a differ-
ent form of input preparation, where the combined
4 Subtask B: Legal Natural Language texts were tokenized and encoded to maintain the
Inference (L-NLI) sequence’s structure for CNN processing. These to-
kenized datasets were then converted into PyTorch
The goal of this subtask is to automatically clas- tensors and mapped accordingly, enabling their
sify the relationships between different legal texts. use in a combined model that integrates both the
Specifically, we aim to determine whether a legal Roberta model and the CNN. Subtask B involves
premise, such as a summary of a legal complaint, finding the similarity between the hypothesis and
entails, contradicts, or remains neutral with respect premises. By using a CNN model to highlight the
to a given hypothesis, like an online review. This keywords in sentences, the combined model may
task, termed Legal Natural Language Inference (L- perform better.
NLI), involves sentence-pair classification to as-
sess these relationships. By creating an NLI corpus 4.3 Model Training
tailored for legal documents, we facilitate applica- Our combined model architecture
tions like legal case matching and automated legal integrates the ynie/roberta-large-
reasoning. Detailed task definitions and datasets snli_mnli_fever_anli_R1_R2_R3-nli model
Hyperparameter Value
Input Text
Learning Rate 2e-5
Batch Size (train and Eval) 4
Number of Epochs 20
Weight Decay: 0.01
Roberta Model CNN Model
Table 5: Hyperparameters of the fine-tuned model for
subtask A (L-NLI)
Model CP Privacy TCPA Wage Avg
Concatinate Models Falcon 7B 87.2 84.5 83.9 68.5 81.02
without cnn 87.23 85.48 83.88 90.6 86.77
roberta-base 82.9 62.0 69.5 69.7 71.02
Our model 84.4 90 84 96 88.6
Fully Connected Layer Table 6: Comparison of F1-score on the validation set
for various models for the NLI task for specific-domain
(Consumer Protection, Privacy,TCPA and Wage)
Output 4.4 Results and Discussion
We found that pre-trained NLI models can per-
Figure 1: Diagram of combined model (Roberta and form significantly better than vanilla models and
CNN)
LLMs. Falcon 7B and RoBERTa base are the best-
performing models for LLMs and vanilla models,
with a custom-built CNN model for keyword respectively, as shown in Table 6. The validation
detection. The Roberta model is responsible for set has been selected to be domain-specific, based
capturing contextual information from the text, on legal_act.
while the CNN model detects important keyword
patterns within the input text. The outputs from 4.5 Direct Comparison to Related Work
both models are concatenated and passed through The shared task organizers evaluated the models
a fully connected layer (softmax) to produce the using a hidden test set, where our model attained an
final classification decision, the architecture of the F1-score of 0.724, placing fifth in the competition.
model can be seen in Figure1. In more detail, the The results of the top five teams are detailed in
RoBERTa model consists of one embedding layer Table 7.
and 24 Transformer encoder layers, while the CNN
model includes one embedding layer and three 5 Conclusion and Future Work
convolutional layers, each with a different filter
Our experiments demonstrated the success of trans-
size (2, 3, 4), followed by a fully connected layer.
former models, such as RoBERTa and DeBERTa,
In total, we have 31 layers Training was conducted
in handling complex legal tasks, including violation
using the Trainer class from the transformers
detection and inference. In Subtask A (L-NER),
library, which facilitated the fine-tuning of the
incorporating DeBERTa into the spaCy pipeline
model. We defined specific hyperparameters, that
can be seen Table 5. The model was evaluated at
the end of each epoch, with the best model being Model F1-score
saved based on the F1-score. The training process 1-800-Shared-Tasks 0.853
also included strategies for early stopping and Baseline 0.807
warmup steps to optimize performance. Semantists 0.785
This approach combines the strengths of both Nowj 0.746
the Roberta model and CNN, allowing for a more UOttawa 0.724
comprehensive analysis of the text data. The fine-
tuning process ensures that the model is well-suited Table 7: Performance of the leading 5 teams on the
for the specific task of classifying legal text as ’En- hidden test set in the NLI subtask, measured by F1-
tailed’, ’Neutral’, or ’Contradict’. score (Hagag et al., 2024).
yielded strong results for legal named entity recog- Reto Gubelmann, Aikaterini-Lida Kalouli, Christina
nition. In Subtask B (L-NLI), combining RoBERTa Niklaus, and Siegfried Handschuh. 2023. When truth
matters-addressing pragmatic categories in natural
with CNN for keyword detection boosted classifi-
language inference (nli) by large language models
cation accuracy. (llms). In Proceedings of the 12th Joint Conference
However, despite using robust models, generaliz- on Lexical and Computational Semantics (* SEM
ing to unseen cases proved challenging, particularly 2023), pages 24–39.
with nuanced legal language. While the CNN im- Ben Hagag, Liav Harpaz, Gil Semo, Dor Bernsohn, Ro-
proved phrase detection, more advanced methods, hit Saha, Pashootan Vaezipoor, Kyryl Truskovskyi,
like attention mechanisms, may further enhance and Gerasimos Spanakis. 2024. Legallens shared
performance. task 2024: Legal violation identification in unstruc-
tured text. Preprint, arXiv:2410.12064.
Future work should explore architectures fine-
tuned on legal texts or combine transformers with Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021.
graph models to capture legal relationships. Ad- Debertav3: Improving deberta using electra-style pre-
training with gradient-disentangled embedding shar-
ditionally, leveraging LLMs like GPT-4 could im- ing. Preprint, arXiv:2111.09543.
prove legal reasoning.
Mi-Young Kim, Juliano Rabelo, Housam Khal-
ifa Bashier Babiker, Md Abed Rahman, and Randy
References Goebel. 2024. Legal information retrieval and en-
tailment using transformer-based approaches. The
Hidelberg O Albuquerque, Ellen Souza, Adriano LI Review of Socionetwork Strategies, 18(1):101–121.
Oliveira, David Macêdo, Cleber Zanchettin, Douglas
Vitório, Nádia FF da Silva, and André CPLF de Car- Shao-Man Lee, Yu-Hsiang Tan, and Han-Ting Yu. 2023.
valho. 2023. On the assessment of deep learning Learner: few-shot legal argument named entity recog-
models for named entity recognition of brazilian le- nition. In Proceedings of the Nineteenth Interna-
gal documents. In EPIA Conference on Artificial tional Conference on Artificial Intelligence and Law,
Intelligence, pages 93–104. Springer. pages 422–426.
Dor Bernsohn, Gil Semo, Yaron Vazana, Gila Hayat, Elena Leitner, Georg Rehm, and Julian Moreno-
Ben Hagag, Joel Niklaus, Rohit Saha, and Kyryl Schneider. 2019. Fine-grained named entity recogni-
Truskovskyi. 2024. LegalLens: Leveraging LLMs tion in legal documents. In International conference
for legal violation identification in unstructured text. on semantic systems, pages 272–287. Springer.
In Proceedings of the 18th Conference of the Euro-
Pedro Henrique Luz de Araujo, Teófilo E de Campos,
pean Chapter of the Association for Computational
Renato RR de Oliveira, Matheus Stauffer, Samuel
Linguistics (Volume 1: Long Papers), pages 2129–
Couto, and Paulo Bermejo. 2018. Lener-br: a dataset
2145, St. Julian’s, Malta. Association for Computa-
for named entity recognition in brazilian legal text.
tional Linguistics.
In Computational Processing of the Portuguese Lan-
guage: 13th International Conference, PROPOR
Luiz Henrique Bonifacio, Paulo Arantes Vilela, Gus-
2018, Canela, Brazil, September 24–26, 2018, Pro-
tavo Rocha Lobato, and Eraldo Rezende Fernandes.
ceedings 13, pages 313–323. Springer.
2020. A study on the impact of intradomain finetun-
ing of deep language models for legal named entity Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal,
recognition in portuguese. In Intelligent Systems: Jason Weston, and Douwe Kiela. 2020. Adversarial
9th Brazilian Conference, BRACIS 2020, Rio Grande, NLI: A new benchmark for natural language under-
Brazil, October 20–23, 2020, Proceedings, Part I 9, standing. In Proceedings of the 58th Annual Meeting
pages 648–662. Springer. of the Association for Computational Linguistics. As-
sociation for Computational Linguistics.
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
and Christopher D. Manning. 2015. A large anno- Riccardo Pozzi, Riccardo Rubini, Christian Bernasconi,
tated corpus for learning natural language inference. and Matteo Palmonari. 2023. Named entity recog-
In Proceedings of the 2015 Conference on Empiri- nition and linking for entity extraction from italian
cal Methods in Natural Language Processing, pages civil judgements. In International Conference of the
632–642, Lisbon, Portugal. Association for Compu- Italian Association for Artificial Intelligence, pages
tational Linguistics. 187–201. Springer.
Can Çetindağ, Berkay Yazıcıoğlu, and Aykut Koç. 2023. Yiu Kei Tang. 2023. Natural language inference transfer
Named-entity recognition in turkish legal texts. Nat- learning in a multi-task contract dataset: In the case
ural Language Engineering, 29(3):615–642. of contractnli: a document information extraction
system.
Harshil Darji, Jelena Mitrović, and Michael Granitzer.
2023. German bert model for legal named entity Marco Valentino and André Freitas. 2024. On the na-
recognition. arXiv preprint arXiv:2303.05388. ture of explanation: An epistemological-linguistic
perspective for explanation-based natural language
inference. Philosophy & Technology, 37(3):88.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of the 2018 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1
(Long Papers), pages 1112–1122. Association for
Computational Linguistics.
Xinrui Zhang, Xudong Luo, and Jiaye Wu. 2023. A
roberta-globalpointer-based method for named en-
tity recognition of legal documents. In 2023 Interna-
tional Joint Conference on Neural Networks (IJCNN),
pages 1–8. IEEE.