Shayan Bali
2026
Detecting Subtle Biases: An Ethical Lens on Underexplored Areas in AI Language Models Biases
Shayan Bali | Farhan Farsi | Mohammad Hosseini | Adel Khorramrouz | Ehsaneddin Asgari
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Shayan Bali | Farhan Farsi | Mohammad Hosseini | Adel Khorramrouz | Ehsaneddin Asgari
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) are increasingly embedded in the daily lives of individuals across diverse social classes. This widespread integration raises urgent concerns about the subtle, implicit biases these models may contain. In this work, we investigate such biases through the lens of ethical reasoning, analyzing model responses to scenarios in a new dataset we propose comprising 1,016 scenarios, systematically categorized into ethical, unethical, and neutral types. Our study focuses on dimensions that are socially influential but less explored, including (i) residency status, (ii) political ideology, (iii) Fitness Status, (iv) educational attainment, and (v) attitudes toward AI. To assess LLMs’ behavior, we propose a baseline and employ one statistical test and one metric: a permutation test that reveals the presence of bias by comparing the probability distributions of ethical/unethical scenarios with the probability distribution of neutral scenarios on each demographic group, and a tendency measurement that captures the magnitude of bias with respect to the relative difference between probability distribution of ethical and unethical scenarios. Our evaluations of 12 prominent LLMs reveal persistent and nuanced biases across all four attributes, and Llama models exhibited the most pronounced biases. These findings highlight the need for refined ethical benchmarks and bias-mitigation tools in LLMs.
APARSIN: A Multi-Variety Sentiment and Translation Benchmark for Iranic Languages
Sadegh Jafari | Tara Azin | Farhad Roodi | Zahra Dehghani Tafti | Mehrdad Ghadrdan | Elham Vatankhahan Esfahani | Aylin Naebzadeh | Mohammadhadi Shahhosseini | Ghafoor Khan | Kazem Forghani | Danial Namazi | Seyed Mohammad Hossein Hashemi | Farhan Farsi | Mohammad Osoolian | Maede Mohammadi | Mohammad Erfan Zare | Muhammad Hasnain Khan | Muhammad Hussain | Nooreen Zaki | Joma Mohammadi | Shayan Bali | Mohammad Javad Ranjbar | Els Lefever | Veronique Hoste
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
Sadegh Jafari | Tara Azin | Farhad Roodi | Zahra Dehghani Tafti | Mehrdad Ghadrdan | Elham Vatankhahan Esfahani | Aylin Naebzadeh | Mohammadhadi Shahhosseini | Ghafoor Khan | Kazem Forghani | Danial Namazi | Seyed Mohammad Hossein Hashemi | Farhan Farsi | Mohammad Osoolian | Maede Mohammadi | Mohammad Erfan Zare | Muhammad Hasnain Khan | Muhammad Hussain | Nooreen Zaki | Joma Mohammadi | Shayan Bali | Mohammad Javad Ranjbar | Els Lefever | Veronique Hoste
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
The Iranic language family includes many underrepresented languages and dialects that remain largely unexplored in modern NLP research. We introduce APARSIN, a multi-variety benchmark covering 14 Iranic languages, dialects, and accents, designed for sentiment analysis and machine translation. The dataset includes both high and low-resource varieties, several of which are endangered, capturing linguistic variation across them. We evaluate a set of instruction-tuned Large Language Models (LLMs) on these tasks and analyze their performance across the varieties. Our results highlight substantial performance gaps between standard Persian and other Iranic languages and dialects, demonstrating the need for more inclusive multilingual and dialectally diverse NLP benchmarks.
2025
Persian in a Court: Benchmarking VLMs In Persian Multi-Modal Tasks
Farhan Farsi | Shahriar Shariati Motlagh | Shayan Bali | Sadra Sabouri | Saeedeh Momtazi
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
Farhan Farsi | Shahriar Shariati Motlagh | Shayan Bali | Sadra Sabouri | Saeedeh Momtazi
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
This study introduces a novel framework for evaluating Large Language Models (LLMs) and Vision-Language Models (VLMs) in Persian, a low-resource language. We develop comprehensive datasets to assess reasoning, linguistic understanding, and multimodal capabilities. Our datasets include Persian-OCR-QA for optical character recognition, Persian-VQA for visual question answering, Persian world-image puzzle for multimodal integration, Visual-Abstraction-Reasoning for abstract reasoning, and Iran-places for visual knowledge of Iranian figures and locations. We evaluate models like GPT-4o, Claude 3.5 Sonnet, and Llama 3.2 90B Vision, revealing their strengths and weaknesses in processing Persian. This research contributes to inclusive language processing by addressing the unique challenges of low-resource language evaluation.
MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language
Farhan Farsi | Farnaz Aghababaloo | Shahriar Shariati Motlagh | Parsa Ghofrani | MohammadAli SadraeiJavaheri | Shayan Bali | Amir Hossein Shabani | Farbod Bijary | Ghazal Zamaninejad | AmirMohammad Salehoof | Saeedeh Momtazi
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Farhan Farsi | Farnaz Aghababaloo | Shahriar Shariati Motlagh | Parsa Ghofrani | MohammadAli SadraeiJavaheri | Shayan Bali | Amir Hossein Shabani | Farbod Bijary | Ghazal Zamaninejad | AmirMohammad Salehoof | Saeedeh Momtazi
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
As large language models (LLMs) become increasingly embedded in our daily lives, evaluating their quality and reliability across diverse contexts has become essential. While comprehensive benchmarks exist for assessing LLM performance in English, there remains a significant gap in evaluation resources for other languages. Moreover, because most LLMs are trained primarily on data rooted in European and American cultures, they often lack familiarity with non-Western cultural contexts. To address this limitation, our study focuses on the Persian language and Iranian culture. We introduce 19 new evaluation datasets specifically designed to assess LLMs on topics such as Iranian law, Persian grammar, Persian idioms, and university entrance exams. Using these datasets, we benchmarked 41 prominent LLMs, aiming to bridge the existing cultural and linguistic evaluation gap in the field. The evaluation results are publicly available on our live leaderboard: https://huggingface.co/spaces/opll-org/Open-Persian-LLM-Leaderboard
2024
RFBES at SemEval-2024 Task 8: Investigating Syntactic and Semantic Features for Distinguishing AI-Generated and Human-Written Texts
Mohammad Heydari Rad | Farhan Farsi | Shayan Bali | Romina Etezadi | Mehrnoush Shamsfard
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Mohammad Heydari Rad | Farhan Farsi | Shayan Bali | Romina Etezadi | Mehrnoush Shamsfard
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Nowadays, the usage of Large Language Models (LLMs) has increased, and LLMs have been used to generate texts in different languages and for different tasks. Additionally, due to the participation of remarkable companies such as Google and OpenAI, LLMs are now more accessible, and people can easily use them. However, an important issue is how we can detect AI-generated texts from human-written ones. In this article, we have investigated the problem of AI-generated text detection from two different aspects: semantics and syntax. Finally, we presented an AI model that can distinguish AI-generated texts from human-written ones with high accuracy on both multilingual and monolingual tasks using the M4 dataset. According to our results, using a semantic approach would be more helpful for detection. However, there is a lot of room for improvement in the syntactic approach, and it would be a good approach for future work.
Search
Fix author
Co-authors
- Farhan Farsi 5
- Saeedeh Momtazi 2
- Shahriar Shariati Motlagh 2
- Farnaz Aghababaloo 1
- Ehsaneddin Asgari 1
- Tara Azin 1
- Farbod Bijary 1
- Elham Vatankhahan Esfahani 1
- Romina Etezadi 1
- Kazem Forghani 1
- Mehrdad Ghadrdan 1
- Parsa Ghofrani 1
- Seyed Mohammad Hossein Hashemi 1
- Mohammad Heydari Rad 1
- Mohammad Hosseini 1
- Veronique Hoste 1
- Muhammad Hussain 1
- Sadegh Jafari 1
- Ghafoor Khan 1
- Muhammad Hasnain Khan 1
- Adel Khorramrouz 1
- Els Lefever 1
- Maede Mohammadi 1
- Joma Mohammadi 1
- Aylin Naebzadeh 1
- Danial Namazi 1
- Mohammad Osoolian 1
- Mohammad Javad Ranjbar 1
- Farhad Roodi 1
- Sadra Sabouri 1
- MohammadAli SadraeiJavaheri 1
- Amirmohammad Salehoof 1
- Amir Hossein Shabani 1
- Mohammadhadi Shahhosseini 1
- Mehrnoush Shamsfard 1
- Zahra Dehghani Tafti 1
- Nooreen Zaki 1
- Ghazal Zamaninejad 1
- Mohammad Erfan Zare 1