Rui Sousa-Silva is assistant professor of the Faculty of Arts and Humanities, and researcher and Scientific Coordinator of the Centre for Linguistics (CLUP) of the University of Porto, where he conducts his research into Forensic Linguistics, notably authorship analysis, plagiarism detection and analysis and cybercrime. He is also a member of the Scientific Committee of the Master's in Translation and Language Services and Coordinator of the Specialisation Course in Forensic Linguistics. He has a first degree in Translation and a Masters in Terminology and Translation, both awared by the Faculty of Arts and Humanities of the University of Porto, and a PhD in Applied Linguistics from Aston University (Birmingham, UK), where he submitted his thesis on Forensic Linguistics. He is co-editor of the journal Language and Law and co-editor of the Routledge Handbook of Forensic Linguistics. He was LITHME WG1-Computational Linguistics Chair (2020-2024).
EDUCATION
PhD in Applied Linguistics - Forensic Linguistics, Aston University, Birningham, UK
RESEARCH, TEACHING, or OTHER INTERESTS
Linguistics and Language, Arts and Humanities
24
Scopus Publications
Scopus Publications
"The facts speak for themselves": dismantling conspiracy theories as disinformation Rui Sousa-Silva Linguistics Vanguard, 2026 Disinformation has been commonly approached as fake news, i.e., news that does not comply with the principles of factuality, objectivity, and neutrality. However, not all pieces of disinformation are damaging (e.g., satire) or rely on lack of factuality. Rather, it is the combination of lack of factuality and intention to deceive that embodies the most serious form of disinformation. By taking advantage of echo chambers and filter bubbles, disinformers use distorted facts to disseminate alternative forms of disinformation and manipulate readers’ views. This article discusses the relevance of establishing a sociolect of conspiracy theories (CTs) as alternative sources of disinformation. It builds on a small corpus of CTs published in Portuguese to explore the use of forensic linguistics methods to assist the detection of disinformation. A holistic linguistics approach is adopted, which operates by scrutinizing metadata, structure, and discourse to identify which linguistic features depart from mainstream sources and which ones overlap, and thus understanding the linguistic materializations of CTs. The findings reveal promising results in detecting CTs as alternative sources of disinformation, with the additional advantage of substantiating judgements of disinformation with linguistic evidence. This article concludes with a discussion of the limitations of this exploratory research.
Function words as possible style markers: an application to the forensic authorship analysis of Getúlio Vargas’ suicide note (carta-testamento) Viviane Costa, Rui Sousa-Silva Delta Documentacao De Estudos Em Linguistica Teorica E Aplicada, 2025 Resumo Em agosto de 1954, o então presidente do Brasil Getúlio Vargas tirou a sua própria vida com um tiro no peito. Ao lado do corpo, foi encontrada uma carta, que ficou conhecida como carta-testamento, cuja autoria foi contestada na época. Pessoas próximas a Vargas atribuíram a autoria da carta ao seu amigo íntimo e speechwriter, José Soares Maciel Filho. O suicídio de Vargas e a mensagem deixada em sua última missiva mudaram os rumos da História do Brasil e, por isso, acreditamos ser importante analisar a sua carta de suicídio pela perspectiva da análise de autoria forense. Nossa análise concentra-se, entretanto, apenas nas palavras gramaticais. Para tal, analisaremos também outros textos de autoria conhecida, tanto de Getúlio Vargas, quanto de Maciel Filho. Os resultados, parte de um projeto de pesquisa mais amplo, mostram que há diferenças consideráveis entre a carta-testamento e as demais cartas de suicídio escritas por Vargas. Além disso, os resultados da análise sugerem que a utilização de certas palavras parece estar mais ligada à relação entre locutário e alocutário do que ao gênero e aos tópicos textuais
Leveraging Loanword Constraints for Improving Machine Translation in a Low-Resource Multilingual Context Felermino D. M. A. Ali, Henrique Lopes Cardoso, Rui Sousa-Silva Emnlp 2025 2025 Conference on Empirical Methods in Natural Language Processing Proceedings of the Conference, 2025 This research investigates how to improve machine translation systems for low-resource languages by integrating loanword constraints as external linguistic knowledge.Focusing on the Portuguese-Emakhuwa language pair, which exhibits significant lexical borrowing, we address the challenge of effectively adapting loanwords during the translation process.To tackle this, we propose a novel approach that augments source sentences with loanword constraints, explicitly linking source-language loanwords to their target-language equivalents.Then, we perform supervised fine-tuning on multilingual neural machine translation models and multiple Large Language Models of different sizes.Our results demonstrate that incorporating loanword constraints leads to significant improvements in translation quality as well as in handling loanword adaptation correctly in target languages, as measured by different machine translation metrics.This approach offers a promising direction for improving machine translation performance in low-resource settings characterized by frequent lexical borrowing.
Evaluating WMT 2025 Metrics Shared Task Submissions on the SSA-MTE African Challenge Set Senyu Li, Felermino Dario Mario Ali, Jiayi Wang, Rui Sousa-Silva, Henrique Lopes Cardoso, Pontus Stenetorp, Colin Cherry, David Ifeoluwa Adelani Conference on Machine Translation Proceedings, 2025 Senyu Li, Felermino Dario Mario Ali, Jiayi Wang, Rui Sousa-Silva, Henrique Lopes Cardoso, Pontus Stenetorp, Colin Cherry, David Ifeoluwa Adelani. Proceedings of the Tenth Conference on Machine Translation. 2025.
SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages? Senyu Li, Jiayi Wang, Felermino D. M. A. Ali, Colin Cherry, Daniel Deutsch, Eleftheria Briakou, Rui Sousa-Silva, Henrique Lopes Cardoso, Pontus Stenetorp, David Ifeoluwa Adelani Emnlp 2025 2025 Conference on Empirical Methods in Natural Language Processing Proceedings of the Conference, 2025 Senyu Li, Jiayi Wang, Felermino D. M. A. Ali, Colin Cherry, Daniel Deutsch, Eleftheria Briakou, Rui Sousa-Silva, Henrique Lopes Cardoso, Pontus Stenetorp, David Ifeoluwa Adelani. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025.
‘We Attempted to Deliver Your Package’: Forensic Translation in the Fight Against Cross-Border Cybercrime Rui Sousa-Silva International Journal for the Semiotics of Law, 2024 Cybercrime has increased significantly, recently, as a result of both individual and group criminal practice, and is now a threat to individuals, organisations, and democratic systems worldwide. However, cybercrime raises two main challenges for legal systems: firstly, because cybercriminals operate online, cybercrime spans beyond the boundaries of specific jurisdictions, which constrains the operation of the police and, subsequently, the conviction of the perpetrators; secondly, since cybercriminals can operate from anywhere in the world, law enforcement agencies struggle to identify the origin of the communications, especially when obfuscation strategies are used, e.g. dark web fora. Nevertheless, cybercriminals inherently use language to communicate, so the linguistic analysis of suspect communications is particularly helpful in deterring cybercriminal practice. This article reports the potential of forensic translation in the fight against cybercrime. Although the term ‘forensic translation’ is typically understood as a synonym of ‘legal translation’, it is argued that the implications of forensic translation span beyond those of legal translation, to include analyses of language rights, of the right to interpretation and translation in legal procedures (in the EU), or even investigative and intelligence practices. Translation is a pervasive activity that is conducted, not only by professional translators, but also by lay speakers of language, often using machine translation systems. The ease of use of the latter makes it particularly suitable for cross-border criminal (e.g. extortion or fraud) and cybercriminal communications (e.g. cybertrespass, cyberfraud, cyberpiracy, cyberporn or child online porn, cyberviolence or cyberstalking). This article presents the results of the analysis of cybercriminal communications from a forensic translation perspective. It demonstrates that translation is frequently used to spread cybercriminal communications, and that reverse-engineering the translational procedure will assist law enforcement agencies in narrowing down their pool of suspects and, consequently, deter cybercriminal threats.
Building Resources for Emakhuwa: Machine Translation and News Classification Benchmarks Felermino D. M. A. Ali, Henrique Lopes Cardoso, Rui Sousa-Silva Emnlp 2024 2024 Conference on Empirical Methods in Natural Language Processing Proceedings of the Conference, 2024 This paper introduces a comprehensive collection of NLP resources for Emakhuwa, Mozambique's most widely spoken language.The resources include the first manually translated news bitext corpus between Portuguese and Emakhuwa, news topic classification datasets, and monolingual data.We detail the process and challenges of acquiring this data and present benchmark results for machine translation and news topic classification tasks.Our evaluation examines the impact of different data types-originally clean text, postcorrected OCR, and back-translated data-and the effects of fine-tuning from pre-trained models, including those focused on African languages.Our benchmarks demonstrate good performance in news topic classification and promising results in machine translation.We fine-tuned multilingual encoder-decoder models using real and synthetic data and evaluated them on our test set and the FLORES evaluation sets.The results highlight the importance of incorporating more data and potential for future improvements.All models, code, and datasets are available in the
Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation Conference on Machine Translation Proceedings, 2024
Detecting Loanwords in Emakhuwa: An Extremely Low-Resource Bantu Language Exhibiting Significant Borrowing From Portuguese 2024 Joint International Conference on Computational Linguistics Language Resources and Evaluation Lrec Coling 2024 Main Conference Proceedings, 2024
Introduction: Understanding Language in the Human-Machine Era Proceedings of the 1st Luhme Workshop, 2024
Network-based Approach for Stopwords Detection Proceedings of the 16th International Conference on Computational Processing of the Portuguese Language Propor 2024, 2024
Annotating Arguments in a Corpus of Opinion Articles 2022 Language Resources and Evaluation Conference Lrec 2022, 2022
Predicting Argument Density from Multiple Annotations Gil Rocha, Bernardo Leite, Luís Trigo, Henrique Lopes Cardoso, Rui Sousa-Silva, Paula Carvalho, Bruno Martins, Miguel Won Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2022
Knowledge Organization in the New Era Using DIY Corpora as Writing Assistants Advances in Knowledge Organization, 2020
Biased Language Detection in Court Decisions Alexandra Guedes Pinto, Henrique Lopes Cardoso, Isabel Margarida Duarte, Catarina Vaz Warrot, Rui Sousa-Silva Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2020