Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering

Helena Bonaldi, Greta Damo, Nicolás Benjamín Ocampo, Elena Cabrio, Serena Villata, Marco Guerini. 2024. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: EMNLP 2024. Miami, Florida. (Under Publication).
Conference CORE Rank: A*.

In this paper, we test whether the presence of safety guardrails hinders the quality of counterspeech generation. Secondly, we assess whether attacking a specific component of hate speech results in a more effective argumentative strategy to fight online hate.

PEACE: Providing Explanations and Analysis for Combating Hate Expressions

Greta Damo, Nicolás Benjamín Ocampo†, Elena Cabrio and Serena Villata. 2024. In Proceedings of The 27th European Conference on Artificial Intelligence: ECAI 2024. Santiago de Compostela, Spain. (Under Publication).
Equal Contribution.
Conference CORE Rank: A.

Unveiling the Hate: Generating Faithful and Plausible Explanations for Implicit and Subtle Hate Speech Detection

Greta Damo, Nicolás Benjamín Ocampo†, Elena Cabrio and Serena Villata. 2024. In Proceedings of The 29th International Conference on Natural Language & Information Systems: NLDB 2024. Turin, Italy. (Under Publication).
Equal Contribution.
Conference CORE Rank: C.

In this paper we propose a comprehensive approach combining prompt construction, free-text generation, few-shot learning, and fine-tuning to generate explanations for hate speech classification, with the goal of providing more context for content moderators to unveil the actual nature of a message on social media.
Code and Paper will be available soon.

Unmasking the Hidden Meaning: Bridging Implicit and Explicit Hate Speech Embedding Representations

Nicolás Benjamín Ocampo, Elena Cabrio, and Serena Villata. 2023. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore.
Conference CORE Rank: A*.

This research addresses the challenge of detecting implicit hate speech (HS) in user-generated content. It presents a fourfold contribution: a comparative analysis of transformer-based models on datasets with implicit HS, an examination of embedding representations for veiled cases, a comparison linking explicit and implicit HS through their targets to improve embeddings, and a demonstration of enhanced performance in borderline HS classification cases.
URL: https://aclanthology.org/2023.findings-emnlp.441/
CODE: https://github.com/benjaminocampo/bridging_ie_hs_embs

Playing the Part of the Sharp Bully: Generating Adversarial Examples for Implicit Hate Speech Detection

Nicolás Benjamín Ocampo, Elena Cabrio, and Serena Villata. 2023. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada.
Conference CORE Rank: A*.

This paper introduces a framework for generating adversarial implicit hate speech (HS) messages using Auto-regressive Language Models, categorizing them into EASY, MEDIUM, and HARD complexity levels. It also presents a "build it, break it, fix it" training approach, demonstrating that retraining state-of-the-art models with HARD messages significantly improves their performance on implicit HS detection.
URL: https://aclanthology.org/2023.findings-acl.173/
CODE: https://github.com/benjaminocampo/implicit_generator

An In-depth Analysis of Implicit and Subtle Hate Speech Messages

Nicolás Benjamín Ocampo, Elena Cabrio, and Serena Villata. 2023. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia.
Conference CORE Rank: A.

The study explores the difficulty in detecting subtle and implicit hate speech (HS) on social media, which is more complex than explicit HS. It reveals that advanced neural network models are effective in identifying explicit HS but struggle with subtle and implicit forms, indicating the need for further research in this area.
URL: https://aclanthology.org/2023.eacl-main.147
CODE: https://github.com/benjaminocampo/ISHate