Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models

Paper · arXiv 2305.18585 · Published May 29, 2023

The advent of social media has given rise to numerous ethical challenges, with hate speech among the most significant concerns. Researchers are attempting to tackle this problem by leveraging hate-speech detection and employing language models to automatically moderate content and promote civil discourse. Unfortunately, recent studies have revealed that hate-speech detection systems can be misled by adversarial attacks, raising concerns about their resilience. While previous research has separately addressed the robustness of these models under adversarial attacks and their interpretability, there has been no comprehensive study exploring their intersection. The novelty of our work lies in combining these two critical aspects, leveraging interpretability to identify potential vulnerabilities and enabling the design of targeted adversarial attacks. We present a comprehensive and comparative analysis of adversarial robustness exhibited by various hatespeech detection models. Our study evaluates the resilience of these models against adversarial attacks using explainability techniques. To gain insights into the models’ decisionmaking processes, we employ the Local Interpretable Modelagnostic Explanations (LIME) framework.

Introduction. Due to the expanding influence of social media, it has become increasingly important to understand the nature of online exchanges and address discussions that contain offensive or hateful content. While prior work has focused on the efficacy of content moderation (Srinivasan et al. 2019), recent attention has shifted to automated mediation for promoting civil discourse instead of merely removing posts that include offensive language. Steps in this direction are only in the nascent stages (Kirk et al. 2022). A crucial prerequisite for ensuring the effectiveness of approaches to achieving this goal is the accurate detection of hate speech. An interaction may be offensive if it touches upon sensitive elements such as race, color, gender, ethnicity, religion, etc. The problem of discerning whether a sentence includes hate speech has been explored using models such as BERT (Devlin et al. 2018), LSTM (Hochreiter and Schmidhuber 1997) and CNN (Kim 2014).

Discussion / Conclusion. This paper introduces a unique investigation into the interplay between explainability and adversarial robustness in hate-speech detection models. Informed by this investigation, we adopt an approach whereby attacks are conducted on widely used hate-speech models, with a focus on exploiting explainable features to reveal vulnerabilities. Our findings provide empirical support for our initial hypothesis, underscoring the potential tradeoff between enhancing model explainability and inadvertently increasing its vulnerability to perturbations, thus compromising adversarial robustness. Moreover, our study yields compelling evidence of a proportional relationship between explainability and adversarial robustness in hate-speech detection models. This discovery offers valuable insights for the fine-tuning of such models, enabling the establishment of an optimal balance between explainability and adversarial robustness. By leveraging this understanding, we can ensure that the models perform well even under adversarial attacks.

Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models

Synthesis notes from this paper's topics