LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback

Paper · arXiv 2406.03363 · Published June 5, 2024
Argumentation and PersuasionSentiment, Semantics, and Toxicity DetectionNatural Language InferenceSocial Media and AI

Ensuring that online discussions are civil and productive is a major challenge for social media platforms. Such platforms usually rely both on users and on automated detection tools to flag inappropriate arguments of other users, which moderators then review. However, this kind of post-hoc moderation is expensive and time-consuming, and moderators are often overwhelmed by the amount and severity of flagged content. Instead, a promising alternative is to prevent negative behavior during content creation. This paper studies how inappropriate language in arguments can be computationally mitigated. We propose a reinforcement learningbased rewriting approach that balances content preservation and appropriateness based on existing classifiers, prompting an instructionfinetuned large language model (LLM) as our initial policy. Unlike related style transfer tasks, rewriting inappropriate arguments allows deleting and adding content permanently. It is therefore tackled on document level rather than sentence level. We evaluate different weighting schemes for the reward function in both absolute and relative human assessment studies.

Introduction. Creating trusted and safe online spaces where people with different backgrounds and opinions can discuss controversial issues is a major challenge for social media platforms (Salminen et al., 2018). The diversity in opinions, emotional attachments, and the anonymity of the web easily lead to heated discussions, which can quickly turn into toxic environments, even if only one participant behaves inappropriately (Habernal et al., 2018). Avoiding this is a challenging task, often supported by platform

Discussion / Conclusion. In this paper, we have studied how to mitigate inappropriate language in arguments through rewriting. To this end, we have proposed an approach based on reinforcement learning from human feedback (RLHF), which balances the semantic similarity of arguments with a target style (here, with appropriateness). Our approach resorts to machine feedback instead of human feedback, though, thus enabling full automation. Our experiments have demonstrated that prompting an instruction-finetuned large language model, combined with a single style classifier and an unlabeled dataset, is sufficient to train a policy that outperforms competitive baselines in terms of appropriateness and semantic similarity. Through manual annotation studies, we have provided evidence that our approach can mitigate the inappropriateness of arguments while preserving their content to a wide extent. Intriguingly, our human annotators prefer approaches that prioritize appropriateness over semantic similarity. Our results suggest that a careful design of the reward function is crucial for the success of RLHF-like approaches, if trained solely in an offline fashion.