A8. Toxicity

[2024/06] Preference Tuning For Toxicity Mitigation Generalizes Across Languages
[2024/06] Supporting Human Raters with the Detection of Harmful Content using Large Language Models
[2024/05] ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
[2024/05] Mitigating Text Toxicity with Counterfactual Generation
[2024/05] PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models
[2024/05] UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images
[2024/04] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
[2024/03] Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision-Language Models
[2024/03] MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation
[2024/03] Risk and Response in Large Language Models: Evaluating Key Threat Categories
[2024/03] From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards
[2024/03] Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation
[2024/03] Helpful or Harmful? Exploring the Efficacy of Large Language Models for Online Grooming Prevention
[2024/03] Harnessing Artificial Intelligence to Combat Online Hate: Exploring the Challenges and Opportunities of Large Language Models in Hate Speech Detection
[2024/03] From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models
[2024/03] DPP-Based Adversarial Prompt Searching for Lanugage Model
[2024/03] LLMGuard: Guarding Against Unsafe LLM Behavior
[2024/02] GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection?
[2024/02] Beyond Hate Speech: NLP's Challenges and Opportunities in Uncovering Dehumanizing Language
[2024/02] Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
[2024/02] Zero shot VLMs for hate meme detection: Are we there yet?
[2024/02] Universal Prompt Optimizer for Safe Text-to-Image Generation
[2024/02] Can LLMs Recognize Toxicity? Structured Toxicity Investigation Framework and Semantic-Based Metric
[2024/02] Bryndza at ClimateActivism 2024: Stance, Target and Hate Event Detection via Retrieval-Augmented GPT-4 and LLaMA
[2024/01] Using LLMs to discover emerging coded antisemitic hate-speech emergence in extremist social media
[2024/01] MetaHate: A Dataset for Unifying Efforts on Hate Speech Detection
[2024/01] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
[2023/12] Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models
[2023/12] Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models
[2023/12] GTA: Gated Toxicity Avoidance for LM Performance Preservation
[2023/12] Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
[2023/11] Unveiling the Implicit Toxicity in Large Language Models
[2023/10] All Languages Matter: On the Multilingual Safety of Large Language Models
[2023/10] On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
[2023/09] (InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild
[2023/09] Controlled Text Generation via Language Model Arithmetic
[2023/09] Curiosity-driven Red-teaming for Large Language Models
[2023/09] RealChat-1M: A Large-Scale Real-World LLM Conversation Dataset
[2023/09] Understanding Catastrophic Forgetting in Language Models via Implicit Inference
[2023/09] Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models
[2023/09] What's In My Big Data?
[2023/08] Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
[2023/08] You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content
[2023/05] Evaluating ChatGPT's Performance for Multilingual and Emoji-based Hate Speech Detection
[2023/05] Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
[2023/04] Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
[2023/02] Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models
[2023/02] Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech
[2022/12] Constitutional AI: Harmlessness from AI Feedback
[2022/12] On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning
[2022/10] Unified Detoxifying and Debiasing in Language Generation via Inference-time Adaptive Optimization
[2022/05] Toxicity Detection with Generative Prompt-based Inference
[2022/04] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
[2022/03] ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
[2020/09] RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toxicity.md

toxicity.md

A8. Toxicity

Files

toxicity.md

Latest commit

History

toxicity.md

File metadata and controls

A8. Toxicity