20.09 |
University of Washington |
EMNLP2020(findings) |
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models |
Toxicity |
21.09 |
University of Oxford |
ACL2022 |
TruthfulQA: Measuring How Models Mimic Human Falsehoods |
Truthfulness |
22.03 |
MIT |
ACL2022 |
ToxiGen: A Large-Scale Machine-Generated datasets for Adversarial and Implicit Hate Speech Detection |
Toxicity |
23.07 |
Zhejiang University; School of Engineering Westlake University |
arxiv |
Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models |
Text Safety&Benchmark&Jailbreaking |
23.07 |
Stevens Institute of Technology |
NAACL2024(findings) |
HateModerate: Testing Hate Speech Detectors against Content Moderation Policies |
Hate Speech Detection&Content Moderation&Machine Learning |
23.08 |
Meta Reality Labs |
NAACL2024 |
Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? |
Large Language Models&Knowledge Graphs&Question Answering |
23.08 |
Bocconi University |
NAACL2024 |
XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models |
Large Language Models&Safety Behaviours&Test Suite |
23.09 |
LibrAI, MBZUAI, The University of Melbourne |
arxiv |
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs |
Safety Evaluation&Safeguards |
23.10 |
University of Edinburgh, Huawei Technologies Co., Ltd. |
NAACL2024 |
Assessing the Reliability of Large Language Model Knowledge |
Large Language Models&Factual Knowledge&Knowledge Probing |
23.10 |
University of Pennsylvania |
NAACL2024(findings) |
Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks |
Hallucination Assessment&Adversarial Attacks&Large Language Models |
23.11 |
Fudan University |
arxiv |
JADE: A Linguistic-based Safety Evaluation Platform for LLM |
Safety Benchmarks |
23.11 |
UNC-Chapel Hill |
arxiv |
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges |
Hallucination&Benchmark&Multimodal |
23.11 |
IBM Research AI |
EMNLP2023(GEM workshop) |
Unveiling Safety Vulnerabilities of Large Language Models |
Adversarial Examples&Clustering&Automatically Identifying |
23.11 |
The Hong Kong University of Science and Technology |
arxiv |
P-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models |
Differential Privacy&Privacy Evaluation |
23.11 |
UC Berkeley |
arxiv |
CAN LLMS FOLLOW SIMPLE RULES |
Evaluation&Attack Strategies |
23.11 |
University of Central Florida |
arxiv |
THOS: A Benchmark Dataset for Targeted Hate and Offensive Speech |
Hate Speech&Offensive Speech&Dataset |
23.11 |
Beijing Jiaotong University; DAMO Academy, Alibaba Group, Peng Cheng Lab |
arXiv |
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation |
Multi-modal Large Language Models&Hallucination&Benchmark |
23.11 |
Patronus AI, University of Oxford, Bocconi University |
arxiv |
SIMPLESAFETYTESTS: a Test Suite for Identifying Critical Safety Risks in Large Language Models |
Safety Risks&Test Suite&Evaluation |
23.11 |
University of Southern California, University of Pennsylvania, University of California Davis |
arxiv |
Deceiving Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination? |
Hallucinations&Semantic Associations&Benchmark |
23.11 |
Seoul National University, Chung-Ang University, NAVER AI Lab, NAVER Cloud, University of Richmond |
arxiv |
LifeTox: Unveiling Implicit Toxicity in Life Advice |
LifeTox Dataset&Toxicity Detection&Social Media Analysis |
23.11 |
School of Information Renmin University of China |
arxiv |
UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation |
Hallucination&Evaluation Benchmark |
23.11 |
UC Santa Cruz, UNC-Chapel Hill |
arxiv |
How Many Are in This Image? A Safety Evaluation Benchmark for Vision LLMs |
Vision Large Language Models&Safety Evaluation&Adversarial Robustness |
23.11 |
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; Baidu Inc. |
arxiv |
FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality Fairness Toxicity |
Harmlessness Evaluation |
23.11 |
Fudan University&Shanghai Artificial Intelligence Laboratory |
NAACL2024 |
Fake Alignment: Are LLMs Really Aligned Well? |
Large Language Models&Safety Evaluation&Fake Alignment |
23.11 |
Kahlert School of Computing |
NAACL2024 |
Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness |
NLP Robustness&Out-of-Domain Evaluation&Adversarial Evaluation |
23.11 |
Shanghai Jiao Tong University |
NAACL2024(findings) |
CLEAN–EVAL: Clean Evaluation on Contaminated Large Language Models |
Clean Evaluation&Data Contamination&Large Language Models |
23.12 |
Meta |
arxiv |
Purple Llama CYBERSECEVAL: A Secure Coding Benchmark for Language Models |
Safety&Cybersecurity&Code Security Benchmark |
23.12 |
University of Illinois Chicago, Bosch Research North America & Bosch Center for Artificial Intelligence (BCAI), UNC Chapel-Hill |
arxiv |
DELUCIONQA: Detecting Hallucinations in Domain-specific Question Answering |
Hallucination Detection&Domain-specific QA&Retrieval-augmented LLMs |
23.12 |
University of Science and Technology of China, Hong Kong University of Science and Technology, Microsoft |
arxiv |
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models |
Indirect Prompt Injection Attacks&BIPIA Benchmark&Defense |
24.01 |
NewsBreak, University of Illinois Urbana-Champaign |
arxiv |
RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models |
Retrieval-Augmented Generation&Hallucination Detection&Dataset |
24.01 |
University of Notre Dame, Lehigh University, Illinois Institute of Technology, Institut Polytechnique de Paris, William & Mary, Texas A&M University, Samsung Research America, Stanford University |
ICML 2024 |
TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS |
Trustworthiness&Benchmark Evaluation |
24.01 |
University College London |
arxiv |
Hallucination Benchmark in Medical Visual Question Answering |
Medical Visual Question Answering&Hallucination Benchmark |
24.01 |
Carnegie Mellon University |
arxiv |
TOFU: A Task of Fictitious Unlearning for LLMs |
Data Privacy&Ethical Concerns&Unlearning |
24.01 |
IRLab CITIC Research Centre, Universidade da Coruña |
arxiv |
MetaHate: A Dataset for Unifying Efforts on Hate Speech Detection |
Hate Speech Detection&Social Media |
24.01 |
Northwestern University, New York University, University of Liverpool, Rutgers University |
arxiv |
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models |
Jailbreak Attack&Evaluation Frameworks&Ground Truth Dataset |
24.01 |
Shanghai Jiao Tong University |
arxiv |
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents |
LLM Agents&Safety Risk Awareness&Benchmark |
24.02 |
University of Illinois Urbana-Champaign, Center for AI Safety, Carnegie Mellon University, UC Berkeley, Microsoft |
arxiv |
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal |
Automated Red Teaming&Robust Refusal |
24.02 |
Shanghai Artificial Intelligence Laboratory, Harbin Institute of Technology, Beijing Institute of Technology, Chinese University of Hong Kong |
arxiv |
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models |
Safety Benchmark&Safety Evaluation**&Hierarchical Taxonomy |
24.02 |
Middle East Technical University |
arxiv |
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs |
Hallucination&Benchmarking Dataset |
24.02 |
Indian Institute of Technology Kharagpur |
arxiv |
How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries |
Instruction-centric Responses&Ethical Vulnerabilities |
24.03 |
East China Normal University |
arxiv |
DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models |
Dialogue-level Hallucination&Benchmarking&Human-machine Interaction |
24.03 |
Tianjin University, Tianjin University, Zhengzhou University, China Academy of Information and Communications Technology |
arxiv |
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment, and Safety |
Chinese LLMs&Benchmarking&Safety |
24.04 |
University of Pennsylvania, ETH Zurich, EPFL, Sony AI |
arxiv |
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models |
Jailbreaking Attacks&Robustness Benchmark |
24.04 |
Vector Institute for Artificial Intelligence, University of Limerick |
arxiv |
Developing Safe and Responsible Large Language Models - A Comprehensive Framework |
Responsible AI&AI Safety&Generative AI |
24.04 |
LMU Munich, University of Oxford, Siemens AG, Munich Center for Machine Learning (MCML), Wuhan University |
arxiv |
RED TEAMING GPT-4V: ARE GPT-4V SAFE AGAINST UNI/MULTI-MODAL JAILBREAK ATTACKS? |
Jailbreak Attacks&GPT-4V&Evaluation Benchmark&Robustness |
24.04 |
Bocconi University, University of Oxford |
arxiv |
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety |
LLM Safety&Open Datasets&Systematic Review |
24.04 |
University of Alberta&The University of Tokyo |
arxiv |
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward |
LLM Safety&Online Safety Analysis&Benchmark |
24.04 |
Technion – Israel Institute of Technology, Google Research |
arxiv |
Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs |
Hallucinations&Benchmarks |
24.05 |
Carnegie Mellon University |
arxiv |
PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models |
Multilingual Evaluation&*Datasets |
24.05 |
Paul G. Allen School of Computer Science & Engineering |
arxiv |
MASSIVE Multilingual Abstract Meaning Representation: A Dataset and Baselines for Hallucination Detection |
Hallucination Detection&Multilingual AMR&Dataset |
24.05 |
University of California, Riverside |
arxiv |
Cross-Task Defense: Instruction-Tuning LLMs for Content Safety |
Instruction-Tuning&LLM Safety&Content Safety |
24.06 |
University of Waterloo |
arxiv |
TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability |
Truthfulness&Reliability |
24.06 |
Rutgers University |
arxiv |
MoralBench: Moral Evaluation of LLMs |
Moral Evaluation&MoralBench |
24.06 |
Tsinghua University |
arxiv |
Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study |
Trustworthiness&MLLMs&Benchmark |
24.06 |
Beijing Academy of Artificial Intelligence |
arxiv |
HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation |
Hallucination Evaluation&Dialogue-Level&HalluDial |
24.06 |
Sichuan University |
arxiv |
LEGEND: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets |
Safety Margin&Preference Datasets&Representation Engineering |
24.06 |
The Hong Kong University of Science and Technology (Guangzhou) |
arxiv |
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs |
Jailbreak Attacks&Benchmarking |
24.06 |
AI Innovation Center, China Unicom |
arxiv |
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models |
Chinese Hierarchical Safety Benchmark&Large Language Models&Automatic Evaluation |
24.06 |
Google |
arxiv |
Supporting Human Raters with the Detection of Harmful Content using Large Language Models |
Harmful Content Detection&Hate Speech |
24.06 |
South China University of Technology, Pazhou Laboratory, University of Maryland, Baltimore County |
arxiv |
GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models |
Gender Bias Mitigation&Alignment Dataset&Bias Categories |
24.06 |
Center for AI Safety and Governance, Institute for AI, Peking University |
arxiv |
SAFESORA: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset |
Safety Alignment&Text2Video Generation |
24.06 |
Fudan University |
arxiv |
Cross-Modality Safety Alignment |
Multimodal Safety&Large Vision-Language Models&SIUO Benchmark |
24.06 |
KAIST |
arxiv |
CSRT: Evaluation and Analysis of LLMs using Code-Switching Red-Teaming Dataset |
Code-Switching&Red-Teaming&Multilingualism |
24.06 |
University College London |
arxiv |
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models |
Gender Bias&Hiring Bias&Benchmarking |
24.06 |
Peking University |
arxiv |
PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models |
Safety Alignment&Preference Dataset |
24.06 |
University of California, Los Angeles |
arxiv |
MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries? |
Multimodal Language Models&Oversensitivity&Safety Mechanisms |
24.06 |
Allen Institute for AI |
arxiv |
WILDGUARD: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs |
Safety Moderation&Jailbreak Attacks&Moderation Tools |
24.06 |
University of Washington |
arxiv |
WILDTEAMING at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models |
Jailbreaking&Safety Training&Adversarial Attacks |
24.07 |
Beijing Jiaotong University |
arxiv |
KG-FPQ: Evaluating Factuality Hallucination in LLMs with Knowledge Graph-based False Premise Questions |
Factuality Hallucination&Knowledge Graph&False Premise Questions |
24.07 |
Chinese Academy of Sciences |
arxiv |
T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models |
Text-to-Video Generation&Safety Evaluation&Generative Models |
24.07 |
Patronus AI |
arxiv |
Lynx: An Open Source Hallucination Evaluation Model |
Hallucination Detection&RAG&Evaluation Model |
24.07 |
Virginia Tech |
arxiv |
AIR-BENCH 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies |
AI Safety&Regulations&Policies&Risk Categories |
24.07 |
Columbia University |
ECCV 2024 |
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning |
Hallucination&Vision-Language Models&Datasets |
24.07 |
Center for AI Safety |
arxiv |
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? |
AI Safety&Benchmarks |
24.08 |
Walled AI Labs |
arxiv |
WALLEDEVAL: A Comprehensive Safety Evaluation Toolkit for Large Language Models |
AI Safety&Prompt Injection |
24.08 |
ShanghaiTech University |
arxiv |
MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models |
Jailbreak Attacks&Vision-Language Models&Security |
24.08 |
Stanford University |
arxiv |
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models |
Cybersecurity&Capture the Flag |
24.08 |
Zhejiang University |
arxiv |
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks |
Jailbreak Attacks&LLM Reliability&Evaluation Framework |
24.08 |
Enkrypt AI |
arxiv |
SAGE-RT: Synthetic Alignment Data Generation for Safety Evaluation and Red Teaming |
Synthetic Data Generation&Safety Evaluation&Red Teaming |
24.08 |
Tianjin University |
Findings of ACL 2024 |
CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models |
Moral Evaluation&Moral Dilemma |
24.08 |
University of Surrey |
IJCAI 2024 |
CodeMirage: Hallucinations in Code Generated by Large Language Models |
Code Hallucinations&CodeMirage Dataset |
24.08 |
Chalmers University of Technology |
arxiv |
LLMSecCode: Evaluating Large Language Models for Secure Coding |
Secure Coding&Evaluation Framework |
24.09 |
The Chinese University of Hong Kong |
arxiv |
Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness |
Correctness&Non-Toxicity&Fairness |
24.09 |
KAIST |
arxiv |
Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering |
Image Hallucination&Text-to-Image Generation&Question-Answering |
24.09 |
Zhejiang University |
arxiv |
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks |
Prompt Injection&LLM Safety&Benchmarking |
24.10 |
Zhejiang University |
arxiv |
AGENT SECURITY BENCH (ASB): FORMALIZING AND BENCHMARKING ATTACKS AND DEFENSES IN LLM-BASED AGENTS |
LLM-based Agents&Security Benchmarks&Adversarial Attacks |
24.10 |
Zhejiang University, Duke University |
arxiv |
SCISAFEEVAL: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks |
Safety Alignment&Scientific Tasks |
24.10 |
The Chinese University of Hong Kong, Tencent AI Lab |
arxiv |
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step |
Chain-of-Jailbreak&Image Generation Models&Safety |
24.10 |
University of California, Santa Cruz, University of California, Berkeley |
arxiv |
Multimodal Situational Safety: A Benchmark for Large Language Models |
Multimodal Situational Safety&MLLMs&Safety Benchmark |
24.10 |
IBM Research |
arxiv |
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents |
Web Agents&Safety&Trustworthiness |
24.10 |
Renmin University of China, Anthropic, University of Oxford, University of Edinburgh, Mila, Tangentic |
arxiv |
POISONBENCH: Assessing Large Language Model Vulnerability to Data Poisoning |
Data poisoning&LLM vulnerability&Preference learning |
24.10 |
Gray Swan AI, UK AI Safety Institute |
arxiv |
AGENTHARM: A Benchmark for Measuring Harmfulness of LLM Agents |
Jailbreaking&LLM agents&Harmful agent tasks |
24.10 |
Purdue University |
arxiv |
COLLU-BENCH: A Benchmark for Predicting Language Model Hallucinations in Code |
Code hallucinations&Code generation&Automated program repair |
24.10 |
The Hong Kong University of Science and Technology (Guangzhou), University of Birmingham, Baidu Inc. |
arxiv |
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework |
Jailbreak judge&Multi-agent framework |
24.10 |
University of Notre Dame, IBM Research |
arxiv |
BenchmarkCards: Large Language Model and Risk Reporting |
BenchmarkCards&Bias&Fairness |
24.10 |
Vectara, Inc., Iowa State University, University of Southern California, Entropy Technologies, University of Waterloo, Funix.io, University of Wisconsin, Madison |
arxiv |
FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs |
Hallucination detection&Human-annotated benchmark&Faithfulness |
24.10 |
Southern University of Science and Technology |
arxiv |
ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models |
ChineseSafe&Content Safety&LLM Evaluation |
24.10 |
Beihang University |
arxiv |
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models |
Multimodal Large Language Models&Safety Evaluation Framework&Risk Assessment |
24.10 |
University of Washington-Madison |
arxiv |
CFSafety: Comprehensive Fine-grained Safety Assessment for LLMs |
Safety Assessment&LLM Evaluation&Instruction Attacks |
24.10 |
University of Pennsylvania |
arxiv |
Benchmarking LLM Guardrails in Handling Multilingual Toxicity |
Multilingual Toxicity Detection&Guardrails&Jailbreaking Attacks |
24.10 |
University of Wisconsin-Madison |
arxiv |
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models |
Prompt Injection Defense&Over-defense Detection&Guardrail Models |
24.10 |
National Engineering Research Center for Software Engineering, Peking University |
NeurIPS 2024 |
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types |
LLM Safety&Prompt Engineering&Jailbreak Attacks |
24.11 |
Fudan University |
arXiv |
LONGSAFETYBENCH: LONG-CONTEXT LLMS STRUGGLE WITH SAFETY ISSUES |
Long-Context Models&Safety Evaluation&Benchmarking |
24.11 |
Anthropic |
arXiv |
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples |
Jailbreak Defense&Rapid Response |
24.11 |
Texas A&M University |
arXiv |
Responsible AI in Construction Safety: Systematic Evaluation of Large Language Models and Prompt Engineering |
Construction Safety&Prompt Engineering&LLM Evaluation |
24.11 |
IBM Research Europe |
NeurIPS 2024 SafeGenAI Workshop |
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment |
Jailbreaking Techniques&LLM Vulnerability&Quantization Impact |
24.11 |
Peking University |
arxiv |
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain |
LLM Safety&Chemistry Domain&Benchmarking |
24.11 |
New York University, JPMorgan Chase, Cornell Tech, Northeastern University |
arxiv |
Assessment of LLM Responses to End-user Security Questions |
LLM Evaluation&End-user Security&Information Integrity |
24.11 |
National Library of Medicine, NIH&University of Maryland&University of Virginia&Universidad de Chile |
arxiv |
Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine |
Medical AI&LLM Safety&MedGuard Benchmark |
24.11 |
European Commission Joint Research Centre |
EMNLP 2024 |
GuardBench: A Large-Scale Benchmark for Guardrail Models |
guardrail models&benchmark&evaluation |
24.12 |
Vizuara AI Labs |
arxiv |
CBEVAL: A Framework for Evaluating and Interpreting Cognitive Biases in LLMs |
Cognitive Biases&LLM Evaluation&Reasoning Limitations |
24.12 |
Beijing Institute of Technology, Beihang University |
arxiv |
REFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks |
Format Faithfulness&Benchmark |
24.12 |
UCLA, Salesforce AI Research |
NeurIPS 2024 |
SAFEWORLD: Geo-Diverse Safety Alignment |
Geo-Diverse Alignment&Safety Evaluation&Legal Compliance |
24.12 |
Shanghai Jiao Tong University |
arxiv |
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents |
Safety-Aware Task Planning&Embodied LLM Agents&Hazard Mitigation |
24.12 |
Tsinghua University |
arxiv |
AGENT-SAFETYBENCH: Evaluating the Safety of LLM Agents |
Agent Safety&Risk Awareness&Interactive Evaluation |
24.12 |
TU Darmstadt |
arxiv |
LLMs Lost in Translation: M-ALERT Uncovers Cross-Linguistic Safety Gaps |
Cross-Linguistic Safety&Multilingual Benchmark&LLM Alignment |
24.12 |
Alibaba, China Academy of Information and Communications Technology |
arxiv |
Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models |
Safety Benchmark&Factuality Evaluation |
24.12 |
University of Warwick, Cranfield University |
arxiv |
MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models |
Medical Hallucinations&Benchmark&RLHF |
24.12 |
The Hong Kong Polytechnic University |
arxiv |
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity |
Cybersecurity Benchmark&Large Language Models&Dataset Evaluation |
25.01 |
KTH Royal Institute of Technology |
arxiv |
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models |
Cybersecurity Benchmark&Jailbreaking&Prompt Dataset |
25.01 |
Shahjalal University of Science and Technology |
arxiv |
From Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLMs |
Fake News Detection&Bangla&Low-Resource Languages |
25.01 |
NVIDIA |
arxiv |
AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails |
AI Safety&Content Moderation Dataset&LLM Risk Taxonomy |
25.01 |
Georgia Institute of Technology |
arxiv |
On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena |
Cultural Bias in LLMs&Cross-Linguistic Analysis&Arabic-English Benchmarks |
25.01 |
Bocconi University |
arxiv |
MSTS: A Multimodal Safety Test Suite for Vision-Language Models |
Multimodal Safety&Vision-Language Models |
25.01 |
Fudan University |
arxiv |
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense |
Jailbreak Defense&LLM Performance&USEBench |
25.01 |
McGill University |
arxiv |
OnionEval: A Unified Evaluation of Fact-conflicting Hallucination for Small-Large Language Models |
Fact-conflicting Hallucination&Small-Large Language Models (SLLMs)&Benchmark |
25.01 |
HKUST |
arxiv |
Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak |
Audio Language Models&Jailbreak Vulnerabilities&Audio Modality Edits |