diff --git a/Lectures/S0-L15/images/intro_example_1.gif b/Lectures/S0-L15/images/intro_example_1.gif new file mode 100644 index 00000000..b5b8cd13 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_1.gif differ diff --git a/Lectures/S0-L15/images/intro_example_10.gif b/Lectures/S0-L15/images/intro_example_10.gif new file mode 100644 index 00000000..b90c4179 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_10.gif differ diff --git a/Lectures/S0-L15/images/intro_example_11.gif b/Lectures/S0-L15/images/intro_example_11.gif new file mode 100644 index 00000000..3c02e39d Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_11.gif differ diff --git a/Lectures/S0-L15/images/intro_example_12.gif b/Lectures/S0-L15/images/intro_example_12.gif new file mode 100644 index 00000000..c67c4fc9 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_12.gif differ diff --git a/Lectures/S0-L15/images/intro_example_13.gif b/Lectures/S0-L15/images/intro_example_13.gif new file mode 100644 index 00000000..47692fa3 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_13.gif differ diff --git a/Lectures/S0-L15/images/intro_example_14.gif b/Lectures/S0-L15/images/intro_example_14.gif new file mode 100644 index 00000000..a32711eb Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_14.gif differ diff --git a/Lectures/S0-L15/images/intro_example_15.gif b/Lectures/S0-L15/images/intro_example_15.gif new file mode 100644 index 00000000..bbbaae12 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_15.gif differ diff --git a/Lectures/S0-L15/images/intro_example_16.gif b/Lectures/S0-L15/images/intro_example_16.gif new file mode 100644 index 00000000..268eb436 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_16.gif differ diff --git a/Lectures/S0-L15/images/intro_example_17.gif b/Lectures/S0-L15/images/intro_example_17.gif new file mode 100644 index 00000000..21d910fe Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_17.gif differ diff --git a/Lectures/S0-L15/images/intro_example_18.gif b/Lectures/S0-L15/images/intro_example_18.gif new file mode 100644 index 00000000..35d33b79 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_18.gif differ diff --git a/Lectures/S0-L15/images/intro_example_19.gif b/Lectures/S0-L15/images/intro_example_19.gif new file mode 100644 index 00000000..22e0e02e Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_19.gif differ diff --git a/Lectures/S0-L15/images/intro_example_2.gif b/Lectures/S0-L15/images/intro_example_2.gif new file mode 100644 index 00000000..f2d5f0d4 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_2.gif differ diff --git a/Lectures/S0-L15/images/intro_example_20.gif b/Lectures/S0-L15/images/intro_example_20.gif new file mode 100644 index 00000000..a827a13f Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_20.gif differ diff --git a/Lectures/S0-L15/images/intro_example_21.gif b/Lectures/S0-L15/images/intro_example_21.gif new file mode 100644 index 00000000..267d60b3 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_21.gif differ diff --git a/Lectures/S0-L15/images/intro_example_22.gif b/Lectures/S0-L15/images/intro_example_22.gif new file mode 100644 index 00000000..2d659d0e Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_22.gif differ diff --git a/Lectures/S0-L15/images/intro_example_23.gif b/Lectures/S0-L15/images/intro_example_23.gif new file mode 100644 index 00000000..e3c9b49c Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_23.gif differ diff --git a/Lectures/S0-L15/images/intro_example_24.gif b/Lectures/S0-L15/images/intro_example_24.gif new file mode 100644 index 00000000..9fb65a5e Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_24.gif differ diff --git a/Lectures/S0-L15/images/intro_example_25.gif b/Lectures/S0-L15/images/intro_example_25.gif new file mode 100644 index 00000000..d0f64e26 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_25.gif differ diff --git a/Lectures/S0-L15/images/intro_example_26.gif b/Lectures/S0-L15/images/intro_example_26.gif new file mode 100644 index 00000000..b11b079b Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_26.gif differ diff --git a/Lectures/S0-L15/images/intro_example_27.gif b/Lectures/S0-L15/images/intro_example_27.gif new file mode 100644 index 00000000..1fe2d5d4 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_27.gif differ diff --git a/Lectures/S0-L15/images/intro_example_28.gif b/Lectures/S0-L15/images/intro_example_28.gif new file mode 100644 index 00000000..ae571fab Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_28.gif differ diff --git a/Lectures/S0-L15/images/intro_example_29.gif b/Lectures/S0-L15/images/intro_example_29.gif new file mode 100644 index 00000000..438ae211 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_29.gif differ diff --git a/Lectures/S0-L15/images/intro_example_3.gif b/Lectures/S0-L15/images/intro_example_3.gif new file mode 100644 index 00000000..d3336cb2 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_3.gif differ diff --git a/Lectures/S0-L15/images/intro_example_30.gif b/Lectures/S0-L15/images/intro_example_30.gif new file mode 100644 index 00000000..25230f2e Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_30.gif differ diff --git a/Lectures/S0-L15/images/intro_example_31.gif b/Lectures/S0-L15/images/intro_example_31.gif new file mode 100644 index 00000000..09ac3063 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_31.gif differ diff --git a/Lectures/S0-L15/images/intro_example_32.gif b/Lectures/S0-L15/images/intro_example_32.gif new file mode 100644 index 00000000..bac2ae34 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_32.gif differ diff --git a/Lectures/S0-L15/images/intro_example_33.gif b/Lectures/S0-L15/images/intro_example_33.gif new file mode 100644 index 00000000..b42872e3 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_33.gif differ diff --git a/Lectures/S0-L15/images/intro_example_34.gif b/Lectures/S0-L15/images/intro_example_34.gif new file mode 100644 index 00000000..7297c983 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_34.gif differ diff --git a/Lectures/S0-L15/images/intro_example_35.gif b/Lectures/S0-L15/images/intro_example_35.gif new file mode 100644 index 00000000..2588b7c7 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_35.gif differ diff --git a/Lectures/S0-L15/images/intro_example_36.gif b/Lectures/S0-L15/images/intro_example_36.gif new file mode 100644 index 00000000..9da1040a Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_36.gif differ diff --git a/Lectures/S0-L15/images/intro_example_4.gif b/Lectures/S0-L15/images/intro_example_4.gif new file mode 100644 index 00000000..90ab7914 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_4.gif differ diff --git a/Lectures/S0-L15/images/intro_example_5.gif b/Lectures/S0-L15/images/intro_example_5.gif new file mode 100644 index 00000000..68e9ec3a Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_5.gif differ diff --git a/Lectures/S0-L15/images/intro_example_6.gif b/Lectures/S0-L15/images/intro_example_6.gif new file mode 100644 index 00000000..b31cb358 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_6.gif differ diff --git a/Lectures/S0-L15/images/intro_example_7.gif b/Lectures/S0-L15/images/intro_example_7.gif new file mode 100644 index 00000000..cf477783 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_7.gif differ diff --git a/Lectures/S0-L15/images/intro_example_8.gif b/Lectures/S0-L15/images/intro_example_8.gif new file mode 100644 index 00000000..fea91575 Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_8.gif differ diff --git a/Lectures/S0-L15/images/intro_example_9.gif b/Lectures/S0-L15/images/intro_example_9.gif new file mode 100644 index 00000000..22d3552a Binary files /dev/null and b/Lectures/S0-L15/images/intro_example_9.gif differ diff --git a/_contents/S0-L15.md b/_contents/S0-L15.md index e89d7e11..006507fc 100755 --- a/_contents/S0-L15.md +++ b/_contents/S0-L15.md @@ -43,3 +43,949 @@ In this session, our readings cover: ### Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment + https://arxiv.org/abs/2308.05374 + + + +# LLM Hallucination + + + +## A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions + + + +### Brief introduction to LLM Hallucinations + + + +- The current definition of hallucinations characterizes them as generated content that is nonsensical or unfaithful to the provided source content.  + + + +- These hallucinations are further categorized into intrinsic hallucination and extrinsic hallucination types, depending on the contradiction with the source content. + + + +- In LLMs, the scope of hallucination encompasses a broader and more comprehensive concept, primarily centering on factual errors.  + + + +- In light of the evolution of the LLM era, there arises a need to adjust the existing hallucination taxonomy, enhancing its applicability and adaptability. + + + + + +### Types of Hallucinations + + + +- **Factuality Hallucination**: inconsistent with real-world facts or potentially misleading + + + +  - Factual Inconsistency: facts relate to real-world information, but has contradictions + + + +  - Factual Fabrication: unverifiable against established real-world knowledge + + + + + + + +- **Faithfulness Hallucination**: inconsistency with user provided instructions and contextual information + + + +  - Instruction inconsistency: deviate from a user’s instructions + + + +  - Context inconsistency: unfaithful with the provided contextual information + + + +  - Logical inconsistency: exhibit internal logical contradictions + + + + + + + + + + + + + +### Hallucination Causes + + + +- Data + + + +- Training + + + +- Inference + + + + + + + + + + + +1.Hallucination from Data + + + +- Misinformation and Biases + + + +  - Imitative Falsehoods: trained on factual incorrect data + + + +  - Duplication Bias: over-prioritize the recall of duplicated data + + + +  - Social Biases: Gender, Race  + + + + + + + +- Knowledge Boundary + + + +  - Domain Knowledge Deficiency: Lack of proprietary data lead to less expertise + + + +  - Outdated Factual Knowledge + + + + + + + +- Inferior Data Utilization + + + +  - Knowledge Shortcut: overly rely on co-occurrence statistics, relevant document count + + + +  - Knowledge Recall Failures + + + +    - Long-tail Knowledge: rare, specialized, or highly specific information not widely known or discussed. + + + +    - Complex Scenario: multi-hop reasoning and logical deduction + + + + + + + +2.Hallucination from Training + + + +- Hallucination from Pre-training + + + +  - Architecture Flaw + + + +    - Inadequate Unidirectional Representation: predict the subsequent token based solely on preceding tokens in a left-to-right manner + + + +    - Attention Glitches: limitations of soft attention  + + + +      - attention diluted across positions as sequence length increases + + + +  - Exposure Bias: teacher forcing + + + +* Hallucination from Alignment + + + +  - Capability Misalignment: mismatch between LLMs' pre-trained capabilities and the expectations from fine-tuning data + + + +  - Belief Misalignment: prioritize appeasing perceived user preferences over truthfulness + + + +3.Hallucination from Inference + + + +- Inherent Sampling Randomness + + + +  - Stochastic Sampling: controlled randomness enhance creativity and diversity + + + +  - likelihood trap: high-probability, low-quality text + + + +- Imperfect Decoding Representation + + + +  - Insufficient Context Attention: prioritize recent or nearby words in attention (Over-Confidence Issue) + + + +  - Softmax Bottleneck: inability manage multi-modal distributions, irrelevant or inaccurate content + + + + + +### Hallucination Detection and Benchmarks + + + +As LLMs have garnered substantial attention in recent times, distinguishing accurate and hallucinated content has become a pivotal concern these days. Two primary facets encompass the broad spectrum of hallucination mitigation: detection mechanisms and evaluation benchmarks. + + + +Traditional metrics fall short in differentiating the nuanced discrepancies between plausible and hallucinated content, which highlights the necessity of more sophisticated detection methods. + + + +**1. Factuality Hallucination Detection** + + + +- **Retrieve External Facts**  + + + +Comparing the model generated content against reliable knowledge sources. Here is an example of detecting factuality hallucination by retrieving external facts: + + + + + + + +- **Uncertainty Estimation**  + + + +Premise: the origin of LLM hallucinations is inherently tied to the model’s uncertainty. + + + +Zero-resource settings. Categorized into 2 approaches: + + + +1. LLM Internal States: operates under the assumption that one can access the model’s internal state + + + +2. LLM Behavior: leveraging solely the model’s observable behaviors to infer its underlying uncertainty + + + + + + + +**2. Faithfulness Hallucination Detection** + + + +Focuses on ensuring the alignment of the generated content with the given context, sidestepping the potential pitfalls of extraneous or contradictory output. + + + +- **Fact-based Metrics**: assesses faithfulness by measuring the overlap of facts between the generated content and the source content + + + +- **Classifier-based Metrics**: utilizing trained classifiers to distinguish the level of entailment between the generated content and the source content + + + +- **Question-Answering based Metrics**: employing question-answering systems to validate the consistency of information between the source content and the generated content + + + +- **Uncertainty Estimation**: assesses faithfulness by measuring the model’s confidence in its generated outputs + + + +- **Prompting-based Metrics**: induced to serve as evaluators, assessing the faithfulness of generated content through specific prompting strategies. + + + + + + + +Figure 5: The illustration of detection methods for faithfulness hallucinations: a) Fact-based Metrics, which assesses faithfulness by measuring the overlap of facts between the generated content and the source content; b) Classifier-based Metrics, utilizing trained classifiers to distinguish the level of entailment between the generated content and the source content; c) QA-based Metrics, employing question-answering systems to validate the consistency of information between the source content and the generated content; d) Uncertainty Estimation, which assesses faithfulness by measuring the model's confidence in its generated outputs; e) Prompting-based Metrics, wherein LLMs are induced to serve as evaluators, assessing the faithfulness of generated content through specific prompting strategies. + + + +**3. Benchmarks** + + + +- **Hallucination Evaluation Benchmarks** + + + +Assess LLMs’ proclivity to produce hallucinations, with a particular   emphasis on identifying factual inaccuracies and measuring deviations from original contexts + + + +- **Hallucination Detection Benchmarks** + + + +Evaluate the performance of existing hallucination detection methods. + + + +Primarily concentrated on task specific hallucinations, such as abstractive   summarization, data-to-text, and machine translation. + + + + + + + + + + + + + +### Mitigation Strategies + + + +**4. Mitigating Data-related Hallucinations** + + + +- **Mitigating Misinformation and Biases**:  + + + +  - **Factuality Data Enhancement:** Gathering high-quality data, Up-sampling factual data during the pre-training + + + +  - **Duplication Bias:** Exact Duplicates, Near-Duplicates + + + +  - **Societal Biases:** Focusing on curated, diverse, balanced, and representative training corpora + + + +- **Mitigating Knowledge Boundary:** + + + +  - **Knowledge Editing:** Modifying Model Parameter(Locate-then-edit methods, Meta-learning methods), Preserving Model Parameters  + + + +  - **Retrieval Augmentation:** One-time Retrieval, Iterative Retrieval, Post-hoc Retrieval + + + +- **Mitigating Knowledge Shortcut:** + + + +  -   ****Fine-tuning on a debiased dataset by excluding biased samples + + + +- **Mitigating Knowledge Recall Failures:** + + + +  - Adding relevant information to questions to aid recall, Encourages LLMs to reason through steps to improve recall + + + +**Mitigating Data-related Hallucinations** + + + + + + + +**5. Mitigating Training-related Hallucination** + + + +**Mitigating Pretraining-related Hallucination** + + + +The majority of research emphasizes the exploration of novel model architectures and the improvement of pre-training objectives + + + +- **Mitigating Flawed Model Architecture:** + + + +  - _Mitigating Unidirectional Representation:_ BATGPT introduces a bidirectional autoregressive approach, enhancing context comprehension by considering both past and future contexts + + + +  - _Mitigating Attention Glitches:_ Attention-sharpening regularizers promote sparsity in self-attention, reducing reasoning errors + + + +* **Mitigating Suboptimal Pre-training Objective:** + + + +  - _Training Objective:_ Incorporation of factual contexts as TOPIC PREFIX to ensure accurate entity associations and reduce factual errors  + + + +  - _Exposure Bias:_ Techniques like intermediate sequence supervision and Minimum Bayes Risk decoding reduce error accumulation and domain-shift hallucinations  + + + +**Mitigating Misalignment Hallucination** + + + +- **Improving Human Preference Judgments:** Enhancing the quality of human-annotated data and preference models to reduce the propensity for reward hacking and sycophantic responses + + + +- **Modifying LLMs’ Internal Activations:** Fine-Tuning with Synthetic Data by training LLMs on data with truth claims independent of user opinions to curb sycophantic tendencies  + + + +**Mitigating Inference-related Hallucination** + + + +**Factuality Enhanced Decoding** + + + +- **On Standalone Decoding:** + + + +  - **Factual-Nucleus Sampling**: Adjusts nucleus probability dynamically for a balance between factual accuracy and output diversity. + + + +  - **Inference-Time Intervention (ITI)**: Utilizes activation space directionality for factually correct statements, steering LLMs towards accuracy during inference. + + + +- **Post-editing Decoding:** + + + +  - **Chain-of-Verification (COVE):** Employs self-correction capabilities to refine generated content through a systematic verification and revision process  + + + +**Faithfulness Enhanced Decoding** + + + +- **Context Consistency:** + + + +  - **Context-Aware Decoding (CAD):** Adjusting output distribution to enhance focus on contextual information, balancing between diversity and attribution + + + +- **Logical Consistency:** + + + +  - **Knowledge Distillation and Contrastive Decoding:** Generating consistent rationale and fine-tuning with counterfactual reasoning to eliminate reasoning shortcuts, ensuring logical progression in multi-step reasoning  + + + + + +### Challenges and Open Questions + + + +**Challenges in LLM Hallucination** + + + +- **Hallucination in Long-form Text Generation** + + + +Absence of manually annotated hallucination benchmarks in the domain of long-form text generation + + + +- **Hallucination in Retrieval Augmented Generation** + + + +Irrelevant evidence can be propagated into the generation phase, possibly tainting the output + + + +- **Hallucination in Large Vision-Language Models** + + + +LVLMs sometimes mix or miss parts of the visual context, as well as fail to understand temporal or logical connections between them + + + +**Open Questions in LLM Hallucination** + + + +- **Can Self-Correct Mechanisms Help in Mitigating Reasoning Hallucinations?** + + + +Occasionally exhibit unfaithful reasoning characterized by inconsistencies within the reasoning steps or conclusions that do not logically follow the reasoning chain. + + + +- **Can We Accurately Capture LLM Knowledge Boundaries?** + + + +LLMs still face challenges in recognizing their own knowledge boundaries. This shortfall leads to the occurrence of hallucinations, where LLMs confidently produce falsehoods without an awareness of their own knowledge limits. + + + +- **How Can We Strike a Balance between Creativity and Factuality?**  + + + +Hallucinations can sometimes offer valuable perspectives, particularly in creative endeavors such as storytelling, brainstorming, and generating solutions that transcend conventional thinking. + + + + + +## LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond + + + + + + + + + + + +LLMs are used to summarize documents across different domains. The summarizations must be accurate and factual. + + + +LLMs have some issues as factual reasoners.  + + + +1. Not all LLMs can generate explanations that locate factual inaccuracies + + + +2. Many mislabeled samples of factual inconsistencies are undetected by annotators.  + + + + + + + +Laban et. al discusses LLMs as factual reasoners, propose a new protocol for creating inconsistency detection benchmarks, and release SummEdits, which applies their protocol across 10 domains. + + + +Laban et. al test different LLMs on the FactCC dataset to find which LLMs are potentially factual reasoners. + + + +In-context learning and prompt engineering can optimize the desired output of LLMs. + + + +The authors the factual accuracy of many LLMs and non-LLM models. + + + +Their experiment yields a few interesting findings for the binary classification test: + + + +- non-LLM outperforms the LLM. + + + +- Few-shot will improve performance compared to zero-shot (not GPT4 and PaLM2). + + + +- Generate-with-Evidence outperforms Chain-of-Thought. + + + +- Persona-based improves GPT3.5-turbo performance. + + + + + + + + + +They also found that the models are mostly accurate when detecting positive samples, but are very bad at detecting factual inconsistencies, particularly pronoun swaps.  + + + +Through manual analysis of the LLM outputs, they found that response explanations for challenging questions were either not given, irrelevant, or plausible but wrong.  + + + + + + + +The authors also conducted a fine-grain analysis to evaluate each document sentence pair concerning individual error types while ignoring other types of errors. They recorded low precision but a high recall score, and they were not able to distinguish error types. + + + +The authors also discuss the limitations of existing AggreFact and DialSumEval crowd-sourced benchmarks. The authors filtered out all models that did not achieve a balanced accuracy above 60% on FactCC and used a single Zero-Shot (ZS) prompt for all LLM models on these benchmarks. + + + + + + + + + + + +The authors conclude there is low reliability for these crowd-sourced benchmarks. Further, the scale of these benchmarks limits their quality and interpretability.  + + + +The authors propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called **SummEdits**. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as they estimate inter-annotator agreement at about 0.9. + + + +Based on the analysis of previous benchmarks, the authors set several design principles that can help create higher quality factual consistency benchmark: + + + +- P1: Binary Classification Task: summary is either consistent or inconsistent + + + +- P2: Focus on Factual Consistency: summary is flawless on attributes unrelated to consistency + + + +- P3: Reproducibility: labels should be independent of annotator + + + +- P4: Benchmark Diversity: inconsistencies should represent a wide range of errors in real textual domains + + + +They introduce a protocol designed to create challenging benchmarks while ensuring the reproducibility of the labels. The protocol involves manually verifying the consistency of a small set of seed summaries and subsequently generating numerous edited versions of these summaries.  + + + + + + + +More details are shown as follows + + + + + + + +The procedure is visualized below + + + + + + + +Some example samples produced by the protocol are presented as follows + + + + + + + +The SummEdits benchmark was created by implementing the protocol in ten diverse textual domains, including the legal, dialogue, academic, financial, and sales domains. Specifically, it contains: + + + +- News: Articles and summaries from Google News top events from February 2023 + + + +- Podcasts: 40 transcripts from Spotify dataset, automatic summaries + + + +- BillSum: 40 US bills and their summaries + + + +- SamSum: 40 dialogues and their summaries from a dialogue summarization dataset + + + +- Shakespeare: 40 scenes, automatic summaries + + + +- SciTLDR: 40 research paper abstracts and their summaries + + + +- QMSum: 40 documents and summaries from query-based meeting summarization dataset + + + +- ECTSum: 40 documents from financial earnings call dataset, automatic summaries + + + +- Sales Call & Email: 40 fictional sales calls & emails generated along with summaries + + + +For the statistics of SummEdits, the authors report that + + + +- At least 20% of each domain’s samples were annotated by multiple annotators + + + +- Cohen’s Kappa varied between 0.72-0.90 for the domains when considering the three labels, averaging 0.82 + + + +  - After removing ‘borderline’ samples, average Kappa rose to 0.92 -> high agreement + + + +- Total cost: $3,000 for 150 hours of annotator work + + + +  - Average domain cost is $300 + + + +- Using processes of other benchmarks would have had a 20x increase in cost + + + +  - If each sample required 30 min of annotator time, as in the FRANK benchmark + + + +The following table reports the average performance of specialized models, LLMs with a zero-shot prompt, an oracle version for the LLM in which it has access to additional information and an estimate of human performance computed on the subset of the benchmark which was plurally annotated. + + + + + + + +From the table, we can see that + + + +- Low performance overall - only GPT-4 comes within 10% of human performance + + + +- Only 4 LLMs outperform non-LLM QAFactEval - most LLMs are not capable of reasoning about the consistency of facts out-of-the-box + + + +- Specialized models performed best on News, probably because it was similar to their training data + + + +- BillSum and Shakespeare are particularly challenging + + + +- Oracle test: model is given document, seed, and edited summary + + + +  - Large boost in performance, within 2% of human performance + + + +  - Shows that high performance is indeed attainable + + + +To gain more specific insights into the types of edits present in SUMMEDITS, the authors annotated each inconsistent sample in the benchmark with tags of edit types that lead to factual inconsistency, including the following four edit types: + + + +- Entity modification + + + +- Antonym Swap + + + +- Hallucinated Fact Insertion + + + +- Negation Insertion + + + +  - SummEdits distribution: 78% of inconsistent summaries contain entity modification, 48% antonym swap, 22% hallucinated fact insertion, 18% negation insertion + + + +    - Distribution influenced by the LLM used to produce the edits + + + +Table 10 presents model performance across each of the edit types. Additionally, the authors grouped inconsistent summaries by the number of distinct edit types they contain (1 to 4) and computed model performance on each group, with results summarized in Table 11. + + + + + + + +In conclusion, the authors of this paper + + + +- simplified annotation process for improved reproducibility + + + +- created SummEdits benchmark which spans 10 domains + + + +  - Highly reproducible and more cost-effective than previous benchmarks + + + +  - Challenging for most current LLMs + + + +  - A valuable tool for evaluating LLMs’ ability to reason about facts and detect factual errors + + + +- encouraged LLM developers to report their performance on the benchmark + + + + + +## Survey of Hallucination in Natural Language Generation + + + +Link: + + + +Following previous works, the authors categorize different hallucinations into two main types, namely intrinsic hallucination and extrinsic hallucination: + + + + + + + +The authors of this paper present a general overview of evaluation metrics and mitigation methods for different NLG task, which is summarized here: + + + + + +## References + +* Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., ... & Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. _arXiv preprint arXiv:2311.05232_. +* Laban, P., Kryściński, W., Agarwal, D., Fabbri, A. R., Xiong, C., Joty, S., & Wu, C. S. (2023). Llms as factual reasoners: Insights from existing benchmarks and beyond. _arXiv preprint arXiv:2305.14540_. +* Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, P. (2023). Survey of hallucination in natural language generation. _ACM Computing Surveys_, _55_(12), 1-38. \ No newline at end of file