diff --git a/_contents/S0-L06.md b/_contents/S0-L06.md index 338a3045..4964a8dd 100755 --- a/_contents/S0-L06.md +++ b/_contents/S0-L06.md @@ -490,3 +490,97 @@ Finally, the paper also shows the carbon emission of training OLMo, with a sligh + +## Paper E. Llama 2: Open Foundation and Fine-Tuned Chat Models + + + +#### E.1     Pre-training methodology + +To create the new family of Llama 2 models, the authors used an optimized auto-regressive transformer but made several changes to improve performance. + +Specifically, they performed more robust data cleaning, updated data mixes, trained on 40% more total tokens, doubled the context length, and used grouped-query attention (GQA) to improve inference scalability for larger models. + + + +#### E.2     Training Details +1. Adopt most of the pretraining setting and model architecture from Llama 1: + + use the standard transformer architecture + + apply pre-normalization using RMSNorm + + use the SwiGLU activation function + + use rotary positional embeddings (RoPE) +2. Primary architectural differences: + + increased context length + + grouped-query attention (GQA) + +#### E.3     Llama 2: Rotary Positional Embeddings (RoPE) + +An enhancement to the traditional position encoding used in transformer models. RoPE dynamically encodes the position information by rotating the query and key vectors in the attention mechanism. + +*Problems in prior methods*: ++ Absolute positional encoding is simple, but may not generalize well in longer sequences. ++ Relative positional bias (T5) is not efficient. +Solution: ++ Apply rotation to word vector to encode rotation. ++ Maintain both absolute and relative positional embeddings in an input sentence. ++ We do not need to train custom parameters. + + + +#### E.4     Llama 2: Grouped-query Attention (GQA) + ++ 34B and 70B models used GQA for improved inference scalability. + + + +#### Pre-trained Results ++ After pretraining, results are not as good as other proprietary, closed-source models. (GPT-4 and PaLM-2-L.) ++ Llama-2 is still very competitive (only a pre-trained model) + + + +#### E.4     Fine-tuning methodology + +#### Llama 2: Iterative Fine-Tuning ++ Rejection Sampling: Sample K outputs from the model, select the best candidate based on the reward model ++ Can be combined with PPO ++ Generating multiple samples in this manner can drastically increase the maximum reward of a sample. + + + +#### Llama 2: Ghost Attention (GAtt) + + + +#### Llama 2: Fine-Tuning Results +Report the progress of our different SFT and then RLHF versions for both Safety and Helpfulness axes, measured by our in-house Safety and Helpfulness reward models. + + + + +#### E.5     Model Safety +#### Llama 2: Safety in Fine-Tuning: Adversarial Samples ++ Gather adversarial prompts and safe demonstrations in the SFT training set. ++ Essentially probes for edge cases. ++ Annotator writes both the prompt and the response in adversarial samples. + + + +#### Llama 2: Safety in RLHF +RLHF safety measures: ++ Safety RM uses human preference data to train. ++ Reuse the adversarial prompts when training safety RM. + +Helpfulness remains intact after safety tuning with RLHF. + + + +#### Llama 2: Safety Evaluation +The fine-tuned versions of LLama 2-Chat, show virtually zero toxicity across all groups. ++ The effectiveness of fine-tuning in mitigating model-generated toxicity. + + + + + + diff --git a/_contents/S0-L08.md b/_contents/S0-L08.md index f7d4b96e..6105da7e 100644 --- a/_contents/S0-L08.md +++ b/_contents/S0-L08.md @@ -62,8 +62,6 @@ https://aclanthology.org/2023.findings-acl.719/ 1. Foundation Models and Fair Use 2. Copyright Plug-in Market for The Text-to-Image Copyright Protection 3. Extracting Training Data from Diffusion Models -4. A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT -5. Llama 2: Open Foundation and Fine-Tuned Chat Models ## Paper A. Foundation Models and Fair Use @@ -528,9 +526,10 @@ Overall, diffusion models have higher membership inference leakage, e.g., diffus 3. Stronger diffusion models are less private than weaker diffusion models 4. Propose attack techniques to help estimate the privacy risks of trained models - ## Paper D. A Comprehensive Survey of AI-Generated Content (AIGC):A History of Generative AI from GAN to ChatGPT ++ A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT + ChatGPT and other Generative AI (GAI) techniques belong to the category of Artificial Intelligence Generated Content (AIGC), which involves the creation of digital content. The goal of AIGC is to make the content creation process more efficient and accessible, allowing for the production of high-quality content at a faster pace. @@ -609,12 +608,12 @@ Combines transformer-based encoders and decoders together for pre-training, Eg. *Normalizing Flows*: A Normalizing Flow is a distribution transformation from simple to complex by a sequence of invertible and differentiable mappings. -1. Coupling and autoregressive flows - + Multi-scale flows +1. Coupling and autoregressive flows + + Multi-scale flows 2. Convolutional and Residual Flows. - + ConvFlow - + RevNets - + iRevNets + + ConvFlow + + RevNets + + iRevNets *Diffusion Models*: The Generative Diffusion Model (GDM) is a cutting-edge class of generative models based on probability, which demonstrates state-of-the-art results in the field of computer vision. It works by progressively corrupting data with multiple-level noise perturbations and then learning to reverse this process for sample generation. @@ -626,8 +625,8 @@ Under the hood of Encoder-Decoder family architectures. The encoder is responsib #### Vision Language Encoders -+ Concatenated encoders: concatenating the embeddings from single encoders - ++ Concatenated encoders: concatenating the embeddings from single encoders + *Cross-aligned encoders*: learning contextualized representations is to look at pairwise interactions between modalities. @@ -635,9 +634,9 @@ Under the hood of Encoder-Decoder family architectures. The encoder is responsib #### Vision Language Decoders 1. To text decoders: Jointly- trained decoders, frozen decoders. 2. To image decoders: - + GAN-based, - + Diffusion-based:GLIDE, Imagen - + VAE-based: DALL-E + + GAN-based, + + Diffusion-based:GLIDE, Imagen + + VAE-based: DALL-E @@ -657,104 +656,14 @@ Under the hood of Encoder-Decoder family architectures. The encoder is responsib 2. Training efficiency: This covers factors that affect the speed and resource requirements of training a model, such as training time, memory footprint, and scalability across multiple #### D.8     Future Directions -+ High-stakes Applications -+ Specialization and Generalization -+ Continual Learning and Retraining ++ High-stakes Applications ++ Specialization and Generalization ++ Continual Learning and Retraining + Reasoning + Scaling up + Social issue -## Paper E. Llama 2: Open Foundation and Fine-Tuned Chat Models - - - -#### E.1     Pre-training methodology - -To create the new family of Llama 2 models, the authors used an optimized auto-regressive transformer but made several changes to improve performance. - -Specifically, they performed more robust data cleaning, updated data mixes, trained on 40% more total tokens, doubled the context length, and used grouped-query attention (GQA) to improve inference scalability for larger models. - - - -#### E.2     Training Details -1. Adopt most of the pretraining setting and model architecture from Llama 1: - + use the standard transformer architecture - + apply pre-normalization using RMSNorm - + use the SwiGLU activation function - + use rotary positional embeddings (RoPE) -2. Primary architectural differences: - + increased context length - + grouped-query attention (GQA) - -#### E.3     Llama 2: Rotary Positional Embeddings (RoPE) - -An enhancement to the traditional position encoding used in transformer models. RoPE dynamically encodes the position information by rotating the query and key vectors in the attention mechanism. - -*Problems in prior methods*: -+ Absolute positional encoding is simple, but may not generalize well in longer sequences. -+ Relative positional bias (T5) is not efficient. -Solution: -+ Apply rotation to word vector to encode rotation. -+ Maintain both absolute and relative positional embeddings in an input sentence. -+ We do not need to train custom parameters. - - - -#### E.4     Llama 2: Grouped-query Attention (GQA) - -+ 34B and 70B models used GQA for improved inference scalability. - - - -#### Pre-trained Results -+ After pretraining, results are not as good as other proprietary, closed-source models. (GPT-4 and PaLM-2-L.) -+ Llama-2 is still very competitive (only a pre-trained model) - - - -#### E.4     Fine-tuning methodology - -#### Llama 2: Iterative Fine-Tuning -+ Rejection Sampling: Sample K outputs from the model, select the best candidate based on the reward model -+ Can be combined with PPO -+ Generating multiple samples in this manner can drastically increase the maximum reward of a sample. - - - -#### Llama 2: Ghost Attention (GAtt) - - - -#### Llama 2: Fine-Tuning Results -Report the progress of our different SFT and then RLHF versions for both Safety and Helpfulness axes, measured by our in-house Safety and Helpfulness reward models. - - - - -#### E.5     Model Safety -#### Llama 2: Safety in Fine-Tuning: Adversarial Samples -+ Gather adversarial prompts and safe demonstrations in the SFT training set. -+ Essentially probes for edge cases. -+ Annotator writes both the prompt and the response in adversarial samples. - - - -#### Llama 2: Safety in RLHF -RLHF safety measures: -+ Safety RM uses human preference data to train. -+ Reuse the adversarial prompts when training safety RM. - -Helpfulness remains intact after safety tuning with RLHF. - - - -#### Llama 2: Safety Evaluation -The fine-tuned versions of LLama 2-Chat, show virtually zero toxicity across all groups. -+ The effectiveness of fine-tuning in mitigating model-generated toxicity. - - - ## References + https://arxiv.org/abs/2303.15715