.

sedol1339 · Jan 6, 2025 · e8429fd · e8429fd
1 parent e3729f2
commit e8429fd
Showing 1 changed file with 70 additions and 2 deletions.
diff --git a/papers.md b/papers.md
@@ -7745,7 +7745,7 @@ McDermott, E. (2018). A Deep Generative Acoustic Model for Compositional Automat
   - To specify which task the model should perform, we add a task-specific text prefix. We found that changing the exact wording of the prefix had limited impact. (IMO, it is even possible to use task numbers). For example, for MNLI benchmark, the input sequence becomes "mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity." and target text is "entailment".
   - We use a simplified form of position embeddings where each "embedding" is simply a scalar that is added to the corresponding logit used for computing the attention weights. We also share the position embedding parameters across layers, though within a given layer each attention head uses a different learned position embedding. Typically, a fixed number of embeddings are learned, each corresponding to a range of possible key-query offsets. (IMO this is similar to adaptive attention span) We use 32 embeddings for all of our models with ranges that increase in size logarithmically up to an offset of 128 beyond which we assign all relative positions to the same embedding. We also remove the Layer Norm bias and apply it to the input of each subcomponent.
   - The model is trained with a maximum likelihood objective using "teacher forcing". At test time, we use greedy decoding.
-  - We design a pre-training objective that randomly drops out 15% of tokens in the input sequence (fig. 2). All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence. The sentinel IDs are special tokens which are added to our vocabulary and do not correspond to any wordpiece. We only predict dropped-out tokens were made to reduce the computational cost of pre-training. The target corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel tokens used in the input sequence plus a final sentinel token to mark the end of the target sequence.
+  - We design "replacing corrupted spans", a pre-training objective that randomly drops out 15% of tokens in the input sequence (fig. 2). All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence. The sentinel IDs are special tokens which are added to our vocabulary and do not correspond to any wordpiece. We only predict dropped-out tokens were made to reduce the computational cost of pre-training. The target corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel tokens used in the input sequence plus a final sentinel token to mark the end of the target sequence.
   - For STS-B regression task the floating-point values are mapped to strings like "2.6".
   - For Winograd tasks we highlight the text passage and asking the model to predict the noun that it refers to: "The city councilmen refused the demonstrators a permit because *they* feared violence." and the model would be trained to predict the target text "The city councilmen".
   - For pretraining, we introduce the "Colossal Clean Crawled Corpus" (C4), a data set of clean English text that uses Common Crawl as a source of text scraped from the web. We use several stages to filter Common Crawl, where the majority of the text is not natural language, but gibberish or boiler-plate text like menus, error messages, or duplicate text.
@@ -7760,4 +7760,72 @@ McDermott, E. (2018). A Deep Generative Acoustic Model for Compositional Automat
   - 2) Gradual unfreezing the encoder and decoder in parallel. It caused a minor degradation in performance across all tasks, though it did provide some speedup during fine-tuning.
   - Multitask learning typically has a goal to perform many tasks at once. We instead explore training on multiple tasks at once, but selecting a different checkpoint for each task. In our case, "multi-task learning" simply corresponds to mixing data sets together. We try several data balancing strategies. However, such a multi-task fine-tuning is outperformed by individual task fine-tuning - this has previously been observed, if the tasks are not very similar. We explore including the supervised tasks alongside the unsupervised objective during pre-training to give the model some beneficial early exposure to the downstream tasks, but this does not seem to help.
   - If we have, say, 4× more compute, how to use it? Our results suggest that increasing the training time and increasing the model size can be complementary means of improving performance. In some tasks ensembling 4 completely separately trained models significantly outperformed every other scaling approach. Also, there was no clear winner between training for 4× as many steps or using a 4× larger batch size. However, on SuperGLUE neither ensembling approach significantly improved over the baseline.
-  - For SuperGLUE, we improved upon the SOTA by a large margin with T5-11B, combining insights from our experimental study with unprecedented scale, and nearly matching the human performance. Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions.
+  - For SuperGLUE, we improved upon the SOTA by a large margin with T5-11B, combining insights from our experimental study with unprecedented scale, and nearly matching the human performance. Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions.
+  - TODO add a paragraph about examples-proportional mixing scheme
+  - TODO add a paragraph about packing
+
+@article{Sanh2021Oct,
+	author = {Sanh, Victor and Webson, Albert and Raffel, Colin and Bach, Stephen H. and Sutawika, Lintang and Alyafeai, Zaid and Chaffin, Antoine and Stiegler, Arnaud and Scao, Teven Le and Raja, Arun and Dey, Manan and Bari, M. Saiful and Xu, Canwen and Thakker, Urmish and Sharma, Shanya Sharma and Szczechla, Eliza and Kim, Taewoon and Chhablani, Gunjan and Nayak, Nihal and Datta, Debajyoti and Chang, Jonathan and Jiang, Mike Tian-Jian and Wang, Han and Manica, Matteo and Shen, Sheng and Yong, Zheng Xin and Pandey, Harshit and Bawden, Rachel and Wang, Thomas and Neeraj, Trishala and Rozen, Jos and Sharma, Abheesht and Santilli, Andrea and Fevry, Thibault and Fries, Jason Alan and Teehan, Ryan and Bers, Tali and Biderman, Stella and Gao, Leo and Wolf, Thomas and Rush, Alexander M.},
+	title = {{Multitask Prompted Training Enables Zero-Shot Task Generalization}},
+	journal = {arXiv},
+	year = {2021},
+	month = oct,
+	eprint = {2110.08207},
+	doi = {10.48550/arXiv.2110.08207}
+}
+  - An influential hypothesis is that LLMs generalize to new tasks as a result of an implicit process of multitask learning: as a byproduct of learning to predict the next word, a LM is forced to learn from a mixture of implicit tasks. We also hypothesize that common NLP tasks can appear in an explicit form in pretraining corpora.
+  - We introduce the Public Pool of Prompts (P3) that convert a large set of natural language tasks into prompted form, using a simple templating language (see examples in fig. 3). P3 containing 2073 crowdsourced prompts for 177 datasets. We also allow prompts that permuted the original task for improved diversity (but they are not reported in evaluation). We use PromptSource environment (see "PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts"). We provide all the prompts in Appendix G.
+  - We exclude non-english tasks, coding tasks, and tasks that require special domain knowledge. This yields 12 tasks and 62 datasets (fig. 2, yellow - training, green - held-out).
+  - We hold out all constituent datasets of four tasks (fig. 2, green): NLI, coreference resolution, sentence completion, and word sense disambiguation (humans also zero-shot generalize to such tasks). We also verify that data for these tasks is not leaked through the pretraining corpus. We also add 14 tasks from BIG-bench to the held-out set.
+  - We propose T0: a T5 11B fine-tuned on the described set of the training tasks and prompts. Since T5 was trained to generate only tokens removed from the input text, it is different from the natural text generation format of prompted datasets. Therefore, we use LM-adapted T5 model (referred to as T5+LM, see "The power of scale for parameter-efficient prompt tuning"), produced by training T5 on 100B additional tokens from C4 on a standard language modeling objective.
+  - We perform early stopping on validation splits of the training datasets: this satisfies the true zero-shot setting as we do not use any examples from any of the held-out tasks to select the best checkpoint.
+  - We do not perform prompt selection by comparing the performance of different prompts on the validation split. In "True few-shot learning with language models" it was argued that such a strategy leaks information from the evaluation splits, which makes the evaluation not "true" zero-shot. (IMO not clear why)
+  - Our multi-task fine-tuning gives significant gains (fig. 4). We report the median performance and interquartile range across all prompts. T0 matches or exceeds the performance of all GPT-3 models on 9 out of 11 held-out datasets, without prompt cherry-picking.
+  - We conduct two ablations on the effects of the number of training prompts and datasets.
+  - 1) Even with just one prompt per dataset, performance on held-out tasks can improve substantially over the non-prompted baseline. Increasing training prompts count gives improvement in both median and spread. So, training on more prompts per dataset leads to better and more robust generalization to held-out tasks.
+  - 2) To study the effect of more training datasets, we train T0+, a variant of T0 but trained on a mixture that adds GPT-3’s evaluation datasets (hence it cannot be directly compared to GPT-3 in a zero-shot setting). We also train T0++ which further adds SuperGLUE to the training mixture (except RTE and CB) and leaves NLI and the BIG-bench tasks as the only held-out tasks. The results vary, but the average median performance (across all 5 held-out datasets) increases. It appears that increasing training datasets count does not consistently make the model more robust to the wording of prompts.
+  - There is a similar concurrent work "Finetuned language models are zero-shot learners", differences are discussed in sec. 7. Surprisingly, they perform an ablation with a model of comparable size (8B parameters) and find that that performance on held-out tasks decreases after multitask prompted training. We identify two key differences between the models that could explain this discrepancy: First, we use an encoder-decoder model that was pretrained with a different objective (masked language modeling) before being trained as a standard language model and finally fine-tuned on the multitask mixture. Secondly, our prompts are qualitatively more diverse in terms of their length and creativity.
+  - IMO, the baseline T5+LM was trained on language modeling objective, so this model is unaware that it should actually answer the question (but not to continue the question, or write another question etc). Of course, fine-tuning to answer various questions may improve such a model, even on another questions, just by unlocking the understanding that the answer should follow further. This is very practical way to adopt models after pretraining. But what if a model was already trained to answer the questions, but for some reason needs fine-tuning (for example, to adapt to a specific language)?  If we train on some set of tasks, will the performance on another tasks grow or decrease? Possibly the second may be the case, especially if we use a small amount of prompts during training, which reduces to a multi-task training with prompt serving as a "task label".
+
+@article{Wei2021Sep,
+	author = {Wei, Jason and Bosma, Maarten and Zhao, Vincent Y. and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M. and Le, Quoc V.},
+	title = {{Finetuned Language Models Are Zero-Shot Learners}},
+	journal = {arXiv},
+	year = {2021},
+	month = sep,
+	eprint = {2109.01652},
+	doi = {10.48550/arXiv.2109.01652}
+}
+  - We propose instruction tuning: a fine-tuning of pretrained LM on a mixture of >60 NLP datasets expressed via natural language instructions, to improve the zero-shot performance. The motivation is to improve the ability to respond to NLP instructions.
+  - In comparison, T5 prompts are mostly just a tag for the dataset, which would not work in the zero-shot setting. In contrast, the prompts that we use for FLAN are similar to what would be used to ask a human to perform the task (see appendix E).
+  - We refer to the resulting decoder-only 137B model as FLAN (Finetuned Language Net), fine-tuned for 30k gradient steps with a batch size of 8,192 tokens from LaMDA-PT checkpoint pretrained on a collection of web documents.
+  - We transform existing 62 datasets into an instructional format (fig. 3). Each dataset is categorized into one of twelve task clusters. We hold out each cluster for zero-shot evaluation while instruction tuning on all other clusters. For each dataset, we manually compose ten unique prompt templates (fig. 4), including up to three templates to solve the reversed task (e.g., for sentiment classification we include templates asking to generate a movie review).
+  - For classification with options GPT-3 used a rank classification approach where, for example, only two outputs ("yes" and "no") are considered and the higher probability one is taken as the model’s prediction. However, a large number of alternative ways of saying "yes" may lower the probability mass assigned to "yes". Therefore, we include an options suffix, in which we append the token OPTIONS to the end of a classification task along with a list of the output classes for that task. This makes the model aware of which choices are desired.
+  - Instruction tuning is very effective on tasks naturally verbalized as instructions, e.g., NLI, QA, translation, struct-to-text.
+  - Adding additional task clusters to instruction tuning improves zero-shot performance on held-out task clusters (fig. 6).
+  - Instruction tuning does not improve performance for many language modeling tasks, e.g., commonsense reasoning or coreference resolution tasks formulated as sentence completions, where instructions are largely redundant). On these tasks FLAN only outperforms LaMDA-PT on 3 of 7 tasks.
+  - On 8B and smaller models, however, instruction tuning actually hurts performance on held-out tasks (fig. 7). One potential explanation for this result could be that for small-scale models, learning the ∼40 tasks used during instruction tuning fills the entire model capacity, causing these models to perform worse on new tasks.
+  - THis is a concurrent work with "Multitask Prompted Training Enables Zero-Shot Task Generalization".
+  - We explore another finetuning setups. In a dataset name setup, each input is prepended with the name of the task and dataset (e.g., "[Translation: WMT’14 to French] The dog runs."). This performs substantially worse on held-out tasks (47% vs 55% zero-shot performance).
+  - IMO, it could be interesting to see how the performance on held-out tasks would change as we perform more fine-tuning steps, either with instruction tuning setup or with a dataset name setup. This could help to understand the effect of the catastrophic forgetting of how to perform another tasks, when fine-tuning on a limited amount of tasks and prompts.
+  - We study how instruction tuning can be used when few-shot exemplars (up to 16) are available at inference time, giving the instruction format "instruct(x1) y1 ... instruct(xk) yk instruct(x)". This improves the performance on all task clusters and reduces sensitivity to prompt engineering. (IMO it could be interesting to use few-shot exemplars at fine-tuning time, so that the model less tuning will required (as the initial performance becomes better), which reduces a possible negative impact of instruction tuning caused by catastrophic forgetting.
+  - Prompt tuning works better with FLAN than LaMDA-PT: instruction-tuned models respond better to continuous inputs from prompt tuning. So, instruction tuning can result in a checkpoint that is more desirable for performing NLP tasks.
+
+@article{Schick2020Jan,
+	author = {Schick, Timo and Sch{\ifmmode\ddot{u}\else\"{u}\fi}tze, Hinrich},
+	title = {{Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference}},
+	journal = {arXiv},
+	year = {2020},
+	month = jan,
+	eprint = {2001.07676},
+	doi = {10.48550/arXiv.2001.07676}
+}
+  - Solving a task from only a few examples becomes much easier when we also have a task description.
+  - We propose Pattern-Exploiting Training (PET).
+  - Given a masked language model and some dataset of pairs (X, Y), we define a pattern-verbalizer pair (PVP). First, we design a pattern to convert X to a cloze questions with exactly one mask token. Second, we define a verbalizer that converts Y to a single word token from model's vocabulary, to substutute it into the cloze question.
+  - For example, consider the task of identifying whether two sentences contradict or agree with each other. We may choose the pattern A? <mask> B, when <mask> can be either "Yes" or "No".
+  - We assume access to a small training set of pairs (X, Y) and much larger of unlabeled samples X. Using our pattern-verbalizer pair, we can fine-tune a masked language model on the training set. To mitigate catastrophic forgetting, we also employ language modeling on unlabeled set (after applying our pattern) as auxiliary task.
+  - We usually have several patterns, and we fine-tune a separate LM on each pattern with the described procedure. We use the ensemble of finetuned models to annotate examples from the unlabeled set.
+  - Finally we finetune a language model on both the original and pseudo-labeled sets, with the described procedure. The fine-tuned model serves as a classifier.
+  - As some patterns perform (possibly much) worse than others, the training set for our final model may contain many mislabeled examples. To compensate this, we devise iPET, an iterative variant of PET. In iPET, we train several generations of models on datasets of increasing size.
+  - When the initial amount of training data is limited, PET gives large improvements over standard supervised training and strong semi-supervised approaches.