This repository has been archived by the owner on Feb 12, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 52
Open Phi Collaboration Meeting #1 ‐ 10‐09‐2023
emrgnt-cmplxty edited this page Oct 10, 2023
·
1 revision
1. Model Training:
- Model Training Stages:
- Pretraining a general-purpose LLM.
- Fine-tuning on specialized corpus.
- End task classification or RLHF.
-
Jeremy's Call to Action:
- Proposal for a continuous one-step training process, progressively refining document quality.
- Emphasis on considering the creation of a general-purpose classifier.
- Nate at Carper has fine-tuned a 1B LLM to select the best texts for this purpose.
- Autometa mentioned potentially using Lila for the discrimination for filtered data.
2. More on Real Web Data and Classifier Model:
- Highlighting the importance of real web data.
- Strategy: Use vast datasets with web data, instructions, and syntehtic textbooks.
- Differentiation based on quality, accuracy, and utility.
3. Phi Model Issues and Replication:
- Challenges with generating 10B high quality diverse tokens.
- Emphasis on replicating the Phi model and scaling to 3B or 7b models.
- Nate from Carper AI and his replication work are of significant relevance.
4. Data Annotation, Generation, and Control:
- Autometa suggests Lilac as a data annotation platform.
- Prioritize creating differentiated data and intelligent sampling.
- Potential to use LLM for YouTube subtitle augmentation, improving coherence and quality.
5. Open Source Texts and Collaboration:
- Exploration of piloting open source texts.
- Potential collaboration with OpenSyllabus discussed, given their 24M course syllabi dataset.
6. Semantic Web and Wikidata:
- Examining the connection between LLM and semantic web.
- Wikidata's potential and challenges in foreign languages were highlighted.
7. Feedback Loop and Dataset Quality:
- Importance of monitoring dataset quality and using perplexity as a learning proxy.
- Consider a feedback loop mechanism and optimize weighting across datasets.
8. Phi Library and Data Types:
- Debate on the proportion of datasets to resemble textbooks or instructions.
- Ultimate goal: Train models for expert-level engagement, especially in chat form.
10. Future Directions:
- Full replication of phi-1 and phi-1.5 remains a primary goal.
- Jeremy's call to action emphasizes the importance of a general-purpose classifier in this work.
- Jeremy's suggestion of further pre-training on models with synthetic dataset is a potential extension of this work.
- Owen's proposed suggestion of DoReMi + synthetic data for weighting for dynamic curriculum generation is a separate extension.
- Opportunities exist for collaborating on experiment design, data generation, and model fine tuning