This repository has been archived by the owner on Feb 12, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 52
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Simplify book geneation pipeline (#72)
* Simplify book geneation pipeline * fix gen pipeline * Add deprecated --------- Co-authored-by: EC2 Default User <[email protected]>
- Loading branch information
1 parent
29a969b
commit bc836fe
Showing
60 changed files
with
1,249 additions
and
78,400 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,19 @@ | ||
# Local cruft and other | ||
poetry.lock | ||
.env | ||
.vscode | ||
*.out | ||
*.bak | ||
*.log | ||
*.sh | ||
**/__pycache__/** | ||
**/.DS_Store | ||
storeage/ | ||
|
||
textbooks/ | ||
# Local sandbox environments | ||
playground/ | ||
outputs/ | ||
dump/ | ||
|
||
# Scraped data | ||
sciphi/examples/library_of_phi/raw_data/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,41 +4,28 @@ | |
<img width="716" alt="Screenshot 2023-10-01 at 10 45 12 AM" src="https://github.com/emrgnt-cmplxty/sciphi/assets/68796651/c4192288-b5af-4ef8-9774-82b3bb5c8251"> | ||
</p> | ||
|
||
<!-- ## **Overview** --> | ||
|
||
**SciPhi** is an configurable Python framework designed to tackle the challenges of efficiently training LLM (Large Language Model) through synthetic data. At its core, SciPhi offers: | ||
**SciPhi** is an configurable Python framework designed to tackle the challenges of efficiently training powerful LLMs (Large Language Model) through synthetic data. At its core, SciPhi offers: | ||
|
||
- **Configurable Data Generation**: Efficiently produce LLM-mediated synthetic training and tuning datasets tailored to your specific needs. | ||
|
||
<!-- - Seamless Model Evaluation: Streamline the process of evaluating model outputs using our integrated LLM-mediated tools _(under construction)_. --> | ||
|
||
- The Library of Phi: An initiative to leverage AI-driven techniques to craft high-quality open source textbooks. | ||
- **The Library of Phi**: An initiative to leverage AI-driven techniques to craft high-quality open source textbooks. | ||
|
||
## **Getting Started & Support** | ||
|
||
<!-- - New to SciPhi? Kickstart your journey with our comprehensive [tutorial](https://substack.com/inbox/post/137197905). --> | ||
|
||
- Engage with our active [Discord community](https://discord.gg/j9GxfbxqAe) for discussions, troubleshooting, and collaboration. | ||
|
||
- For specialized support or collaboration inquiries, feel free to [reach out directly](mailto:[email protected]). | ||
|
||
## **Library of Phi Generation** | ||
|
||
**Introduction:** | ||
The Library of Phi is an initiative sponsored by SciPhi. Its primary goal is to democratize access to high-quality textbooks. The project utilizes AI-driven techniques to generate textbooks by processing information from the MIT OCW course webpages. | ||
" | ||
**Workflow:** | ||
The workflow encompasses data scraping, data processing, YAML configuration creation, and [RAG](https://research.ibm.com/blog/retrieval-augmented-generation-RAG) over all of Wikipedia, with intermittent work done by LLMs. | ||
|
||
1. Scrape MIT OCW Course Webpages. | ||
2. Extract Syllabi. | ||
3. Formulate Table of Contents. | ||
4. Craft Textbooks. | ||
The Library of Phi is an initiative sponsored by SciPhi. Its primary goal is to democratize access to high-quality textbooks. The project utilizes AI-driven techniques to generate textbooks by processing combining raw information (such as table of contents) with unstructured data (such as Vector Databases) to generate high quality factually grounded textbooks. | ||
|
||
#### **Generating the default Textbook:** | ||
|
||
```bash | ||
poetry run python sciphi/examples/library_of_phi/generate_textbook.py run --do-wiki=False --textbook=Aerodynamics_of_Viscous_Fluids --log-level=DEBUG | ||
# Note, rather than passing arguments in the command line, you can modify the default settings in config/generation_settings/book_draft_settings.yml | ||
poetry run python sciphi/examples/library_of_phi/generate_textbook.py run --llm-provider=openai --llm_model_name=gpt-3.5-turbo --do-rag=False --textbook=Aerodynamics_of_Viscous_Fluids --filter_existing_books=False --log-level=debug | ||
``` | ||
|
||
_[See the example output here](sciphi/data/library_of_phi/sample/Aerodynamics_of_Viscous_Fluids.md)_ | ||
|
@@ -47,15 +34,15 @@ _[See the example output here](sciphi/data/library_of_phi/sample/Aerodynamics_of | |
|
||
1. Draft a table of contents and save as `textbook_name.yaml`. | ||
2. Place it in `[Your Working Directory]/sciphi/data/library_of_phi/table_of_contents`. | ||
3. Format similarly to `Aerodynamics_of_Viscous_Fluids.yaml`. | ||
3. Format identically to `Aerodynamics_of_Viscous_Fluids.yaml`. | ||
|
||
#### **Incorporating RAG over Wikipedia:** | ||
#### **Incorporating RAG:** | ||
|
||
1. Enable the `--do-wiki` flag: `True`. | ||
1. Enable the `--do-rag` flag: `True`. | ||
2. In `.env`, set: | ||
- `WIKI_SERVER_URL` | ||
- `WIKI_SERVER_USERNAME` | ||
- `WIKI_SERVER_PASSWORD` | ||
- `RAG_SERVER_URL` | ||
- `RAG_SERVER_USERNAME` | ||
- `RAG_SERVER_PASSWORD` | ||
|
||
**Output**: | ||
Generated textbooks reside in: | ||
|
@@ -124,38 +111,6 @@ The long-term view of the SciPhi framework is to provide a training-feedback loo | |
|
||
See arguments and their default values in the README. Notable ones include `--provider`, `--model_name`, and `--temperature`. | ||
|
||
### **Replicating Full Table of Contents Generation** | ||
|
||
**Step 0**: Scrape MIT OCW for course details. | ||
|
||
```bash | ||
poetry run python sciphi/examples/library_of_phi/raw_data/ocw_scraper.py scrape | ||
``` | ||
|
||
**Step 1**: Convert scraped data into 'draft' syllabi YAMLs. | ||
|
||
```bash | ||
poetry run python sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py run | ||
``` | ||
|
||
**Step 2**: Refine the draft YAML into the finalized syllabi. | ||
|
||
```bash | ||
poetry run python sciphi/examples/library_of_phi/gen_step_2_clean_syllabi.py run | ||
``` | ||
|
||
**Step 3**: Transition the syllabi to a 'draft' table of contents. | ||
|
||
```bash | ||
poetry run python sciphi/examples/library_of_phi/gen_step_3_draft_table_of_contents.py run | ||
``` | ||
|
||
**Step 4**: Produce clean table of contents YAML files. | ||
|
||
```bash | ||
poetry run python sciphi/examples/library_of_phi/gen_step_4_clean_table_of_contents.py run | ||
``` | ||
|
||
### License | ||
|
||
Licensed under the Apache-2.0 License. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# null represents Python's None. Will be replaced with default in the code. | ||
# basic_config: | ||
data_dir: null | ||
toc_dir: table_of_contents | ||
output_dir: textbooks | ||
textbook: null | ||
log_level: INFO | ||
num_threads_per_proc: null | ||
num_processes: 1 | ||
process_num: 0 | ||
filter_existing_books: true | ||
|
||
# llm config | ||
llm_provider: openai | ||
llm_model_name: gpt-4-0613 | ||
temperature: 0.1 | ||
max_tokens_to_sample: 8192 | ||
top_k: 100 | ||
|
||
# sampling config | ||
max_related_context_to_sample: 2000 | ||
max_prev_snippet_to_sample: 2000 | ||
|
||
# rag config | ||
do_rag: true | ||
rag_server_url: null | ||
rag_username: null | ||
rag_password: null | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
Oops, something went wrong.