Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

Commit

Permalink
Simplify book geneation pipeline (#72)
Browse files Browse the repository at this point in the history
* Simplify book geneation pipeline

* fix gen pipeline

* Add deprecated

---------

Co-authored-by: EC2 Default User <[email protected]>
  • Loading branch information
emrgnt-cmplxty and EC2 Default User authored Oct 13, 2023
1 parent 29a969b commit bc836fe
Show file tree
Hide file tree
Showing 60 changed files with 1,249 additions and 78,400 deletions.
6 changes: 3 additions & 3 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,6 @@ CHROMA_REMOTE_ADDR=your_chroma_db_addr
CHROMA_REMOTE_PORT="8000" # default
CHROMA_TOKEN=your_chroma_db_token
CHROMA_AUTH_PROVIDER="chromadb.auth.token.TokenAuthClientProvider"
WIKI_SERVER_URL="your_wiki_server"
WIKI_USER_NAME="your_wiki_user_name"
WIKI_USER_PASSWORD="your_wiki_user_password"
RAG_SERVER_URL="your_rag_server"
RAG_SERVER_NAME="your_rag_auth_user_name"
RAG_SERVER_PASSWORD="your_rag_auth_user_password"
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
# Local cruft and other
poetry.lock
.env
.vscode
*.out
*.bak
*.log
*.sh
**/__pycache__/**
**/.DS_Store
storeage/

textbooks/
# Local sandbox environments
playground/
outputs/
dump/

# Scraped data
sciphi/examples/library_of_phi/raw_data/
67 changes: 11 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,41 +4,28 @@
<img width="716" alt="Screenshot 2023-10-01 at 10 45 12 AM" src="https://github.com/emrgnt-cmplxty/sciphi/assets/68796651/c4192288-b5af-4ef8-9774-82b3bb5c8251">
</p>

<!-- ## **Overview** -->

**SciPhi** is an configurable Python framework designed to tackle the challenges of efficiently training LLM (Large Language Model) through synthetic data. At its core, SciPhi offers:
**SciPhi** is an configurable Python framework designed to tackle the challenges of efficiently training powerful LLMs (Large Language Model) through synthetic data. At its core, SciPhi offers:

- **Configurable Data Generation**: Efficiently produce LLM-mediated synthetic training and tuning datasets tailored to your specific needs.

<!-- - Seamless Model Evaluation: Streamline the process of evaluating model outputs using our integrated LLM-mediated tools _(under construction)_. -->

- The Library of Phi: An initiative to leverage AI-driven techniques to craft high-quality open source textbooks.
- **The Library of Phi**: An initiative to leverage AI-driven techniques to craft high-quality open source textbooks.

## **Getting Started & Support**

<!-- - New to SciPhi? Kickstart your journey with our comprehensive [tutorial](https://substack.com/inbox/post/137197905). -->

- Engage with our active [Discord community](https://discord.gg/j9GxfbxqAe) for discussions, troubleshooting, and collaboration.

- For specialized support or collaboration inquiries, feel free to [reach out directly](mailto:[email protected]).

## **Library of Phi Generation**

**Introduction:**
The Library of Phi is an initiative sponsored by SciPhi. Its primary goal is to democratize access to high-quality textbooks. The project utilizes AI-driven techniques to generate textbooks by processing information from the MIT OCW course webpages.
"
**Workflow:**
The workflow encompasses data scraping, data processing, YAML configuration creation, and [RAG](https://research.ibm.com/blog/retrieval-augmented-generation-RAG) over all of Wikipedia, with intermittent work done by LLMs.

1. Scrape MIT OCW Course Webpages.
2. Extract Syllabi.
3. Formulate Table of Contents.
4. Craft Textbooks.
The Library of Phi is an initiative sponsored by SciPhi. Its primary goal is to democratize access to high-quality textbooks. The project utilizes AI-driven techniques to generate textbooks by processing combining raw information (such as table of contents) with unstructured data (such as Vector Databases) to generate high quality factually grounded textbooks.

#### **Generating the default Textbook:**

```bash
poetry run python sciphi/examples/library_of_phi/generate_textbook.py run --do-wiki=False --textbook=Aerodynamics_of_Viscous_Fluids --log-level=DEBUG
# Note, rather than passing arguments in the command line, you can modify the default settings in config/generation_settings/book_draft_settings.yml
poetry run python sciphi/examples/library_of_phi/generate_textbook.py run --llm-provider=openai --llm_model_name=gpt-3.5-turbo --do-rag=False --textbook=Aerodynamics_of_Viscous_Fluids --filter_existing_books=False --log-level=debug
```

_[See the example output here](sciphi/data/library_of_phi/sample/Aerodynamics_of_Viscous_Fluids.md)_
Expand All @@ -47,15 +34,15 @@ _[See the example output here](sciphi/data/library_of_phi/sample/Aerodynamics_of

1. Draft a table of contents and save as `textbook_name.yaml`.
2. Place it in `[Your Working Directory]/sciphi/data/library_of_phi/table_of_contents`.
3. Format similarly to `Aerodynamics_of_Viscous_Fluids.yaml`.
3. Format identically to `Aerodynamics_of_Viscous_Fluids.yaml`.

#### **Incorporating RAG over Wikipedia:**
#### **Incorporating RAG:**

1. Enable the `--do-wiki` flag: `True`.
1. Enable the `--do-rag` flag: `True`.
2. In `.env`, set:
- `WIKI_SERVER_URL`
- `WIKI_SERVER_USERNAME`
- `WIKI_SERVER_PASSWORD`
- `RAG_SERVER_URL`
- `RAG_SERVER_USERNAME`
- `RAG_SERVER_PASSWORD`

**Output**:
Generated textbooks reside in:
Expand Down Expand Up @@ -124,38 +111,6 @@ The long-term view of the SciPhi framework is to provide a training-feedback loo

See arguments and their default values in the README. Notable ones include `--provider`, `--model_name`, and `--temperature`.

### **Replicating Full Table of Contents Generation**

**Step 0**: Scrape MIT OCW for course details.

```bash
poetry run python sciphi/examples/library_of_phi/raw_data/ocw_scraper.py scrape
```

**Step 1**: Convert scraped data into 'draft' syllabi YAMLs.

```bash
poetry run python sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py run
```

**Step 2**: Refine the draft YAML into the finalized syllabi.

```bash
poetry run python sciphi/examples/library_of_phi/gen_step_2_clean_syllabi.py run
```

**Step 3**: Transition the syllabi to a 'draft' table of contents.

```bash
poetry run python sciphi/examples/library_of_phi/gen_step_3_draft_table_of_contents.py run
```

**Step 4**: Produce clean table of contents YAML files.

```bash
poetry run python sciphi/examples/library_of_phi/gen_step_4_clean_table_of_contents.py run
```

### License

Licensed under the Apache-2.0 License.
Expand Down
7 changes: 4 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ anthropic = { version = "^0.3.10", optional = true }
# hf
accelerate = { version = "^0.23.0", optional = true }
datasets = { version = "^2.14.5", optional = true }
torch = { version = "^2.0.1", optional = true }
transformers = { version = "^4.33.1", optional = true }
# openai
openai = { version = "0.27.8", optional = true }
Expand All @@ -31,7 +30,7 @@ plotly = {version = "^5.17.0", optional = true}
scipy = {version = "^1.11.2", optional = true}
scikit-learn = {version = "^1.3.1", optional = true}
# vllm
vllm = { version = "0.1.7", optional = true }
vllm = { version = "0.2.0", optional = true }
# llama-index
llama-index = { version = "^0.8.29.post1", optional = true }
# chroma
Expand All @@ -40,6 +39,8 @@ retrying = "^1.3.4"
fire = "^0.5.0"
tiktoken = "^0.5.1"
bs4 = "^0.0.1"
sentencepiece = "^0.1.99"
torch = "^2.1.0"

[tool.poetry.extras]
anthropic_support = ["anthropic"]
Expand Down Expand Up @@ -69,7 +70,7 @@ line-length = 79

[tool.mypy]
ignore_missing_imports = true
exclude = 'playground/.*'
exclude = 'playground/.*|deprecated/*|dump/*'

[[tool.mypy.overrides]]
module = "yaml"
Expand Down
29 changes: 29 additions & 0 deletions sciphi/config/generation_settings/book_draft_settings.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# null represents Python's None. Will be replaced with default in the code.
# basic_config:
data_dir: null
toc_dir: table_of_contents
output_dir: textbooks
textbook: null
log_level: INFO
num_threads_per_proc: null
num_processes: 1
process_num: 0
filter_existing_books: true

# llm config
llm_provider: openai
llm_model_name: gpt-4-0613
temperature: 0.1
max_tokens_to_sample: 8192
top_k: 100

# sampling config
max_related_context_to_sample: 2000
max_prev_snippet_to_sample: 2000

# rag config
do_rag: true
rag_server_url: null
rag_username: null
rag_password: null

50 changes: 48 additions & 2 deletions sciphi/core/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,59 @@ def get_data_dir() -> str:
return os.path.join(script_dir, "..", "data")


def get_data_config_dir() -> str:
def get_config_dir() -> str:
"""Get the path to the root of the config directory."""
script_dir = os.path.dirname(os.path.realpath(__file__))
return os.path.join(script_dir, "..", "data", "stock_config")
return os.path.join(script_dir, "..", "config")


def get_data_raw_dir() -> str:
"""Get the path to the root of the raw data directory."""
script_dir = os.path.dirname(os.path.realpath(__file__))
return os.path.join(script_dir, "..", "data", "stock_raw")


class SciPhiConfig:
"""Configuration class for SciPhi."""

def __init__(self, dictionary):
for key, value in dictionary.items():
if isinstance(value, dict):
value = SciPhiConfig(
value
) # Recursively convert nested dictionaries
else:
value = self._cast_to_appropriate_type(
value
) # Cast value to its appropriate type
setattr(self, key, value)

@staticmethod
def _cast_to_appropriate_type(value):
"""Automatically cast a value to its appropriate type."""
# If value is a string representation of an integer
if isinstance(value, str) and value.isdigit():
return int(value)
return value

def _update_from_dict(self, dictionary):
"""Update fields using a dictionary."""
for key, value in dictionary.items():
if isinstance(value, dict):
existing_value = getattr(self, key, None)
if existing_value and isinstance(existing_value, SciPhiConfig):
existing_value.update(value)
else:
setattr(self, key, SciPhiConfig(value))
else:
setattr(
self, key, self._cast_to_appropriate_type(value)
) # Cast value to its appropriate type

def add_field(self, key, value):
"""Add a field to the configuration."""
setattr(self, key, value)

def update(self, new_config_dict):
"""Update fields using a dictionary."""
self._update_from_dict(new_config_dict)
Loading

0 comments on commit bc836fe

Please sign in to comment.