Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

Commit

Permalink
Feature/fix model setting bug (#47)
Browse files Browse the repository at this point in the history
* Add work in progress yaml, in case anyone would like to try more books

* fix type error

* Fix model setting bug

* Cleanup
  • Loading branch information
emrgnt-cmplxty authored Oct 1, 2023
1 parent fe585ab commit 946f4f2
Show file tree
Hide file tree
Showing 7 changed files with 60 additions and 70 deletions.
50 changes: 21 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,6 @@ SciPhi is a Python package offering:
- Configurable generation of LLM-mediated synthetic training/tuning data.
- Seamless LLM-mediated evaluation of model output.

<p align="center">
<img width="524" alt="Screenshot 2023-09-18 at 9 53 55 AM" src="https://github.com/emrgnt-cmplxty/SciPhi/assets/68796651/9731f891-1d99-432a-aaec-37916bc6362f">
</p>

## **Questions?**

- Join our [Discord community](https://discord.gg/j9GxfbxqAe).
Expand Down Expand Up @@ -62,7 +58,6 @@ Options include:
**Overview:**
The Library of Phi is an initiative sponsored by SciPhi. Its primary goal is to democratize access to high-quality textbooks. The project utilizes AI-driven techniques to generate textbooks by processing information from the MIT OCW course webpages.


**Workflow:**
The workflow encompasses data scraping, data processing, YAML configuration creation, and RAG execution over Wikipedia, with intermittent work done by LLMs.

Expand All @@ -77,6 +72,8 @@ The workflow encompasses data scraping, data processing, YAML configuration crea
poetry run python sciphi/examples/library_of_phi/generate_textbook.py run --do-wiki=False --textbook=Introduction_to_Deep_Learning
```

__[See the example output here](sciphi/data/library_of_phi/Introduction_to_Deep_Learning.md)__

#### **Using a Custom Table of Contents:**

1. Draft a table of contents and save as `textbook_name.yaml`.
Expand All @@ -99,7 +96,24 @@ Generated textbooks reside in:

---

### Replicating Full Table of Contents Generation
### **Customizable Runner Script**

For flexible applications, execute the relevant `runner.py` with various command-line arguments.

```bash
poetry run python sciphi/examples/basic_data_gen/runner.py --provider_name=openai --model_name=gpt-4 --log_level=INFO --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need_basic_split
```

The above command will generate a single sample from GPT-4, using the `textbooks_are_all_you_need_basic_split` configuration, and save the output to `example_output.jsonl`. The long-term view of this framework is to function as pictured below:
<p align="center">
<img width="524" alt="Screenshot 2023-09-18 at 9 53 55 AM" src="https://github.com/emrgnt-cmplxty/SciPhi/assets/68796651/9731f891-1d99-432a-aaec-37916bc6362f">
</p>

#### **Command-Line Arguments**

See arguments and their default values in the README. Notable ones include `--provider`, `--model_name`, and `--temperature`.

### **Replicating Full Table of Contents Generation**

**Step 0**: Scrape MIT OCW for course details.

Expand All @@ -125,33 +139,11 @@ poetry run python sciphi/examples/library_of_phi/gen_step_2_clean_syllabi.py run
poetry run python sciphi/examples/library_of_phi/gen_step_3_table_of_contents.py run
```

### Customizable Runner Script

For flexible applications, execute the relevant `runner.py` with various command-line arguments.

```bash
poetry run python sciphi/examples/basic_data_gen/runner.py --provider_name=openai --model_name=gpt-4 --log_level=INFO --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need_basic_split
```

### Command-Line Arguments

See arguments and their default values in the README. Notable ones include `--provider`, `--model_name`, and `--temperature`.

### Example Generated Data

<p align="center">
<img width="776" alt="Screenshot 2023-09-17 at 11 11 18 PM" src="https://github.com/emrgnt-cmplxty/SciPhi/assets/68796651/8f1ef11d-cd37-4fc7-a7a0-a1e0159ba4a3">
</p>

## Development

Use SciPhi to craft synthetic data for a given LLM provider. Check the provided code for an example.

### License

Licensed under the Apache-2.0 License.

### Referenced Datasets
### Created Datasets

1. [Python Synthetic Textbooks](https://huggingface.co/datasets/emrgnt-cmplxty/sciphi-python-textbook/viewer/default/train)
2. [Textbooks are all you need](https://huggingface.co/datasets/emrgnt-cmplxty/sciphi-textbooks-are-all-you-need)
Expand Down
5 changes: 2 additions & 3 deletions sciphi/data/library_of_phi/Introduction_to_Deep_Learning.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# NOTE - THIS TEXTBOOK WAS GENERATED WITH AI.
# NOTE - THIS TEXTBOOK WAS AI GENERATED

# Table of Contents

Expand Down Expand Up @@ -2041,5 +2041,4 @@ It is worth noting that while manifold alignment can produce accurate alignments

To illustrate the concept further, let's consider an example from the field of speech recognition. Speaker adaptation, an essential technology for fine-tuning speech models, often encounters inter-speaker variation as a mismatch between training and testing speakers. Kernel eigenvoice (KEV) is a non-linear adaptation technique that incorporates kernel principal component analysis to capture higher-order correlations and enhance recognition performance. By applying KEV, it becomes possible to adapt the speaker models based on prior knowledge of training speakers, even with limited adaptation data. This demonstrates the efficacy of feature-level adaptation in addressing domain-specific challenges.

In summary, feature-level adaptation, particularly through techniques like manifold alignment, plays a crucial role in domain adaptation. By aligning the feature representations of the source and target domains, feature-level adaptation enables the transfer of knowledge from a source domain to a target domain with a different data distribution. This technique is valuable in various real-world applications and facilitates transfer learning, where knowledge from one domain is leveraged to improve performance in related domains.

In summary, feature-level adaptation, particularly through techniques like manifold alignment, plays a crucial role in domain adaptation. By aligning the feature representations of the source and target domains, feature-level adaptation enables the transfer of knowledge from a source domain to a target domain with a different data distribution. This technique is valuable in various real-world applications and facilitates transfer learning, where knowledge from one domain is leveraged to improve performance in related domains.
19 changes: 17 additions & 2 deletions sciphi/examples/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@
import yaml
from requests.auth import HTTPBasicAuth

from sciphi.interface import ProviderName
from sciphi.interface import InterfaceManager, ProviderName
from sciphi.interface.base import LLMInterface
from sciphi.llm import LLMConfigManager


def gen_llm_config(args: argparse.Namespace) -> dict:
Expand Down Expand Up @@ -308,7 +310,7 @@ def format_yaml_line(line: str, index: int, split_lines: list[str]) -> str:
line = (
line[:first_non_blank_char]
+ '"'
+ line[first_non_blank_char:-1]
+ line[first_non_blank_char:]
+ '":'
)
return line
Expand Down Expand Up @@ -397,3 +399,16 @@ def wiki_search_api(
else:
response.raise_for_status() # Raise an HTTPError if the HTTP request returned an unsuccessful status code
raise ValueError("Unexpected response from API")


def get_default_settings_provider(
provider: str, model_name: str, max_tokens_to_sample=None
) -> LLMInterface:
"""Get the default LLM config and provider for the given provider and model name."""

provider_name = ProviderName(provider)
llm_config = LLMConfigManager.get_config_for_provider(
provider_name
).create(max_tokens_to_sample=max_tokens_to_sample, model_name=model_name)

return InterfaceManager.get_provider(provider_name, model_name, llm_config)
11 changes: 3 additions & 8 deletions sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,8 @@
import fire
import yaml

from sciphi.examples.helpers import get_default_settings_provider
from sciphi.examples.library_of_phi.prompts import SYLLABI_CREATION_PROMPT
from sciphi.interface import InterfaceManager, ProviderName
from sciphi.llm import LLMConfigManager


def extract_data_from_record(record: dict[str, str]) -> tuple[dict, str]:
Expand Down Expand Up @@ -137,12 +136,8 @@ def run(self) -> None:
"""Run the draft YAML generation process."""
yaml.add_representer(str, quoted_presenter)

provider_name = ProviderName(self.provider)
llm_config = LLMConfigManager.get_config_for_provider(
provider_name
).create(max_tokens_to_sample=None)
llm_provider = InterfaceManager.get_provider(
provider_name, self.model_name, llm_config
llm_provider = get_default_settings_provider(
provider=self.provider, model_name=self.model_name
)
if not self.data_directory:
file_path = os.path.dirname(os.path.abspath(__file__))
Expand Down
14 changes: 4 additions & 10 deletions sciphi/examples/library_of_phi/gen_step_3_table_of_contents.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,8 @@

import fire

from sciphi.examples.helpers import get_default_settings_provider
from sciphi.examples.library_of_phi.prompts import TABLE_OF_CONTENTS_PROMPT
from sciphi.interface import InterfaceManager, ProviderName
from sciphi.llm import LLMConfigManager


class TableOfContentsRunner:
Expand All @@ -47,7 +46,7 @@ class TableOfContentsRunner:
def __init__(
self,
input_rel_dir: str = "output_step_2",
output_rel_dir: str = "table_of_contents",
output_rel_dir: str = "output_step_3",
data_directory=None,
provider: str = "openai",
model_name: str = "gpt-4-0613",
Expand All @@ -70,14 +69,9 @@ def run(self):
)

# Build an LLM and provider interface
provider_name = ProviderName(self.provider)
llm_config = LLMConfigManager.get_config_for_provider(
provider_name
).create(max_tokens_to_sample=None)
llm_provider = InterfaceManager.get_provider(
provider_name, self.model_name, llm_config
llm_provider = get_default_settings_provider(
provider=self.provider, model_name=self.model_name
)

input_dir = os.path.join(self.data_directory, self.input_rel_dir)
output_dir = os.path.join(self.data_directory, self.output_rel_dir)

Expand Down
29 changes: 12 additions & 17 deletions sciphi/examples/library_of_phi/gen_step_4_draft_book.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
Parameters:
provider (str):
The provider to use. Default is 'openai'.
model (str):
The model name to use. Default is 'gpt-3.5-turbo-instruct'.
model_name (str):
The model_name to use. Default is 'gpt-3.5-turbo-instruct'.
parsed_dir (str):
Directory containing parsed data. Default is 'raw_data'.
toc_dir (str):
Expand Down Expand Up @@ -48,15 +48,17 @@

import fire

from sciphi.examples.helpers import load_yaml_file, wiki_search_api
from sciphi.examples.helpers import (
get_default_settings_provider,
load_yaml_file,
wiki_search_api,
)
from sciphi.examples.library_of_phi.prompts import (
BOOK_BULK_PROMPT,
BOOK_CHAPTER_INTRODUCTION_PROMPT,
BOOK_CHAPTER_SUMMARY_PROMPT,
BOOK_FOREWARD_PROMPT,
)
from sciphi.interface import InterfaceManager, ProviderName
from sciphi.llm import LLMConfigManager
from sciphi.writers import RawDataWriter

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -102,7 +104,7 @@ class TextbookContentGenerator:
def __init__(
self,
provider="openai",
model="gpt-4-0613",
model_name="gpt-4-0613",
parsed_dir="raw_data",
toc_dir="table_of_contents",
output_dir="output_step_4",
Expand All @@ -116,7 +118,7 @@ def __init__(
log_level="INFO",
):
self.provider = provider
self.model = model
self.model_name = model_name
self.parsed_dir = parsed_dir
self.toc_dir = toc_dir
self.output_dir = output_dir
Expand Down Expand Up @@ -146,22 +148,15 @@ def run(self):
)
yml_config = load_yaml_file(yml_file_path, do_prep=True)

# Build an LLM and provider interface
provider_name = ProviderName(self.provider)
llm_config = LLMConfigManager.get_config_for_provider(
provider_name
).create(max_tokens_to_sample=None)
llm_provider = InterfaceManager.get_provider(
provider_name, self.model, llm_config
)

# Create an instance of the generator
traversal_generator = traverse_config(yml_config)

output_path = os.path.join(
local_pwd, self.parsed_dir, self.output_dir, f"{self.textbook}.md"
)

llm_provider = get_default_settings_provider(
provider=self.provider, model_name=self.model_name
)
if not os.path.exists(os.path.dirname(output_path)):
os.makedirs(os.path.dirname(output_path))
logger.info(f"Saving textbook to {output_path}")
Expand Down
2 changes: 1 addition & 1 deletion sciphi/llm/openai_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ class OpenAIConfig(LLMConfig):
# Base
provider_name: ProviderName = ProviderName.OPENAI
model_name: str = "gpt-3.5-turbo"
temperature: float = 0.7
temperature: float = 0.1
top_p: float = 1.0

# OpenAI Extras
Expand Down

0 comments on commit 946f4f2

Please sign in to comment.