Feature/fix model setting bug (#47)

* Add work in progress yaml, in case anyone would like to try more books * fix type error * Fix model setting bug * Cleanup
SciPhi-AI · Oct 1, 2023 · 946f4f2 · 946f4f2
1 parent fe585ab
commit 946f4f2
Show file tree

Hide file tree

Showing 7 changed files with 60 additions and 70 deletions.
diff --git a/README.md b/README.md
@@ -6,10 +6,6 @@ SciPhi is a Python package offering:
 - Configurable generation of LLM-mediated synthetic training/tuning data.
 - Seamless LLM-mediated evaluation of model output.
 
-<p align="center">
-<img width="524" alt="Screenshot 2023-09-18 at 9 53 55 AM" src="https://github.com/emrgnt-cmplxty/SciPhi/assets/68796651/9731f891-1d99-432a-aaec-37916bc6362f">
-</p>
-
 ## **Questions?**
 
 - Join our [Discord community](https://discord.gg/j9GxfbxqAe).
@@ -62,7 +58,6 @@ Options include:
 **Overview:**  
 The Library of Phi is an initiative sponsored by SciPhi. Its primary goal is to democratize access to high-quality textbooks. The project utilizes AI-driven techniques to generate textbooks by processing information from the MIT OCW course webpages.
 
-
 **Workflow:**  
 The workflow encompasses data scraping, data processing, YAML configuration creation, and RAG execution over Wikipedia, with intermittent work done by LLMs.
 
@@ -77,6 +72,8 @@ The workflow encompasses data scraping, data processing, YAML configuration crea
 poetry run python sciphi/examples/library_of_phi/generate_textbook.py run --do-wiki=False --textbook=Introduction_to_Deep_Learning
 ```
 
+__[See the example output here](sciphi/data/library_of_phi/Introduction_to_Deep_Learning.md)__
+
 #### **Using a Custom Table of Contents:**
 
 1. Draft a table of contents and save as `textbook_name.yaml`.
@@ -99,7 +96,24 @@ Generated textbooks reside in:
 
 ---
 
-### Replicating Full Table of Contents Generation
+### **Customizable Runner Script**
+
+For flexible applications, execute the relevant `runner.py` with various command-line arguments.
+
+```bash
+poetry run python sciphi/examples/basic_data_gen/runner.py --provider_name=openai --model_name=gpt-4 --log_level=INFO --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need_basic_split
+```
+
+The above command will generate a single sample from GPT-4, using the `textbooks_are_all_you_need_basic_split` configuration, and save the output to `example_output.jsonl`. The long-term view of this framework is to function as pictured below:
+<p align="center">
+<img width="524" alt="Screenshot 2023-09-18 at 9 53 55 AM" src="https://github.com/emrgnt-cmplxty/SciPhi/assets/68796651/9731f891-1d99-432a-aaec-37916bc6362f">
+</p>
+
+#### **Command-Line Arguments**
+
+See arguments and their default values in the README. Notable ones include `--provider`, `--model_name`, and `--temperature`.
+
+### **Replicating Full Table of Contents Generation**
 
 **Step 0**: Scrape MIT OCW for course details.
 
@@ -125,33 +139,11 @@ poetry run python sciphi/examples/library_of_phi/gen_step_2_clean_syllabi.py run
 poetry run python sciphi/examples/library_of_phi/gen_step_3_table_of_contents.py run
 ```
 
-### Customizable Runner Script
-
-For flexible applications, execute the relevant `runner.py` with various command-line arguments.
-
-```bash
-poetry run python sciphi/examples/basic_data_gen/runner.py --provider_name=openai --model_name=gpt-4 --log_level=INFO --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need_basic_split
-```
-
-### Command-Line Arguments
-
-See arguments and their default values in the README. Notable ones include `--provider`, `--model_name`, and `--temperature`.
-
-### Example Generated Data
-
-<p align="center">
-<img width="776" alt="Screenshot 2023-09-17 at 11 11 18 PM" src="https://github.com/emrgnt-cmplxty/SciPhi/assets/68796651/8f1ef11d-cd37-4fc7-a7a0-a1e0159ba4a3">
-</p>
-
-## Development
-
-Use SciPhi to craft synthetic data for a given LLM provider. Check the provided code for an example.
-
 ### License
 
 Licensed under the Apache-2.0 License.
 
-### Referenced Datasets
+### Created Datasets
 
 1. [Python Synthetic Textbooks](https://huggingface.co/datasets/emrgnt-cmplxty/sciphi-python-textbook/viewer/default/train)
 2. [Textbooks are all you need](https://huggingface.co/datasets/emrgnt-cmplxty/sciphi-textbooks-are-all-you-need)

diff --git a/sciphi/data/library_of_phi/Introduction_to_Deep_Learning.md b/sciphi/data/library_of_phi/Introduction_to_Deep_Learning.md
@@ -1,4 +1,4 @@
-# NOTE - THIS TEXTBOOK WAS GENERATED WITH AI.
+# NOTE - THIS TEXTBOOK WAS AI GENERATED
 
 # Table of Contents
 
@@ -2041,5 +2041,4 @@ It is worth noting that while manifold alignment can produce accurate alignments
 
 To illustrate the concept further, let's consider an example from the field of speech recognition. Speaker adaptation, an essential technology for fine-tuning speech models, often encounters inter-speaker variation as a mismatch between training and testing speakers. Kernel eigenvoice (KEV) is a non-linear adaptation technique that incorporates kernel principal component analysis to capture higher-order correlations and enhance recognition performance. By applying KEV, it becomes possible to adapt the speaker models based on prior knowledge of training speakers, even with limited adaptation data. This demonstrates the efficacy of feature-level adaptation in addressing domain-specific challenges.
 
-In summary, feature-level adaptation, particularly through techniques like manifold alignment, plays a crucial role in domain adaptation. By aligning the feature representations of the source and target domains, feature-level adaptation enables the transfer of knowledge from a source domain to a target domain with a different data distribution. This technique is valuable in various real-world applications and facilitates transfer learning, where knowledge from one domain is leveraged to improve performance in related domains.
-
+In summary, feature-level adaptation, particularly through techniques like manifold alignment, plays a crucial role in domain adaptation. By aligning the feature representations of the source and target domains, feature-level adaptation enables the transfer of knowledge from a source domain to a target domain with a different data distribution. This technique is valuable in various real-world applications and facilitates transfer learning, where knowledge from one domain is leveraged to improve performance in related domains.
diff --git a/sciphi/examples/helpers.py b/sciphi/examples/helpers.py
@@ -8,7 +8,9 @@
 import yaml
 from requests.auth import HTTPBasicAuth
 
-from sciphi.interface import ProviderName
+from sciphi.interface import InterfaceManager, ProviderName
+from sciphi.interface.base import LLMInterface
+from sciphi.llm import LLMConfigManager
 
 
 def gen_llm_config(args: argparse.Namespace) -> dict:
@@ -308,7 +310,7 @@ def format_yaml_line(line: str, index: int, split_lines: list[str]) -> str:
             line = (
                 line[:first_non_blank_char]
                 + '"'
-                + line[first_non_blank_char:-1]
+                + line[first_non_blank_char:]
                 + '":'
             )
     return line
@@ -397,3 +399,16 @@ def wiki_search_api(
     else:
         response.raise_for_status()  # Raise an HTTPError if the HTTP request returned an unsuccessful status code
         raise ValueError("Unexpected response from API")
+
+
+def get_default_settings_provider(
+    provider: str, model_name: str, max_tokens_to_sample=None
+) -> LLMInterface:
+    """Get the default LLM config and provider for the given provider and model name."""
+
+    provider_name = ProviderName(provider)
+    llm_config = LLMConfigManager.get_config_for_provider(
+        provider_name
+    ).create(max_tokens_to_sample=max_tokens_to_sample, model_name=model_name)
+
+    return InterfaceManager.get_provider(provider_name, model_name, llm_config)
diff --git a/sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py b/sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py
@@ -48,9 +48,8 @@
 import fire
 import yaml
 
+from sciphi.examples.helpers import get_default_settings_provider
 from sciphi.examples.library_of_phi.prompts import SYLLABI_CREATION_PROMPT
-from sciphi.interface import InterfaceManager, ProviderName
-from sciphi.llm import LLMConfigManager
 
 
 def extract_data_from_record(record: dict[str, str]) -> tuple[dict, str]:
@@ -137,12 +136,8 @@ def run(self) -> None:
         """Run the draft YAML generation process."""
         yaml.add_representer(str, quoted_presenter)
 
-        provider_name = ProviderName(self.provider)
-        llm_config = LLMConfigManager.get_config_for_provider(
-            provider_name
-        ).create(max_tokens_to_sample=None)
-        llm_provider = InterfaceManager.get_provider(
-            provider_name, self.model_name, llm_config
+        llm_provider = get_default_settings_provider(
+            provider=self.provider, model_name=self.model_name
         )
         if not self.data_directory:
             file_path = os.path.dirname(os.path.abspath(__file__))

diff --git a/sciphi/examples/library_of_phi/gen_step_3_table_of_contents.py b/sciphi/examples/library_of_phi/gen_step_3_table_of_contents.py
@@ -36,9 +36,8 @@
 
 import fire
 
+from sciphi.examples.helpers import get_default_settings_provider
 from sciphi.examples.library_of_phi.prompts import TABLE_OF_CONTENTS_PROMPT
-from sciphi.interface import InterfaceManager, ProviderName
-from sciphi.llm import LLMConfigManager
 
 
 class TableOfContentsRunner:
@@ -47,7 +46,7 @@ class TableOfContentsRunner:
     def __init__(
         self,
         input_rel_dir: str = "output_step_2",
-        output_rel_dir: str = "table_of_contents",
+        output_rel_dir: str = "output_step_3",
         data_directory=None,
         provider: str = "openai",
         model_name: str = "gpt-4-0613",
@@ -70,14 +69,9 @@ def run(self):
             )
 
         # Build an LLM and provider interface
-        provider_name = ProviderName(self.provider)
-        llm_config = LLMConfigManager.get_config_for_provider(
-            provider_name
-        ).create(max_tokens_to_sample=None)
-        llm_provider = InterfaceManager.get_provider(
-            provider_name, self.model_name, llm_config
+        llm_provider = get_default_settings_provider(
+            provider=self.provider, model_name=self.model_name
         )
-
         input_dir = os.path.join(self.data_directory, self.input_rel_dir)
         output_dir = os.path.join(self.data_directory, self.output_rel_dir)
 

diff --git a/sciphi/examples/library_of_phi/gen_step_4_draft_book.py b/sciphi/examples/library_of_phi/gen_step_4_draft_book.py
@@ -16,8 +16,8 @@
 Parameters:
     provider (str): 
         The provider to use. Default is 'openai'.
-    model (str): 
-        The model name to use. Default is 'gpt-3.5-turbo-instruct'.
+    model_name (str): 
+        The model_name to use. Default is 'gpt-3.5-turbo-instruct'.
     parsed_dir (str): 
         Directory containing parsed data. Default is 'raw_data'.
     toc_dir (str): 
@@ -48,15 +48,17 @@
 
 import fire
 
-from sciphi.examples.helpers import load_yaml_file, wiki_search_api
+from sciphi.examples.helpers import (
+    get_default_settings_provider,
+    load_yaml_file,
+    wiki_search_api,
+)
 from sciphi.examples.library_of_phi.prompts import (
     BOOK_BULK_PROMPT,
     BOOK_CHAPTER_INTRODUCTION_PROMPT,
     BOOK_CHAPTER_SUMMARY_PROMPT,
     BOOK_FOREWARD_PROMPT,
 )
-from sciphi.interface import InterfaceManager, ProviderName
-from sciphi.llm import LLMConfigManager
 from sciphi.writers import RawDataWriter
 
 logger = logging.getLogger(__name__)
@@ -102,7 +104,7 @@ class TextbookContentGenerator:
     def __init__(
         self,
         provider="openai",
-        model="gpt-4-0613",
+        model_name="gpt-4-0613",
         parsed_dir="raw_data",
         toc_dir="table_of_contents",
         output_dir="output_step_4",
@@ -116,7 +118,7 @@ def __init__(
         log_level="INFO",
     ):
         self.provider = provider
-        self.model = model
+        self.model_name = model_name
         self.parsed_dir = parsed_dir
         self.toc_dir = toc_dir
         self.output_dir = output_dir
@@ -146,22 +148,15 @@ def run(self):
         )
         yml_config = load_yaml_file(yml_file_path, do_prep=True)
 
-        # Build an LLM and provider interface
-        provider_name = ProviderName(self.provider)
-        llm_config = LLMConfigManager.get_config_for_provider(
-            provider_name
-        ).create(max_tokens_to_sample=None)
-        llm_provider = InterfaceManager.get_provider(
-            provider_name, self.model, llm_config
-        )
-
         # Create an instance of the generator
         traversal_generator = traverse_config(yml_config)
 
         output_path = os.path.join(
             local_pwd, self.parsed_dir, self.output_dir, f"{self.textbook}.md"
         )
-
+        llm_provider = get_default_settings_provider(
+            provider=self.provider, model_name=self.model_name
+        )
         if not os.path.exists(os.path.dirname(output_path)):
             os.makedirs(os.path.dirname(output_path))
         logger.info(f"Saving textbook to {output_path}")

diff --git a/sciphi/llm/openai_llm.py b/sciphi/llm/openai_llm.py
@@ -16,7 +16,7 @@ class OpenAIConfig(LLMConfig):
     # Base
     provider_name: ProviderName = ProviderName.OPENAI
     model_name: str = "gpt-3.5-turbo"
-    temperature: float = 0.7
+    temperature: float = 0.1
     top_p: float = 1.0
 
     # OpenAI Extras