diff --git a/README.md b/README.md index 5e776c8..ebf8cb7 100644 --- a/README.md +++ b/README.md @@ -6,10 +6,6 @@ SciPhi is a Python package offering: - Configurable generation of LLM-mediated synthetic training/tuning data. - Seamless LLM-mediated evaluation of model output. -
- -
- ## **Questions?** - Join our [Discord community](https://discord.gg/j9GxfbxqAe). @@ -62,7 +58,6 @@ Options include: **Overview:** The Library of Phi is an initiative sponsored by SciPhi. Its primary goal is to democratize access to high-quality textbooks. The project utilizes AI-driven techniques to generate textbooks by processing information from the MIT OCW course webpages. - **Workflow:** The workflow encompasses data scraping, data processing, YAML configuration creation, and RAG execution over Wikipedia, with intermittent work done by LLMs. @@ -77,6 +72,8 @@ The workflow encompasses data scraping, data processing, YAML configuration crea poetry run python sciphi/examples/library_of_phi/generate_textbook.py run --do-wiki=False --textbook=Introduction_to_Deep_Learning ``` +__[See the example output here](sciphi/data/library_of_phi/Introduction_to_Deep_Learning.md)__ + #### **Using a Custom Table of Contents:** 1. Draft a table of contents and save as `textbook_name.yaml`. @@ -99,7 +96,24 @@ Generated textbooks reside in: --- -### Replicating Full Table of Contents Generation +### **Customizable Runner Script** + +For flexible applications, execute the relevant `runner.py` with various command-line arguments. + +```bash +poetry run python sciphi/examples/basic_data_gen/runner.py --provider_name=openai --model_name=gpt-4 --log_level=INFO --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need_basic_split +``` + +The above command will generate a single sample from GPT-4, using the `textbooks_are_all_you_need_basic_split` configuration, and save the output to `example_output.jsonl`. The long-term view of this framework is to function as pictured below: ++ +
+ +#### **Command-Line Arguments** + +See arguments and their default values in the README. Notable ones include `--provider`, `--model_name`, and `--temperature`. + +### **Replicating Full Table of Contents Generation** **Step 0**: Scrape MIT OCW for course details. @@ -125,33 +139,11 @@ poetry run python sciphi/examples/library_of_phi/gen_step_2_clean_syllabi.py run poetry run python sciphi/examples/library_of_phi/gen_step_3_table_of_contents.py run ``` -### Customizable Runner Script - -For flexible applications, execute the relevant `runner.py` with various command-line arguments. - -```bash -poetry run python sciphi/examples/basic_data_gen/runner.py --provider_name=openai --model_name=gpt-4 --log_level=INFO --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need_basic_split -``` - -### Command-Line Arguments - -See arguments and their default values in the README. Notable ones include `--provider`, `--model_name`, and `--temperature`. - -### Example Generated Data - -- -
- -## Development - -Use SciPhi to craft synthetic data for a given LLM provider. Check the provided code for an example. - ### License Licensed under the Apache-2.0 License. -### Referenced Datasets +### Created Datasets 1. [Python Synthetic Textbooks](https://huggingface.co/datasets/emrgnt-cmplxty/sciphi-python-textbook/viewer/default/train) 2. [Textbooks are all you need](https://huggingface.co/datasets/emrgnt-cmplxty/sciphi-textbooks-are-all-you-need) diff --git a/sciphi/data/library_of_phi/Introduction_to_Deep_Learning.md b/sciphi/data/library_of_phi/Introduction_to_Deep_Learning.md index 4e34c32..fcab286 100644 --- a/sciphi/data/library_of_phi/Introduction_to_Deep_Learning.md +++ b/sciphi/data/library_of_phi/Introduction_to_Deep_Learning.md @@ -1,4 +1,4 @@ -# NOTE - THIS TEXTBOOK WAS GENERATED WITH AI. +# NOTE - THIS TEXTBOOK WAS AI GENERATED # Table of Contents @@ -2041,5 +2041,4 @@ It is worth noting that while manifold alignment can produce accurate alignments To illustrate the concept further, let's consider an example from the field of speech recognition. Speaker adaptation, an essential technology for fine-tuning speech models, often encounters inter-speaker variation as a mismatch between training and testing speakers. Kernel eigenvoice (KEV) is a non-linear adaptation technique that incorporates kernel principal component analysis to capture higher-order correlations and enhance recognition performance. By applying KEV, it becomes possible to adapt the speaker models based on prior knowledge of training speakers, even with limited adaptation data. This demonstrates the efficacy of feature-level adaptation in addressing domain-specific challenges. -In summary, feature-level adaptation, particularly through techniques like manifold alignment, plays a crucial role in domain adaptation. By aligning the feature representations of the source and target domains, feature-level adaptation enables the transfer of knowledge from a source domain to a target domain with a different data distribution. This technique is valuable in various real-world applications and facilitates transfer learning, where knowledge from one domain is leveraged to improve performance in related domains. - +In summary, feature-level adaptation, particularly through techniques like manifold alignment, plays a crucial role in domain adaptation. By aligning the feature representations of the source and target domains, feature-level adaptation enables the transfer of knowledge from a source domain to a target domain with a different data distribution. This technique is valuable in various real-world applications and facilitates transfer learning, where knowledge from one domain is leveraged to improve performance in related domains. \ No newline at end of file diff --git a/sciphi/examples/helpers.py b/sciphi/examples/helpers.py index 56fb09e..41f0eaf 100644 --- a/sciphi/examples/helpers.py +++ b/sciphi/examples/helpers.py @@ -8,7 +8,9 @@ import yaml from requests.auth import HTTPBasicAuth -from sciphi.interface import ProviderName +from sciphi.interface import InterfaceManager, ProviderName +from sciphi.interface.base import LLMInterface +from sciphi.llm import LLMConfigManager def gen_llm_config(args: argparse.Namespace) -> dict: @@ -308,7 +310,7 @@ def format_yaml_line(line: str, index: int, split_lines: list[str]) -> str: line = ( line[:first_non_blank_char] + '"' - + line[first_non_blank_char:-1] + + line[first_non_blank_char:] + '":' ) return line @@ -397,3 +399,16 @@ def wiki_search_api( else: response.raise_for_status() # Raise an HTTPError if the HTTP request returned an unsuccessful status code raise ValueError("Unexpected response from API") + + +def get_default_settings_provider( + provider: str, model_name: str, max_tokens_to_sample=None +) -> LLMInterface: + """Get the default LLM config and provider for the given provider and model name.""" + + provider_name = ProviderName(provider) + llm_config = LLMConfigManager.get_config_for_provider( + provider_name + ).create(max_tokens_to_sample=max_tokens_to_sample, model_name=model_name) + + return InterfaceManager.get_provider(provider_name, model_name, llm_config) diff --git a/sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py b/sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py index 2226242..7438eca 100644 --- a/sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py +++ b/sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py @@ -48,9 +48,8 @@ import fire import yaml +from sciphi.examples.helpers import get_default_settings_provider from sciphi.examples.library_of_phi.prompts import SYLLABI_CREATION_PROMPT -from sciphi.interface import InterfaceManager, ProviderName -from sciphi.llm import LLMConfigManager def extract_data_from_record(record: dict[str, str]) -> tuple[dict, str]: @@ -137,12 +136,8 @@ def run(self) -> None: """Run the draft YAML generation process.""" yaml.add_representer(str, quoted_presenter) - provider_name = ProviderName(self.provider) - llm_config = LLMConfigManager.get_config_for_provider( - provider_name - ).create(max_tokens_to_sample=None) - llm_provider = InterfaceManager.get_provider( - provider_name, self.model_name, llm_config + llm_provider = get_default_settings_provider( + provider=self.provider, model_name=self.model_name ) if not self.data_directory: file_path = os.path.dirname(os.path.abspath(__file__)) diff --git a/sciphi/examples/library_of_phi/gen_step_3_table_of_contents.py b/sciphi/examples/library_of_phi/gen_step_3_table_of_contents.py index d14afe1..01a9446 100644 --- a/sciphi/examples/library_of_phi/gen_step_3_table_of_contents.py +++ b/sciphi/examples/library_of_phi/gen_step_3_table_of_contents.py @@ -36,9 +36,8 @@ import fire +from sciphi.examples.helpers import get_default_settings_provider from sciphi.examples.library_of_phi.prompts import TABLE_OF_CONTENTS_PROMPT -from sciphi.interface import InterfaceManager, ProviderName -from sciphi.llm import LLMConfigManager class TableOfContentsRunner: @@ -47,7 +46,7 @@ class TableOfContentsRunner: def __init__( self, input_rel_dir: str = "output_step_2", - output_rel_dir: str = "table_of_contents", + output_rel_dir: str = "output_step_3", data_directory=None, provider: str = "openai", model_name: str = "gpt-4-0613", @@ -70,14 +69,9 @@ def run(self): ) # Build an LLM and provider interface - provider_name = ProviderName(self.provider) - llm_config = LLMConfigManager.get_config_for_provider( - provider_name - ).create(max_tokens_to_sample=None) - llm_provider = InterfaceManager.get_provider( - provider_name, self.model_name, llm_config + llm_provider = get_default_settings_provider( + provider=self.provider, model_name=self.model_name ) - input_dir = os.path.join(self.data_directory, self.input_rel_dir) output_dir = os.path.join(self.data_directory, self.output_rel_dir) diff --git a/sciphi/examples/library_of_phi/gen_step_4_draft_book.py b/sciphi/examples/library_of_phi/gen_step_4_draft_book.py index 86d6fea..613048e 100644 --- a/sciphi/examples/library_of_phi/gen_step_4_draft_book.py +++ b/sciphi/examples/library_of_phi/gen_step_4_draft_book.py @@ -16,8 +16,8 @@ Parameters: provider (str): The provider to use. Default is 'openai'. - model (str): - The model name to use. Default is 'gpt-3.5-turbo-instruct'. + model_name (str): + The model_name to use. Default is 'gpt-3.5-turbo-instruct'. parsed_dir (str): Directory containing parsed data. Default is 'raw_data'. toc_dir (str): @@ -48,15 +48,17 @@ import fire -from sciphi.examples.helpers import load_yaml_file, wiki_search_api +from sciphi.examples.helpers import ( + get_default_settings_provider, + load_yaml_file, + wiki_search_api, +) from sciphi.examples.library_of_phi.prompts import ( BOOK_BULK_PROMPT, BOOK_CHAPTER_INTRODUCTION_PROMPT, BOOK_CHAPTER_SUMMARY_PROMPT, BOOK_FOREWARD_PROMPT, ) -from sciphi.interface import InterfaceManager, ProviderName -from sciphi.llm import LLMConfigManager from sciphi.writers import RawDataWriter logger = logging.getLogger(__name__) @@ -102,7 +104,7 @@ class TextbookContentGenerator: def __init__( self, provider="openai", - model="gpt-4-0613", + model_name="gpt-4-0613", parsed_dir="raw_data", toc_dir="table_of_contents", output_dir="output_step_4", @@ -116,7 +118,7 @@ def __init__( log_level="INFO", ): self.provider = provider - self.model = model + self.model_name = model_name self.parsed_dir = parsed_dir self.toc_dir = toc_dir self.output_dir = output_dir @@ -146,22 +148,15 @@ def run(self): ) yml_config = load_yaml_file(yml_file_path, do_prep=True) - # Build an LLM and provider interface - provider_name = ProviderName(self.provider) - llm_config = LLMConfigManager.get_config_for_provider( - provider_name - ).create(max_tokens_to_sample=None) - llm_provider = InterfaceManager.get_provider( - provider_name, self.model, llm_config - ) - # Create an instance of the generator traversal_generator = traverse_config(yml_config) output_path = os.path.join( local_pwd, self.parsed_dir, self.output_dir, f"{self.textbook}.md" ) - + llm_provider = get_default_settings_provider( + provider=self.provider, model_name=self.model_name + ) if not os.path.exists(os.path.dirname(output_path)): os.makedirs(os.path.dirname(output_path)) logger.info(f"Saving textbook to {output_path}") diff --git a/sciphi/llm/openai_llm.py b/sciphi/llm/openai_llm.py index e26e209..91ada0a 100644 --- a/sciphi/llm/openai_llm.py +++ b/sciphi/llm/openai_llm.py @@ -16,7 +16,7 @@ class OpenAIConfig(LLMConfig): # Base provider_name: ProviderName = ProviderName.OPENAI model_name: str = "gpt-3.5-turbo" - temperature: float = 0.7 + temperature: float = 0.1 top_p: float = 1.0 # OpenAI Extras