feat: Knowledge Graph Retriever trainer using PEFT (#39)

* started adding training scripts initial commit * modified training scripts and added QLoRA training scripts * added arg-parse functionality to run training scripts * formatted the scripts using black * added doc strings for the training files * added a markdown file domenstrating my work on property graph and relick for relation extraction which has further scope of bieng intigrated into KG builder * updated README.MD file for training scripts
c2siorg · Aug 25, 2024 · 86252ac · 86252ac
1 parent 1f1ee9f
commit 86252ac
Show file tree

Hide file tree

Showing 7 changed files with 679 additions and 0 deletions.
diff --git a/graph_rag/graph_builder/Example/build_with_relic.MD b/graph_rag/graph_builder/Example/build_with_relic.MD
@@ -0,0 +1,143 @@
+# Knowledge Graph with Relik and Llama-Index
+
+This markdown file demonstrates an experiment in building a knowledge graph using the `Relik` and `Llama-Index` Property Graphs. The steps include coreference resolution with `Spacy`, relation extraction with `Relik`, and knowledge graph construction with `llama-index PropertyGraphs`,stored in `neo4j`.
+
+## Import Necessary Libraries
+
+Import the essential libraries required for the experiment. These include NLP tools (`Spacy`, `coreferee`), document readers, large language models (LLMs), embeddings, and Neo4j for graph storage.
+
+```python
+import spacy, coreferee
+from llama_index.core import SimpleDirectoryReader
+import nest_asyncio
+from llama_index.llms.ollama import Ollama
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+from llama_index.core import PropertyGraphIndex
+from llama_index.core import Settings
+from llama_index.extractors.relik.base import RelikPathExtractor
+from llama_index.graph_stores.neo4j import Neo4jPGStore
+```
+
+## Coreference Resolution Function
+
+Sets up a function to resolve coreferences in a text. This is crucial for ensuring that the references to entities like "she" or "it" are correctly linked back to their antecedents,removing de-duplication of nodes from knowledge graph.
+
+```python
+coref_nlp = spacy.load('en_core_web_lg')
+coref_nlp.add_pipe('coreferee')
+
+def coref_text(text):
+    coref_doc = coref_nlp(text)
+    resolved_text = ""
+
+    for token in coref_doc:
+        repres = coref_doc._.coref_chains.resolve(token)
+        if repres:
+            resolved_text += " " + " and ".join(
+                [
+                    t.text
+                    if t.ent_type_ == ""
+                    else [e.text for e in coref_doc.ents if t in e][0]
+                    for t in repres
+                ]
+            )
+        else:
+            resolved_text += " " + token.text
+
+    return resolved_text
+```
+
+### Example Usage of Coreference Resolution
+
+An example is provided to demonstrate how the `coref_text` function resolves references in the text. 
+
+```python
+coref_text("alice is great. she can study for long hours and remember")
+# Output: alice is great. alice can study for long hours and remember
+```
+
+## Load and Process Documents
+
+The documents are loaded from a specified directory and processed with the coreference resolution function to prepare them for knowledge graph construction.
+
+```python
+documents = SimpleDirectoryReader(input_dir='/content/data').load_data()
+len(documents)
+
+for doc in documents:
+    doc.text = coref_text(doc.text)
+```
+
+## Initialize Relik Path Extractor
+
+Here, the `RelikPathExtractor` is initialized, which will be used to extract relationships between entities from the processed documents.
+
+```python
+relik = RelikPathExtractor(
+    model="relik-ie/relik-relation-extraction-small", model_config={"skip_metadata": True}
+)
+```
+
+## Set Up Language Model and Embeddings
+
+This section configures the LLM (`Ollama`) and the embedding model (`HuggingFaceEmbedding`) to be used for generating embeddings for the knowledge graph.
+
+```python
+llm = Ollama(base_url="http://localhost:11434", model="llama3.1")
+embed_model = HuggingFaceEmbedding(model_name="microsoft/codebert-base")
+Settings.llm = llm
+```
+
+## Configure Neo4j Graph Store
+
+Sets up the connection to a Neo4j database, where the knowledge graph will be stored. Ensure to replace the placeholder for the password with your actual Neo4j password.
+
+```python
+username = "neo4j"
+password = "*****************************"
+url = "neo4j+s://45256b03.databases.neo4j.io"
+
+graph_store = Neo4jPGStore(
+    username=username,
+    password=password,
+    url=url,
+    refresh_schema=False
+)
+```
+
+## Build the Knowledge Graph
+
+Here, the knowledge graph is constructed from the processed documents using the configured tools: `Relik`, `Ollama`, `HuggingFaceEmbedding`, and `Neo4j`.
+
+```python
+index = PropertyGraphIndex.from_documents(
+    documents,
+    kg_extractors=[relik],
+    llm=llm,
+    embed_model=embed_model,
+    property_graph_store=graph_store,
+    show_progress=True,
+)
+```
+![Alt text](random/visualisation.png)
+
+
+## Query the Knowledge Graph
+
+Finally, a query engine is created, allowing you to query the knowledge graph. Example queries and their expected outputs are provided.
+
+```python
+query_engine = index.as_query_engine(include_text=True)
+
+response = query_engine.query("what is keras nlp?")
+print(str(response))
+
+# Output: Keras NLP provides a simple way to fine-tune pre-trained language models for various natural language processing tasks...
+```
+
+```python
+response = query_engine.query("format for citing keras nlp")
+print(str(response))
+
+# Output: To cite Keras NLP, you can refer to the following format: KerasNLP. (n.d.). Retrieved from <https://keras-nlp.github.io/>...
+```
diff --git a/graph_rag/graph_builder/Example/random/visualisation.png b/graph_rag/graph_builder/Example/random/visualisation.png
diff --git a/graph_rag/graph_retrieval/README.MD b/graph_rag/graph_retrieval/README.MD
@@ -41,4 +41,37 @@ from graph_rag.graph_retrieval.graph_retrieval import graph_query
 response = graph_query("Your query here", query_engine)
 print(response)
 ```
+## Advanced Training with QLoRA and P-Tuning
+
+>fine-tuning LLMs on data(masked language or Next toke Prediction) for few epochs, may result in better retrieval and response
+
+### 1. Setup
+
+To use QLoRA and P-Tuning, ensure your environment is set up with the required libraries and that your model and dataset configurations are defined in a `config.yaml` file.
+
+### 2. Finetuning with QLoRA
+
+Use the QLoRA method for efficient fine-tuning by passing the appropriate configurations in your `config.yaml`. This method is ideal when working with large models on limited hardware.
+
+```bash
+python qlora_adapter.py --config path/to/config.yaml
+```
+Execute the training script with the `--config` argument to specify your configuration file:
+
+### 3. Fine-Tuning with P-Tuning 
+
+P-Tuning allows for parameter-efficient prompt-based fine-tuning. Adjust the number of virtual tokens and other related parameters in the `config.yaml` to customize the training process.
+
+```bash
+python p_tuning.py--config path/to/config.yaml
+```
+Execute the training script with the `--config` argument to specify your configuration file:
+
+
+
+
+
+
+This will start the training process using the specified method (QLoRA or P-Tuning) and configurations.
+
 
diff --git a/graph_rag/graph_retrieval/training_scripts/QLoRA_tuning/config.yaml b/graph_rag/graph_retrieval/training_scripts/QLoRA_tuning/config.yaml
@@ -0,0 +1,35 @@
+MODEL:
+  MODEL: "codellama/CodeLlama-7b-Instruct-hf"
+  SEQ_LENGTH: 2048
+  LOAD_IN_8BIT: False
+
+DATA:
+  REPO_PATH: '/content/keras-io/templates'
+  SEED: 0
+  EXTENSIONS: [ 'md' ]
+  OUTPUT_FILE: 'merged_output.txt'# Column name containing the code content
+
+TRAINING_ARGUMENTS:
+  BATCH_SIZE: 64
+  GR_ACC_STEPS: 1
+  LR: 5e-4
+  LR_SCHEDULER_TYPE: "cosine"
+  WEIGHT_DECAY: 0.01
+  NUM_WARMUP_STEPS: 30
+  EVAL_FREQ: 100
+  SAVE_FREQ: 100
+  LOG_FREQ: 10
+  OUTPUT_DIR:
+  BF16: True
+  FP16: False
+
+LORA:
+  LORA_R: 8
+  LORA_ALPHA: 32
+  LORA_DROPOUT: 0.0
+  LORA_TARGET_MODULES:
+
+BNB_CONFIG:
+  USE_NESTED_QUANT: True
+  BNB_4BIT_COMPUTE_DTYPE: "bfloat16"
+