Skip to content

Commit

Permalink
feat: Knowledge Graph Retriever trainer using PEFT (#39)
Browse files Browse the repository at this point in the history
* started adding training scripts initial commit

* modified training scripts and added QLoRA training scripts

* added arg-parse functionality to run training scripts

* formatted the scripts using black

* added doc strings for the training files

* added a markdown file domenstrating my work on property graph and relick for relation extraction which has further scope of bieng intigrated into KG builder

* updated README.MD file for training scripts
  • Loading branch information
debrupf2946 authored Aug 25, 2024
1 parent 1f1ee9f commit 86252ac
Show file tree
Hide file tree
Showing 7 changed files with 679 additions and 0 deletions.
143 changes: 143 additions & 0 deletions graph_rag/graph_builder/Example/build_with_relic.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Knowledge Graph with Relik and Llama-Index

This markdown file demonstrates an experiment in building a knowledge graph using the `Relik` and `Llama-Index` Property Graphs. The steps include coreference resolution with `Spacy`, relation extraction with `Relik`, and knowledge graph construction with `llama-index PropertyGraphs`,stored in `neo4j`.

## Import Necessary Libraries

Import the essential libraries required for the experiment. These include NLP tools (`Spacy`, `coreferee`), document readers, large language models (LLMs), embeddings, and Neo4j for graph storage.

```python
import spacy, coreferee
from llama_index.core import SimpleDirectoryReader
import nest_asyncio
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import PropertyGraphIndex
from llama_index.core import Settings
from llama_index.extractors.relik.base import RelikPathExtractor
from llama_index.graph_stores.neo4j import Neo4jPGStore
```

## Coreference Resolution Function

Sets up a function to resolve coreferences in a text. This is crucial for ensuring that the references to entities like "she" or "it" are correctly linked back to their antecedents,removing de-duplication of nodes from knowledge graph.

```python
coref_nlp = spacy.load('en_core_web_lg')
coref_nlp.add_pipe('coreferee')

def coref_text(text):
coref_doc = coref_nlp(text)
resolved_text = ""

for token in coref_doc:
repres = coref_doc._.coref_chains.resolve(token)
if repres:
resolved_text += " " + " and ".join(
[
t.text
if t.ent_type_ == ""
else [e.text for e in coref_doc.ents if t in e][0]
for t in repres
]
)
else:
resolved_text += " " + token.text

return resolved_text
```

### Example Usage of Coreference Resolution

An example is provided to demonstrate how the `coref_text` function resolves references in the text.

```python
coref_text("alice is great. she can study for long hours and remember")
# Output: alice is great. alice can study for long hours and remember
```

## Load and Process Documents

The documents are loaded from a specified directory and processed with the coreference resolution function to prepare them for knowledge graph construction.

```python
documents = SimpleDirectoryReader(input_dir='/content/data').load_data()
len(documents)

for doc in documents:
doc.text = coref_text(doc.text)
```

## Initialize Relik Path Extractor

Here, the `RelikPathExtractor` is initialized, which will be used to extract relationships between entities from the processed documents.

```python
relik = RelikPathExtractor(
model="relik-ie/relik-relation-extraction-small", model_config={"skip_metadata": True}
)
```

## Set Up Language Model and Embeddings

This section configures the LLM (`Ollama`) and the embedding model (`HuggingFaceEmbedding`) to be used for generating embeddings for the knowledge graph.

```python
llm = Ollama(base_url="http://localhost:11434", model="llama3.1")
embed_model = HuggingFaceEmbedding(model_name="microsoft/codebert-base")
Settings.llm = llm
```

## Configure Neo4j Graph Store

Sets up the connection to a Neo4j database, where the knowledge graph will be stored. Ensure to replace the placeholder for the password with your actual Neo4j password.

```python
username = "neo4j"
password = "*****************************"
url = "neo4j+s://45256b03.databases.neo4j.io"

graph_store = Neo4jPGStore(
username=username,
password=password,
url=url,
refresh_schema=False
)
```

## Build the Knowledge Graph

Here, the knowledge graph is constructed from the processed documents using the configured tools: `Relik`, `Ollama`, `HuggingFaceEmbedding`, and `Neo4j`.

```python
index = PropertyGraphIndex.from_documents(
documents,
kg_extractors=[relik],
llm=llm,
embed_model=embed_model,
property_graph_store=graph_store,
show_progress=True,
)
```
![Alt text](random/visualisation.png)


## Query the Knowledge Graph

Finally, a query engine is created, allowing you to query the knowledge graph. Example queries and their expected outputs are provided.

```python
query_engine = index.as_query_engine(include_text=True)

response = query_engine.query("what is keras nlp?")
print(str(response))

# Output: Keras NLP provides a simple way to fine-tune pre-trained language models for various natural language processing tasks...
```

```python
response = query_engine.query("format for citing keras nlp")
print(str(response))

# Output: To cite Keras NLP, you can refer to the following format: KerasNLP. (n.d.). Retrieved from <https://keras-nlp.github.io/>...
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 33 additions & 0 deletions graph_rag/graph_retrieval/README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,37 @@ from graph_rag.graph_retrieval.graph_retrieval import graph_query
response = graph_query("Your query here", query_engine)
print(response)
```
## Advanced Training with QLoRA and P-Tuning

>fine-tuning LLMs on data(masked language or Next toke Prediction) for few epochs, may result in better retrieval and response
### 1. Setup

To use QLoRA and P-Tuning, ensure your environment is set up with the required libraries and that your model and dataset configurations are defined in a `config.yaml` file.

### 2. Finetuning with QLoRA

Use the QLoRA method for efficient fine-tuning by passing the appropriate configurations in your `config.yaml`. This method is ideal when working with large models on limited hardware.

```bash
python qlora_adapter.py --config path/to/config.yaml
```
Execute the training script with the `--config` argument to specify your configuration file:

### 3. Fine-Tuning with P-Tuning

P-Tuning allows for parameter-efficient prompt-based fine-tuning. Adjust the number of virtual tokens and other related parameters in the `config.yaml` to customize the training process.

```bash
python p_tuning.py--config path/to/config.yaml
```
Execute the training script with the `--config` argument to specify your configuration file:






This will start the training process using the specified method (QLoRA or P-Tuning) and configurations.


Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
MODEL:
MODEL: "codellama/CodeLlama-7b-Instruct-hf"
SEQ_LENGTH: 2048
LOAD_IN_8BIT: False

DATA:
REPO_PATH: '/content/keras-io/templates'
SEED: 0
EXTENSIONS: [ 'md' ]
OUTPUT_FILE: 'merged_output.txt'# Column name containing the code content

TRAINING_ARGUMENTS:
BATCH_SIZE: 64
GR_ACC_STEPS: 1
LR: 5e-4
LR_SCHEDULER_TYPE: "cosine"
WEIGHT_DECAY: 0.01
NUM_WARMUP_STEPS: 30
EVAL_FREQ: 100
SAVE_FREQ: 100
LOG_FREQ: 10
OUTPUT_DIR:
BF16: True
FP16: False

LORA:
LORA_R: 8
LORA_ALPHA: 32
LORA_DROPOUT: 0.0
LORA_TARGET_MODULES:

BNB_CONFIG:
USE_NESTED_QUANT: True
BNB_4BIT_COMPUTE_DTYPE: "bfloat16"

Loading

0 comments on commit 86252ac

Please sign in to comment.