Skip to content

Latest commit

 

History

History
332 lines (222 loc) · 15.1 KB

README.md

File metadata and controls

332 lines (222 loc) · 15.1 KB

EDGAR Q&A

Overview

About this template

This AI Starter Kit is an example of a semantic search workflow that can be built using the SambaNova platform to answer questions about organizations using their 10-K annual reports. It includes:

  • A configurable SambaStudio connector to run inference off a model deployed in it.
  • A configurable integration with a third-party vector database.
  • An implementation of the semantic search workflow and prompt construction strategies.

This sample is ready-to-use. We provide two options to help you run this demo by following a few simple steps described in the Getting Started section. It also serves as a starting point for customization to your organization's needs, which you can learn more about in the Customizing the Template section.

Workflow

This AI Starter Kit implements two distinct workflows that pipelines a series of operations.

Ingestion

This workflow is an example of downloading and indexing data for subsequent Q&A. The steps are:

  1. Download data: This workflow begins with pulling 10K reports from the EDGAR dataset to be chunked, indexed and stored for future retrieval. EDGAR data is downloaded using the SEC DATA DOWNLOADER, which retrieves the data as text.
  2. Split data: Once the data has been downloaded, we need to split the data into chunks of text to be embedded and stored in a vector database. This size of the chunk of text depends on the context (sequence) length offered by the model. Generally, larger context lengths result in better performance. The method used to split text also has an impact on performance (for instance, making sure there are no word breaks, sentence breaks, etc.). The downloaded data is split using RecursiveCharacterTextSplitter.
  3. Embed data: For each chunk of text from the previous step, we use an embeddings model to create a vector representation of it. These embeddings are used in the storage and retrieval of the most relevant content based on the user's query. The split text is embedded using HuggingFaceInstructEmbeddings.
  4. Store embeddings: Embeddings for each chunk, along with content and relevant metadata (such as source documents) are stored in a vector database. The embedding acts as the index in the database. In this template, we store information with each entry, which can be modified to suit your needs. There are several vector database options available, each with their own pros and cons. This AI template is setup to use chromadb or QDrant as the vector database, but can easily be updated to use any other.

Retrieval

This workflow is an example of leveraging data stored in a vector database along with a large language model to enable retrieval-based Q&A off your data. The steps are:

  1. Embed query: Given a user-submitted query, the first step is to convert it into a common representation (an embedding) for subsequent use in identifying the most relevant stored content. Because of this, it is recommended to use the same embedding model to generate embeddings. In this sample, the query text is embedded using HuggingFaceInstructEmbeddings, which is the same model as used in the ingestion workflow.
  2. Retrieve relevant content: Next, we use the embeddings representation of the query to make a retrieval request from the vector database, which returns relevant entries (content). The vector database therefore also acts as a retriever for fetching relevant information from the database.

Answering

  1. SambaNova Large language model (LLM): Once the relevant information is retrieved, the content is sent to a SambaNova LLM to generate the final response to the user query.
    • Prompt engineering: The user's query is combined with the retrieved content along with instructions to form the prompt before being sent to the LLM. This process involves prompt engineering, and is an important part of ensuring quality output. In this AI template, customized prompts are provided to the LLM to improve the quality of response for this use case.

Third-party tools and data sources

All the packages/tools are listed in the requirements.txt file in the project directory. Some of the main packages are listed below:

  • streamlit (version 1.25.0)
  • llama-hub (version 0.0.25)
  • langchain (version 0.0.266)
  • llama-index (version 0.8.20)
  • sentence_transformers (version 2.2.2)
  • instructorembedding (version 1.0.1)
  • beautifulsoup4 (version 4.12.2)
  • chromadb (version 0.4.8)
  • qdrant-client (version 1.5.2)
  • fastapi (version 0.99.1)
  • unstructured (version 0.8.1)

Getting started

1. Deploy your model in SambaStudio

Begin by deploying your LLM of choice to an endpoint for inference in SambaStudio either through the GUI or CLI. Refer to the SambaStudio endpoint documentation for help on deploying endpoints.

2. Integrate your model in the starter kit

Integrate your LLM deployed on SambaStudio with this AI starter kit in two simple steps:

  1. Clone this repo.
  git clone https://github.com/sambanova/ai-starter-kit.git
  1. Update API information for the SambaNova LLM and, optionally, the vector database. These are represented as configurable variables in the export.env file in the project directory. The variable names are listed below as an example.
BASE_URL="http://...."
PROJECT_ID=""
ENDPOINT_ID=""
API_KEY=""
VECTOR_DB_URL=http://host.docker.internal:6333

3. Deploy the starter kit

Option 1: Run through local install

Running through local install is the simplest option and includes a simple Streamlit based UI for quick experimentation. It is not recommended for POC implementation.

Important: When running through local install, no 10-Ks for organizations are pre indexed, with 10-Ks being pulled and indexed on-demand. The workflow to do this has been implemented in this Starter Kit. To pull the latest 10-K from EDGAR, simply specify the company ticker in the sample UI and click on Submit. This results in a one-time fetch of the latest 10-K from EDGAR, embedding and indexing it before making it available for Q&A. As a result, it takes some time for the data to be available the first time you ask a question for a new company ticker. As this is a one-time operation per company ticker, all subsequent Q&A off that company ticker is much faster, as this process does not need to be repeated.

Begin by installing dependencies. It is recommended to use virtual env or conda environment for installation.

cd  edgar_qna/edgar_qna_streamlit
python3  -m  venv  edgar_demo
source  edgar_demo/bin/activate
pip  install  -r  requirements.txt

To run the demo through local install, run the following commands:

sh run.sh

This will open the demo in your default browser at port 8501.

Option 2: Run via Docker

Running through Docker is the most scalable approach for running this AI Starter Kit, and one that provides a path to production deployment.

Important: This approach does not include a UI, but allows inference via API to be connected to custom user experiences. Setting up to run via Docker includes pulling and indexing the latest 10-Ks for organizations in the S&P 500. While the setup may take a bit more time, it leads to a better Q&A experience as there are no delays due to fetching and indexing data on-the-fly.

Install dependencies to upload data to qdrant server

cd edgar_qna/edgar_qna_server
python3 -m venv edgar_server_demo
source edgar_server_demo/bin/activate
pip install -r requirements.txt

Setting-up vector database Qdrant

Make sure docker is installed and running.

Get qdrant docker

docker pull qdrant/qdrant

Run qdrant on local host

docker run -p 6333:6333 \
-v $(pwd)/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant

Upload data to the database

Note: In case you have data already vectorized, you can download a snapshot and upload it in any new DB instance.

upload a snapshot to qdrant server
unzip edgar_tsla_sample.snapshot.zip
curl -X POST 'http://localhost:6333/collections/edgar/snapshots/upload' \
-H 'Content-Type:multipart/form-data' \
-F 'snapshot=@edgar_tsla_sample.snapshot'
Download a created snapshot from qdrant server (optional)
curl -O http://localhost:6333/collections/edgar/snapshots/<snapshot name>

Download and embed data for other organizations (optional)

For Q&A over other companies, one can download and vectorize the EDGAR dataset using the files get_snp500_data.py and vectorize_and_load.py.

The commands below will download and embed the data for other organizations listed in ticker_to_download.csv.

python3 get_snp500_data.py
python3 vectorize_and_load.py

Setting-up Q&A api

Build AI starter kit docker image

cd ../..
docker build -t edgar_assistant -f edgar_qna/edgar_qna_server/Dockerfile .

Run AI starter kit container with image

docker run -p 80:80   --env-file edgar_qna/export.env edgar_assistant

Make a request to the API

The app can be accessed at

http://127.0.0.1/docs#

Click the drop-down menu next to POST and click the Try it out button.

Edit the following JSON input for the ticker and query.

{
"ticker": "string",
"query": "string"
}

For example, below is an edited version for a query regarding Tesla.

{
"ticker": "tsla",
"query": "What was the increase in R&D expenses in 2022 compared to 2021?"
}

Once you click the execute button, the LLM response will be populated in the Response body. Here is an example for the query above.

{
"message": " R&D expenses increased $482 million, or 19% in the year ended December 31, 2022 as compared to the year ended December 31, 2021. The increase was primarily due to a $175 million increase in employee and labor related expenses, a $132 million increase in facilities, outside services, freight and depreciation expense, a $101 million increase in R&D expensed materials and an $87 million increase in stock-based compensation expense. These increases were to support our expanding product roadmap and \n"
}

Customizing the template

The provided example template can be further customized based on the use case.

Import Data

Depending on the format of input data files (e.g., .pdf, .docx, .rtf), different packages can be used to convert them into plain text files.

PDF Format:

This modification can be done in the following location:

file: edgar_sec_qa.py
function: vector_db_sec_docs

Split Data

You can experiment with different ways of splitting the data, such as splitting by tokens or using context-aware splitting for code or markdown files. LangChain provides several examples of different kinds of splitting here.

This modification can be done in the following location:

file: edgar_sec_qa.py
function: vector_db_sec_docs

The RecursiveCharacterTextSplitter, which is used for this template, can be further customized using the chunk_size and chunk_overlap parameters. For LLMs with a long sequence length, a larger value of chunk_size could be used to provide the LLM with broader context and improve performance. The chunk_overlap parameter is used to maintain continuity between different chunks.

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100,
chunk_overlap=20
)

This modification can be done in the following location:

file: edgar_sec_qa.py
function: vector_db_sec_docs

Embed data

There are several open-source embedding models available on HuggingFace. This leaderboard ranks these models based on the Massive Text Embedding Benchmark (MTEB). A number of these models are available on SambaStudio and can be further fine-tuned on specific datasets to improve performance.

This modification can be done in the following location:

file: edgar_sec_qa.py
function: vector_db_sec_docs

Store embeddings

The template can be customized to use different vector databases to store the embeddings generated by the embedding model. The LangChain vector stores documentation provides a broad collection of vector stores that can be easily integrated.

This modification can be done in the following location:

file: edgar_sec_qa.py
function: vector_db_sec_docs

Retrieval

Similar to the vector stores, a wide collection of retriever options are available depending on the use case. In this template, the vector store was used as a retriever, but it can be enhanced and customized, as shown in some of the examples. here.

This modification can be done in the following location:

file: edgar_sec_qa.py
function: retreival_qa_chain

Large language model (LLM)

Fine-tuning

The template uses the SN LLM model, which can be further fine-tuned to improve response quality. To train a model in SambaStudio, learn how to prepare your training data, import your dataset into SambaStudio and run a training job

Prompt engineering

Finally, prompting has a significant effect on the quality of LLM responses. Prompts can be further customized to improve the overall quality of the responses from the LLMs. For example, in the given template, the following prompt was used to generate a response from the LLM, where question is the user query and context are the documents retrieved by the retriever.

custom_prompt_template = """Use the following pieces of context about company annual/quarterly report filing to answer the question at the end. If the answer to the question cant be extracted from given CONTEXT than say I do not have information regarding this.
{context}




Question: {question}
Helpful Answer:"""
CUSTOMPROMPT = PromptTemplate(
template=custom_prompt_template, input_variables=["context", "question"]
)

This modification can be done in the following location:

file: edgar_sec_qa.py
function: retreival_qa_chain