Skip to content

This chatbot uses context-based learning with RAG (Retrieval-Augmented Generation) to answer questions about the Postgraduate Program in Computer Science (PPGCC) program. We use data from the PPGCC website to create an intelligent and responsive assistant for program-related queries.

License

Notifications You must be signed in to change notification settings

unesp-ppgcc-2024-02-ap-chatbot/unesp-ppgcc-chatbot

Repository files navigation

UNESP (PPGCC) Chatbot 🚀

Project Overview 🎓

Welcome! This is a chatbot project developed as part of the Deep Learning course in the Postgraduate Program in Computer Science (PPGCC) at UNESP! 🤖💬

This chatbot uses context-based learning with RAG (Retrieval-Augmented Generation) to answer questions about the PPGCC program. We use data from the PPGCC website to create an intelligent and responsive assistant for program-related queries.

The system employs a hybrid search approach, combining semantic search using the BAAI/bge-m3 embeddings model with BM25. After searches, we use RRF (Reciprocal Rank Fusion) to combine the results 🔍🧠

Course: Deep Learning

Professor

Prof. Denis Henrique Pinheiro Salvadeo

Project Team 👥

André da Fonseca Schuck

Gabriel de Souza Lima

Wagner Costa Santos

Technologies Used 💻

  • Qdrant: Vector search engine for efficient similar information retrieval.
  • Next.js: React framework for building modern and optimized web applications.
  • AI SDK (Vercel): Development kit for AI integration in web applications.
  • Llama 3.2 11B: Open Source language model used for experiments.
  • GPT-4o mini: OpenAI language model used in the chatbot.
  • BAAI/bge-m3: Embeddings model for advanced semantic search.

Getting Started

The search engine is based on the Qdrant vector database. For details on how to scrape the data, see the scraping directory.

Prerequisites

  • Python 3.11 or later

Installation

Install venv:

python3.11 -m venv .venv

Activate venv:

# Linux
source .venv/bin/activate

# Windows
.venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Note: If you want to run the scraper, you need to install the dependencies in the 01-scraper\requirements.txt file.

Scraping

See the scraping directory.

Global environment variables

Create a .env file in the root directory with the following variables:

QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=someapikey
ROOT_PROJECT_FOLDER=/home/someuser/unesp-chatbot # The root folder of the project
QDRANT_COLLECTION_NAME=UNESP_CHATBOT_PPGCC

Preprocessing

python 03-preprocess-create-chunks/01-create-metadata.py

This script will create a metadata file:

02-preprocessed-data/content_metadata.json

Chunking content

  • To create the chunks we are also using 2 external files from other projects:
02-preprocessed-data/utils/page_titles.json
02-preprocessed-data/utils/remove-list.json

We are using the page_titles.json file to get the titles of the pages (LLM generated). The remove-list.json file is used to ignore some pages that are not relevant to the chatbot (empty pages or pages with only links). We are also removing other pages with old information.

  • Preprocess the data and create the chunks:
python 03-preprocess-create-chunks/02-create-chunks.py

Result:

02-preprocessed-data/01-chunks_data.json
02-preprocessed-data/chunks_stats.json # Stats about the chunks

Upload to vector database (Qdrant)

  • Using Qdrant locally:
docker-compose up -d
  • Upload the data to Qdrant:
python 04-load-vector-db/main.py

We are using "BAAI/bge-m3" model to generate the embeddings and we also using BM25 ("Qdrant/bm25") to generate the scores.

API

The API is a FastAPI application that uses the Qdrant database to search for answers (05-search-api).

To run the API:

python 05-search-api/service.py

You can access the API at http://localhost:8055/docs.

Chatbot

The chatbot is a Next.js application that uses the AI SDK to interact with the API (06-chatbot).

Environment variables:

Create a .env.local file in the 06-chatbot directory with the following variables:

# Get your OpenAI API Key here: https://platform.openai.com/account/api-keys
OPENAI_API_KEY=************
FIREWORKS_API_KEY=************

API_SEARCH_SERVER=************
AUTH_TRUST_HOST=true

# Generate a random secret: https://generate-secret.vercel.app/32 or `openssl rand -base64 32`
AUTH_SECRET=************

POSTGRES_URL=************

AUTH_GOOGLE_ID=************
AUTH_GOOGLE_SECRET=************

Install dependencies:

pnpm install

Run the chatbot:

pnpm dev

Infrastructure

Diagram (Pricing in USD - 2024 November)

Infrastructure Diagram

Monthly Costs

  • Digital Ocean droplet: $56.00 (R$336.00)

Important: The costs are based on the current prices and the usage of a small number of users. The costs can change according to the usage and the number of users.

Cost per query

  • Estimated tokens per query (average): Input: 1700 tokens, Output: 300 tokens
  1. GPT-4o-mini

    • Cost (input token): $0.15/1000000 = $0.00000015 * 1700 = $0.000255
    • Cost (output token): $0.60/1000000 = $0.0000006 * 300 = $0.00018
    • Total cost: $0.000435
    • Total cost (BRL): $0.000435 * 6 = R$0.00261
  2. Fireworks - LLaMA 3.2 11b

    • Cost (input token): $0.2/1000000 = $0.0000002 * 1700 = $0.00034
    • Cost (output token): $0.2/1000000 = $0.0000002 * 300 = $0.00006
    • Total cost: $0.0004
    • Total cost (BRL): $0.0004 * 6 = R$0.0024

Cost Summary

Service Cost Cost (BRL) Cost Type
GPT-4o-mini $0.000435 R$0.00261 Per query
Fireworks - LLaMA 3.2 11b $0.0004 R$0.0024 Per query
Digital Ocean droplet $56.00 R$336.00 Monthly

Embeddings

The embeddings run on the Qdrant server locally using CPU. The embeddings are generated using the "BAAI/bge-m3" model. The Qdrant server also handles the BM25 scoring.

About

This chatbot uses context-based learning with RAG (Retrieval-Augmented Generation) to answer questions about the Postgraduate Program in Computer Science (PPGCC) program. We use data from the PPGCC website to create an intelligent and responsive assistant for program-related queries.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published