Retrieval-Augmented Generation (RAG) Focused on Farsi Language

Overview

This project implements Retrieval-Augmented Generation (RAG) to enhance the quality and accuracy of information retrieval and response generation in the Farsi language. Leveraging advanced technologies and models, this project aims to provide precise and up-to-date responses by integrating retrieval mechanisms with generation models.

Technologies Used

Models: microsoft/Phi-3-mini-4k-instruct
Data Processing: Read and extracted data from files, cleaned and normalized data using Hazm (Farsi NLP Toolkit)
Embeddings: Created embeddings for deep learning models
Paraphrasing: Utilized paraphrasing techniques to enhance generated responses

Project Steps

Install Dependencies: Installed necessary libraries and dependencies to work with the tools and models required for the project.
Load and Extract Dataset: Loaded the dataset and extracted relevant questions and answers from files to make data usable.
Upload Data: Uploaded data to Google Drive to avoid re-uploading in each session.
Read Data: Addressed challenges related to various data formats and extracted information into a usable format.
Data Splitting: Split data into training and testing sets in an 80-20 ratio for model evaluation.
Data Cleaning and Normalization: Used multiple tools for Farsi text normalization, with Dadmatools (A Python NLP Library for Farsi) and Hazm (Farsi NLP Toolkit) providing the best results.
Create Embeddings: Generated embeddings for preparing data for deep learning models.
Load Pre-trained Model: Utilized the microsoft/Phi-3-mini-4k-instruct model for the RAG implementation.
Question Answering: Configured the model to generate responses and utilized it to answer questions effectively.
Evaluation: Assessed the quality of responses using cosine similarity and transformer sentence models.
Paraphrasing: Applied paraphrasing techniques to further refine generated answers.

Results

The implementation of RAG has significantly improved the accuracy and quality of responses. The integration of models has enabled the system to provide precise and relevant answers. Additionally, paraphrasing has contributed to the enhancement of response quality.

Challenges

Library Compatibility: Faced issues with library versions, particularly with Torch, which were resolved by updating to a suitable version.
Normalization Tools: Encountered difficulties finding effective normalization tools for Farsi, with Dadmatools and Hazm ultimately being the most effective.
Data Handling: Manual data handling and custom data loading presented challenges.

Learnings and Future Work

This project provided valuable insights into RAG technology and its application to Farsi language processing. Future work may include exploring more accurate models and improving data processing techniques. Additionally, optimizing performance and expanding the system's capabilities for other languages could be considered.

Contributing

Contributions to improve the project are welcome. Please follow the standard GitHub workflow for contributions, including forks, pull requests, and issue tracking.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
code		code
data		data
document		document
takenData		takenData
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval-Augmented Generation (RAG) Focused on Farsi Language

Overview

Technologies Used

Project Steps

Results

Challenges

Learnings and Future Work

Contributing

License

About

Releases

Packages

Languages

License

Bahareh0281/RAG-Text-Generation

Folders and files

Latest commit

History

Repository files navigation

Retrieval-Augmented Generation (RAG) Focused on Farsi Language

Overview

Technologies Used

Project Steps

Results

Challenges

Learnings and Future Work

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages