Skip to content
This repository has been archived by the owner on Sep 3, 2024. It is now read-only.

Commit

Permalink
Merge pull request #13 from axioned/ns/derive-on-base
Browse files Browse the repository at this point in the history
Updated Readme & gitignore
  • Loading branch information
snirajaxioned authored Oct 11, 2023
2 parents 47c4b77 + 639f81d commit af9d172
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 1 deletion.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
venv
.env
__pycache__
.DS_Store
.DS_Store
2-private-GPT/models/*
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Why to build ?

- Kiwix team need a GPT like platform to be available to use in remote areas

- It should be trained on custom data (flat files / zim files)

- Knowledge / Tokenized data should be available offline & transferrable from one system to another

- No GPU should be required for using the GPT (Only CPU - Mid size)

# What to build ?

- Build a GPT like platform which can perform Q/A offline

- It should be trained on specific data (It can extend already existing knowledge)

- Training can be on Flat Files / Zim Files / Json (Q/K/V Pattern) etc

- Should be standalone deployed on small - mid level system to get results

# How to build ?

- Source ticket(https://github.com/kiwix/overview/issues/93)

1. Use pre-built Model for tokenizer
2. Read flat file
3. Tokenize data
4. Create config for MLM / NSP model for Training
5. Train
6. Store Knowledge / Tokens in Flat files (easy to share)
7. Test

# Resources

https://saturncloud.io/blog/training-a-bert-model-from-scratch-with-hugging-face-a-comprehensive-guide/

https://huggingface.co/datasets/yelp_review_full

https://huggingface.co/docs/transformers/training

0 comments on commit af9d172

Please sign in to comment.