Merge pull request #13 from axioned/ns/derive-on-base

Updated Readme & gitignore
axioned · Oct 11, 2023 · af9d172 · af9d172
2 parents 47c4b77 + 639f81d
commit af9d172
Show file tree

Hide file tree

Showing 2 changed files with 41 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 venv
 .env
 __pycache__
-.DS_Store
+.DS_Store
+2-private-GPT/models/*
diff --git a/README.md b/README.md
@@ -0,0 +1,39 @@
+# Why to build ?
+
+- Kiwix team need a GPT like platform to be available to use in remote areas
+
+- It should be trained on custom data (flat files / zim files)
+
+- Knowledge / Tokenized data should be available offline & transferrable from one system to another
+
+- No GPU should be required for using the GPT (Only CPU - Mid size)
+
+# What to build ?
+
+- Build a GPT like platform which can perform Q/A offline
+
+- It should be trained on specific data (It can extend already existing knowledge)
+
+- Training can be on Flat Files / Zim Files / Json (Q/K/V Pattern) etc
+
+- Should be standalone deployed on small - mid level system to get results
+
+# How to build ?
+
+- Source ticket(https://github.com/kiwix/overview/issues/93)
+
+1. Use pre-built Model for tokenizer
+2. Read flat file
+3. Tokenize data
+4. Create config for MLM / NSP model for Training
+5. Train
+6. Store Knowledge / Tokens in Flat files (easy to share)
+7. Test
+
+# Resources
+
+https://saturncloud.io/blog/training-a-bert-model-from-scratch-with-hugging-face-a-comprehensive-guide/
+
+https://huggingface.co/datasets/yelp_review_full
+
+https://huggingface.co/docs/transformers/training