Extending GPT-2 Context Length via RoPE Scaling

note: ive chose gpt2 as its the only small model i could find that i could finetune easiy without ooming which was old enough to not have rope preimplemented (qlora was making it harder and giving too many errors so dint wanna go into that as time constrians)

Training Runs

Training Run 1 on WandB
Training Run 2 on WandB (this one is new im not using this model yet)

Demo

Try the model here: GPT-2 Long Demo
Try giving an input of >1k or 2k tokens

Evaluation

Not as good as expected: RoPE Test Evaluation but...good enough! given the compute constrians! :D

Approach

Use the rotatory pos implementation by lucid rains here
change the model to use rope pos embeddings
save and upload to huggingface (to not oom), the model can be found here
load and train seperately on long-alpaca12k
these steps can be seen in notebook and notebook
for logs and other findings or docs check logs and this

Note:

i kind of get that the ideal way to apply pathces to models would be something like this kaiokendev impl though this was my frist time doing this and time constrains so i just used whatever i could

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

Extending GPT-2 Context Length via RoPE Scaling

Training Runs

Demo

Evaluation

Approach

Note:

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

Extending GPT-2 Context Length via RoPE Scaling

Training Runs

Demo

Evaluation

Approach

Note: