Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training #3

Open
armoreal opened this issue Feb 24, 2019 · 10 comments
Open

training #3

armoreal opened this issue Feb 24, 2019 · 10 comments

Comments

@armoreal
Copy link

Is there's any way to train GPT2 using my own text corpus?

@graykode
Copy link
Owner

@armoreal Which language do you want? Is it English?

@armoreal
Copy link
Author

In russian.

@graykode
Copy link
Owner

graykode commented Feb 24, 2019

@armoreal
First, Existing gpt-2 models are only supported in English. openai/gpt-2#31
If you want to train your language, I recommend you to read original gpt, gpt-2 paper.
Please See Improving Language Understanding by Generative Pre-Training, 3-1. Unsupervised pre-training and 3-2. Supervised fine-tuning!
https://github.com/eukaryote31/openwebtext In here, you can also see GPT-2 WebText dataset.

@armoreal
Copy link
Author

Thanks for your reply.
As far as i understand, GPT2 were trained on english and that's the reason why it doesn't support other languages, but I'd like to try to train it on other languages using my own dataset. OpenAI reply about training: openai/gpt-2#19 So it's possible, but they didn't planning to release the code yet.

@graykode
Copy link
Owner

graykode commented Feb 24, 2019

@armoreal I think this repository can be trainable https://github.com/openai/finetune-transformer-lm
but, There are no dataset related to your langauge and computer resource I think..
In gpt-2 paper, they explained what is different gpt between gpt-2.
It will be problem at training, dataset(including how they pre-processing) and computer computer
image

@graykode
Copy link
Owner

graykode commented Feb 24, 2019

@armoreal
See code and paper more detail
image

  1. Text-Predict in here : https://github.com/openai/finetune-transformer-lm/blob/master/train.py#L176, 3.1 Unsupervisedpre-training
  2. Task classification in here : https://github.com/openai/finetune-transformer-lm/blob/master/train.py#L193, 3.2 Supervisedfine-tuning

L3(C) = L2(C) + λ∗L1(C)
https://github.com/openai/finetune-transformer-lm/blob/master/train.py#L205

@graykode
Copy link
Owner

graykode commented Feb 24, 2019

Overall, There is code related with training. so you can train.
BUT Dataset and Computer power maybe problem :(

Please do not close this issue for everyone!

@guotong1988
Copy link

Same question. Thank you.

@robertmacyiii
Copy link

Is there a way to finetune this GPT-2 implementation on my own English corpus?

@radiodee1
Copy link

radiodee1 commented Jun 1, 2019

I would like to fine tune pytorch gpt2 on an English corpus. Is the openai code pytorch or tf? Are there examples online in pytorch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants