Feeding the model separate examples instead of one continuous block of text #17

CupOfGeo · 2021-10-26T20:18:31Z

Hello I'm interested in adding this feature anding a function in text2csv.py to take a folder of texts and then in run_clm.py pad and truncate them instead of the group_text function.

CupOfGeo · 2021-10-28T19:26:57Z

I'm using songs for my data the line new line spacing is important and i would like them to be separate while fine tuning so the end of one song isn't the start of another.
I have it create the csv's so that each row is a song but then when it gets group_text applied to it it concatenates them all and make blocks of 1024. looking into trynig to add the DataCollatorWithPadding but not having much luck at the moment

i also notice that its using <|endoftext|> as bos_token and eos_token wondering how that would affect things and if what im doing is even needed if or if i should just have theses tokens between my examples.
from the config.json in the model
"bos_token_id": 50256,
"embed_dropout": 0,
"eos_token_id": 50256,

CupOfGeo mentioned this issue Oct 27, 2021

separate examples for finetuning #18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feeding the model separate examples instead of one continuous block of text #17

Feeding the model separate examples instead of one continuous block of text #17

CupOfGeo commented Oct 26, 2021

CupOfGeo commented Oct 28, 2021

Feeding the model separate examples instead of one continuous block of text #17

Feeding the model separate examples instead of one continuous block of text #17

Comments

CupOfGeo commented Oct 26, 2021

CupOfGeo commented Oct 28, 2021