-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questioning the Cost of Data translation with ChatGPT Turbo #3
Comments
I would like to know this too. Right now i am at 4% and 0.25 cents costs, so it could be even around 6€. I will tell you once I have finished the German translation, right now it is really slow, almost like it would be stuck at 4%. EDIT: looks like I hit the rate limit, after some experiments I am now down to 25 parallel calls. This is very slow, but it seems to work. |
I have no idea how they did it for US$8. My cost was close to US$25. I did not translate to Portuguese, though. |
I am trying this with Hindi. The generation results don't seem so good. |
I highly recommend to translate the Cleaned Dataset: https://github.com/gururise/AlpacaDataCleaned I will try to translate it into German in a few weeks when the cleaning has progressed further. |
If you look closely in
Only a chunk of the original instruction set is translated. You need to repeat this process by changing the |
Translating the whole alpaca-lora/alpaca_data_cleaned_archive.json is somewhere around I'm curious is the selected chunk 40000-55000 for translation in the project chosen for it's quality or is it just random? |
I translated the complete |
Ah it seems I've miscalculated from the JSON structure rows <-> instructions, thank you for the correction. I'll just run the whole translation, but I think the larger dataset will take a lot more time to fine-tune. |
Dropping tqdm in favour of just counting via callback how many futures have been completed/not completed seems to double the overall speed of the threading job. There seems to be underlying issues with this library that is used in many machinelearning projects. |
Nice. Didn't think of trying that before |
"We translated the alpaca_data.json to portuguese using ChatGPT. ....We paid around US$ 8.00 to translate the full dataset to portuguese."
The initial size of the data is approximately 20 million, and the cost of processing it with ChatGPT Turbo is $0.002 per 1,000 tokens. I am curious as to why the total cost is not close to $40.
By the way, I appreciate you sharing the excellent suggestion for fine-tuning.
The text was updated successfully, but these errors were encountered: