-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface #143
Comments
ArrayRecord works best when using its ParallelRead methods, which utilizes its internal threadpool. In python, these methods are exposed by supplying indices, range to read, or calling the read_all.
You should be able to see a performance boost after switching to use these methods. |
Yes, you are right. I should have looked into the function definition in more detail. Probably grain is not calling this as well, hence the slowdown. array_record/python/array_record_data_source.py Lines 338 to 340 in 1f72e6d
grain uses array_record data source class ArrayRecordDataSource(array_record.ArrayRecordDataSource):
"""Data source for ArrayRecord files."""
def __init__(self, paths: ArrayRecordDataSourcePaths):
"""Creates a new ArrayRecordDataSource object.
See `array_record.ArrayRecordDataSource` for more details.
Args:
paths: A single path/FileInstruction or list of paths/FileInstructions.
"""
super().__init__(paths)
_api_usage_counter.Increment("ArrayRecordDataSource")
def __getitem__(self, record_key: SupportsIndex) -> bytes:
data = super().__getitem__(record_key)
_bytes_read_counter.IncrementBy(len(data), "ArrayRecordDataSource")
return data I am curious what you think about grain. I can't use it as it is right now. I know this issue belongs to array_records, but still, iterating over it using the grain dataloader takes too much time. I don't understand how maxtext processing is even acceptable. Even iterating using it is so slow. |
Hello,
I was comparing
Hugging Face Datasets
toarray_record
and found some unexpected results. I was wondering if I am using library in a wrong way or this is expected.created
array_record
datanow lets try just iterating over it using grain:
This takes around 6 hours to iterate over
Okay, now lets just use
array_record
and excludegrain
, this takes around 27 hoursLet's see what happends if I use
huggingface dataset
, 1 minuteBy default
array_record
parameters, compression is aproximatly 1.8x-2x better thanhuggingface dataset
. The most optimized open source library for training LLMs maxtext usesgrain
andarray_record
for it's training. I was wondering if, because large model forwards take much more time than iterating, this is not a performance issue, or am I doing something terribly wrong here?The text was updated successfully, but these errors were encountered: