comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface #143

ZurabDz · 2025-01-03T08:37:50Z

Hello,
I was comparing Hugging Face Datasets to array_record and found some unexpected results. I was wondering if I am using library in a wrong way or this is expected.

cpu type: 11th Gen Intel® Core™ i7-11800H @ 2.30GHz × 16

created `array_record` data

from array_record.python.array_record_module import ArrayRecordWriter
from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")
writer = ArrayRecordWriter('stories.array_record', 'group_size:1')

 for row in tqdm(ds['train']):
     data = json.dumps(row)
     writer.write(data.encode())

writer.close()

now lets try just iterating over it using grain:

This takes around 6 hours to iterate over

from array_record.python.array_record_data_source import ArrayRecordDataSource
import grain.python as grain

source = ArrayRecordDataSource('stories.array_record')

index_sampler_example = grain.IndexSampler(
    num_records=len(source),
    num_epochs=1,
    shard_options=grain.ShardOptions(
        shard_index=0, shard_count=1, drop_remainder=True),
    shuffle=False,
    seed=0)

loader = grain.DataLoader(
    data_source=source,
    sampler=index_sampler_example,
    worker_count=2,
    worker_buffer_size=2,
)

for element in tqdm(loader, total=len(source)):
    a =  element

Okay, now lets just use `array_record` and exclude `grain`, this takes around 27 hours

from array_record.python.array_record_module import ArrayRecordReader

reader = ArrayRecordReader('stories.array_record')

for i in tqdm(range(0, reader.num_records())):
    a = reader.read([i])

Let's see what happends if I use `huggingface dataset`, 1 minute

from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")
for element in tqdm(ds['train']):
    a = element

By default `array_record` parameters, compression is aproximatly 1.8x-2x better than `huggingface dataset`. The most optimized open source library for training LLMs maxtext uses `grain` and `array_record` for it's training. I was wondering if, because large model forwards take much more time than iterating, this is not a performance issue, or am I doing something terribly wrong here?

The text was updated successfully, but these errors were encountered:

dryman · 2025-01-05T11:27:02Z

ArrayRecord works best when using its ParallelRead methods, which utilizes its internal threadpool.

In python, these methods are exposed by supplying indices, range to read, or calling the read_all.

records = reader.read([0..99]) # reads records by indices 0..99. Any list of indices would do
records = reader.read(0, 99) # reads 0..99 records
all_records = reader.read_all() # reads all records

You should be able to see a performance boost after switching to use these methods.

ZurabDz · 2025-01-06T09:55:01Z

Yes, you are right. I should have looked into the function definition in more detail. Probably grain is not calling this as well, hence the slowdown.

array_record/python/array_record_data_source.py

Lines 338 to 340 in 1f72e6d

    
           def __getitems__( 
        
               self, record_keys: Sequence[SupportsIndex] 
        
           ) -> Sequence[bytes]:

grain uses array_record data source __getitem__:

class ArrayRecordDataSource(array_record.ArrayRecordDataSource):
  """Data source for ArrayRecord files."""

  def __init__(self, paths: ArrayRecordDataSourcePaths):
    """Creates a new ArrayRecordDataSource object.

    See `array_record.ArrayRecordDataSource` for more details.

    Args:
      paths: A single path/FileInstruction or list of paths/FileInstructions.
    """
    super().__init__(paths)
    _api_usage_counter.Increment("ArrayRecordDataSource")

  def __getitem__(self, record_key: SupportsIndex) -> bytes:
    data = super().__getitem__(record_key)
    _bytes_read_counter.IncrementBy(len(data), "ArrayRecordDataSource")
    return data

I am curious what you think about grain. I can't use it as it is right now. I know this issue belongs to array_records, but still, iterating over it using the grain dataloader takes too much time. I don't understand how maxtext processing is even acceptable. Even iterating using it is so slow.

ZurabDz changed the title ~~comparing array_records` ArrayRecordDataSource + (grain data loader) to huggingface~~ comparing array_records ArrayRecordDataSource + (grain data loader) to huggingface Jan 3, 2025

ZurabDz changed the title ~~comparing array_records ArrayRecordDataSource + (grain data loader) to huggingface~~ comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface #143

comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface #143

ZurabDz commented Jan 3, 2025

dryman commented Jan 5, 2025

ZurabDz commented Jan 6, 2025

comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface #143

comparing array_record's ArrayRecordDataSource + (grain data loader) to huggingface #143

Comments

ZurabDz commented Jan 3, 2025

created array_record data

This takes around 6 hours to iterate over

Okay, now lets just use array_record and exclude grain, this takes around 27 hours

Let's see what happends if I use huggingface dataset, 1 minute

dryman commented Jan 5, 2025

ZurabDz commented Jan 6, 2025

created `array_record` data

Okay, now lets just use `array_record` and exclude `grain`, this takes around 27 hours

Let's see what happends if I use `huggingface dataset`, 1 minute