refactor code to calculate records per shard using n_volumes and number of shards #328

hvgazula · 2024-04-20T03:18:33Z

Lines 115 to 122 in 976691d

    
           first_shard = ( 
        
               dataset.take(1) 
        
               .flat_map( 
        
                   lambda x: tf.data.TFRecordDataset(x, compression_type=compression_type) 
        
               ) 
        
               .map(map_func=parse_fn, num_parallel_calls=num_parallel_calls) 
        
           ) 
        
           block_length = len([0 for _ in first_shard])

If the number of volumes in the shard is too large, this snippet of code can be time-consuming. Alternatives are

use a combination of n_volumes and number of files with file_pattern to calculate len(first_shard)
provide metadata (number of volumes in the shard) as well as total number of volumes in the dataset

The text was updated successfully, but these errors were encountered:

hvgazula · 2024-04-25T15:51:08Z

code for option 1: https://github.com/neuronets/nobrainer_training_scripts/blob/784c8668ae01356173faffbcf860bca458f46a73/1.2.0/create_tfshards.py#L307-L311 should now work after #329

hvgazula · 2024-05-11T11:08:05Z

Ideally, if the tfrecords are created using the API, with the aforementioned change, we can ensure the same number of records in every shard except the last one. Now, if n_volumes is not specified, it can be calculated using this function, which is num_records_first_shard * (num_shards - 1) + num_records_in_last_shard

hvgazula self-assigned this Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor code to calculate records per shard using n_volumes and number of shards #328

refactor code to calculate records per shard using n_volumes and number of shards #328

hvgazula commented Apr 20, 2024

hvgazula commented Apr 25, 2024

hvgazula commented May 11, 2024 •

edited

Loading

refactor code to calculate records per shard using n_volumes and number of shards #328

refactor code to calculate records per shard using n_volumes and number of shards #328

Comments

hvgazula commented Apr 20, 2024

hvgazula commented Apr 25, 2024

hvgazula commented May 11, 2024 • edited Loading

hvgazula commented May 11, 2024 •

edited

Loading