Add the ability to read directly from hdf5 files (for large datasets) as well as numpy arrays. #12

kryczko · 2018-12-05T16:42:45Z

Dear author,

I have found your module quite useful, but I think with my changes to allow reading directly from hdf5 files makes this module much more impactful for deep learning applications with larger datasets.

With only numpy arrays, you're restricted to loading everything in memory. With hdf5 and Keras, this is not the case.

Please let me know if there are tests that I should run. I have already tested some of my code locally and have successfully read directly from hdf5 files with multiple workers concurrently.

Thanks,
Kevin Ryczko

Avsecz · 2018-12-05T16:53:14Z

Awesome! I didn't know about the shuffle='batch' trick - very neat!

successfully read directly from hdf5 files with multiple workers concurrently.

In the current PR, the data are not loaded using multiple workers, or?

Regarding the tests: can you

store a small hdf5 dataset to tests/data (use the output or parts of the output of the current test data() function)
write a function data_hdf5 to https://github.com/Avsecz/kopt/blob/master/tests/data.py which would load/prepare the hdf5 dataset
refactor the code in https://github.com/Avsecz/kopt/blob/master/tests/test_hyopt.py in a way that the data function (or data function name) would be added as an argument to the test file and then multiple data functions could be tested simultaneously:

Old:

def test_compilefn_train_test_split(tmpdir):
    fn = CompileFN(db_name, exp_name,
                   data_fn=data.data,
                   model_fn=model.build_model,
                    ....)

New

import pytest

@pytest.mark.parametrize("data_fn", [data.data, data.data_hdf5])
def test_compilefn_train_test_split(data_fn, tmpdir):
    fn = CompileFN(db_name, exp_name,
                   data_fn=data_fn,
                   model_fn=model.build_model,
                    ....)

kryczko · 2018-12-05T16:59:21Z

My apologies, what I meant by multiple workers is using MongoDB with multiple workers with the KMongoTrials function.

I'll generate some sample data, and add your suggested changes.

kryczko · 2018-12-20T15:36:45Z

Okay, so finally got around to this. I could not use the dataset currently being used to test based on the formats of the data, so I used a cifar10 dataset from keras, and wrote some of it to disk. This also made it problematic to incorporate the tests properly, so please have a look and see if things are okay.

Avsecz · 2019-01-03T18:55:54Z

@kryczko do the tests work for you locally? Seems that they fail when checking fn_test. Can you fix these to make them work with hdf5?

kryczko · 2019-01-03T19:01:48Z

How should I run them locally? Kevin

…

On Jan 3, 2019, at 1:55 PM, Žiga Avsec ***@***.***> wrote: @kryczko do the tests work for you locally? Seems that they fail when checking fn_test. Can you fix these to make them work with hdf5? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Avsecz · 2019-01-03T19:03:54Z

Install pytest and run $ pytest from the repository root.

Kevin Ryczko added 2 commits December 5, 2018 11:33

modified to be able to handle training directly from h5 files with keras

d3e35ad

added another comment and indenting to keep same style

3a10e2b

pushing up changes and small dataset

843a9f1

Avsecz added 2 commits January 1, 2019 23:51

Update test_hyopt.py

a156482

Update data.py

d100ca9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to read directly from hdf5 files (for large datasets) as well as numpy arrays. #12

Add the ability to read directly from hdf5 files (for large datasets) as well as numpy arrays. #12

kryczko commented Dec 5, 2018

Avsecz commented Dec 5, 2018 •

edited

Loading

kryczko commented Dec 5, 2018

kryczko commented Dec 20, 2018

Avsecz commented Jan 3, 2019

kryczko commented Jan 3, 2019 via email

Avsecz commented Jan 3, 2019

Add the ability to read directly from hdf5 files (for large datasets) as well as numpy arrays. #12

Are you sure you want to change the base?

Add the ability to read directly from hdf5 files (for large datasets) as well as numpy arrays. #12

Conversation

kryczko commented Dec 5, 2018

Avsecz commented Dec 5, 2018 • edited Loading

kryczko commented Dec 5, 2018

kryczko commented Dec 20, 2018

Avsecz commented Jan 3, 2019

kryczko commented Jan 3, 2019 via email

Avsecz commented Jan 3, 2019

Avsecz commented Dec 5, 2018 •

edited

Loading