Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to read directly from hdf5 files (for large datasets) as well as numpy arrays. #12

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

kryczko
Copy link

@kryczko kryczko commented Dec 5, 2018

Dear author,

I have found your module quite useful, but I think with my changes to allow reading directly from hdf5 files makes this module much more impactful for deep learning applications with larger datasets.

With only numpy arrays, you're restricted to loading everything in memory. With hdf5 and Keras, this is not the case.

Please let me know if there are tests that I should run. I have already tested some of my code locally and have successfully read directly from hdf5 files with multiple workers concurrently.

Thanks,
Kevin Ryczko

@Avsecz
Copy link
Owner

Avsecz commented Dec 5, 2018

Awesome! I didn't know about the shuffle='batch' trick - very neat!

successfully read directly from hdf5 files with multiple workers concurrently.

In the current PR, the data are not loaded using multiple workers, or?

Regarding the tests: can you

Old:

def test_compilefn_train_test_split(tmpdir):
    fn = CompileFN(db_name, exp_name,
                   data_fn=data.data,
                   model_fn=model.build_model,
                    ....)

New

import pytest

@pytest.mark.parametrize("data_fn", [data.data, data.data_hdf5])
def test_compilefn_train_test_split(data_fn, tmpdir):
    fn = CompileFN(db_name, exp_name,
                   data_fn=data_fn,
                   model_fn=model.build_model,
                    ....)

@kryczko
Copy link
Author

kryczko commented Dec 5, 2018

My apologies, what I meant by multiple workers is using MongoDB with multiple workers with the KMongoTrials function.

I'll generate some sample data, and add your suggested changes.

@kryczko
Copy link
Author

kryczko commented Dec 20, 2018

Okay, so finally got around to this. I could not use the dataset currently being used to test based on the formats of the data, so I used a cifar10 dataset from keras, and wrote some of it to disk. This also made it problematic to incorporate the tests properly, so please have a look and see if things are okay.

@Avsecz
Copy link
Owner

Avsecz commented Jan 3, 2019

@kryczko do the tests work for you locally? Seems that they fail when checking fn_test. Can you fix these to make them work with hdf5?

@kryczko
Copy link
Author

kryczko commented Jan 3, 2019 via email

@Avsecz
Copy link
Owner

Avsecz commented Jan 3, 2019

Install pytest and run $ pytest from the repository root.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants