Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save jpeg in fuel/h5py #360

Open
markusnagel opened this issue Aug 4, 2016 · 2 comments
Open

Save jpeg in fuel/h5py #360

markusnagel opened this issue Aug 4, 2016 · 2 comments

Comments

@markusnagel
Copy link

Hello,

I currently deal with image datasets of about 1 million images. When saving them as bumpy array (with dtype uint8) this would result in a dataset file of over 1TB which is very suboptimal.

My solution to this is to save the images as jpeg (that is there original format) which should reduce the dataset size to less than 100GB. I was very enthusiastic when I saw that there is a Transformer called ImageFromBytes, this does exactly what I want, load the image with PIL from binary and then transform it to numpy.

So my only problem is now how to get the jpeg images into a fuel compatible h5py. Reading the binary is easy:

fin = open(filename, 'rb')
binary_data = fin.read()

But I can not manage the save it in h5py. If I do:

f = h5py.File('foo.hdf5')
dt = h5py.special_dtype(vlen='str')
dset = f.create_dataset('binary', (100, ), dtype=dt)
dset[0] = binary_data

I get the following error

tid = h5t.py_create(dtype, logical=1)
...
ValueError: Size must be positive (Size must be positive)

If I change line 3 to dt = h5py.special_dtype(vlen=bytes) then I get ValueError: VLEN strings do not support embedded NULLs like in http://docs.h5py.org/en/latest/strings.html#variable-length-ascii. But avoiding nulls is not possible in a binary string from jpeg images.

Also changing the last line to dset[0] = np.void(binary_data) seems not to help (was suggested here: http://docs.h5py.org/en/latest/strings.html#how-to-store-raw-binary-data).

I in general I tried lots/all different things and combinations from this 3 tutorials:
http://docs.h5py.org/en/latest/special.html
http://docs.h5py.org/en/latest/strings.html
http://fuel.readthedocs.io/en/latest/h5py_dataset.html#variable-length-data

It would be great if anyone could hint me to the correct solution.

Thanks a lot,
Markus

@dwf
Copy link
Contributor

dwf commented Aug 4, 2016

I'd recommend looking at the implementation of the ilsvrc2010 converter,
somehow that thing works.

Also, I've found the throughput when working with JPEG-in-HDF5 to be...
disappointing. I haven't had time to really dig into it that much.

On Thu, Aug 4, 2016 at 12:11 PM, markusnagel [email protected]
wrote:

Hello,

I currently deal with image datasets of about 1 million images. When
saving them as bumpy array (with dtype uint8) this would result in a
dataset file of over 1TB which is very suboptimal.

My solution to this is to save the images as jpeg (that is there original
format) which should reduce the dataset size to less than 100GB. I was very
enthusiastic when I saw that there is a Transformer called ImageFromBytes,
this does exactly what I want, load the image with PIL from binary and then
transform it to numpy.

So my only problem is now how to get the jpeg images into a fuel
compatible h5py. Reading the binary is easy:

fin = open(filename, 'rb')
binary_data = fin.read()

But I can not manage the save it in h5py. If I do:

f = h5py.File('foo.hdf5')
dt = h5py.special_dtype(vlen='str')
dset = f.create_dataset('binary', (100, ), dtype=dt)
dset[0] = binary_data

I get the following error

tid = h5t.py_create(dtype, logical=1)
...
ValueError: Size must be positive (Size must be positive)

If I change line 3 to dt = h5py.special_dtype(vlen=bytes) then I get ValueError:
VLEN strings do not support embedded NULLs like in
http://docs.h5py.org/en/latest/strings.html#variable-length-ascii. But
avoiding nulls is not possible in a binary string from jpeg images.

Also changing the last line to dset[0] = np.void(binary_data) seems not
to help (was suggested here: http://docs.h5py.org/en/
latest/strings.html#how-to-store-raw-binary-data).

I in general I tried lots/all different things and combinations from this
3 tutorials:
http://docs.h5py.org/en/latest/special.html
http://docs.h5py.org/en/latest/strings.html
http://fuel.readthedocs.io/en/latest/h5py_dataset.html#
variable-length-data

It would be great if anyone could hint me to the correct solution.

Thanks a lot,
Markus


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#360, or mute the thread
https://github.com/notifications/unsubscribe-auth/AADrLjD0Ad6pWFOBJ8FspEHtK_O5-tQ4ks5qcg81gaJpZM4Jc3SL
.

@markusnagel
Copy link
Author

Thanks for the hint David, should have thought of it myself to look there ;)
I found now the solution, they save it as a numpy array of 'uint8', there it does not have the problem with the NULL bytes (I assume there they save the shape and do not use NULL as the end of the string).
So the working code is now:

import numpy as np
import h5py

filename = 'test.jpg'
fin = open(filename, 'rb')
binary_data = fin.read()

f = h5py.File('foo.hdf5')
dt = h5py.special_dtype(vlen=np.dtype('uint8'))
dset = f.create_dataset('binary_data', (100, ), dtype=dt)

# Save data string converted as a np array
dset[0] = np.fromstring(binary_data, dtype='uint8')```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants