Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading dataset of compound type containing uint8 #21

Open
kchaliki opened this issue Feb 4, 2015 · 2 comments
Open

Reading dataset of compound type containing uint8 #21

kchaliki opened this issue Feb 4, 2015 · 2 comments
Assignees

Comments

@kchaliki
Copy link

kchaliki commented Feb 4, 2015

Hello all, question/remark regarding compound types.

If I create a compound type from h5py that looks like say

total size: 13 bytes

timestamp: NATIVE_UINT64
market: NATIVE_UINT8
price: NATIVE_FLOAT

and create a dataset of this type and write a few values out (from a numpy recarray), I think/expect those will be saved to disk in the packed way described in the datatype so they basically use 13 bytes each.

If I then try and read this back into go using go-hdf5 I can declare a struct that looks like

type tick struct {
timestamp uint64
market uint8
price float32
}

which generally will not be 13 bytes but a few more, let's say 16 because of alignment of the members. So if I am going to read back into such a struct, I need to create a compound datatype of the same offsets/sizes so hdf5 knows how to map the values from disk to memory.

However, the API retrieves the datatype of the on-file dataset and passes that into the read call along with the memory address of the beginning of my slice of structs. This ends up mapping the values incorrectly.

I can modify the h5py dataset creation side to use say all 4 or 8 byte datatypes and using similar types on the go side the read will work again but only by chance.

Is my understanding wrong or does the API need some refining?

Thanks!!

@sbinet
Copy link
Owner

sbinet commented Feb 6, 2015

(apologies for the belated answer)

go-hdf5 should definitely handle this better.

could you attach a little h5py test script as a reproducer?

@sbinet sbinet self-assigned this Feb 6, 2015
@kchaliki
Copy link
Author

kchaliki commented Feb 6, 2015

Hello - thanks for coming back, here is a repro, the python script that creates the hdf5 file and golang that reads it. Obviously you need to rename the file if not on windows.

python

import h5py
import pandas as pd

data = [ 
    {
        'timestamp': 1234567890,
        'market': 1,
        'price': 1.45
    },
    {
        'timestamp': 1234567891,
        'market': 2,
        'price': 1.55
    },
]

h5_file = h5py.File('C:\\temp\\repro.h5', 'w')

type = h5py.h5t.create(h5py.h5t.COMPOUND, 13)
type.insert('timestamp', 0, h5py.h5t.NATIVE_UINT64)
type.insert('market', 8, h5py.h5t.NATIVE_UINT8)
type.insert('price', 9, h5py.h5t.NATIVE_FLOAT)

records = pd.DataFrame(data).to_records()
_dataset = h5_file.create_dataset('foo', None, type, records)

golang

package main

import (
    "fmt"
    "github.com/sbinet/go-hdf5"
    "unsafe"
)

type tick struct {
    timestamp uint64
    market    uint8
    price     float32
}

func main() {
    fname := "C:/temp/repro.h5"
    f, err := hdf5.OpenFile(fname, hdf5.F_ACC_RDONLY)
    if err != nil {
        fmt.Printf("Could not open data file %s", fname)
        panic(err)
    }

    dataset, err := f.OpenDataset("foo")

    numTicks := dataset.Space().SimpleExtentNPoints()
    fmt.Printf("Reading %d ticks into struct of size %d", numTicks, unsafe.Sizeof(tick{}))
    ticks := make([]tick, numTicks)
    err = dataset.Read(&ticks)
    if err != nil {
        panic(err)
    }

    // display the fields
    fmt.Printf("%+v\n", ticks)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants