Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for appending to existing hsds "files" using hsload #86

Merged
merged 7 commits into from
Jun 16, 2020

Conversation

MRossol
Copy link
Contributor

@MRossol MRossol commented Apr 21, 2020

Change mode of fin from "x" to "a"
in obj_creation_handler check if obj exists before creating

@jreadey
Copy link
Member

jreadey commented Apr 21, 2020

Interesting -- I hadn't considered an append version of hsload!

I'd suggest adding an "--append" command line flag so that users don't inadvertently append to a domain when they didn't mean too.

What about root attributes (see: line 626 in utillib.py)? If there's a collision with an existing attribute, then you should skip and output a warning message like you do with links.

@MRossol
Copy link
Contributor Author

MRossol commented Apr 21, 2020

I have my own version of this for some of the WTK data. Its stored across multiple (11) .h5 files. Each file has a set of unique datasets and then redundant meta and time_index datasets. In HSDS I just loaded them into a single "file" which is very useful.

Great comments, will do on the --append flag.

I did a pretty significant refactor locally that I can push up that fixes some potential pitfalls, like re-copying existing data.

@jreadey
Copy link
Member

jreadey commented Apr 21, 2020

Ok - I just checked in some minor changes so you might want to merge those first.

load_file -> check if obj exists before creating
consolidate object helpers into a single function to allow skipping existing objects
@MRossol
Copy link
Contributor Author

MRossol commented May 27, 2020

Added the append option (I'm used to using click so make sure I integrated it correctly with sys.argv...) and changed the object_helper approach to do a single cycle through the objects and skip existing objects (needed during append)

@jreadey
Copy link
Member

jreadey commented May 27, 2020

I had forgotten about this! I'll take a look tomorrow.

@MRossol
Copy link
Contributor Author

MRossol commented May 27, 2020

It kept on getting pushed down my todo list... so no rush!

@jreadey
Copy link
Member

jreadey commented May 29, 2020

There are a few edge cases where the creating links and coping the objects would cause problems. e.g. you have multiple links pointing to the same object.

Would it cause problems to go back to the multi-pass approach?

@MRossol
Copy link
Contributor Author

MRossol commented May 29, 2020

Whats the multi-pass approach?

@jreadey
Copy link
Member

jreadey commented May 29, 2020

It's like this in the current code:

# build a rough map of the file using the internal function above
logging.info("creating target objects")
fin.visititems(object_create_helper)
# copy over any attributes
logging.info("creating target attributes")
fin.visititems(object_attribute_helper)
# create soft/external links (and hardlinks not already created)
create_links(fin, fout, ctx)  # create root soft/external links
fin.visititems(object_link_helper)
if dataload:
    # copy dataset data
    logging.info("copying dataset data")
    fin.visititems(object_copy_helper)
else:
    logging.info("skipping dataset data copy (dataload is None)")

@MRossol
Copy link
Contributor Author

MRossol commented May 29, 2020

I guess I still don't follow. I made two structural changes:

  1. Instead of iterating through all of the .h5 objects multiple times (create, add attributes, links, copy) I just iterate once and create add attributes, add links, and copy as needed
  2. I skip Implement RegionRefs #1 if the obj already exists in fout. This is essential for my use case because the file/dataset structure looks like this:
    file_1.h5:
    -time_index
  • meta
    -ds1
    -ds1
    file_2.h5
    -time_index
    -meta
    -ds3
    -ds4

where time_index and meta are redundant and would break your code as they already exist and can't be re-created.

@MRossol
Copy link
Contributor Author

MRossol commented May 29, 2020

@jreadey
Copy link
Member

jreadey commented May 29, 2020

I'll need to do some testing to convince myself the approaches are equivalent.
And there's another bug with importing files with multilinks (to links to the same object) that I need to fix.
I'll try to get to this next week, but in the meantime you shouldn't be blocked from using your version.

@jreadey
Copy link
Member

jreadey commented Jun 12, 2020

Hey @MRossol - could you merge in the latest changes from master and re-submit?
Also, please verify the non-append case works. I got some errors testing this, but it may have just been my bad resolve conflicts.

@jreadey
Copy link
Member

jreadey commented Jun 15, 2020

@MRossol - Do you get this error with doing a regular (non-append) hsload?

$ hsload ~/tall.h5 /home/test_user1/test/tall2.h5 Traceback (most recent call last): File "/opt/anaconda3/envs/py38/bin/hsload", line 11, in <module> load_entry_point('h5pyd==0.7.2', 'console_scripts', 'hsload')() File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/h5pyd-0.7.2-py3.8.egg/h5pyd/_apps/hsload.py", line 353, in main load_file(fin, fout, verbose=verbose, dataload=dataload, s3path=s3path, deflate=deflate,) File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/h5pyd-0.7.2-py3.8.egg/h5pyd/_apps/utillib.py", line 686, in load_file fin.visititems(object_helper) File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/h5py/_hl/group.py", line 565, in visititems return h5o.visit(self.id, proxy) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 355, in h5py.h5o.visit File "h5py/defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name File "h5py/h5o.pyx", line 302, in h5py.h5o.cb_obj_simple File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/h5py/_hl/group.py", line 564, in proxy return func(name, self[name]) File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/h5pyd-0.7.2-py3.8.egg/h5pyd/_apps/utillib.py", line 639, in object_helper if name in fout: UnboundLocalError: local variable 'fout' referenced before assignment

@MRossol
Copy link
Contributor Author

MRossol commented Jun 16, 2020

@jreadey All bugs fixed:

(hsds) [mrossol@el3 h5pyd]$ hsload --link ${S3_FILE} ${HSDS_FILE}
(hsds) [mrossol@el3 h5pyd]$ echo $S3_FILE
s3://nrel-pds-wtk/vietnam/wtk_vietnam_2016.h5
(hsds) [mrossol@el3 h5pyd]$ echo $HSDS_FILE
/nrel/wtk/vietnam/wtk_vietnam_2016.h5
(hsload) [mrossol@el3 rechunk]$ hsls /nrel/wtk/vietnam/wtk_vietnam_2016.h5
/ Group
coordinates Dataset {1037610, 2}
inversemoninobukhovlength_2m Dataset {8784, 1037610}
meta Table {1037610}
precipitationrate_0m Dataset {8784, 1037610}
pressure_0m Dataset {8784, 1037610}
pressure_100m Dataset {8784, 1037610}
pressure_200m Dataset {8784, 1037610}
relativehumidity_2m Dataset {8784, 1037610}
temperature_100m Dataset {8784, 1037610}
temperature_10m Dataset {8784, 1037610}
temperature_120m Dataset {8784, 1037610}
temperature_140m Dataset {8784, 1037610}
temperature_160m Dataset {8784, 1037610}
temperature_200m Dataset {8784, 1037610}
temperature_2m Dataset {8784, 1037610}
temperature_40m Dataset {8784, 1037610}
temperature_60m Dataset {8784, 1037610}
temperature_80m Dataset {8784, 1037610}
time_index Dataset {8784}
winddirection_100m Dataset {8784, 1037610}
winddirection_10m Dataset {8784, 1037610}
winddirection_120m Dataset {8784, 1037610}
winddirection_140m Dataset {8784, 1037610}
winddirection_160m Dataset {8784, 1037610}
winddirection_200m Dataset {8784, 1037610}
winddirection_40m Dataset {8784, 1037610}
winddirection_60m Dataset {8784, 1037610}
winddirection_80m Dataset {8784, 1037610}
windspeed_100m Dataset {8784, 1037610}
windspeed_10m Dataset {8784, 1037610}
windspeed_120m Dataset {8784, 1037610}
windspeed_140m Dataset {8784, 1037610}
windspeed_160m Dataset {8784, 1037610}
windspeed_200m Dataset {8784, 1037610}
windspeed_40m Dataset {8784, 1037610}
windspeed_60m Dataset {8784, 1037610}
windspeed_80m Dataset {8784, 1037610}

@jreadey jreadey merged commit 4d0f45a into HDFGroup:master Jun 16, 2020
@jreadey
Copy link
Member

jreadey commented Jun 16, 2020

Thanks @MRossol! Changes merged to master.

@jreadey
Copy link
Member

jreadey commented Jul 9, 2020

I've belatedly recalled why the multi-pass approach is needed. Say you have a dataset, dset1, with attribute a1 that contains a reference to another dataset, dset2. If dset1 is created before dset2, the creation of a1 will fail because dset2 doesn't exist yet.

Sounds bizarre, but this type of structure is commonly used for dimension lists...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants