Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update parallel compression feature to support multi-dataset I/O #3591

Merged
merged 11 commits into from
Oct 10, 2023
51 changes: 36 additions & 15 deletions doc/parallel-compression.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ H5Pset_dxpl_mpio(dxpl_id, H5FD_MPIO_COLLECTIVE);
H5Dwrite(..., dxpl_id, ...);
```

The following are two simple examples of using the parallel compression
feature:
The following are two simple examples of using the parallel
compression feature:

[ph5_filtered_writes.c](https://github.com/HDFGroup/hdf5/blob/develop/examples/ph5_filtered_writes.c)

Expand All @@ -76,9 +76,30 @@ Remember that the feature requires these writes to use collective
I/O, so the MPI ranks which have nothing to contribute must still
participate in the collective write call.

## Multi-dataset I/O support

The parallel compression feature is supported when using the
multi-dataset I/O API routines ([H5Dwrite_multi](https://hdfgroup.github.io/hdf5/group___h5_d.html#gaf6213bf3a876c1741810037ff2bb85d8)/[H5Dread_multi](https://hdfgroup.github.io/hdf5/group___h5_d.html#ga8eb1c838aff79a17de385d0707709915)), but the
following should be kept in mind:

- Parallel writes to filtered datasets **must** still be collective,
even when using the multi-dataset I/O API routines

- When the multi-dataset I/O API routines are passed a mixture of
filtered and unfiltered datasets, the library currently has to
perform I/O on them separately in two phases. Since there is
some slight complexity involved in this, it may be best (depending
on the number of datasets, number of selected chunks, number of
filtered vs. unfiltered datasets, etc.) to make two individual
multi-dataset I/O calls, one for the filtered datasets and one
for the unfiltered datasets. When performing writes to the datasets,
this would also allow independent write access to the unfiltered
datasets if desired, while still performing collective writes to
the filtered datasets.

## Incremental file space allocation support

HDF5's [file space allocation time](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALLOC_TIME)
HDF5's [file space allocation time](https://hdfgroup.github.io/hdf5/group___d_c_p_l.html#ga85faefca58387bba409b65c470d7d851)
is a dataset creation property that can have significant effects
on application performance, especially if the application uses
parallel HDF5. In a serial HDF5 application, the default file space
Expand All @@ -97,15 +118,15 @@ While this strategy has worked in the past, it has some noticeable
drawbacks. For one, the larger the chunked dataset being created,
the more noticeable overhead there will be during dataset creation
as all of the data chunks are being allocated in the HDF5 file.
Further, these data chunks will, by default, be [filled](https://portal.hdfgroup.org/display/HDF5/H5P_SET_FILL_VALUE)
Further, these data chunks will, by default, be [filled](https://hdfgroup.github.io/hdf5/group___d_c_p_l.html#ga4335bb45b35386daa837b4ff1b9cd4a4)
with HDF5's default fill data value, leading to extraordinary
dataset creation overhead and resulting in pre-filling large
portions of a dataset that the application might have been planning
to overwrite anyway. Even worse, there will be more initial overhead
from compressing that fill data before writing it out, only to have
it read back in, unfiltered and modified the first time a chunk is
written to. In the past, it was typically suggested that parallel
HDF5 applications should use [H5Pset_fill_time](https://portal.hdfgroup.org/display/HDF5/H5P_SET_FILL_TIME)
HDF5 applications should use [H5Pset_fill_time](https://hdfgroup.github.io/hdf5/group___d_c_p_l.html#ga6bd822266b31f86551a9a1d79601b6a2)
with a value of `H5D_FILL_TIME_NEVER` in order to disable writing of
the fill value to dataset chunks, but this isn't ideal if the
application actually wishes to make use of fill values.
Expand Down Expand Up @@ -199,14 +220,14 @@ chunks to end up at addresses in the file that do not align
well with the underlying file system, possibly leading to
poor performance. As an example, Lustre performance is generally
good when writes are aligned with the chosen stripe size.
The HDF5 application can use [H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT)
The HDF5 application can use [H5Pset_alignment](https://hdfgroup.github.io/hdf5/group___f_a_p_l.html#gab99d5af749aeb3896fd9e3ceb273677a)
to have a bit more control over where objects in the HDF5
file end up. However, do note that setting the alignment
of objects generally wastes space in the file and has the
potential to dramatically increase its resulting size, so
caution should be used when choosing the alignment parameters.

[H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT)
[H5Pset_alignment](https://hdfgroup.github.io/hdf5/group___f_a_p_l.html#gab99d5af749aeb3896fd9e3ceb273677a)
has two parameters that control the alignment of objects in
the HDF5 file, the "threshold" value and the alignment
value. The threshold value specifies that any object greater
Expand Down Expand Up @@ -243,19 +264,19 @@ in a file, this can create significant amounts of free space
in the file over its lifetime and eventually cause performance
issues.

An HDF5 application can use [H5Pset_file_space_strategy](http://portal.hdfgroup.org/display/HDF5/H5P_SET_FILE_SPACE_STRATEGY)
An HDF5 application can use [H5Pset_file_space_strategy](https://hdfgroup.github.io/hdf5/group___f_c_p_l.html#ga167ff65f392ca3b7f1933b1cee1b9f70)
with a value of `H5F_FSPACE_STRATEGY_PAGE` to enable the paged
aggregation feature, which can accumulate metadata and raw
data for dataset data chunks into well-aligned, configurably
sized "pages" for better performance. However, note that using
the paged aggregation feature will cause any setting from
[H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT)
[H5Pset_alignment](https://hdfgroup.github.io/hdf5/group___f_a_p_l.html#gab99d5af749aeb3896fd9e3ceb273677a)
to be ignored. While an application should be able to get
comparable performance effects by [setting the size of these pages](http://portal.hdfgroup.org/display/HDF5/H5P_SET_FILE_SPACE_PAGE_SIZE) to be equal to the value that
would have been set for [H5Pset_alignment](https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT),
comparable performance effects by [setting the size of these pages](https://hdfgroup.github.io/hdf5/group___f_c_p_l.html#gad012d7f3c2f1e1999eb1770aae3a4963) to be equal to the value that
would have been set for [H5Pset_alignment](https://hdfgroup.github.io/hdf5/group___f_a_p_l.html#gab99d5af749aeb3896fd9e3ceb273677a),
this may not necessarily be the case and should be studied.

Note that [H5Pset_file_space_strategy](http://portal.hdfgroup.org/display/HDF5/H5P_SET_FILE_SPACE_STRATEGY)
Note that [H5Pset_file_space_strategy](https://hdfgroup.github.io/hdf5/group___f_c_p_l.html#ga167ff65f392ca3b7f1933b1cee1b9f70)
has a `persist` parameter. This determines whether or not the
file free space manager should include extra metadata in the
HDF5 file about free space sections in the file. If this
Expand All @@ -279,12 +300,12 @@ hid_t file_id = H5Fcreate("file.h5", H5F_ACC_TRUNC, fcpl_id, fapl_id);

While the parallel compression feature requires that the HDF5
application set and maintain collective I/O at the application
interface level (via [H5Pset_dxpl_mpio](https://portal.hdfgroup.org/display/HDF5/H5P_SET_DXPL_MPIO)),
interface level (via [H5Pset_dxpl_mpio](https://hdfgroup.github.io/hdf5/group___d_x_p_l.html#ga001a22b64f60b815abf5de8b4776f09e)),
it does not require that the actual MPI I/O that occurs at
the lowest layers of HDF5 be collective; independent I/O may
perform better depending on the application I/O patterns and
parallel file system performance, among other factors. The
application may use [H5Pset_dxpl_mpio_collective_opt](https://portal.hdfgroup.org/display/HDF5/H5P_SET_DXPL_MPIO_COLLECTIVE_OPT)
application may use [H5Pset_dxpl_mpio_collective_opt](https://hdfgroup.github.io/hdf5/group___d_x_p_l.html#gacb30d14d1791ec7ff9ee73aa148a51a3)
to control this setting and see which I/O method provides the
best performance.

Expand All @@ -297,7 +318,7 @@ H5Dwrite(..., dxpl_id, ...);

### Runtime HDF5 Library version

An HDF5 application can use the [H5Pset_libver_bounds](http://portal.hdfgroup.org/display/HDF5/H5P_SET_LIBVER_BOUNDS)
An HDF5 application can use the [H5Pset_libver_bounds](https://hdfgroup.github.io/hdf5/group___f_a_p_l.html#gacbe1724e7f70cd17ed687417a1d2a910)
routine to set the upper and lower bounds on library versions
to use when creating HDF5 objects. For parallel compression
specifically, setting the library version to the latest available
Expand Down
11 changes: 10 additions & 1 deletion release_docs/RELEASE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,16 @@ New Features

Parallel Library:
-----------------
-
- Added optimized support for the parallel compression feature when
using the multi-dataset I/O API routines collectively

Previously, calling H5Dwrite_multi/H5Dread_multi collectively in parallel
with a list containing one or more filtered datasets would cause HDF5 to
break out of the optimized multi-dataset I/O mode and instead perform I/O
by looping over each dataset in the I/O request. The library has now been
updated to perform I/O in a more optimized manner in this case by first
performing I/O on all the filtered datasets at once and then performing
I/O on all the unfiltered datasets at once.


Fortran Library:
Expand Down
51 changes: 51 additions & 0 deletions src/H5Dchunk.c
Original file line number Diff line number Diff line change
Expand Up @@ -1114,6 +1114,31 @@ H5D__chunk_io_init(H5D_io_info_t *io_info, H5D_dset_io_info_t *dinfo)
}
}

#ifdef H5_HAVE_PARALLEL
/*
* If collective metadata reads are enabled, ensure all ranks
* have the dataset's chunk index open (if it was created) to
* prevent possible metadata inconsistency issues or unintentional
* independent metadata reads later on.
*/
if (H5F_SHARED_HAS_FEATURE(io_info->f_sh, H5FD_FEAT_HAS_MPI) &&
H5F_shared_get_coll_metadata_reads(io_info->f_sh) &&
H5D__chunk_is_space_alloc(&dataset->shared->layout.storage)) {
H5D_chunk_ud_t udata;
hsize_t scaled[H5O_LAYOUT_NDIMS] = {0};

/*
* TODO: Until the dataset chunk index callback structure has
* callbacks for checking if an index is opened and also for
* directly opening the index, the following fake chunk lookup
* serves the purpose of forcing a chunk index open operation
* on all ranks
*/
if (H5D__chunk_lookup(dataset, scaled, &udata) < 0)
HGOTO_ERROR(H5E_DATASET, H5E_CANTINIT, FAIL, "unable to collectively open dataset chunk index");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to address this properly when I address #2658, but for now this is about the only way to ensure that every rank has the dataset's chunk index opened and read in correctly.

}
#endif

done:
if (file_space_normalized == true)
if (H5S_hyper_denormalize_offset(dinfo->file_space, old_offset) < 0)
Expand Down Expand Up @@ -1556,6 +1581,9 @@ H5D__create_piece_map_single(H5D_dset_io_info_t *di, H5D_io_info_t *io_info)
piece_info->in_place_tconv = false;
piece_info->buf_off = 0;

/* Check if chunk is in a dataset with filters applied */
piece_info->filtered_dset = di->dset->shared->dcpl_cache.pline.nused > 0;

/* make connection to related dset info from this piece_info */
piece_info->dset_info = di;

Expand Down Expand Up @@ -1591,6 +1619,7 @@ H5D__create_piece_file_map_all(H5D_dset_io_info_t *di, H5D_io_info_t *io_info)
hsize_t curr_partial_clip[H5S_MAX_RANK]; /* Current partial dimension sizes to clip against */
hsize_t partial_dim_size[H5S_MAX_RANK]; /* Size of a partial dimension */
bool is_partial_dim[H5S_MAX_RANK]; /* Whether a dimension is currently a partial chunk */
bool filtered_dataset; /* Whether the dataset in question has filters applied */
unsigned num_partial_dims; /* Current number of partial dimensions */
unsigned u; /* Local index variable */
herr_t ret_value = SUCCEED; /* Return value */
Expand Down Expand Up @@ -1640,6 +1669,9 @@ H5D__create_piece_file_map_all(H5D_dset_io_info_t *di, H5D_io_info_t *io_info)
/* Set the index of this chunk */
chunk_index = 0;

/* Check whether dataset has filters applied */
filtered_dataset = di->dset->shared->dcpl_cache.pline.nused > 0;

/* Create "temporary" chunk for selection operations (copy file space) */
if (NULL == (tmp_fchunk = H5S_create_simple(fm->f_ndims, fm->chunk_dim, NULL)))
HGOTO_ERROR(H5E_DATASET, H5E_CANTCREATE, FAIL, "unable to create dataspace for chunk");
Expand Down Expand Up @@ -1686,6 +1718,8 @@ H5D__create_piece_file_map_all(H5D_dset_io_info_t *di, H5D_io_info_t *io_info)
new_piece_info->in_place_tconv = false;
new_piece_info->buf_off = 0;

new_piece_info->filtered_dset = filtered_dataset;

/* Insert the new chunk into the skip list */
if (H5SL_insert(fm->dset_sel_pieces, new_piece_info, &new_piece_info->index) < 0) {
H5D__free_piece_info(new_piece_info, NULL, NULL);
Expand Down Expand Up @@ -1798,6 +1832,7 @@ H5D__create_piece_file_map_hyper(H5D_dset_io_info_t *dinfo, H5D_io_info_t *io_in
hsize_t chunk_index; /* Index of chunk */
hsize_t start_scaled[H5S_MAX_RANK]; /* Starting scaled coordinates of selection */
hsize_t scaled[H5S_MAX_RANK]; /* Scaled coordinates for this chunk */
bool filtered_dataset; /* Whether the dataset in question has filters applied */
int curr_dim; /* Current dimension to increment */
unsigned u; /* Local index variable */
herr_t ret_value = SUCCEED; /* Return value */
Expand Down Expand Up @@ -1831,6 +1866,9 @@ H5D__create_piece_file_map_hyper(H5D_dset_io_info_t *dinfo, H5D_io_info_t *io_in
/* Calculate the index of this chunk */
chunk_index = H5VM_array_offset_pre(fm->f_ndims, dinfo->layout->u.chunk.down_chunks, scaled);

/* Check whether dataset has filters applied */
filtered_dataset = dinfo->dset->shared->dcpl_cache.pline.nused > 0;

/* Iterate through each chunk in the dataset */
while (sel_points) {
/* Check for intersection of current chunk and file selection */
Expand Down Expand Up @@ -1885,6 +1923,8 @@ H5D__create_piece_file_map_hyper(H5D_dset_io_info_t *dinfo, H5D_io_info_t *io_in
new_piece_info->in_place_tconv = false;
new_piece_info->buf_off = 0;

new_piece_info->filtered_dset = filtered_dataset;

/* Add piece to global piece_count */
io_info->piece_count++;

Expand Down Expand Up @@ -2257,6 +2297,8 @@ H5D__piece_file_cb(void H5_ATTR_UNUSED *elem, const H5T_t H5_ATTR_UNUSED *type,
piece_info->in_place_tconv = false;
piece_info->buf_off = 0;

piece_info->filtered_dset = dinfo->dset->shared->dcpl_cache.pline.nused > 0;

/* Make connection to related dset info from this piece_info */
piece_info->dset_info = dinfo;

Expand Down Expand Up @@ -2417,6 +2459,9 @@ H5D__chunk_mdio_init(H5D_io_info_t *io_info, H5D_dset_io_info_t *dinfo)

/* Add to sel_pieces and update pieces_added */
io_info->sel_pieces[io_info->pieces_added++] = piece_info;

if (piece_info->filtered_dset)
io_info->filtered_pieces_added++;
}

/* Advance to next skip list node */
Expand Down Expand Up @@ -2728,6 +2773,9 @@ H5D__chunk_read(H5D_io_info_t *io_info, H5D_dset_io_info_t *dset_info)
if (io_info->sel_pieces)
io_info->sel_pieces[io_info->pieces_added] = chunk_info;
io_info->pieces_added++;

if (io_info->sel_pieces && chunk_info->filtered_dset)
io_info->filtered_pieces_added++;
}
} /* end if */
else if (!skip_missing_chunks) {
Expand Down Expand Up @@ -3142,6 +3190,9 @@ H5D__chunk_write(H5D_io_info_t *io_info, H5D_dset_io_info_t *dset_info)
if (io_info->sel_pieces)
io_info->sel_pieces[io_info->pieces_added] = chunk_info;
io_info->pieces_added++;

if (io_info->sel_pieces && chunk_info->filtered_dset)
io_info->filtered_pieces_added++;
}
} /* end else */

Expand Down
2 changes: 2 additions & 0 deletions src/H5Dcontig.c
Original file line number Diff line number Diff line change
Expand Up @@ -644,6 +644,8 @@ H5D__contig_io_init(H5D_io_info_t *io_info, H5D_dset_io_info_t *dinfo)
new_piece_info->in_place_tconv = false;
new_piece_info->buf_off = 0;

new_piece_info->filtered_dset = dinfo->dset->shared->dcpl_cache.pline.nused > 0;

/* Calculate type conversion buffer size and check for in-place conversion if necessary. Currently
* only implemented for selection I/O. */
if (io_info->use_select_io != H5D_SELECTION_IO_MODE_OFF &&
Expand Down
Loading