-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce total number of files when converting to OME-zarr? #116
Comments
Firstly, yes, However, multiple chunks are not colocated in the same file. So if compression is employed, which it is in the example you gave, even if you were to set a 5792x5792x2 (width, height, bytes per pixel; ~64MiB) chunk size a chunk may compress very well, perhaps it's full of zeros or completely white, and consequently could easily be 1KiB or smaller. Chunk colocation within the same file (also sometimes referred to in the community as sharding) is being discussed [4] but I am not aware of any current Zarr implementation. TileDB [5] addresses some of these concerns with a journaled approach but that is not without its own downsides such as reconciliation. The Zarr layout is simple by design and adding complexity to the chunk format will require significant discussion and strong community backing. You can read more about the design decisions and perspectives of simple (Zarr) vs. complex layouts (TileDB for example) on this issue if you so desire: There is also a fairly detailed discussion around the precomputed, sharded, chunk based format that Neuroglancer uses available here: A sharded format would however not necessarily relieve the short write, high volume of small operations that you are noticing when, I assume, when writing directly to S3 as the unit of work for Furthermore, I would strongly caution against going beyond 1024 chunking in the Y and X dimensions in the pursuit of better write performance and a smaller number of larger chunks. This may improve write performance but will substantially impact read performance and first byte latency for streaming viewers. Projects such as the aforementioned Neuroglancer or webKnossos go as far as to have tiny 3D chunk sizes (32^3) to combat this. The source data in your example (the .svs file) will also be chunked (tiled in TIFF parlance) and compressed. Selection of output chunk sizes that are not aligned can result in substantial read slowdowns as the source data has to be rechunked and repeatedly decompressed in order to conform to a desired output chunk size. In short, the behavior you are seeing is expected and I don't think a 64MiB object size is either practical or reasonably achievable at present. Hope this helps.
|
OK. Thanks for sharing a detailed perspective. they are very different
from my understanding:
Furthermore, I would strongly caution against going beyond 1024 chunking
in the Y and X dimensions in the pursuit of better write performance and a
smaller number of larger chunks. This may improve write performance but
will substantially impact read performance and first byte latency for
streaming viewers.
The other practical problems I'm seeing are:
1) I get a 10X file size increase from an SVS to an OME Zarr using zlib=9
(and blosc seems worse?)
2) Working with 10+K objects means even deleting a single dataset from S3
takes minutes instead of seconds
Is the size expansion roughly in line with what you see?
Are other users complaining about having to deal with 10+K objects for a
single image dataset?
…On Thu, Sep 16, 2021 at 7:11 AM Chris Allan ***@***.***> wrote:
Firstly, yes, --tile_width, --tile_height as well as --tile_depth control
the Zarr chunk [1] size. For the two most common Zarr storage [2]
implementations (file system and object storage) each chunk is a single
file (or object) where the filename (or key) is the chunk's index within
the array separated by the dimension separator [3]. There is a more visual
description of this available here:
- https://ngff.openmicroscopy.org/latest/#on-disk
However, multiple chunks are not colocated in the same file. So if
compression is employed, which it is in the example you gave, even if you
were to set a 5792x5792x2 (width, height, bytes per pixel; ~64MiB) chunk
size a chunk may compress very well, perhaps it's full of zeros or
completely white, and consequently could easily be 1KiB or smaller. Chunk
colocation within the same file (also sometimes referred to in the
community as sharding) is being discussed [4] but I am not aware of any
current Zarr implementation. TileDB [5] addresses some of these concerns
with a journaled approach but that is not without its own downsides such as
reconciliation. The Zarr layout is simple by design and adding complexity
to the chunk format will require significant discussion and strong
community backing. You can read more about the design decisions and
perspectives of simple (Zarr) vs. complex layouts (TileDB for example) on
this issue if you so desire:
- zarr-developers/zarr-python#515
<zarr-developers/zarr-python#515>
There is also a fairly detailed discussion around the precomputed,
sharded, chunk based format that Neuroglancer uses available here:
-
https://github.com/google/neuroglancer/blob/master/src/neuroglancer/datasource/precomputed/sharded.md
A sharded format would however not necessarily relieve the short write,
high volume of small operations that you are noticing when, I assume, when
writing directly to S3 as the unit of work for bioformats2raw is a chunk.
Latency per write is going to be very similar and the same number of writes
still need to take place regardless of whether they are happening to one
sharded object or many unsharded chunks. Obviously you could approach this
by buffering colocated chunks locally first and transferring the shard only
when all chunks are processed. This is just one of a plethora of
optimizations one might consider however each comes with substantial
implementation and maintenance burden as well as potential for the deep
coupling of bioformats2raw to storage subsystem architectural design.
Furthermore, I would strongly caution against going beyond 1024 chunking
in the Y and X dimensions in the pursuit of better write performance and a
smaller number of larger chunks. This may improve write performance but
will substantially impact read performance and first byte latency for
streaming viewers. Projects such as the aforementioned Neuroglancer or
webKnossos go as far as to have *tiny* 3D chunk sizes (32^3) to combat
this. The source data in your example (the .svs file) will also be chunked
(tiled in TIFF parlance) and compressed. Selection of output chunk sizes
that are not aligned can result in substantial read slowdowns as the source
data has to be rechunked and repeatedly decompressed in order to conform to
a desired output chunk size.
In short, the behavior you are seeing is expected and I don't think a
64MiB object size is either practical or reasonably achievable at present.
Hope this helps.
1. https://zarr.readthedocs.io/en/stable/spec/v2.html#chunks
2. https://zarr.readthedocs.io/en/stable/spec/v2.html#storage
3. https://zarr.readthedocs.io/en/stable/spec/v2.html#arrays
4. https://forum.image.sc/t/sharding-support-in-ome-zarr/55409
5.
https://docs.tiledb.com/main/solutions/tiledb-embedded/internal-mechanics/architecture
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#116 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVUT6JS2VYVJG6MPE7DHHEDUCH3JXANCNFSM5ECWZNDA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@davidekhub If the SVS you are converting is a 24-bit RGB, likely it is stored with lossy compression (JPEG, JPEG-2000) and that is the reason for the difference in file size. Z-lib and blosc are lossless compression algorithms so they will never achieve the same compression ratios (although the pixel values will be exactly the same between the SVS and zarr data). There may be a way to encode with something like JPEG using b2r, but bear in mind that there is an accumulation effect in compression errors. The large number of objects is ideal for some scenarios like web visualization and very fast conversion but has downsides that need to be weighed. |
Hmm. I'm going to have to forward this on to the experts on our side. If
we're seeing 10X expansion with max compression because the originals are
lossy (which I need to verify) that makes me reconsider the value of
OME-zarr (it's already an operational burden compared to working with the
original SVS, and not any faster for vis in our case).
…On Tue, Dec 21, 2021 at 11:54 AM Heath Patterson ***@***.***> wrote:
@davidekhub <https://github.com/davidekhub> If the SVS you are converting
is a 24-bit RGB, likely it is stored with lossy compression (JPEG,
JPEG-2000) and that is the reason for the difference in file size. Z-lib
and blosc are lossless compression algorithms so they will never achieve
the same compression ratios (although the pixel values will be exactly the
same between the SVS and zarr data). There may be a way to encode with
something like JPEG using b2r, but bear in mind that there is an
accumulation effect in compression errors. The large number of objects is
ideal for some scenarios like web visualization and very fast conversion
but has downsides that need to be weighed.
—
Reply to this email directly, view it on GitHub
<#116 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVUT6JQ3HZXYX7555EYRYVDUSDLOZANCNFSM5ECWZNDA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I am converting SVS files (~500+MB each) to OME-zarr. My command line looks like this:
${params.cmd} --max_workers=${params.max_workers} --compression=zlib --compression-properties level=9 --resolutions 6 ${slide_file} ${slide_file}.zarr
I end up with buckets filled with OME zarr that contain thousands of files, many extremely small (2K) and the largest are 1M. This is really too small for object storage (leads to many small operations that take a long time) so I'd like my files to average around 64MB or so. But the documentation doesn't say what flags affect this so I'm curious- is it the tile_height and width?
The text was updated successfully, but these errors were encountered: