You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, stackstac is built around each STAC Asset being its own chunk in the dask array—the time and band dimensions always have a chunksize of 1.
However, there are cases where you might want to load multiple Assets in one chunk of the array. Most commonly, you'd do this when you have a huge graph, need to cut down on tasks, and can give up some granularity. Particularly, you might be happy to combine the time dimension into fewer chunks if you know you're doing a composite right away anyway. See microsoft/PlanetaryComputer#12 (comment) for a motivating example.
So let's support extending the chunksize= argument to stackstac.stack to take up to 4-tuples (time, band, y, x), so you can specify the chunking along all dimensions.
Note that this isn't#66 (though that could be a follow-on): we're not talking about flattening/pre-mosaicing the data. We'd still load every asset as usual, it's just that the chunks of the dask array might be (4, 2, Y, X) instead of always (1, 1, Y, X).
When a chunk contains multiple assets, should they be loaded serially, or in parallel? We could create our own internal threadpool, since most of the IO is not CPU-bound. However, because we have to duplicate the GDAL Dataset and file-descriptor per-thread, that might be expensive on memory. I suppose the runtime of T threads reading N assets is the same as T threads reading N / C assets, where each read takes C times longer. So probably in serial. Sure would be nice to just have an aiocogeo Reader for this 😁
Currently, stackstac is built around each STAC Asset being its own chunk in the dask array—the
time
andband
dimensions always have a chunksize of 1.However, there are cases where you might want to load multiple Assets in one chunk of the array. Most commonly, you'd do this when you have a huge graph, need to cut down on tasks, and can give up some granularity. Particularly, you might be happy to combine the
time
dimension into fewer chunks if you know you're doing a composite right away anyway. See microsoft/PlanetaryComputer#12 (comment) for a motivating example.So let's support extending the
chunksize=
argument tostackstac.stack
to take up to 4-tuples (time
,band
,y
,x
), so you can specify the chunking along all dimensions.Note that this isn't #66 (though that could be a follow-on): we're not talking about flattening/pre-mosaicing the data. We'd still load every asset as usual, it's just that the chunks of the dask array might be
(4, 2, Y, X)
instead of always(1, 1, Y, X)
.This should be done/considered as a part of #105.
Questions:
T
threads readingN
assets is the same asT
threads readingN / C
assets, where each read takesC
times longer. So probably in serial. Sure would be nice to just have anaiocogeo
Reader for this 😁The text was updated successfully, but these errors were encountered: