Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-pass stats, pyramids, and histogram #116

Merged
merged 85 commits into from
Sep 27, 2024

Conversation

neilflood
Copy link
Member

Add single-pass calculation for pyramid layers (i.e. overviews), basic statistics, and histograms, on output image files.

Purpose

In the past, RIOS always used GDAL's own functions to add pyramid layers, statistics and histograms to all output files. These were done after the user processing had completed and the output files were already written. Each of these three components needed a separate call to GDAL, operating on the completed file, and so each required a separate pass through the already-written output rasters.

This pull request adds the ability to do all of these in the same single pass as the processing and writing of each block. This gives a useful saving in time spent on these tasks. For smaller images (< 8,000 x 8,000 pixels), the time spent was never very large anyway, but for much larger images, there was considerable time spent, and so the time saved by doing these with a single pass is also considerable.

The default behaviour is to attempt to use single-pass for all three components, wherever possible, and to fall back on using GDAL when it is not.

Each of the three components is now handled much more independently of the others. This includes new controls methods - setOmitBasicStats and setOmitHistogram, which match with the previous setOmitPyramids. These provide more granular control than the previous setCalcStats method, although the latter still works as always.

There are also three new methods to force the use of single-pass or GDAL - setSinglePassPyramids, setSinglePassBasicStats, and setSinglePassHistogram.

The user should not need to make any changes to existing code, and will gain useful speed-ups in writing output files.

Driver Support

As this code was developed, it became apparent that the KEA driver did not support the direct writing of pyramid layers. Further investigation fixed this problem (OSGeo/gdal#10616), and it should be available in GDAL 3.9.3. To guard against this, the single-pass code tests for driver support (in any driver), and will fall back to at-end GDAL pyramid layers if required.

Timings

The timings object on the ApplierReturn now includes some new timers. Originally, the closing timer included the time spent on flushing the final blocks of output, and also all of the time spent in writing pyramids, statistics, and histogram. The final flush is now included in the writing timer (which is thus more accurate), and the closing timer has been replaced with separate timers for pyramids, basicstats, and histogram, which accumulate the time spent both calculating and writing the corresponding item.

Pyramid Layer Aggregation Type

The default aggregation type used for creating pyramid layers with GDAL used to be 'AVERAGE', unless the file had been explicitly set as 'thematic', in which case it was 'NEAREST'. However, it is quite common that outputs are never set to be either thematic or athematic, and so 'AVERAGE' would end up being used more commonly than is appropriate. So, the default is now 'NEAREST' in all cases, which is more than adequate for most purposes.

Furthermore, the use of single-pass pyramids only supports 'NEAREST', so if anything else is explicitly selected, then single-pass is not possible and it will fall back to GDAL.

Centering of low-resolution pixels

When GDAL's BuildOverviews() function uses 'NEAREST' to subsample the raster, it selects the value from the top-left corner of the low resolution pixel. This is probably a minor bug in GDAL. When single-pass pyramids are used in RIOS, they will use the value at the centre of the low resolution pixel, which removes any sub-pixel shift. This is not a major issue either way, but it seemed good to get it right.

Test Suite

A lot of tests have been added for all these statistics calculations, checking various different combinations. As a result, the TESTSTATS set of tests now takes noticeably longer, but my paranoia is satisfied. The whole suite is still under 30 seconds, so not too serious.

Histogram Minimum

The original code would always set the histogram minimum value to be zero for thematic layers, and all 8-bit layers, and use the actual minimum for all other cases. I suspect it was unnecessary, and possibly a bad idea. However, it does appear to have been relied upon in other situations, most notably when using the RAT ReadAsArray() and WriteArray() functions to access the Histogram column. Because of this, I have preserved the old behaviour.

…emove attempt at flushing pyramid block from Python, as this works, but slows everything down. Will have to rely on the KEA C++ fix
…EST', which it probably should have been all along
…ode with single-pass calculations. More to do, but so far seems to work
… use in imagewriter. Can now manage each of the three actions independently
…d imagename= parameter for setApproxStats. Add check on use of approx stats on thematic outputs
@neilflood neilflood requested a review from gillins September 25, 2024 23:44
@neilflood neilflood merged commit 51740f2 into ubarsc:master Sep 27, 2024
6 checks passed
@neilflood neilflood deleted the single-pass-stats branch September 27, 2024 23:47
@tonykgill
Copy link
Contributor

Hi @neilflood. These single pass calculations are a great idea. I have noted that NEAREST is the only supported pyramid layer aggregation type. For continuous biophysical variables I tend to use AVERAGE. Do you think it's feasible to add support for the AVERAGE aggregation method in a single-pass pyramid layer creation?

@neilflood
Copy link
Member Author

Hi @tonykgill

Doing NEAREST is easy using just numpy slice operations. However, doing AVERAGE requires a bit more complexity. There are two possible approaches. One is to require the presence of numba, and write some jit-ted code to do it. The other is to use slicing to pull out subset arrays with 4 different offsets and then average these together. I was going to do the latter until I realised that there was a lot of complexity needed for the edge blocks where the number of pixels can potentially change between the 4 offset values. So, I didn't. If you explicitly select AVERAGE, then it will fall back to using GDAL at the end, and quietly avoid single-pass for pyramids (although it will still apply to stats & histogram).

I also used to generally use AVERAGE for continuous variables, but have often thought that it does not matter much. In general I never use the pyramids for calculations, only for viewing convenience, and so I don't think that the use of NEAREST is really any disadvantage.

If you think it might be important for you, please speak up. I am quite prepared to look into either of the above approaches (or any other) if there is a real need. I just thought it probably didn't matter, so took the simplest path.

@tonykgill
Copy link
Contributor

Thanks Neil. Happy to leave it. Having single pass on the stats and histograms is still a step forward.

Just FYI, we use average resampling for products that are viewed by users in our web maps. We have found that average creates a nicer product, and better user experience, than nearest. It reduces much of the speckle in a product that is relatively spatiallly noisy at full resolution. I think if the product wasn't noisy, then nearest would probably be ok.

@neilflood
Copy link
Member Author

OK, that's interesting to know. That does sound like there are situations where it matters, so I will keep that in mind, and look for a good solution.

@tonykgill
Copy link
Contributor

That's very generous of you Neil. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants