Skip to content

Commit

Permalink
Single-pass stats, pyramids, and histogram (#116)
Browse files Browse the repository at this point in the history
* Bare bones of single-pass stats support, not yet complete

* Somewhat neater management of singlePassInfo object

* Closer to a working version

* Seems to be roughly working, but needs lots more testing to be sure

* Get maxval right

* Improve aggregationType handling. More to think about here. Fix array sub-sampling for pyramids

* Simplify aggType handling. Commented out band_ov_FlushCache(), as it seems to go Very Slowly

* Make a proper dummy njit decorator for when numba is not available. Remove attempt at flushing pyramid block from Python, as this works, but slows everything down. Will have to rely on the KEA C++ fix

* Change the default value of DEFAULT_OVERVIEWAGGREGRATIONTYPE to 'NEAREST', which it probably should have been all along

* Much simpler dummy njit

* singlePassMgr instead of singlePassInfo

* Check each output driver for pyramid writing support

* Add separate timings for pyramids/stats/histogram

* Cope with completely null output raster

* Don't write stats when whole raster is null

* Bit more explanation in SinglePassManager docstring

* Put 'closing' timer after the other things

* Remove 'closing' as a known timer name

* Re-factoring of addStatistics() routine, to allow better sharing of code with single-pass calculations. More to do, but so far seems to work

* Mostly completed re-factoring of old addStatistics() routine, and its use in imagewriter. Can now manage each of the three actions independently

* Include cache flush in 'writing' timer

* Simplify field names on HistogramParams class

* Improved docstrings for setApproxStats and setSinglePassHistogram. Add imagename= parameter for setApproxStats. Add check on use of approx stats on thematic outputs

* Improved docstrings. Fix backward compatibility of addStatistics()

* Fix teststats.py so it runs stand-alone

* Add workinggrid and singlePassMgr to ApplierReturn object

* Typo

* Beginnings of a non-numba histogram method. Does not yet work, but shows considerable promise

* Reverse arguments to updateHist. Fix calcMin for thematic GDAL histogram

* Remove remnants of numba-based histogram

* Better handling of counts for negative pixel values. Still not entirely right.

* Fix hist min/max for case with both positive and negative pixels

* Guard against attempting histogram when raster is all nulls

* Minor tidying

* For athematic large-int rasters, collapse the single-pass 'direct' histogram to a smaller 'linear' one, which matches how we have always done it with gdal's GetHistogram()

* Better name for int type sets

* Check on pyramids and approx stats

* More explicit name directPyramidsSupported

* Minor changes to make sure the new tests work

* New teststats.py, more comprehensive, and better aligned with single-pass stats calculation

* Only use RFC40 to write Histogram column if the layer is thematic, and when doing so, make allowance for HISTOMIN to be something other than zero

* Only use KEA for ratstats test, since RFC40 is not properly supported in HFA. Be more explicit about the null value, and set thematic before doing calcStats

* Add tests for whether single-pass happened when requested

* Match the old behaviour for thematic layers, forcing HISTOMIN to be zero. Add exception if there are negative values in a thematic layer

* Use int64 for counts, not uint64, to avoid overflow when subtracting totals. Fix minor discrepancy in linearHistFromDirect

* Add a more rigorous test on histogram individual counts, but so far only for 'direct' binned case

* Always write binFunction, even when writing histogram with RAT column

* Also apply hist test to athematic direct case

* Remove my paranoid test on linear hist total count. There is now no way it can fail

* Comments explaining the limits on using RAT to write histogram

* Update docstrings to remove reference to numba, and further explain default behaviour of single-pass stuff

* Typo

* Move the omit test outside the format/datatype loops

* Ensure omit check always happens

* Implement a test of histogram counts for linear-binned case

* Typos

* Remove old histLimits method, left over from earlier implementation

* Remove unused variable

* Use a more reliable check of whether to use RAT to read histogram back

* Use numpy.histogram for both direct and linear histograms. Seems fine, and simplifies the code

* Add a test with negative pixel values

* Fix test for null at end of counts

* Test with a null value other than zero

* Include a test with no null value

* Note in docstrings about assuming the null value(s) have already been set

* Allow genRampImageFile to set numRows/numCols

* Added explicit test of pyramid layers

* Cope with the case of a block of all nulls

* Add a test for an output of all nulls

* Simplify default aggregation type code

* Very old typo in docstring

* Tidying up a few docstrings

* Remove now-redundant variable

* Don't use KEA for the extra one-off tests, as it may not be present

* Cope with old GDAL that doesn't have Int64

* Revert to HFA for rat stats test, as KEA may not be present

* Include libgdal-kea in the linux-conda tests

* Close the test output, so that Windows will be able to delete it

* Put file close further down, after all possible uses, but before delete

* Use the more robust test for GDAL 64-bit support

* Another place where test outfile has to be closed before Windows tries to delete it

* Include a print statement so I can track test failure on github

* More debug printing

* Correct checks on 64-bit type support

* Remove debug print statements
  • Loading branch information
neilflood authored Sep 27, 2024
1 parent 1678a16 commit 51740f2
Show file tree
Hide file tree
Showing 10 changed files with 1,568 additions and 348 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
miniforge-variant: Mambaforge
- name: Install dependencies
shell: bash -l {0}
run: conda install gdal cloudpickle scipy
run: conda install gdal libgdal-kea cloudpickle scipy
- name: Test with testrios
shell: bash -l {0}
run: |
Expand Down
276 changes: 230 additions & 46 deletions rios/applier.py

Large diffs are not rendered by default.

1,026 changes: 816 additions & 210 deletions rios/calcstats.py

Large diffs are not rendered by default.

77 changes: 60 additions & 17 deletions rios/imagewriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,25 +107,33 @@ def setDefaultDriver():


def writeBlock(gdalOutObjCache, blockDefn, outfiles, outputs, controls,
workinggrid):
workinggrid, singlePassMgr, timings):
"""
Write the given block to the files given in outfiles
"""
for (symbolicName, seqNum, filename) in outfiles:
arr = outputs[symbolicName, seqNum]
# Trim the margin
m = controls.getOptionForImagename('overlap', symbolicName)
if m > 0:
arr = arr[:, m:-m, m:-m]

key = (symbolicName, seqNum)
if key not in gdalOutObjCache:
ds = openOutfile(symbolicName, filename, controls, arr,
workinggrid)
gdalOutObjCache[symbolicName, seqNum] = ds
singlePassMgr.initFor(ds, symbolicName, seqNum, arr)

ds = gdalOutObjCache[symbolicName, seqNum]

# Trim the margin
m = controls.getOptionForImagename('overlap', symbolicName)
if m > 0:
arr = arr[:, m:-m, m:-m]
ds.WriteArray(arr, blockDefn.left, blockDefn.top)
with timings.interval('writing'):
# Write the base raster data
ds.WriteArray(arr, blockDefn.left, blockDefn.top)

# If appropriate, do single-pass actions for this block
calcstats.handleSinglePassActions(ds, arr, singlePassMgr,
symbolicName, seqNum, blockDefn.left, blockDefn.top, timings)


def openOutfile(symbolicName, filename, controls, arr, workinggrid):
Expand Down Expand Up @@ -179,17 +187,17 @@ def openOutfile(symbolicName, filename, controls, arr, workinggrid):
return ds


def closeOutfiles(gdalOutObjCache, outfiles, controls):
def closeOutfiles(gdalOutObjCache, outfiles, controls, singlePassMgr, timings):
"""
Close all the output files
"""
# getOpt is just a little local shortcut
getOpt = controls.getOptionForImagename

for (symbolicName, seqNum, filename) in outfiles:
doStats = getOpt('calcStats', symbolicName)
statsIgnore = getOpt('statsIgnore', symbolicName)
omitPyramids = getOpt('omitPyramids', symbolicName)
omitBasicStats = getOpt('omitBasicStats', symbolicName)
omitHistogram = getOpt('omitHistogram', symbolicName)
overviewLevels = getOpt('overviewLevels', symbolicName)
overviewMinDim = getOpt('overviewMinDim', symbolicName)
overviewAggType = getOpt('overviewAggType', symbolicName)
Expand All @@ -201,18 +209,36 @@ def closeOutfiles(gdalOutObjCache, outfiles, controls):
progress = SilentProgress()

ds = gdalOutObjCache[symbolicName, seqNum]
if doStats:
if not omitPyramids:
with timings.interval('writing'):
# Ensure that all data has been written
ds.FlushCache()

if (not singlePassMgr.doSinglePassPyramids(symbolicName) and
not omitPyramids):
# Pyramids have not been done single-pass, and are not being
# omitted, so do them on closing (i.e. the old way)
with timings.interval('pyramids'):
calcstats.addPyramid(ds, progress, levels=overviewLevels,
minoverviewdim=overviewMinDim,
aggregationType=overviewAggType)

# Note that statsIgnore is passed in here. This is a historical
# anomaly, from when calcStats was the only time that the null
# value was set. In the current version, it is set when the file
# is created.
calcstats.addStatistics(ds, progress, statsIgnore,
approx_ok=approxStats)
if singlePassMgr.doSinglePassStatistics(symbolicName):
with timings.interval('basicstats'):
calcstats.finishSinglePassStats(ds, singlePassMgr,
symbolicName, seqNum)
# Make the minMaxList from values already on singlePassMgr
minMaxList = makeMinMaxList(singlePassMgr, symbolicName, seqNum)
elif not omitBasicStats:
with timings.interval('basicstats'):
minMaxList = calcstats.addBasicStatsGDAL(ds, approxStats)

if singlePassMgr.doSinglePassHistogram(symbolicName):
with timings.interval('histogram'):
calcstats.finishSinglePassHistogram(ds, singlePassMgr,
symbolicName, seqNum)
elif not omitHistogram:
with timings.interval('histogram'):
calcstats.addHistogramsGDAL(ds, minMaxList, approxStats)

# This is doing everything I can to ensure the file gets fully closed
# at this point.
Expand All @@ -226,6 +252,23 @@ def closeOutfiles(gdalOutObjCache, outfiles, controls):
addAutoColorTable(filename, autoColorTableType)


def makeMinMaxList(singlePassMgr, symbolicName, seqNum):
"""
Make a list of min/max values per band, for the nominated output file,
from values already present on singlePassMgr.
Mimicing the list returned by addBasicStatsGDAL, for use with
addHistogramsGDAL.
"""
accumList = singlePassMgr.accumulators[symbolicName, seqNum]
minMaxList = []
for i in range(len(accumList)):
accum = accumList[i]
(minval, maxval) = (accum.minval, accum.maxval)
minMaxList.append((minval, maxval))
return minMaxList


def deleteIfExisting(filename):
"""
Delete the filename if it already exists. If possible, use the
Expand Down
4 changes: 4 additions & 0 deletions rios/rioserrors.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,10 @@ class WorkerExceptionError(RiosError):
"A worker thread or process has raised an exception"


class SinglePassActionsError(RiosError):
"An error in processing single-pass actions"


deprecationAlreadyWarned = set()


Expand Down
15 changes: 11 additions & 4 deletions rios/riostests/riostestutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,12 +88,14 @@ def genRampArray(nRows=DEFAULT_ROWS, nCols=DEFAULT_COLS):


def genRampImageFile(filename, reverse=False, xLeft=DEFAULT_XLEFT,
yTop=DEFAULT_YTOP, nullVal=None):
yTop=DEFAULT_YTOP, nullVal=None,
numRows=DEFAULT_ROWS, numCols=DEFAULT_COLS):
"""
Generate a test image of a simple 2-d linear ramp.
"""
ds = createTestFile(filename, xLeft=xLeft, yTop=yTop)
ramp = genRampArray()
ds = createTestFile(filename, xLeft=xLeft, yTop=yTop, numRows=numRows,
numCols=numCols)
ramp = genRampArray(nRows=numRows, nCols=numCols)
if reverse:
# Flip left-to-right
ramp = ramp[:, ::-1]
Expand Down Expand Up @@ -270,7 +272,12 @@ def testAll():
ok = teststats.run()
if not ok:
failureCount += 1


from . import testpyramids
ok = testpyramids.run()
if not ok:
failureCount += 1

from . import testrat
ok = testrat.run()
if not ok:
Expand Down
122 changes: 122 additions & 0 deletions rios/riostests/testpyramids.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
"""
Test that pyramid layers (i.e overviews) are written correctly, in both the
single-pass and GDAL cases
"""
# This file is part of RIOS - Raster I/O Simplification
# Copyright (C) 2012 Sam Gillingham, Neil Flood
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
import os

import numpy
from osgeo import gdal

from rios import applier

from rios.riostests import riostestutils

TESTNAME = 'TESTPYRAMIDS'


def run():
"""
Run tests of pyramid layers (i.e. overviews)
"""
riostestutils.reportStart(TESTNAME)

allOK = True

# Create a test input file
rampInfile = 'ramp.img'
# Set a raster size which will result in exactly one pyramid layer
nRows = nCols = 1024
riostestutils.genRampImageFile(rampInfile, numRows=nRows, numCols=nCols)

infiles = applier.FilenameAssociations()
outfiles = applier.FilenameAssociations()
controls = applier.ApplierControls()

infiles.inimg = rampInfile
outfiles.outimg = "ramp.tif"
controls.setOutputDriverName("GTiff")

for singlePass in [True, False]:
controls.setSinglePassPyramids(singlePass)
applier.apply(doit, infiles, outfiles, controls=controls)
ok = checkPyramids(outfiles.outimg, singlePass)
allOK = allOK and ok

for filename in [rampInfile, outfiles.outimg]:
if os.path.exists(filename):
riostestutils.removeRasterFile(filename)

if allOK:
riostestutils.report(TESTNAME, "Passed")

return allOK


def doit(info, inputs, outputs):
outputs.outimg = inputs.inimg


def checkPyramids(filename, singlePass):
"""
Check that the pyramids as written correspond to the base raster
in the file. If singlePass is True, then we should have centred the low-res
pixels, if it is False, then GDAL calculated them, and it appears not to
centre the low-res pixels (which I believe to be a bug).
"""
ok = True

ds = gdal.Open(filename)
band = ds.GetRasterBand(1)
arr = band.ReadAsArray()
numOverviews = band.GetOverviewCount()
if numOverviews != 1:
msg = "Incorrect overview count: {}".format(numOverviews)
riostestutils.report(TESTNAME, msg)
ok = False

band_ov = band.GetOverview(0)
arr_ov = band_ov.ReadAsArray()

# Work out which offset (o) to use
factor = int(round(arr.shape[0] / arr_ov.shape[0]))
if singlePass:
o = int(round(factor / 2))
else:
# GDAL doesn't use an offset (sigh....)
o = 0

# The true sub-sampled overview array
true_arr_ov = arr[o::factor, o::factor]
if true_arr_ov.shape != arr_ov.shape:
msg = "Overview shape mis-match: {} != {}".format(true_arr_ov.shape,
arr_ov.shape)
riostestutils.report(TESTNAME, msg)
ok = False
else:
mismatch = (arr_ov != true_arr_ov)
numMismatch = numpy.count_nonzero(mismatch)
if numMismatch > 0:
msg = "Pyramid layer pixel mis-match for {} pixels".format(numMismatch)
riostestutils.report(TESTNAME, msg)
ok = False

return ok


if __name__ == "__main__":
run()
8 changes: 4 additions & 4 deletions rios/riostests/testratstats.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
from rios import fileinfo
from rios import rat
from rios import cuiprogress
from . import riostestutils
from rios.riostests import riostestutils

TESTNAME = 'TESTRATSTATS'

Expand All @@ -48,24 +48,24 @@ def run():
band.WriteArray(imgArray)

nullDN = 0
band.SetMetadataItem('LAYER_TYPE', 'thematic')
calcstats.calcStats(ds, ignore=nullDN, progress=cuiprogress.SilentProgress())
columnName = 'Value'
# Note that the RAT has a row for lots of values which have no corresponding pixel
ratValues = (numpy.mgrid[0:nRows]**2).astype(numpy.float64)
ratValues[0] = 500
rat.writeColumnToBand(band, columnName, ratValues)
band.SetMetadataItem('LAYER_TYPE', 'thematic')
del ds

ratStats = fileinfo.RatStats(imgfile, columnlist=[columnName])

# Construct an image of the values, by translating pixel values into RAT values
ratValImg = numpy.zeros(imgArray.shape, dtype=numpy.float64)
for dn in numpy.unique(imgArray):
if dn != 0:
if dn != nullDN:
mask = (imgArray == dn)
ratValImg[mask] = ratValues[dn]
ratValImgNonNull = ratValImg[imgArray!=0]
ratValImgNonNull = ratValImg[imgArray!=nullDN]

# Now find the "true" values of the various stats for this image (i.e. this is the
# histogramweighted=True case, which I think will be the most common one)
Expand Down
Loading

0 comments on commit 51740f2

Please sign in to comment.