Single-pass stats, pyramids, and histogram (#116)

* Bare bones of single-pass stats support, not yet complete * Somewhat neater management of singlePassInfo object * Closer to a working version * Seems to be roughly working, but needs lots more testing to be sure * Get maxval right * Improve aggregationType handling. More to think about here. Fix array sub-sampling for pyramids * Simplify aggType handling. Commented out band_ov_FlushCache(), as it seems to go Very Slowly * Make a proper dummy njit decorator for when numba is not available. Remove attempt at flushing pyramid block from Python, as this works, but slows everything down. Will have to rely on the KEA C++ fix * Change the default value of DEFAULT_OVERVIEWAGGREGRATIONTYPE to 'NEAREST', which it probably should have been all along * Much simpler dummy njit * singlePassMgr instead of singlePassInfo * Check each output driver for pyramid writing support * Add separate timings for pyramids/stats/histogram * Cope with completely null output raster * Don't write stats when whole raster is null * Bit more explanation in SinglePassManager docstring * Put 'closing' timer after the other things * Remove 'closing' as a known timer name * Re-factoring of addStatistics() routine, to allow better sharing of code with single-pass calculations. More to do, but so far seems to work * Mostly completed re-factoring of old addStatistics() routine, and its use in imagewriter. Can now manage each of the three actions independently * Include cache flush in 'writing' timer * Simplify field names on HistogramParams class * Improved docstrings for setApproxStats and setSinglePassHistogram. Add imagename= parameter for setApproxStats. Add check on use of approx stats on thematic outputs * Improved docstrings. Fix backward compatibility of addStatistics() * Fix teststats.py so it runs stand-alone * Add workinggrid and singlePassMgr to ApplierReturn object * Typo * Beginnings of a non-numba histogram method. Does not yet work, but shows considerable promise * Reverse arguments to updateHist. Fix calcMin for thematic GDAL histogram * Remove remnants of numba-based histogram * Better handling of counts for negative pixel values. Still not entirely right. * Fix hist min/max for case with both positive and negative pixels * Guard against attempting histogram when raster is all nulls * Minor tidying * For athematic large-int rasters, collapse the single-pass 'direct' histogram to a smaller 'linear' one, which matches how we have always done it with gdal's GetHistogram() * Better name for int type sets * Check on pyramids and approx stats * More explicit name directPyramidsSupported * Minor changes to make sure the new tests work * New teststats.py, more comprehensive, and better aligned with single-pass stats calculation * Only use RFC40 to write Histogram column if the layer is thematic, and when doing so, make allowance for HISTOMIN to be something other than zero * Only use KEA for ratstats test, since RFC40 is not properly supported in HFA. Be more explicit about the null value, and set thematic before doing calcStats * Add tests for whether single-pass happened when requested * Match the old behaviour for thematic layers, forcing HISTOMIN to be zero. Add exception if there are negative values in a thematic layer * Use int64 for counts, not uint64, to avoid overflow when subtracting totals. Fix minor discrepancy in linearHistFromDirect * Add a more rigorous test on histogram individual counts, but so far only for 'direct' binned case * Always write binFunction, even when writing histogram with RAT column * Also apply hist test to athematic direct case * Remove my paranoid test on linear hist total count. There is now no way it can fail * Comments explaining the limits on using RAT to write histogram * Update docstrings to remove reference to numba, and further explain default behaviour of single-pass stuff * Typo * Move the omit test outside the format/datatype loops * Ensure omit check always happens * Implement a test of histogram counts for linear-binned case * Typos * Remove old histLimits method, left over from earlier implementation * Remove unused variable * Use a more reliable check of whether to use RAT to read histogram back * Use numpy.histogram for both direct and linear histograms. Seems fine, and simplifies the code * Add a test with negative pixel values * Fix test for null at end of counts * Test with a null value other than zero * Include a test with no null value * Note in docstrings about assuming the null value(s) have already been set * Allow genRampImageFile to set numRows/numCols * Added explicit test of pyramid layers * Cope with the case of a block of all nulls * Add a test for an output of all nulls * Simplify default aggregation type code * Very old typo in docstring * Tidying up a few docstrings * Remove now-redundant variable * Don't use KEA for the extra one-off tests, as it may not be present * Cope with old GDAL that doesn't have Int64 * Revert to HFA for rat stats test, as KEA may not be present * Include libgdal-kea in the linux-conda tests * Close the test output, so that Windows will be able to delete it * Put file close further down, after all possible uses, but before delete * Use the more robust test for GDAL 64-bit support * Another place where test outfile has to be closed before Windows tries to delete it * Include a print statement so I can track test failure on github * More debug printing * Correct checks on 64-bit type support * Remove debug print statements
ubarsc · Sep 27, 2024 · 51740f2 · 51740f2
1 parent 1678a16
commit 51740f2
Show file tree

Hide file tree

Showing 10 changed files with 1,568 additions and 348 deletions.
diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
@@ -29,7 +29,7 @@ jobs:
         miniforge-variant: Mambaforge
     - name: Install dependencies
       shell: bash -l {0}
-      run: conda install gdal cloudpickle scipy
+      run: conda install gdal libgdal-kea cloudpickle scipy
     - name: Test with testrios
       shell: bash -l {0}
       run: |

diff --git a/rios/applier.py b/rios/applier.py
diff --git a/rios/calcstats.py b/rios/calcstats.py
diff --git a/rios/imagewriter.py b/rios/imagewriter.py
@@ -107,25 +107,33 @@ def setDefaultDriver():
 
 
 def writeBlock(gdalOutObjCache, blockDefn, outfiles, outputs, controls,
-        workinggrid):
+        workinggrid, singlePassMgr, timings):
     """
     Write the given block to the files given in outfiles
     """
     for (symbolicName, seqNum, filename) in outfiles:
         arr = outputs[symbolicName, seqNum]
+        # Trim the margin
+        m = controls.getOptionForImagename('overlap', symbolicName)
+        if m > 0:
+            arr = arr[:, m:-m, m:-m]
+
         key = (symbolicName, seqNum)
         if key not in gdalOutObjCache:
             ds = openOutfile(symbolicName, filename, controls, arr,
                     workinggrid)
             gdalOutObjCache[symbolicName, seqNum] = ds
+            singlePassMgr.initFor(ds, symbolicName, seqNum, arr)
 
         ds = gdalOutObjCache[symbolicName, seqNum]
 
-        # Trim the margin
-        m = controls.getOptionForImagename('overlap', symbolicName)
-        if m > 0:
-            arr = arr[:, m:-m, m:-m]
-        ds.WriteArray(arr, blockDefn.left, blockDefn.top)
+        with timings.interval('writing'):
+            # Write the base raster data
+            ds.WriteArray(arr, blockDefn.left, blockDefn.top)
+
+        # If appropriate, do single-pass actions for this block
+        calcstats.handleSinglePassActions(ds, arr, singlePassMgr,
+            symbolicName, seqNum, blockDefn.left, blockDefn.top, timings)
 
 
 def openOutfile(symbolicName, filename, controls, arr, workinggrid):
@@ -179,17 +187,17 @@ def openOutfile(symbolicName, filename, controls, arr, workinggrid):
     return ds
 
 
-def closeOutfiles(gdalOutObjCache, outfiles, controls):
+def closeOutfiles(gdalOutObjCache, outfiles, controls, singlePassMgr, timings):
     """
     Close all the output files
     """
     # getOpt is just a little local shortcut
     getOpt = controls.getOptionForImagename
 
     for (symbolicName, seqNum, filename) in outfiles:
-        doStats = getOpt('calcStats', symbolicName)
-        statsIgnore = getOpt('statsIgnore', symbolicName)
         omitPyramids = getOpt('omitPyramids', symbolicName)
+        omitBasicStats = getOpt('omitBasicStats', symbolicName)
+        omitHistogram = getOpt('omitHistogram', symbolicName)
         overviewLevels = getOpt('overviewLevels', symbolicName)
         overviewMinDim = getOpt('overviewMinDim', symbolicName)
         overviewAggType = getOpt('overviewAggType', symbolicName)
@@ -201,18 +209,36 @@ def closeOutfiles(gdalOutObjCache, outfiles, controls):
             progress = SilentProgress()
 
         ds = gdalOutObjCache[symbolicName, seqNum]
-        if doStats:
-            if not omitPyramids:
+        with timings.interval('writing'):
+            # Ensure that all data has been written
+            ds.FlushCache()
+
+        if (not singlePassMgr.doSinglePassPyramids(symbolicName) and
+                not omitPyramids):
+            # Pyramids have not been done single-pass, and are not being
+            # omitted, so do them on closing (i.e. the old way)
+            with timings.interval('pyramids'):
                 calcstats.addPyramid(ds, progress, levels=overviewLevels,
                     minoverviewdim=overviewMinDim,
                     aggregationType=overviewAggType)
 
-            # Note that statsIgnore is passed in here. This is a historical
-            # anomaly, from when calcStats was the only time that the null
-            # value was set. In the current version, it is set when the file
-            # is created.
-            calcstats.addStatistics(ds, progress, statsIgnore,
-                approx_ok=approxStats)
+        if singlePassMgr.doSinglePassStatistics(symbolicName):
+            with timings.interval('basicstats'):
+                calcstats.finishSinglePassStats(ds, singlePassMgr,
+                    symbolicName, seqNum)
+            # Make the minMaxList from values already on singlePassMgr
+            minMaxList = makeMinMaxList(singlePassMgr, symbolicName, seqNum)
+        elif not omitBasicStats:
+            with timings.interval('basicstats'):
+                minMaxList = calcstats.addBasicStatsGDAL(ds, approxStats)
+
+        if singlePassMgr.doSinglePassHistogram(symbolicName):
+            with timings.interval('histogram'):
+                calcstats.finishSinglePassHistogram(ds, singlePassMgr,
+                    symbolicName, seqNum)
+        elif not omitHistogram:
+            with timings.interval('histogram'):
+                calcstats.addHistogramsGDAL(ds, minMaxList, approxStats)
 
         # This is doing everything I can to ensure the file gets fully closed
         # at this point.
@@ -226,6 +252,23 @@ def closeOutfiles(gdalOutObjCache, outfiles, controls):
             addAutoColorTable(filename, autoColorTableType)
 
 
+def makeMinMaxList(singlePassMgr, symbolicName, seqNum):
+    """
+    Make a list of min/max values per band, for the nominated output file,
+    from values already present on singlePassMgr.
+    Mimicing the list returned by addBasicStatsGDAL, for use with
+    addHistogramsGDAL.
+
+    """
+    accumList = singlePassMgr.accumulators[symbolicName, seqNum]
+    minMaxList = []
+    for i in range(len(accumList)):
+        accum = accumList[i]
+        (minval, maxval) = (accum.minval, accum.maxval)
+        minMaxList.append((minval, maxval))
+    return minMaxList
+
+
 def deleteIfExisting(filename):
     """
     Delete the filename if it already exists. If possible, use the

diff --git a/rios/rioserrors.py b/rios/rioserrors.py
@@ -148,6 +148,10 @@ class WorkerExceptionError(RiosError):
     "A worker thread or process has raised an exception"
 
 
+class SinglePassActionsError(RiosError):
+    "An error in processing single-pass actions"
+
+
 deprecationAlreadyWarned = set()
 
 

diff --git a/rios/riostests/riostestutils.py b/rios/riostests/riostestutils.py
@@ -88,12 +88,14 @@ def genRampArray(nRows=DEFAULT_ROWS, nCols=DEFAULT_COLS):
 
 
 def genRampImageFile(filename, reverse=False, xLeft=DEFAULT_XLEFT,
-        yTop=DEFAULT_YTOP, nullVal=None):
+        yTop=DEFAULT_YTOP, nullVal=None,
+        numRows=DEFAULT_ROWS, numCols=DEFAULT_COLS):
     """
     Generate a test image of a simple 2-d linear ramp. 
     """
-    ds = createTestFile(filename, xLeft=xLeft, yTop=yTop)
-    ramp = genRampArray()
+    ds = createTestFile(filename, xLeft=xLeft, yTop=yTop, numRows=numRows,
+        numCols=numCols)
+    ramp = genRampArray(nRows=numRows, nCols=numCols)
     if reverse:
         # Flip left-to-right
         ramp = ramp[:, ::-1]
@@ -270,7 +272,12 @@ def testAll():
     ok = teststats.run()
     if not ok:
         failureCount += 1
-
+
+    from . import testpyramids
+    ok = testpyramids.run()
+    if not ok:
+        failureCount += 1
+
     from . import testrat
     ok = testrat.run()
     if not ok:

diff --git a/rios/riostests/testpyramids.py b/rios/riostests/testpyramids.py
@@ -0,0 +1,122 @@
+"""
+Test that pyramid layers (i.e overviews) are written correctly, in both the
+single-pass and GDAL cases
+"""
+# This file is part of RIOS - Raster I/O Simplification
+# Copyright (C) 2012  Sam Gillingham, Neil Flood
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+import os
+
+import numpy
+from osgeo import gdal
+
+from rios import applier
+
+from rios.riostests import riostestutils
+
+TESTNAME = 'TESTPYRAMIDS'
+
+
+def run():
+    """
+    Run tests of pyramid layers (i.e. overviews)
+    """
+    riostestutils.reportStart(TESTNAME)
+
+    allOK = True
+
+    # Create a test input file
+    rampInfile = 'ramp.img'
+    # Set a raster size which will result in exactly one pyramid layer
+    nRows = nCols = 1024
+    riostestutils.genRampImageFile(rampInfile, numRows=nRows, numCols=nCols)
+
+    infiles = applier.FilenameAssociations()
+    outfiles = applier.FilenameAssociations()
+    controls = applier.ApplierControls()
+
+    infiles.inimg = rampInfile
+    outfiles.outimg = "ramp.tif"
+    controls.setOutputDriverName("GTiff")
+
+    for singlePass in [True, False]:
+        controls.setSinglePassPyramids(singlePass)
+        applier.apply(doit, infiles, outfiles, controls=controls)
+        ok = checkPyramids(outfiles.outimg, singlePass)
+        allOK = allOK and ok
+
+    for filename in [rampInfile, outfiles.outimg]:
+        if os.path.exists(filename):
+            riostestutils.removeRasterFile(filename)
+
+    if allOK:
+        riostestutils.report(TESTNAME, "Passed")
+
+    return allOK
+
+
+def doit(info, inputs, outputs):
+    outputs.outimg = inputs.inimg
+
+
+def checkPyramids(filename, singlePass):
+    """
+    Check that the pyramids as written correspond to the base raster
+    in the file. If singlePass is True, then we should have centred the low-res
+    pixels, if it is False, then GDAL calculated them, and it appears not to
+    centre the low-res pixels (which I believe to be a bug).
+    """
+    ok = True
+
+    ds = gdal.Open(filename)
+    band = ds.GetRasterBand(1)
+    arr = band.ReadAsArray()
+    numOverviews = band.GetOverviewCount()
+    if numOverviews != 1:
+        msg = "Incorrect overview count: {}".format(numOverviews)
+        riostestutils.report(TESTNAME, msg)
+        ok = False
+
+    band_ov = band.GetOverview(0)
+    arr_ov = band_ov.ReadAsArray()
+
+    # Work out which offset (o) to use
+    factor = int(round(arr.shape[0] / arr_ov.shape[0]))
+    if singlePass:
+        o = int(round(factor / 2))
+    else:
+        # GDAL doesn't use an offset (sigh....)
+        o = 0
+
+    # The true sub-sampled overview array
+    true_arr_ov = arr[o::factor, o::factor]
+    if true_arr_ov.shape != arr_ov.shape:
+        msg = "Overview shape mis-match: {} != {}".format(true_arr_ov.shape,
+            arr_ov.shape)
+        riostestutils.report(TESTNAME, msg)
+        ok = False
+    else:
+        mismatch = (arr_ov != true_arr_ov)
+        numMismatch = numpy.count_nonzero(mismatch)
+        if numMismatch > 0:
+            msg = "Pyramid layer pixel mis-match for {} pixels".format(numMismatch)
+            riostestutils.report(TESTNAME, msg)
+            ok = False
+
+    return ok
+
+
+if __name__ == "__main__":
+    run()
diff --git a/rios/riostests/testratstats.py b/rios/riostests/testratstats.py
@@ -24,7 +24,7 @@
 from rios import fileinfo
 from rios import rat
 from rios import cuiprogress
-from . import riostestutils
+from rios.riostests import riostestutils
 
 TESTNAME = 'TESTRATSTATS'
 
@@ -48,24 +48,24 @@ def run():
     band.WriteArray(imgArray)
 
     nullDN = 0
+    band.SetMetadataItem('LAYER_TYPE', 'thematic')
     calcstats.calcStats(ds, ignore=nullDN, progress=cuiprogress.SilentProgress())
     columnName = 'Value'
     # Note that the RAT has a row for lots of values which have no corresponding pixel
     ratValues = (numpy.mgrid[0:nRows]**2).astype(numpy.float64)
     ratValues[0] = 500
     rat.writeColumnToBand(band, columnName, ratValues)
-    band.SetMetadataItem('LAYER_TYPE', 'thematic')
     del ds
 
     ratStats = fileinfo.RatStats(imgfile, columnlist=[columnName])
 
     # Construct an image of the values, by translating pixel values into RAT values
     ratValImg = numpy.zeros(imgArray.shape, dtype=numpy.float64)
     for dn in numpy.unique(imgArray):
-        if dn != 0:
+        if dn != nullDN:
             mask = (imgArray == dn)
             ratValImg[mask] = ratValues[dn]
-    ratValImgNonNull = ratValImg[imgArray!=0]
+    ratValImgNonNull = ratValImg[imgArray!=nullDN]
 
     # Now find the "true" values of the various stats for this image (i.e. this is the
     # histogramweighted=True case, which I think will be the most common one)