-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rasterization fails for relatively few staged files inconsistently #7
Comments
Made branch # define function to exemplify retry approach
def test_func(x, y):
for attempt in range(2):
try:
z = x / y
print(f'Success: {x} / {y} = {z}')
return z
except Exception as e:
print(f'Error dividing {x} by {y} so trying again.')
else:
break
else:
print(f'Error dividing {x} by {y}, ran out of retries.')
return None Apply function with expected input types that won't error:
output:
Apply function with unexpected input types that will error:
Output:
This retry feature is applied to the rasterize_vector() with retry def rasterize_vector(self, path, overwrite=True):
"""
Given a path to an output file from the viz-staging step, create a
GeoTIFF and save it to the configured dir_geotiff directory. By
default, if the output geotiff already exists, it will be
overwritten. To change this behaviour, set overwrite to False.
During this process, the min and max values (and other summary
stats) of the data arrays that comprise the GeoTIFFs for each band
will be tracked.
Parameters
----------
path : str
Path to the staged vector file to rasterize.
overwrite : bool
Optional, defaults to True. If set to False, then if there is
an existing GeoTiff tile at the output path created,
rasterization will be skipped.
Returns
-------
morecantile.Tile or None
The tile that was rasterized or None if there was an error.
"""
for attempt in range(2):
try:
# Get information about the tile from the path
tile = self.tiles.tile_from_path(path)
out_path = self.tiles.path_from_tile(tile, 'geotiff')
if os.path.isfile(out_path) and not overwrite:
logger.info(f'Skip rasterizing {path} for tile {tile}.'
' Tile already exists.')
return None
bounds = self.tiles.get_bounding_box(tile)
# Track and log the event
id = self.__start_tracking('geotiffs_from_vectors')
logger.info(f'Rasterizing {path} for tile {tile} to {out_path}.')
gdf = gpd.read_file(path)
# Check if deduplication should be performed first
dedup_here = self.config.deduplicate_at('raster')
dedup_method = self.config.get_deduplication_method()
if dedup_here and dedup_method is not None:
prop_duplicated = self.config.polygon_prop('duplicated')
if prop_duplicated in gdf.columns:
gdf = gdf[~gdf[prop_duplicated]]
# Get properties to pass to the rasterizer
raster_opts = self.config.get_raster_config()
# Rasterize
raster = Raster.from_vector(
vector=gdf, bounds=bounds, **raster_opts)
raster.write(out_path)
# Track and log the end of the event
message = f'Rasterization for tile {tile} complete.'
self.__end_tracking(id, raster=raster, tile=tile, message=message)
logger.info(
f'Complete rasterization of tile {tile} to {out_path}.')
return tile
except Exception as e:
logger.info(f'Error rasterizing {path} for tile {tile} so trying again.')
else:
break
else:
message = f'Error rasterizing {path} for tile {tile}, ran out of retries.'
self.__end_tracking(id, tile=tile, message=message)
# note that the error = e argument is removed from __end_tracking(),
# because e is no longer locally defined due to the break just before
# find way to maintain e in the error message for a better workflow
return None We apply this with the virtual env see installed packagesPackage Version Editable project location
------------------ ---------- -------------------------
affine 2.3.1
asttokens 2.0.5
attrs 22.2.0
backcall 0.2.0
bcrypt 4.0.1
certifi 2022.12.7
cffi 1.15.1
charset-normalizer 2.1.1
click 8.1.3
click-plugins 1.1.1
cligj 0.7.2
coloraide 0.18.1
colormaps 0.3
contourpy 1.0.6
cryptography 38.0.4
cycler 0.11.0
debugpy 1.5.1
decorator 5.1.1
dill 0.3.6
entrypoints 0.4
executing 0.8.3
filelock 3.8.2
Fiona 1.8.22
fonttools 4.38.0
geopandas 0.12.2
globus-sdk 3.15.1
idna 3.4
ipykernel 6.15.2
ipython 8.7.0
jedi 0.18.1
jupyter_client 7.4.7
jupyter_core 4.11.2
kiwisolver 1.4.4
matplotlib 3.6.2
matplotlib-inline 0.1.6
morecantile 3.2.5
munch 2.5.0
nest-asyncio 1.5.5
numpy 1.24.0
packaging 22.0
pandas 1.5.2
paramiko 2.12.0
parsl 2022.12.19
parso 0.8.3
pdgraster 0.1.0 /home/jcohen/viz-raster
pdgstaging 0.1.0 /home/jcohen/viz-staging
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.3.0
pip 22.3.1
prompt-toolkit 3.0.20
psutil 5.9.4
ptyprocess 0.7.0
pure-eval 0.2.2
pycparser 2.21
pydantic 1.10.2
Pygments 2.11.2
PyJWT 2.6.0
PyNaCl 1.5.0
pyparsing 3.0.9
pyproj 3.4.1
python-dateutil 2.8.2
pytz 2022.7
pyzmq 24.0.1
rasterio 1.3.4
requests 2.28.1
Rtree 0.9.7
setproctitle 1.3.2
setuptools 65.5.0
shapely 2.0.0
six 1.16.0
snuggs 1.4.7
stack-data 0.2.0
tblib 1.7.0
tornado 6.2
traitlets 5.7.1
typeguard 2.13.3
typing_extensions 4.4.0
urllib3 1.26.13
wcwidth 0.2.5
wheel 0.37.1 see config{
"dir_input": "/home/jcohen/viz-workflow/raster-retry/data_subsample",
"dir_geotiff": "raster-retry/OUTPUT_GEOTIFFS",
"dir_web_tiles": "raster-retry/OUTPUT_WEBTILE",
"dir_staged": "raster-retry/OUTPUT_STAGING_TILES",
"filename_staging_summary": "raster-retry/staging_summary.csv",
"filename_rasterization_events": "raster-retry/raster_events.csv",
"filename_rasters_summary": "raster-retry/raster_summary.csv",
"version": "DATE",
"ext_input": ".gpkg",
"simplify_tolerance": 0.0001,
"tms_id": "WGS1984Quad",
"z_range": [0, 11],
"statistics": [
{
"name": "polygon_count",
"weight_by": "count",
"property": "centroids_per_pixel",
"aggregation_method": "sum",
"resampling_method": "sum",
"val_range": [0, null],
"nodata_val": 0,
"nodata_color": "#ffffff00",
"palette": ["#d9c43f", "#d93fce"]
},
{
"name": "coverage",
"weight_by": "area",
"property": "area_per_pixel_area",
"aggregation_method": "sum",
"resampling_method": "average",
"val_range": [0, 1],
"nodata_val": 0,
"nodata_color": "#ffffff00",
"palette": ["#d9c43f", "#d93fce"]
}
],
"deduplicate_at": ["staging"],
"deduplicate_method": "neighbor",
"deduplicate_keep_rules": [["staging_filename", "larger"]],
"deduplicate_overlap_tolerance": 0.1,
"deduplicate_overlap_both": false,
"deduplicate_centroid_tolerance": null
} Data sample is 3000 randomly sampled polygons from lake size change dataset (1000 from each UTM zone). In order to ensure the retry code is run, we purposefully corrupt one of the GeoPackage files by using |
Executed a run with the same environment and config as above, but inserted variable rasterize_vector() with retry def rasterize_vector(self, path, overwrite=True):
"""
Given a path to an output file from the viz-staging step, create a
GeoTIFF and save it to the configured dir_geotiff directory. By
default, if the output geotiff already exists, it will be
overwritten. To change this behaviour, set overwrite to False.
During this process, the min and max values (and other summary
stats) of the data arrays that comprise the GeoTIFFs for each band
will be tracked.
Parameters
----------
path : str
Path to the staged vector file to rasterize.
overwrite : bool
Optional, defaults to True. If set to False, then if there is
an existing GeoTiff tile at the output path created,
rasterization will be skipped.
Returns
-------
morecantile.Tile or None
The tile that was rasterized or None if there was an error.
"""
for attempt in range(2):
try:
# Get information about the tile from the path
tile = self.tiles.tile_from_path(path)
out_path = self.tiles.path_from_tile(tile, 'geotiff')
if os.path.isfile(out_path) and not overwrite:
logger.info(f'Skip rasterizing {path} for tile {tile}.'
' Tile already exists.')
return None
bounds = self.tiles.get_bounding_box(tile)
# Track and log the event
id = self.__start_tracking('geotiffs_from_vectors')
logger.info(f'Rasterizing {path} for tile {tile} to {out_path}.')
gdf = gpd.read_file(path)
# Check if deduplication should be performed first
dedup_here = self.config.deduplicate_at('raster')
dedup_method = self.config.get_deduplication_method()
if dedup_here and dedup_method is not None:
prop_duplicated = self.config.polygon_prop('duplicated')
if prop_duplicated in gdf.columns:
gdf = gdf[~gdf[prop_duplicated]]
# Get properties to pass to the rasterizer
raster_opts = self.config.get_raster_config()
# Rasterize
raster = Raster.from_vector(
vector=gdf, bounds=bounds, **raster_opts)
raster.write(out_path)
# Track and log the end of the event
message = f'Rasterization for tile {tile} complete.'
self.__end_tracking(id, raster=raster, tile=tile, message=message)
logger.info(
f'Complete rasterization of tile {tile} to {out_path}.')
return tile
except Exception as e:
logger.info(f'Error rasterizing {path} for tile {tile} due to error {e} so trying again.')
else:
break
else:
message = f'Error rasterizing {path} for tile {tile}, ran out of retries.'
self.__end_tracking(id, tile=tile, message=message)
return None I'm curious if there is a benefit to printing the explicit For this run, which staged 3000 polygons, then we corrupted 1 before rasterization started:
No errors were reported in For the run with original
No errors reported in that |
Just tested this with the same everything except reduced the amount of retries to 1. I figure this is better because with large datasets such as IWP, even with only ~0.5% of the files initially failing, reducing the amount of retries from 2 to 1 will save time and Delta credits. We don't want to waste resources retrying the same file an unnecessary amount of times. I would ideally like to test this workflow on a larger amount of data now that the syntax and number of tries has shown to be successful. I could use all the lake change files Ingmar uploaded to the PDG google drive. I am not sure if Robyn already processed them, but I believe she has not. It would be interesting to run all those files on the current |
Rasterization step logs errors for relatively few staged files compared to the total amount of staged files. Occurs for both the maximum z-level as well as parent z-levels according to
log.log
. For the lake size change data sample, the GeoPackage files that errored during a certain parsl run are not obviously corrupt, and when rasterization is applied again (outside of the workflow but still with parallelization) the files are successfully rasterized. See this documented here. This was also seen in the ray workflow with IWP dataset, for which Robyn noted:It would be helpful to do more runs with various datasets to determine if the failing files are ever consistent, and if the errors are random (and can therefore be solved by just trying to rasterize these few staged files again with the same approach), or if the files are actually corrupt, or if it has to do with the files not transferring to scratch dir like Robyn suggested.
The text was updated successfully, but these errors were encountered: