Common input pipeline for single- and multi-band imagery and AOI obje…

…ct for input data (#309) * update csv to expect 3 mandatory columns and one optional. See comments in issue #221 * use inference data for binary segmentation in tests/, not data/ * environment.yml: hardcode setuptools version because of pytorch bug * environment.yml: set correct subversion to setuptools * environment.yml: move setuptools from conda section to pip * sampling_segmentation.py: implement AOI class verifications.py: update assert_crs_match function, add validate functions for rasters and vector files * remove support for AWS bucket via boto3 * finish draft of sampling with AOI objects (with basic validation), rather than from raw csv lines * environment.yml: fix and update * environment.yml: add issue link for setuptools * environment.yml: add issue link for setuptools * environment.yml: fix and update * environment.yml: add issue link for setuptools * environment.yml: add issue link for setuptools * sampling_segmentation.py: implement AOI class verifications.py: update assert_crs_match function, add validate functions for rasters and vector files * finish draft of sampling with AOI objects (with basic validation), rather than from raw csv lines * train_segmentation.py: add warning for debugging and skip save checkpoint if val loss is None * tests/data/massachusetts: restore larger format to prevent val_loss=None * tests/data/massachusetts...: switch back to smaller image test_ci_segmentation_binary.yaml: tile images to 32, not 256 test_ci_segmentation_multiclass.yaml: idem train_segmentation.py: raise ValueError for empty train or val dataloader * aoi.py: - create an AOI object with input validation. AOI would be the core input for tiling, training and inference, though only yet implemented for tiling. - add stac item support geoutils.py: - add utils: is_stac_item, stack_vrts() for create artificial multi-band raster from single-bands files test_aoi.py: add first test for parsing raster input from 3 types to a single rasterio.RasterDataset object default.yaml: activate debug functionality for logging test_ci_segmentation_multiclass.yaml: replace 'modalities' with 'bands' key test_ci_segmentation_binary.yaml: idem sample_creation.py: delete utils.py: remove validation from read_csv() function. * inference_segmentation.py: remove read_modalities README.md: start updating * evaluate_segmentation.py: fix bug (remove read_modalities()) dataset/README.md: add documentation on input data configuration and csv format README.md: update * aoi.py: - add write multiband fonction for demo and debugging - move aois_from_csv from sampling_segmentation.py * aoi.py: remove circular import automatically created py Pycharm * fix typos and potential bugs introduced by Pycharm's automatic refactoring * aoi.py: use pre-existing raster validation function sampling_segmentation.py: move validation to aoi object utils.py: finish removing AWS bucket feature verifications.py: - update all data validation functions * inference_segmentation.py: remove bucket parameter in list_input_images * test_aoi.py: use local stac item (prevent timeout error at CI)
NRCan · Jun 6, 2022 · 682bdc7 · 682bdc7
1 parent e3e6862
commit 682bdc7
Show file tree

Hide file tree

Showing 20 changed files with 640 additions and 784 deletions.
diff --git a/GDL.py b/GDL.py
@@ -41,7 +41,8 @@ def run_gdl(cfg: DictConfig) -> None:
     # check if the mode is chosen
     if type(cfg.mode) is DictConfig:
         msg = "You need to choose between those modes: {}"
-        raise logging.critical(msg.format(list(cfg.mode.keys())))
+        logging.critical(msg.format(list(cfg.mode.keys())))
+        raise ValueError()
 
     # save all overwritten parameters
     logging.info('\nOverwritten parameters in the config: \n' + cfg.general.config_override_dirname)

diff --git a/README.md b/README.md
@@ -3,19 +3,16 @@
 
 ## **Overview**
 
-The **geo-deep-learning** project stems from an initiative at NRCan's [CCMEO](https://www.nrcan.gc.ca/earth-sciences/geomatics/10776).  Its aim is to allow using Convolutional Neural Networks (CNN) with georeferenced data sets.
-The overall learning process comprises three broad stages.
+The **geo-deep-learning** project stems from an initiative at NRCan's [CCMEO](https://www.nrcan.gc.ca/earth-sciences/geomatics/10776).  Its aim is to allow using Convolutional Neural Networks (CNN) with georeferenced datasets.
 
-### Data preparation
-The data preparation phase (sampling) allows creating sub-images that will be used for either training, validation or testing.
-The first phase of the process is to determine sub-images (samples) to be used for training, validation and, optionally, test.
-Images to be used must be of the geotiff type.
-Sample locations in each image must be stored in a GeoPackage.
+In geo-deep-learning, the learning process comprises two broad stages: sampling and training, followed by inference, which makes use of a trained model to make new predictions on unseen imagery. 
 
-[comment]: <> (> Note: A data analysis module can be found [here]&#40;./utils/data_analysis.py&#41; and the documentation in [`docs/README.md`]&#40;./docs/README.md&#41;. Useful for balancing training data.)
+### Data sampling (or [tiling](https://torchgeo.readthedocs.io/en/latest/user/glossary.html#term-tiling))
+The data preparation phase creates [chips](https://torchgeo.readthedocs.io/en/latest/user/glossary.html#term-chip) (or patches) that will be used for either training, validation or testing.
+The sampling step requires a csv as input with a list of rasters and labels to be used in the subsequent training phase. See [dataset documentation](dataset#input-data).
 
 ### Training, along with validation and testing
-The training phase is where the neural network learn to use the data prepared in the previous phase to make all the predictions.
+The training phase is where the neural network learns to use the data prepared in the previous phase to make all the predictions.
 The crux of the learning process is the training phase.  
 
 - Samples labeled "*trn*" as per above are used to train the neural network.
@@ -38,18 +35,14 @@ This project comprises a set of commands to be run at a shell command prompt.  E
 > The system can be used on your workstation or cluster.
 
 ## **Installation**
-Those steps are for your workstation on Ubuntu 18.04 using miniconda.
-Set and activate your python environment with the following commands:  
+To execute scripts in this project, first create and activate your python environment with the following commands:  
 ```shell
 conda env create -f environment.yml
 conda activate geo_deep_env
 ```
-> For Windows OS:
-> - Install rasterio, fiona and gdal first, before installing the rest. We've experienced some [installation issues](https://github.com/conda-forge/gdal-feedstock/issues/213), with those libraries.
-> - Mlflow should be installed using pip rather than conda, as mentioned [here](https://github.com/mlflow/mlflow/issues/1951)
-
+> Tested on Ubuntu 20.04 and Windows 10 using miniconda.
 ## **Running GDL**
-This is an example of how to run GDL with hydra in simple steps with the _**massachusetts buildings**_ dataset in the `/data` folder, for segmentation on buildings: 
+This is an example of how to run GDL with hydra in simple steps with the _**massachusetts buildings**_ dataset in the `tests/data/` folder, for segmentation on buildings: 
 
 1. Clone this github repo.
 ```shell
@@ -67,15 +60,14 @@ python GDL.py mode=train
 python GDL.py mode=inference
 ```
 
-> This example is running with the default configuration `./config/gdl_config_template.yaml`, for further examples on running options see the [documentation](config/#Examples).
-> You will also fund information on how to change the model or add a new one to GDL.
+> This example runs with a default configuration `./config/gdl_config_template.yaml`. For further examples on configuration options see the [configuration documentation](config/#Examples).
 
 > If you want to introduce a new task like object detection, you only need to add the code in the main folder and name it `object_detection_sampling.py` for example.
-> The principle is to name the code like `task_mode.py` and the `GDL.py` will deal with the rest. 
+> The principle is to name the code like `{task}_{mode}.py` and the `GDL.py` will deal with the rest. 
 > To run it, you will need to add a new parameter in the command line `python GDL.py mode=sampling task=object_detection` or change the parameter inside the `./config/gdl_config_template.yaml`.
 
 ## **Folder Structure**
-We suggest a high level structure to organize the images and the code.
+We suggest the following high level structure to organize the images and the code.
 ```
 ├── {dataset_name}
     └── data
@@ -128,24 +120,6 @@ _**Don't forget to change the path of the dataset in the config yaml.**_
 
 [comment]: <> (  num_gpus: 2)
 
-[comment]: <> (  BGR_to_RGB: False                # <-- must be already in RGB)
-
-[comment]: <> (  scale_data: [0,1])
-
-[comment]: <> (  aux_vector_file:)
-
-[comment]: <> (  aux_vector_attrib:)
-
-[comment]: <> (  aux_vector_ids:)
-
-[comment]: <> (  aux_vector_dist_maps:)
-
-[comment]: <> (  aux_vector_dist_log:)
-
-[comment]: <> (  aux_vector_scale:)
-
-[comment]: <> (  debug_mode: True)
-
 [comment]: <> (  # Module to include the NIR)
 
 [comment]: <> (  modalities: RGBN                 # <-- must be add)

diff --git a/config/dataset/test_ci_segmentation_binary.yaml b/config/dataset/test_ci_segmentation_binary.yaml
@@ -10,7 +10,7 @@ dataset:
   raw_data_dir: ${general.raw_data_dir}
 
   # imagery
-  modalities: RGB
+  bands: [R, G, B]
 
   # ground truth
   attribute_field: properties/class

diff --git a/config/dataset/test_ci_segmentation_multiclass.yaml b/config/dataset/test_ci_segmentation_multiclass.yaml
@@ -10,7 +10,7 @@ dataset:
   raw_data_dir: ${general.raw_data_dir}
 
   # imagery
-  modalities: RGB
+  bands: [R, G, B]
 
   # ground truth
   attribute_field: properties/Quatreclasses

diff --git a/config/gdl_config_template.yaml b/config/gdl_config_template.yaml
@@ -29,9 +29,9 @@ general:
   workspace: your_name
   max_epochs: 2 # for train only
   min_epochs: 1 # for train only
-  raw_data_dir: data
+  raw_data_dir: dataset
   raw_data_csv: tests/sampling/sampling_segmentation_binary_ci.csv
-  sample_data_dir: data # where the hdf5 will be saved
+  sample_data_dir: dataset # where the hdf5 will be saved
   save_weights_dir: saved_model/${general.project_name}
 
 print_config: True # save the config in the log folder

diff --git a/config/hydra/default.yaml b/config/hydra/default.yaml
@@ -4,6 +4,7 @@ run:
 sweep:
   dir: logs/multiruns/${now:%Y-%m-%d_%H-%M-%S}
   subdir: ${hydra.job.num}
+verbose: ${debug}
 
 # you can set here environment variables that are universal for all users
 # for system specific variables (like data paths) it's better to use .env file!

diff --git a/dataset/README.md b/dataset/README.md
@@ -0,0 +1,59 @@
+# Input data
+The sampling and inference steps requires a csv referencing input data. An example of input csv can be found in [tests](tests/sampling/sampling_segmentation_binary_ci.csv).
+Each row of this csv is considered, in geo-deep-learning terms, to be an [AOI](https://torchgeo.readthedocs.io/en/latest/user/glossary.html#term-area-of-interest-AOI).
+
+| raster path               | vector ground truth path | dataset split | aoi id (optional) |
+|---------------------------|--------------------------|---------------|-------------------|
+| my_dir/my_geoimagery1.tif | my_dir/my_geogt1.gpkg    | trn           | Ontario-1         |
+| my_dir/my_geoimagery2.tif | my_dir/my_geogt2.gpkg    | tst           | NewBrunswick-23   |
+| ...                       | ...                      | ...           | ...               |
+
+> The use of aoi id information will be implemented in a near future. It will serve, for example, to print a detailed report of sampling, training and evaluation, or for easier debugging.
+
+The path to a custom csv must be entered in the [dataset configuration](https://github.com/NRCan/geo-deep-learning/blob/develop/config/dataset/test_ci_segmentation_binary.yaml#L9). See the [configuration documentation](config/README.md) for more information.
+Also check the [suggested folder structure](https://github.com/NRCan/geo-deep-learning#folder-structure).
+
+## Dataset splits
+Split in csv should be either "trn", "tst" or "inference". The validation split is automatically created during sampling. It's proportion is set by the [dataset config](https://github.com/NRCan/geo-deep-learning/blob/develop/config/dataset/test_ci_segmentation_binary.yaml#L8). 
+
+## Raster and vector file compatibility
+Rasters to be used must be in a format compatible with [rasterio](https://rasterio.readthedocs.io/en/latest/quickstart.html?highlight=supported%20raster%20format#opening-a-dataset-in-reading-mode)/[GDAL](https://gdal.org/drivers/raster/index.html) (ex.: GeoTiff). Similarly, labels (aka annotations) for each image must be stored as polygons in a [Geopandas compatible vector file](Rasters to be used must be in a format compatible with [rasterio](https://rasterio.readthedocs.io/en/latest/quickstart.html?highlight=supported%20raster%20format#opening-a-dataset-in-reading-mode)/[GDAL](https://gdal.org/drivers/raster/index.html) (ex.: GeoTiff). Similarly, labels (aka annotations) for each image must be stored as polygons in a [Geopandas compatible vector file](https://geopandas.org/en/stable/docs/user_guide/io.html#reading-spatial-data) (ex.: GeoPackage).
+) (ex.: GeoPackage).
+
+## Single-band vs multi-band imagery
+
+To support both single-band and multi-band imagery, the path in the first column of an input csv can be in **one of three formats**:
+
+### 1. Path to a multi-band image file:
+`my_dir/my_multiband_geofile.tif`
+
+### 2. Path to single-band image files, using only a common string
+A path to a list of single-band rasters can be inserted in the csv, but only a the string common to all single-band files should be considered.
+The "band specific" string in the file name must be in a [hydra-like interpolation format](https://hydra.cc/docs/1.0/advanced/override_grammar/basic/#primitives), with `${...}` notation. The interpolation string completed during execution by a dataset parameter with a list of desired band identifiers to help resolve the single-band filenames.
+
+#### Example:
+
+In [dataset config](../config/dataset/test_ci_segmentation_binary.yaml):
+
+`bands: [R, G, B]`
+
+In [input csv](../tests/sampling/sampling_segmentation_binary_ci.csv):
+
+| raster path                                                | ground truth path | dataset split |
+|------------------------------------------------------------|-------------------|---------------|
+| my_dir/my_singleband_geofile_band_**${dataset.bands}**.tif | gt.gpkg           | trn           |
+
+During execution, this would result in using, **in the same order as bands appear in dataset config**, the following files:
+`my_dir/my_singleband_geofile_band_R.tif`
+`my_dir/my_singleband_geofile_band_G.tif`
+`my_dir/my_singleband_geofile_band_B.tif`
+
+> To simplify the use of both single-band and multi-band rasters through a unique input pipeline, single-band files are artificially merged as a [virtual raster](https://gdal.org/drivers/raster/vrt.html).
+
+### 3. Path to a Stac Item 
+> Only Stac Items referencing **single-band assets** are supported currently. See [our Worldview-2 example](https://datacube-stage.services.geo.ca/api/collections/spacenet-samples/items/SpaceNet_AOI_2_Las_Vegas-056155973080_01_P001-WV03).
+
+Bands must be selected by [common name](https://github.com/stac-extensions/eo/#common-band-names) in dataset config:
+`bands: ["red", "green", "blue"]`
+
+> Order matters: `["red", "green", "blue"]` is not equal to `["blue", "green", "red"]` !