Merge branch 'develop' of https://github.com/NRCan/geo-deep-learning

NRCan · Oct 7, 2022 · da034c9 · da034c9
2 parents 092d06b + cdea293
commit da034c9
Show file tree

Hide file tree

Showing 32 changed files with 339 additions and 192 deletions.
diff --git a/README.md b/README.md
@@ -53,7 +53,7 @@ cd geo-deep-learning
 2. Run the wanted script (for segmentation).
 ```shell
 # Creating the hdf5 from the raw data
-python GDL.py mode=sampling
+python GDL.py mode=tiling
 # Training the neural network
 python GDL.py mode=train
 # Inference on the data

diff --git a/config/README.md b/config/README.md
@@ -65,7 +65,7 @@ The **_tracker section_** is set to `None` by default, but will still log the in
 If you want to set a tracker you can change the value in the config file or add the tracker parameter at execution time via the command line `python GDL.py tracker=mlflow mode=train`.
 
 The **_inference section_** contains the information to execute the inference job (more options will follow soon).
-This part doesn't need to be filled if you want to launch sampling, train or hyperparameters search mode only.
+This part doesn't need to be filled if you want to launch tiling, train or hyperparameters search mode only.
 
 The **_task section_** manages the executing task. `Segmentation` is the default task since it's the primary task of GDL.
 However, the goal will be to add tasks as need be. The `GDL.py` code simply executes the main function from the `task_mode.py` in the main folder of GDL.
@@ -83,7 +83,7 @@ general:
   max_epochs: 2 # for train only
   min_epochs: 1 # for train only
   raw_data_dir: data
-  raw_data_csv: tests/sampling/sampling_segmentation_binary_ci.csv
+  raw_data_csv: tests/tiling/tiling_segmentation_binary_ci.csv
   sample_data_dir: data # where the hdf5 will be saved
   state_dict_path:
   save_weights_dir: saved_model/${general.project_name}
@@ -95,10 +95,10 @@ If `True`, will save the config in the log folder.
 
 #### Mode Section
 ```YAML
-mode: {sampling, train, inference, evaluate, hyperparameters_search}
+mode: {tiling, train, inference, evaluate, hyperparameters_search}
 ```
-**GDL** has five modes: sampling, train, evaluate, inference and hyperparameters search.
-- *sampling*, generates `hdf5` files from a folder containing folders for each individual image with their ground truth.
+**GDL** has five modes: tiling, train, evaluate, inference and hyperparameters search.
+- *tiling*, generates .geotiff and .geojson [chips](https://torchgeo.readthedocs.io/en/latest/user/glossary.html#term-chip) from each source aoi (image & ground truth).
 - *train*, will train the model specified with all the parameters in `training`, `trainer`, `optimizer`, `callbacks` and `scheduler`. The outcome will be `.pth` weights.
 - *evaluate*, this function needs to be filled with images, their ground truth and a weight for the model. At the end of the evaluation you will obtain statistics on those images. 
 - *inference*, unlike the evaluation, the inference doesn't need a ground truth. The inference will produce a prediction on the content of the images fed to the model. Depending on the task, the outcome file will differ.
@@ -148,4 +148,3 @@ new:
 $ python GDL.py --config-name=/path/to/new/gdl_config.yaml mode=train
 ```
 
-
diff --git a/config/dataset/README.md b/config/dataset/README.md
@@ -5,7 +5,7 @@
 ### Input dimensions and overlap
 
 These parameters respectively set the width and length of a single sample and stride from one sample to another as
-outputted by sampling_segmentation.py. Default to 256 and 0, respectively.
+outputted by tiling_segmentation.py. Default to 256 and 0, respectively.
 
 ### Train/validation percentage
 
@@ -31,7 +31,7 @@ For more information on the concept of stratified sampling, see [this Medium art
 
 ### Modalities
 
-Bands to be selected during the sampling process. Order matters (ie "BGR" is not equal to "RGB").
+Bands to be selected during the tiling process. Order matters (ie "BGR" is not equal to "RGB").
 The use of this feature for band selection is a work in progress. It currently serves to indicate how many bands are in
 source imagery.
 

diff --git a/config/dataset/test_ci_segmentation_binary.yaml b/config/dataset/test_ci_segmentation_binary.yaml
@@ -2,26 +2,21 @@
 dataset:
   # dataset-wide
   name:
-  input_dim: 32
-  overlap:
-  use_stratification: False
-  train_val_percent: {'trn':0.7, 'val':0.3, 'tst':0}
   raw_data_csv: ${general.raw_data_csv}
   raw_data_dir: ${general.raw_data_dir}
   download_data: False
 
   # imagery
-  bands: [R, G, B]
+  bands: [1,2,3]
 
   # ground truth
   attribute_field: properties/class
   attribute_values: [1]
-  min_annotated_percent:
   class_name: # will follow in the next version
   classes_dict: {'BUIL':1}
   class_weights:
   ignore_index: -1
 
   # outputs
-  sample_data_dir: ${general.sample_data_dir}
+  tiling_data_dir: ${general.tiling_data_dir}
 
diff --git a/config/dataset/test_ci_segmentation_binary_stac.yaml b/config/dataset/test_ci_segmentation_binary_stac.yaml
@@ -2,11 +2,7 @@
 dataset:
   # dataset-wide
   name:
-  input_dim: 32
-  overlap:
-  use_stratification: False
-  train_val_percent: {'trn':0.7, 'val':0.3, 'tst':0}
-  raw_data_csv: tests/sampling/sampling_segmentation_binary-stac_ci.csv
+  raw_data_csv: tests/tiling/tiling_segmentation_binary-stac_ci.csv
   raw_data_dir: ${general.raw_data_dir}
   download_data: False
 
@@ -16,12 +12,11 @@ dataset:
   # ground truth
   attribute_field:
   attribute_values:
-  min_annotated_percent:
   class_name: # will follow in the next version
   classes_dict: {'BUIL':1}
   class_weights:
   ignore_index: -1
 
   # outputs
-  sample_data_dir: ${general.sample_data_dir}
+  tiling_data_dir: ${general.tiling_data_dir}
 
diff --git a/config/dataset/test_ci_segmentation_multiclass.yaml b/config/dataset/test_ci_segmentation_multiclass.yaml
@@ -2,26 +2,21 @@
 dataset:
   # dataset-wide
   name:
-  input_dim: 32
-  overlap:
-  use_stratification: False
-  train_val_percent: {'trn':0.7, 'val':0.3, 'tst':0}
-  raw_data_csv: tests/sampling/sampling_segmentation_multiclass_ci.csv
+  raw_data_csv: tests/tiling/tiling_segmentation_multiclass_ci.csv
   raw_data_dir: ${general.raw_data_dir}
   download_data: False
 
   # imagery
-  bands: [R, G, B]
+  bands: [1,2,3]
 
   # ground truth
   attribute_field: properties/Quatreclasses
   attribute_values: [1,2,3,4]
-  min_annotated_percent:
   class_name: # will follow in the next version
   classes_dict: {'WAER':1, 'FORE':2, 'ROAI':3, 'BUIL':4}
   class_weights:
   ignore_index: 255
 
   # outputs
-  sample_data_dir: ${general.sample_data_dir}
+  tiling_data_dir: ${general.tiling_data_dir}
 
diff --git a/config/gdl_config_template.yaml b/config/gdl_config_template.yaml
@@ -1,6 +1,7 @@
 defaults:
   - model: gdl_unet
   - verify: default_verify
+  - tiling: default_tiling
   - training: default_training
   - loss: binary/softbce
   - optimizer: adamw
@@ -31,10 +32,10 @@ general:
   max_epochs: 2 # for train only
   min_epochs: 1 # for train only
   raw_data_dir: dataset
-  raw_data_csv: tests/sampling/sampling_segmentation_binary_ci.csv
-  sample_data_dir: dataset # where the hdf5 will be saved
+  raw_data_csv: tests/tiling/tiling_segmentation_binary_ci.csv
+  tiling_data_dir: dataset # where the hdf5 will be saved
   save_weights_dir: saved_model/${general.project_name}
 
 print_config: True # save the config in the log folder
-mode: {verify, sampling, train, inference, evaluate}
+mode: {verify, tiling, train, inference, evaluate}
 debug: True #False # will print the complete yaml config plus run a validation test
diff --git a/config/tiling/default_tiling.yaml b/config/tiling/default_tiling.yaml
@@ -0,0 +1,8 @@
+# @package _global_
+tiling:
+  tiling_data_dir: ${general.tiling_data_dir}
+  train_val_percent: {'trn':0.7, 'val':0.3, 'tst':0}
+  chip_size: 32
+  overlap_size:
+  min_annot_perc: 1
+  use_stratification: False
diff --git a/dataset/README.md b/dataset/README.md
@@ -27,6 +27,16 @@ To support both single-band and multi-band imagery, the path in the first column
 ### 1. Path to a multi-band image file:
 `my_dir/my_multiband_geofile.tif`
 
+A particular order or subset of bands in multi-band file must be used by setting a list of specific indices:
+
+#### Example:
+
+`bands: [3, 2, 1]`
+
+Here, if the original multi-band raster had BGR bands, geo-deep-learning will reorder these bands to RGB order.
+
+The `bands` parameter is set in the [dataset config](../config/dataset/test_ci_segmentation_multiclass.yaml).
+
 ### 2. Path to single-band image files, using only a common string
 A path to a list of single-band rasters can be inserted in the csv, but only a the string common to all single-band files should be considered.
 The "band specific" string in the file name must be in a [hydra-like interpolation format](https://hydra.cc/docs/1.0/advanced/override_grammar/basic/#primitives), with `${...}` notation. The interpolation string completed during execution by a dataset parameter with a list of desired band identifiers to help resolve the single-band filenames.

diff --git a/dataset/aoi.py b/dataset/aoi.py
@@ -18,7 +18,7 @@
 from torchvision.datasets.utils import download_url
 from tqdm import tqdm
 
-from utils.geoutils import stack_singlebands_vrt, is_stac_item, create_new_raster_from_base
+from utils.geoutils import stack_singlebands_vrt, is_stac_item, create_new_raster_from_base, subset_multiband_vrt
 from utils.logger import get_logger
 from utils.utils import read_csv
 from utils.verifications import assert_crs_match, validate_raster, \
@@ -51,8 +51,8 @@ def __init__(
         self.item = item
         self._assets_by_common_name = None
 
-        if bands_requested is not None and len(bands_requested) == 0:
-            logging.warning(f"At least one band should be chosen if assets need to be reached")
+        if not bands_requested:
+            raise ValueError(f"At least one band should be chosen if assets need to be reached")
 
         # Create band inventory (all available bands)
         self.bands_all = [band for band in self.asset_by_common_name.keys()]
@@ -183,7 +183,10 @@ def __init__(self, raster: Union[Path, str],
             self.raster_stac_item = None
 
         # If parsed result has more than a single file, then we're dealing with single-band files
-        self.raster_src_is_multiband = True if len(raster_parsed) == 1 else False
+        if len(raster_parsed) == 1 and rasterio.open(raster_parsed[0]).count > 1:
+            self.raster_src_is_multiband = True
+        else:
+            self.raster_src_is_multiband = False
 
         # Download assets if desired
         self.download_data = download_data
@@ -203,8 +206,8 @@ def __init__(self, raster: Union[Path, str],
         self.raster_parsed = raster_parsed
 
         # if single band assets, build multiband VRT
-        self.raster_to_multiband(virtual=True)
-        self.raster_read()
+        self.src_raster_to_dest_multiband(virtual=True)
+        self.raster_open()
         self.raster_meta = self.raster.meta
         self.raster_meta['name'] = self.raster.name
         if self.raster_src_is_multiband:
@@ -297,8 +300,8 @@ def __init__(self, raster: Union[Path, str],
             )
             if len(self.label_gdf_filtered) == 0:
                 logging.warning(f"\nNo features found for ground truth \"{self.label}\","
-                                 f"\nfiltered by attribute field \"{self.attr_field_filter}\""
-                                 f"\nwith values \"{self.attr_values_filter}\"")
+                                f"\nfiltered by attribute field \"{self.attr_field_filter}\""
+                                f"\nwith values \"{self.attr_values_filter}\"")
         else:
             self.label_gdf_filtered = None
 
@@ -347,7 +350,6 @@ def from_dict(cls,
         )
         return new_aoi
 
-    # TODO: is this necessary if to_dict() is good enough?
     def __str__(self):
         return (
             f"\nAOI ID: {self.aoi_id}"
@@ -359,16 +361,29 @@ def __str__(self):
             f"\n\tAttribute values filter: {self.attr_values_filter}"
             )
 
-    def raster_to_multiband(self, virtual=True):
+    def src_raster_to_dest_multiband(self, virtual=True):
+        """
+        Outputs a multiband raster from multiple sources of input raster
+        E.g.: multiple singleband files, single multiband file with undesired bands, etc.
+        """
         if not self.raster_src_is_multiband:
             if virtual:
                 self.raster_multiband = stack_singlebands_vrt(self.raster_parsed)
             else:
                 self.raster_multiband = self.write_multiband_from_singleband_rasters_as_vrt()
+        elif self.raster_src_is_multiband and self.raster_bands_request:
+            if not all([isinstance(band, int) for band in self.raster_bands_request]):
+                raise ValueError(f"Use only a list of integers to select bands from a multiband raster.\n"
+                                 f"Got {self.raster_bands_request}")
+            if len(self.raster_bands_request) > rasterio.open(self.raster_raw_input).count:
+                raise ValueError(f"Trying to subset more bands than actual number in source raster.\n"
+                                 f"Requested: {self.raster_bands_request}\n"
+                                 f"Available: {rasterio.open(self.raster_raw_input).count}")
+            self.raster_multiband = subset_multiband_vrt(self.raster_parsed[0], band_request=self.raster_bands_request)
         else:
             self.raster_multiband = self.raster_parsed[0]
 
-    def raster_read(self):
+    def raster_open(self):
         self.raster = _check_rasterio_im_load(self.raster_multiband)
 
     def to_dict(self, extended=True):
@@ -509,8 +524,10 @@ def parse_input_raster(
             raster = [value['meta'].href for value in item.bands_requested.values()]
             return raster
         elif "${dataset.bands}" in csv_raster_str:
-            if not isinstance(raster_bands_requested, (List, ListConfig, tuple)) or len(raster_bands_requested) == 0:
-                raise TypeError(f"\nRequested bands should a list of bands. "
+            if not raster_bands_requested \
+                    or not isinstance(raster_bands_requested, (List, ListConfig, tuple)) \
+                    or len(raster_bands_requested) == 0:
+                raise TypeError(f"\nRequested bands should be a list of bands. "
                                 f"\nGot {raster_bands_requested} of type {type(raster_bands_requested)}")
             raster = [csv_raster_str.replace("${dataset.bands}", band) for band in raster_bands_requested]
             return raster
@@ -593,7 +610,7 @@ def aois_from_csv(
     @param csv_path:
         path to csv file containing list of input data. See README for details on expected structure of csv.
     @param bands_requested:
-        List of bands to select from inputted imagery. Applies only to single-band input imagery.
+        List of bands to select from inputted imagery
     @param attr_values_filter:
         Attribute filed to filter features from
     @param attr_field_filter:

diff --git a/environment.yml b/environment.yml
@@ -7,7 +7,7 @@ dependencies:
   - docker-py>=4.4.4
   - geopandas>=0.10.2
   - h5py>=3.7
-  - hydra-core>=1.1.0
+  - hydra-core>=1.2.0
   - pip
   - pystac>=0.3.0
   - pytest>=7.1

diff --git a/inference_segmentation.py b/inference_segmentation.py
@@ -138,7 +138,6 @@ def segmentation(param,
                  chunk_size: int,
                  device,
                  scale: List,
-                 BGR_to_RGB: bool,
                  tp_mem,
                  debug=False,
                  ):
@@ -152,7 +151,6 @@ def segmentation(param,
         chunk_size: image tile size
         device: cuda/cpu device
         scale: scale range
-        BGR_to_RGB: True/False
         tp_mem: memory temp file for saving numpy array to disk
         debug: True/False
 
@@ -192,7 +190,6 @@ def segmentation(param,
         sample['metadata'] = image_metadata
         totensor_transform = augmentation.compose_transforms(param,
                                                              dataset="tst",
-                                                             input_space=BGR_to_RGB,
                                                              scale=scale,
                                                              aug_type='totensor',
                                                              print_log=print_log)
@@ -341,7 +338,6 @@ def main(params: Union[DictConfig, dict]) -> None:
     # Default input directory based on default output directory
     raw_data_csv = get_key_def('raw_data_csv', params['inference'], default=working_folder,
                                  expected_type=str, to_path=True, validate_path_exists=True)
-    BGR_to_RGB = get_key_def('BGR_to_RGB', params['dataset'], expected_type=bool)
 
     # LOGGING PARAMETERS
     exper_name = get_key_def('project_name', params['general'], default='gdl-training')
@@ -403,7 +399,6 @@ def main(params: Union[DictConfig, dict]) -> None:
                             chunk_size=chunk_size,
                             device=device,
                             scale=scale,
-                            BGR_to_RGB=BGR_to_RGB,
                             tp_mem=temp_file,
                             debug=debug)