Skip to content

Commit

Permalink
Common input pipeline for single- and multi-band imagery and AOI obje…
Browse files Browse the repository at this point in the history
…ct for input data (#309)

* update csv to expect 3 mandatory columns and one optional. See comments in issue #221

* use inference data for binary segmentation in tests/, not data/

* environment.yml: hardcode setuptools version because of pytorch bug

* environment.yml: set correct subversion to setuptools

* environment.yml: move setuptools from conda section to pip

* sampling_segmentation.py: implement AOI class
verifications.py: update assert_crs_match function, add validate functions for rasters and vector files

* remove support for AWS bucket via boto3

* finish draft of sampling with AOI objects (with basic validation), rather than from raw csv lines

* environment.yml: fix and update

* environment.yml: add issue link for setuptools

* environment.yml: add issue link for setuptools

* environment.yml: fix and update

* environment.yml: add issue link for setuptools

* environment.yml: add issue link for setuptools

* sampling_segmentation.py: implement AOI class
verifications.py: update assert_crs_match function, add validate functions for rasters and vector files

* finish draft of sampling with AOI objects (with basic validation), rather than from raw csv lines

* train_segmentation.py: add warning for debugging and skip save checkpoint if val loss is None

* tests/data/massachusetts: restore larger format to prevent val_loss=None

* tests/data/massachusetts...: switch back to smaller image
test_ci_segmentation_binary.yaml: tile images to 32, not 256
test_ci_segmentation_multiclass.yaml: idem
train_segmentation.py: raise ValueError for empty train or val dataloader

* aoi.py:
- create an AOI object with input validation. AOI would be the core input for tiling, training and inference, though only yet implemented for tiling.
- add stac item support
geoutils.py:
- add utils: is_stac_item, stack_vrts() for create artificial multi-band raster from single-bands files
test_aoi.py: add first test for parsing raster input from 3 types to a single rasterio.RasterDataset object
default.yaml: activate debug functionality for logging
test_ci_segmentation_multiclass.yaml: replace 'modalities' with 'bands' key
test_ci_segmentation_binary.yaml: idem
sample_creation.py: delete
utils.py: remove validation from read_csv() function.

* inference_segmentation.py: remove read_modalities
README.md: start updating

* evaluate_segmentation.py: fix bug (remove read_modalities())
dataset/README.md: add documentation on input data configuration and csv format
README.md: update

* aoi.py:
- add write multiband fonction for demo and debugging
- move aois_from_csv from sampling_segmentation.py

* aoi.py: remove circular import automatically created py Pycharm

* fix typos and potential bugs introduced by Pycharm's automatic refactoring

* aoi.py: use pre-existing raster validation function
sampling_segmentation.py: move validation to aoi object
utils.py: finish removing AWS bucket feature
verifications.py:
- update all data validation functions

* inference_segmentation.py: remove bucket parameter in list_input_images

* test_aoi.py: use local stac item (prevent timeout error at CI)
  • Loading branch information
remtav authored Jun 6, 2022
1 parent e3e6862 commit 682bdc7
Show file tree
Hide file tree
Showing 20 changed files with 640 additions and 784 deletions.
3 changes: 2 additions & 1 deletion GDL.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ def run_gdl(cfg: DictConfig) -> None:
# check if the mode is chosen
if type(cfg.mode) is DictConfig:
msg = "You need to choose between those modes: {}"
raise logging.critical(msg.format(list(cfg.mode.keys())))
logging.critical(msg.format(list(cfg.mode.keys())))
raise ValueError()

# save all overwritten parameters
logging.info('\nOverwritten parameters in the config: \n' + cfg.general.config_override_dirname)
Expand Down
50 changes: 12 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,16 @@

## **Overview**

The **geo-deep-learning** project stems from an initiative at NRCan's [CCMEO](https://www.nrcan.gc.ca/earth-sciences/geomatics/10776). Its aim is to allow using Convolutional Neural Networks (CNN) with georeferenced data sets.
The overall learning process comprises three broad stages.
The **geo-deep-learning** project stems from an initiative at NRCan's [CCMEO](https://www.nrcan.gc.ca/earth-sciences/geomatics/10776). Its aim is to allow using Convolutional Neural Networks (CNN) with georeferenced datasets.

### Data preparation
The data preparation phase (sampling) allows creating sub-images that will be used for either training, validation or testing.
The first phase of the process is to determine sub-images (samples) to be used for training, validation and, optionally, test.
Images to be used must be of the geotiff type.
Sample locations in each image must be stored in a GeoPackage.
In geo-deep-learning, the learning process comprises two broad stages: sampling and training, followed by inference, which makes use of a trained model to make new predictions on unseen imagery.

[comment]: <> (> Note: A data analysis module can be found [here]&#40;./utils/data_analysis.py&#41; and the documentation in [`docs/README.md`]&#40;./docs/README.md&#41;. Useful for balancing training data.)
### Data sampling (or [tiling](https://torchgeo.readthedocs.io/en/latest/user/glossary.html#term-tiling))
The data preparation phase creates [chips](https://torchgeo.readthedocs.io/en/latest/user/glossary.html#term-chip) (or patches) that will be used for either training, validation or testing.
The sampling step requires a csv as input with a list of rasters and labels to be used in the subsequent training phase. See [dataset documentation](dataset#input-data).

### Training, along with validation and testing
The training phase is where the neural network learn to use the data prepared in the previous phase to make all the predictions.
The training phase is where the neural network learns to use the data prepared in the previous phase to make all the predictions.
The crux of the learning process is the training phase.

- Samples labeled "*trn*" as per above are used to train the neural network.
Expand All @@ -38,18 +35,14 @@ This project comprises a set of commands to be run at a shell command prompt. E
> The system can be used on your workstation or cluster.
## **Installation**
Those steps are for your workstation on Ubuntu 18.04 using miniconda.
Set and activate your python environment with the following commands:
To execute scripts in this project, first create and activate your python environment with the following commands:
```shell
conda env create -f environment.yml
conda activate geo_deep_env
```
> For Windows OS:
> - Install rasterio, fiona and gdal first, before installing the rest. We've experienced some [installation issues](https://github.com/conda-forge/gdal-feedstock/issues/213), with those libraries.
> - Mlflow should be installed using pip rather than conda, as mentioned [here](https://github.com/mlflow/mlflow/issues/1951)
> Tested on Ubuntu 20.04 and Windows 10 using miniconda.
## **Running GDL**
This is an example of how to run GDL with hydra in simple steps with the _**massachusetts buildings**_ dataset in the `/data` folder, for segmentation on buildings:
This is an example of how to run GDL with hydra in simple steps with the _**massachusetts buildings**_ dataset in the `tests/data/` folder, for segmentation on buildings:

1. Clone this github repo.
```shell
Expand All @@ -67,15 +60,14 @@ python GDL.py mode=train
python GDL.py mode=inference
```

> This example is running with the default configuration `./config/gdl_config_template.yaml`, for further examples on running options see the [documentation](config/#Examples).
> You will also fund information on how to change the model or add a new one to GDL.
> This example runs with a default configuration `./config/gdl_config_template.yaml`. For further examples on configuration options see the [configuration documentation](config/#Examples).
> If you want to introduce a new task like object detection, you only need to add the code in the main folder and name it `object_detection_sampling.py` for example.
> The principle is to name the code like `task_mode.py` and the `GDL.py` will deal with the rest.
> The principle is to name the code like `{task}_{mode}.py` and the `GDL.py` will deal with the rest.
> To run it, you will need to add a new parameter in the command line `python GDL.py mode=sampling task=object_detection` or change the parameter inside the `./config/gdl_config_template.yaml`.
## **Folder Structure**
We suggest a high level structure to organize the images and the code.
We suggest the following high level structure to organize the images and the code.
```
├── {dataset_name}
└── data
Expand Down Expand Up @@ -128,24 +120,6 @@ _**Don't forget to change the path of the dataset in the config yaml.**_

[comment]: <> ( num_gpus: 2)

[comment]: <> ( BGR_to_RGB: False # <-- must be already in RGB)

[comment]: <> ( scale_data: [0,1])

[comment]: <> ( aux_vector_file:)

[comment]: <> ( aux_vector_attrib:)

[comment]: <> ( aux_vector_ids:)

[comment]: <> ( aux_vector_dist_maps:)

[comment]: <> ( aux_vector_dist_log:)

[comment]: <> ( aux_vector_scale:)

[comment]: <> ( debug_mode: True)

[comment]: <> ( # Module to include the NIR)

[comment]: <> ( modalities: RGBN # <-- must be add)
Expand Down
2 changes: 1 addition & 1 deletion config/dataset/test_ci_segmentation_binary.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ dataset:
raw_data_dir: ${general.raw_data_dir}

# imagery
modalities: RGB
bands: [R, G, B]

# ground truth
attribute_field: properties/class
Expand Down
2 changes: 1 addition & 1 deletion config/dataset/test_ci_segmentation_multiclass.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ dataset:
raw_data_dir: ${general.raw_data_dir}

# imagery
modalities: RGB
bands: [R, G, B]

# ground truth
attribute_field: properties/Quatreclasses
Expand Down
4 changes: 2 additions & 2 deletions config/gdl_config_template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ general:
workspace: your_name
max_epochs: 2 # for train only
min_epochs: 1 # for train only
raw_data_dir: data
raw_data_dir: dataset
raw_data_csv: tests/sampling/sampling_segmentation_binary_ci.csv
sample_data_dir: data # where the hdf5 will be saved
sample_data_dir: dataset # where the hdf5 will be saved
save_weights_dir: saved_model/${general.project_name}

print_config: True # save the config in the log folder
Expand Down
1 change: 1 addition & 0 deletions config/hydra/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ run:
sweep:
dir: logs/multiruns/${now:%Y-%m-%d_%H-%M-%S}
subdir: ${hydra.job.num}
verbose: ${debug}

# you can set here environment variables that are universal for all users
# for system specific variables (like data paths) it's better to use .env file!
Expand Down
59 changes: 59 additions & 0 deletions dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Input data
The sampling and inference steps requires a csv referencing input data. An example of input csv can be found in [tests](tests/sampling/sampling_segmentation_binary_ci.csv).
Each row of this csv is considered, in geo-deep-learning terms, to be an [AOI](https://torchgeo.readthedocs.io/en/latest/user/glossary.html#term-area-of-interest-AOI).

| raster path | vector ground truth path | dataset split | aoi id (optional) |
|---------------------------|--------------------------|---------------|-------------------|
| my_dir/my_geoimagery1.tif | my_dir/my_geogt1.gpkg | trn | Ontario-1 |
| my_dir/my_geoimagery2.tif | my_dir/my_geogt2.gpkg | tst | NewBrunswick-23 |
| ... | ... | ... | ... |

> The use of aoi id information will be implemented in a near future. It will serve, for example, to print a detailed report of sampling, training and evaluation, or for easier debugging.
The path to a custom csv must be entered in the [dataset configuration](https://github.com/NRCan/geo-deep-learning/blob/develop/config/dataset/test_ci_segmentation_binary.yaml#L9). See the [configuration documentation](config/README.md) for more information.
Also check the [suggested folder structure](https://github.com/NRCan/geo-deep-learning#folder-structure).

## Dataset splits
Split in csv should be either "trn", "tst" or "inference". The validation split is automatically created during sampling. It's proportion is set by the [dataset config](https://github.com/NRCan/geo-deep-learning/blob/develop/config/dataset/test_ci_segmentation_binary.yaml#L8).

## Raster and vector file compatibility
Rasters to be used must be in a format compatible with [rasterio](https://rasterio.readthedocs.io/en/latest/quickstart.html?highlight=supported%20raster%20format#opening-a-dataset-in-reading-mode)/[GDAL](https://gdal.org/drivers/raster/index.html) (ex.: GeoTiff). Similarly, labels (aka annotations) for each image must be stored as polygons in a [Geopandas compatible vector file](Rasters to be used must be in a format compatible with [rasterio](https://rasterio.readthedocs.io/en/latest/quickstart.html?highlight=supported%20raster%20format#opening-a-dataset-in-reading-mode)/[GDAL](https://gdal.org/drivers/raster/index.html) (ex.: GeoTiff). Similarly, labels (aka annotations) for each image must be stored as polygons in a [Geopandas compatible vector file](https://geopandas.org/en/stable/docs/user_guide/io.html#reading-spatial-data) (ex.: GeoPackage).
) (ex.: GeoPackage).

## Single-band vs multi-band imagery

To support both single-band and multi-band imagery, the path in the first column of an input csv can be in **one of three formats**:

### 1. Path to a multi-band image file:
`my_dir/my_multiband_geofile.tif`

### 2. Path to single-band image files, using only a common string
A path to a list of single-band rasters can be inserted in the csv, but only a the string common to all single-band files should be considered.
The "band specific" string in the file name must be in a [hydra-like interpolation format](https://hydra.cc/docs/1.0/advanced/override_grammar/basic/#primitives), with `${...}` notation. The interpolation string completed during execution by a dataset parameter with a list of desired band identifiers to help resolve the single-band filenames.

#### Example:

In [dataset config](../config/dataset/test_ci_segmentation_binary.yaml):

`bands: [R, G, B]`

In [input csv](../tests/sampling/sampling_segmentation_binary_ci.csv):

| raster path | ground truth path | dataset split |
|------------------------------------------------------------|-------------------|---------------|
| my_dir/my_singleband_geofile_band_**${dataset.bands}**.tif | gt.gpkg | trn |

During execution, this would result in using, **in the same order as bands appear in dataset config**, the following files:
`my_dir/my_singleband_geofile_band_R.tif`
`my_dir/my_singleband_geofile_band_G.tif`
`my_dir/my_singleband_geofile_band_B.tif`

> To simplify the use of both single-band and multi-band rasters through a unique input pipeline, single-band files are artificially merged as a [virtual raster](https://gdal.org/drivers/raster/vrt.html).
### 3. Path to a Stac Item
> Only Stac Items referencing **single-band assets** are supported currently. See [our Worldview-2 example](https://datacube-stage.services.geo.ca/api/collections/spacenet-samples/items/SpaceNet_AOI_2_Las_Vegas-056155973080_01_P001-WV03).
Bands must be selected by [common name](https://github.com/stac-extensions/eo/#common-band-names) in dataset config:
`bands: ["red", "green", "blue"]`

> Order matters: `["red", "green", "blue"]` is not equal to `["blue", "green", "red"]` !
Loading

0 comments on commit 682bdc7

Please sign in to comment.