MOSAIKS is a Python package that performs parallelized encoding of satellite imagery into easy-to-use features using the MOSAIKS algorithm and Dask for parallel processing. This package enables users to generate feature vectors based on satellite images by providing a list of latitudes and longitudes and Microsoft's Planetary Computer API key. It supports various satellites, image sizes, time periods, and parallelization options.
We implement the MOSAIKS algorithm based on work by Rolf et al., 2021 based on random convolutional features. The authors of this paper make a global cross-section of pre-computed features using Planet imagery from 2019 available at mosaiks.org, along with tutorials and related research.
This package extends the functionality of the original MOSAIKS implementation in the following ways:
- Flexibility in choice of satellite (tested for Landsat-8 and Sentinel-2)
- Ability to select the timeframe from which to pull imagery
- Flexibility in choosing the size of the images centred on points of interest, image bands used, etc.
- Flexible upload and download of data
- Parallel processing with Dask to speed up fetching images and creating features.
- Once installed, the package can be run on any machine (with the API key).
The package has been tested via Microsoft Planetary Computer on Landsat-8 or Sentinel-2 imagery. Please note this package has only been tested on Python 3.10 and 3.11. Using other versions of Python are expected to raise errors due to dependency conflicts.
For more detailed information on this package and how to use it, please see this blog post. For information on preview and potential use cases for this package, please see this blog post. For more information on MOSAIKS and previous use cases, see the MOSAIKS website here.
Users of this package should acknowledge IDinsight and reference the MOSAIKS RCF algorithm as Rolf, Esther, et al. "A generalizable and accessible approach to machine learning with global satellite imagery." Nature communications 12.1 (2021): 4392.
This section highlights a demo to help you get features ASAP.
Ensure you have all requirements set up:
-
Install Python 3.10 or 3.11.
-
Install the MOSAIKS package -
pip install git+https://github.com/IDinsight/mosaiks
or
pip install mosaiks
-
Acquire the Planetary Computer API key from Microsoft Planetary Computer (MPC). We provide detailed instructions for getting an API key in the FAQs section of this README.
In your terminal, run the following and fill in the API key prompt -
planetarycomputer configure
The quickest way to test the package is to run it in a notebook. Open up a notebook in the relevant environment (where Step 1 was executed) and run the following (the code is present in the "README_DEMO.ipynb" notebook):
-
Import dependencies
import os # Resolves a conflict in Geopandas. Improves speed. os.environ["USE_PYGEOS"] = "0"
-
Import test data. In this case, we are creating random GPS coordinates
# Example: Select 5 coordinates in Uttar Pradesh, India lats = [26.51268717, 26.55187804, 26.54949092, 26.54105597, 26.54843896] lons = [80.51489844, 80.54864309, 80.57813289, 80.51412136, 80.52254959]
-
Execute a run of the
get_features
function:from mosaiks import get_features df_featurised = get_features( lats, lons, datetime="2017", # or ["2013-01-01", "2013-12-31"] or ... image_width=1000, ) df_featurised
The above code executes a default run of the get_features function which executes the featurisation.
-
Run get_features with Dask parallelization
To run the code with the built-in Dask parallelization, set
parallelize
toTrue
anddask_chunksize
to a suitable integer given the size of your dataset.df_featurised = get_features( lats, lons, datetime="2017", # or ["2013-01-01", "2013-12-31"] or ... image_width=1000, parallelize=True, dask_chunksize=2, # set this to 200+ to see benefits from parallization ) df_featurised
Check out
get_features
' docs for parameters to control the in-built parallelization scheme.
The high-level flow of our featurisation pipeline is the following:
- The User feeds 'lat' and 'lon' lists for points they want to featurize
- The user also adds relevant parameters to the function (see docstrings and FAQs)
- For each GPS coordinate, the function fetches STAC references to satellite images
- Once found, the function fetches the images (either all or only the least cloudy depending on the
image_composite_method
parameter) - Function converts each image into features using the MOSAIKS algorithm
- Lastly, the function returns a dataframe with the features alongside the STAC references to the image(s) used to create the features from.
├── mosaiks
│ ├── fetch -- fetching images
│ ├── featurize -- converting images to MOSAIKS features
│ └── pipeline -- get_features() is here.
├── tests -- pytests (need to install requirements_test to run)
├── README.md -- No but actually, read this.
├── README_DEMO.ipynb
├── requirements.txt
├── requirements_dev.txt
├── requirements_test.txt
└── LICENSE
If you are running mosaiks locally or on a non-MPC server, then you need an access token for the satellite image database.
-
If you do not have an MPC account, go here. You should see a “Request Access” button in the top right corner.
It opens up a form which you should fill in. NB: Use your personal email ID rather than an institutional one. If you already have a Microsoft account, use the email ID (non-institutional) associated with it. Otherwise, you also have the additional step of creating a Microsoft account for the email ID you want to use for the MPC Hub.
Once you submit the form, you should receive an email within a week granting you access to the hub.
-
To get the token, go here. Request and copy the new token, and save it.
-
On your local/virtual machine. Run
pip install planetary-compute
andplanetarycomputer configure
from the console, and paste in the API key you generated. -
You only need to do this once and should be able to access the database smoothly every time you run the pipeline after this.
-
More information is available here.
Below are all the parameters and defaults that get_features
uses.
NOTE: You'll probably want to leave most of these as defaults - see the .yaml
file example that follows for the subset of parameters you most likely want to change.
def get_features(
latitudes: List[float],
longitudes: List[float],
datetime: str or List[str] or callable,
satellite_name: str = "landsat-8-c2-l2", # or "sentinel-2-l2a"
image_resolution: int = 30, # or 10 for Sentinel
image_bands: List[str] = ["SR_B2", "SR_B3", "SR_B4", "SR_B5", "SR_B6", "SR_B7"], # For options, see FAQs below
image_width: int = 3000,
min_image_edge: int = 30,
image_composite_method: str = "least_cloudy", # or "all" to create a multi-image median composite before featurisation
image_dtype: str = "int16", # or "int32" or "float". "int8" not supported.
stac_api_name: str = "planetary-compute", # or "earth-search"
n_mosaiks_features: int = 4000,
mosaiks_kernel_size: int = 3,
mosaiks_random_seed_for_filters: int = 768,
model_device: str = "cpu", # or "cuda" if NVIDIA GPU available
parallelize: bool = False,
dask_chunksize: int = 500,
dask_client: Optional[Client] = None, # Provide to override the default per-run LocalCluster creation
dask_n_workers: Optional[int] = None, # Set to None to auto-select maximum
dask_threads_per_worker: Optional[int] = None, # Set to None to auto-select maximum
dask_n_concurrent_tasks: Optional[int] = None, # Set to None to set equal to number of threads
dask_sort_points_by_hilbert_distance: bool = True,
setup_rasterio_env: bool = True,
) -> pd.DataFrame
You can also set these parameters in a .yml
file, read the file, and then input the parameters as **kwargs. Here is an example of a .yml
file with some common project-specific parameters set, leaving everything else as default:
datetime: "2017",
satellite_name: "landsat-8-c2-l2",
image_resolution: 30,
image_bands:
- "SR_B2"
- "SR_B3"
- "SR_B4"
- "SR_B5"
- "SR_B6"
- "SR_B7"
image_width: 3000,
image_composite_method: "least_cloudy",
n_mosaiks_features: 4000,
model_device: "cpu", # or "gpu" if NVIDIA GPU available
parallelize: True,
dask_chunksize: 500,
We have tested this package for 2 satellites: Sentinel-2 and Landsat-8.
Sentinel-2 images are available starting from 23. June 2015 (relevant for datetime
) at 10m resolution (image_resolution
) for 13 spectral bands (image_bands
).
Landsat-8 images are available starting 11th February 2013, at 30m resolution and for 11 spectral bands.
You can explore Microsoft Planetary Computer's data catalog to learn more -- it includes information about the satellites and links for further reading. You can also find information on the best image bands to use for images from the Landsat and Sentinel satellites.
To contribute to this repository, you can make a feature branch and raise a PR (after making sure that the code works and relevant tests pass).
To set up your dev environment, you can go through the following steps:
- Clone the mosaiks repository.
- Run
pip install -e .
in the repo's root folder to install a live local copy of the repository. This can be used in Python as import mosaiks. - pip install the two requirements files
requirements_dev.txt
andrequirements_test.txt
. - Start contributing!
We are happy to receive feedback on the package. Please do submit an issue, or if you know how to fix it, make a feature branch and raise a PR!