Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for EO newbies #95

Open
dks4-hw opened this issue Jul 12, 2022 · 5 comments
Open

Documentation for EO newbies #95

dks4-hw opened this issue Jul 12, 2022 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@dks4-hw
Copy link

dks4-hw commented Jul 12, 2022

This looks like a terrific ML resource with a powerful API. But your documentation is a bit lean, especially for EO newbies. The map in README.md suggests there is terrific image coverage in the dataset of Europe and North America, but the example code is limited to Togo, with benchmarks for Kenya & Brazil.
Can we use cropharvest to feed data for Europe or North America to ML models? I am guessing we need to supplement the features data download with those features in geographies we want to perform ML on. How do we use cropharvest to do that? It is not obvious.
Forgive me if the dataset is intended only for Kenya/Brazil/Togo only and I have misunderstood. As EO professionals you will be familiar with the sentinelsat library whose documentation is brilliant for EO newbies but does not produce ML ready products. Could you produce something as explanatory but with a ML ready output?

@dks4-hw
Copy link
Author

dks4-hw commented Jul 13, 2022

Meant to tag @gabrieltseng

@gabrieltseng
Copy link
Collaborator

gabrieltseng commented Jul 15, 2022

Hi @dks4-hw ,

Thanks for the feedback! I'll work on adding some better documentation. In the meantime, to help you get started:

All the data is accessible through the cropharvest.datasets.CropHarvest object. The main parameters which you might be interested in manipulating are controllable through a cropharvest.datasets.Task, which takes as input the following parameters:

  • A bounding box, which defines the spatial boundaries of the labels retrieves
  • A target label, which defines the class of the positive labels (if this is left to None, then the positive class will be crops and the negative class will be non-crops)
  • A boolean defining whether or not to balance the crops and non-crops in the negative class
  • A test_identifier string, which tells the dataset whether or not to retrieve a file from the test_features folder and return it as the test data.

So if I wanted to use this to train a model to identify crop vs. non crop in France, I might do it like this:

from sklearn.ensemble import RandomForestClassifier

from cropharvest.datasets import Task, CropHarvest
from cropharvest.countries import get_country_bbox

my_dataset = CropHarvest(
    # the first argument to the dataset is the (already existing)
    # folder into which the data will be downloaded / already exists
    "data",
    Task(
        # get_country_bbox returns a list of bounding boxes.
        # the one representing Metropolitan France is the
        # 2nd box
        bounding_box=get_country_bbox("France")[1],
        normalize=True
    )   
)
X, y = my_dataset.as_array(flatten_x=True)
model = RandomForestClassifier(random_state=0)
model.fit(X, y)

I hope this helps to get started; in the meantime, I'll write up some more thorough documentation.

@gabrieltseng gabrieltseng self-assigned this Jul 15, 2022
@gabrieltseng gabrieltseng added the question Further information is requested label Oct 3, 2022
@kolrocket
Copy link

kolrocket commented Oct 25, 2022

Hello, I'm trying to run this exact example. But after

my_dataset = CropHarvest( 
    Task(
        # get_country_bbox returns a list of bounding boxes
        bounding_box=get_country_bbox("France")[0],
        normalize=True
    )   
)

it returns

Traceback (most recent call last):

  File "C:\Users\leand\AppData\Local\Temp\ipykernel_19248\3361455196.py", line 1, in <module>
    my_dataset = CropHarvest(

  File "C:\Users\leand\anaconda3\envs\crop\lib\site-packages\cropharvest\datasets.py", line 203, in __init__
    super().__init__(root, download, filenames=(FEATURES_DIR, TEST_FEATURES_DIR))

  File "C:\Users\leand\anaconda3\envs\crop\lib\site-packages\cropharvest\datasets.py", line 60, in __init__
    self.root = Path(root)

  File "C:\Users\leand\anaconda3\envs\crop\lib\pathlib.py", line 1042, in __new__
    self = cls._from_parts(args, init=False)

  File "C:\Users\leand\anaconda3\envs\crop\lib\pathlib.py", line 683, in _from_parts
    drv, root, parts = self._parse_args(args)

  File "C:\Users\leand\anaconda3\envs\crop\lib\pathlib.py", line 667, in _parse_args
    a = os.fspath(a)

TypeError: expected str, bytes or os.PathLike object, not Task

Any ideas of what is wrong? Thank you very much.

EDIT: runing Task(...) instead of CropHarvest(Task()) works and returns:

Task(bounding_box=BBox(min_lat=41.384912109374994, max_lat=43.021484375, min_lon=8.565625000000011, max_lon=9.556445312500017, name='France_0'), target_label='crop', balance_negative_crops=False, test_identifier=None, normalize=True)

but then for the next part 'Task' object has no attribute 'as_array' .

@gabrieltseng
Copy link
Collaborator

Hi @kolrocket ; apologies. There was a bug in the example above, which is now fixed. I've confirmed the code runs:

>>> from sklearn.ensemble import RandomForestClassifier
>>> from cropharvest.datasets import Task, CropHarvest
>>> from cropharvest.countries import get_country_bbox
>>> my_dataset = CropHarvest("data", Task(bounding_box=get_country_bbox("France")[1], normalize=True))
>>> X, y = my_dataset.as_array(flatten_x=True)
>>> X.shape, y.shape
((6603, 216), (6603,))
>>> model = RandomForestClassifier(random_state=0)
>>> model.fit(X, y)
RandomForestClassifier(random_state=0)

@kolrocket
Copy link

Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants