diff --git a/README.md b/README.md
index 75b1493..7220d08 100644
--- a/README.md
+++ b/README.md
@@ -4,9 +4,9 @@ CropHarvest is an open source remote sensing dataset for agriculture with benchm
-The dataset consists of **90,480** datapoints, of which **30,899** (34.2%) have multiclass labels. All other datapoints only have binary crop / non-crop labels.
+The dataset consists of **95,186** datapoints, of which **33,205** (35%) have multiclass labels. All other datapoints only have binary crop / non-crop labels.
-**65,690** (73%) of these labels are paired with remote sensing and climatology data, specifically [Sentinel-2](https://sentinel.esa.int/web/sentinel/missions/sentinel-2), [Sentinel-1](https://sentinel.esa.int/web/sentinel/missions/sentinel-1/), the [SRTM Digital Elevation Model](https://cgiarcsi.community/data/srtm-90m-digital-elevation-database-v4-1/) and [ERA 5 climatology data](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5).
+**70,213** (74%) of these labels are paired with remote sensing and climatology data, specifically [Sentinel-2](https://sentinel.esa.int/web/sentinel/missions/sentinel-2), [Sentinel-1](https://sentinel.esa.int/web/sentinel/missions/sentinel-1/), the [SRTM Digital Elevation Model](https://cgiarcsi.community/data/srtm-90m-digital-elevation-database-v4-1/) and [ERA 5 climatology data](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5).
21 datasets are aggregated into CropHarvest - these are documented [here](https://github.com/nasaharvest/cropharvest/blob/main/datasets.md).
@@ -40,13 +40,105 @@ pip install cropharvest
```
### Getting started [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nasaharvest/cropharvest/blob/main/demo.ipynb)
-See the [`demo.ipynb`](https://github.com/nasaharvest/cropharvest/blob/main/demo.ipynb) notebook for an example on how to download the data from [Zenodo](https://zenodo.org/record/5828893) and train a random forest against this data.
+See the [`demo.ipynb`](https://github.com/nasaharvest/cropharvest/blob/main/demo.ipynb) notebook for an example on how to download the data from [Zenodo](https://zenodo.org/record/7257688) and train a random forest against this data.
For more examples of models trained against this dataset, see the [benchmarks](https://github.com/nasaharvest/cropharvest/blob/main/benchmarks).
### Contributing
If you would like to contribute a dataset, please see the [contributing readme](https://github.com/nasaharvest/cropharvest/blob/main/contributing.md).
+### ~~FAQ~~ Questions asked at least once
+
+
+How do I use CropHarvest for a specific geography?
+
+All the data is accessible through the `cropharvest.datasets.CropHarvest` object. The main parameters which you might be interested in manipulating are controllable through a `cropharvest.datasets.Task`, which takes as input the following parameters:
+- A bounding box, which defines the spatial boundaries of the labels retrieves
+- A target label, which defines the class of the positive labels (if this is left to `None`, then the positive class will be crops and the negative class will be non-crops)
+- A boolean defining whether or not to balance the crops and non-crops in the negative class
+- A test_identifier string, which tells the dataset whether or not to retrieve a file from the `test_features` folder and return it as the test data.
+
+So if I wanted to use this to train a model to identify crop vs. non crop in France, I might do it like this:
+
+```python
+from sklearn.ensemble import RandomForestClassifier
+
+from cropharvest.datasets import Task, CropHarvest
+from cropharvest.countries import get_country_bbox
+
+# folder into which the data will be downloaded / already exists
+data_dir = "data"
+
+# get_country_bbox returns a list of bounding boxes.
+# the one representing Metropolitan France is the 2nd box
+metropolitan_france_bbox = get_country_bbox("France")[1]
+
+task = Task(bounding_box=metropolitan_france_bbox, normalize=True)
+
+my_dataset = CropHarvest(data_dir, task)
+
+X, y = my_dataset.as_array(flatten_x=True)
+model = RandomForestClassifier(random_state=0)
+model.fit(X, y)
+```
+
+
+
+How do I load a specific pixel timeseries?
+
+The specific use case here is to retrieve NDVI values for a specific row in the `labels.geojson`. Here is how you might go about doing that:
+
+Firstly, I will load the geosjon. I'll do it through a `CropHarvestLabels` object, which is just a wrapper around the geojson but provides some nice utility functions.
+```python
+>>> from cropharvest.datasets import CropHarvestLabels
+>>>
+>>> labels = CropHarvestLabels("cropharvest/data")
+>>> labels_geojson = labels.as_geojson()
+>>> labels_geojson.head()
+ harvest_date planting_date ... is_test geometry
+0 None None ... False POLYGON ((37.08252 10.71274, 37.08348 10.71291...
+1 None None ... False POLYGON ((37.08721 10.72398, 37.08714 10.72429...
+2 None None ... False POLYGON ((37.08498 10.71371, 37.08481 10.71393...
+3 None None ... False POLYGON ((37.09021 10.71320, 37.09014 10.71341...
+4 None None ... False POLYGON ((37.08307 10.72160, 37.08281 10.72197...
+
+[5 rows x 13 columns]
+```
+
+Now, I can use the `labels` object to retrieve the filepath of the processed satellite data for each row in the labels geojson:
+```python
+>>> path_to_file = labels._path_from_row(labels_geojson.iloc[0])
+```
+This processed satellite data is stored as `h5py` files, so I can load it up as follows:
+```python
+>>> import h5py
+>>> h5py_file = h5py.File(path_to_file, "r")
+>>> x = h5py_file.get("array")[:]
+>>> x.shape
+(12, 18)
+```
+The shape of `x` represents 12 timesteps and 18 bands. To retrieve the band I am interested in:
+```python
+>>> from cropharvest.bands import BANDS
+>>> x[:, BANDS.index("NDVI")]
+array([0.28992072, 0.28838343, 0.26833579, 0.22577633, 0.27138986,
+ 0.06584114, 0.498998 , 0.50147203, 0.50437743, 0.44326343,
+ 0.33735849, 0.28375967])
+```
+These are 12 NDVI values, corresponding to the 12 months captured in this timeseries. To find out exactly which month each timestep represents, I can do
+```python
+>>> labels_geojson.iloc[0].export_end_date
+'2021-02-01T00:00:00'
+```
+Wich tells me that the last timestep represents January 2021. I can work backwards from there.
+
+
+
+
+What is the data format?
+The structure of the different data files is now described in depth in the data folder's [Readme](https://github.com/nasaharvest/cropharvest/blob/main/data/README.md)
+
+
### License
CropHarvest has a [Creative Commons Attribution-ShareAlike 4.0 International](https://github.com/nasaharvest/cropharvest/blob/main/LICENSE.txt) license.
diff --git a/cropharvest/config.py b/cropharvest/config.py
index 203e1e7..e5d1f61 100644
--- a/cropharvest/config.py
+++ b/cropharvest/config.py
@@ -16,7 +16,7 @@
EXPORT_END_MONTH = 2
EXPORT_END_DAY = 1
-DATASET_VERSION_ID = 6985649
+DATASET_VERSION_ID = 7257688
DATASET_URL = f"https://zenodo.org/record/{DATASET_VERSION_ID}"
LABELS_FILENAME = "labels.geojson"
FEATURES_DIR = "features"
diff --git a/setup.py b/setup.py
index 1f87a54..2235426 100644
--- a/setup.py
+++ b/setup.py
@@ -15,7 +15,7 @@
author="Gabriel Tseng",
author_email="gabrieltseng95@gmail.com",
url="https://github.com/nasaharvest/cropharvest",
- version="0.5.0",
+ version="0.6.0",
classifiers=[
"Programming Language :: Python :: 3",
"License :: Other/Proprietary License",