[82,22] S3 download/upload support for xlsx,xlsm and geojson #88

j-gillam · 2023-10-26T16:29:32Z

S3 download/upload support for xlsx,xlsm and geojson

This PR closes the issues: 82 and 22.

Description

S3 download/upload support for xlsx,xlsm and geojson.
Added:

Updates to S3.py file for uploading/downloading xlsx,xlsm and geojson.
Updated setup with new package requirements.
Added tests for uploading and downloading the files.
Added dummy xlsx,xlsm and geojson files.

Checklist:

I have followed Contributor Guidelines
I have updated the documentation to include any new functionality I have added or modified
I have written unit tests for any new functionality I have added
I have ran and passed all tests locally with >> pytest
I have checked that all tests ran successfully on the Github after I pushed

setup.cfg

sqr00t · 2023-10-26T17:51:42Z

setup.cfg

+  openpyxl==3.0.9
+  shapely==2.0.2
+  geopandas==0.13.2


It may be best to add a gis environment to include shapely and geopandas
I'd suggest adding openpyxl to another environment io_extras to reduce bloat, or add to install_requires
Another possibility is adding [all] to pandas in install_requires i.e. pandas[all]==1.5.1
We could also follow how pandas adds each engine/ extra and enable it downstream in this package

Suggested change

openpyxl==3.0.9

shapely==2.0.2

geopandas==0.13.2

gis =

geopandas==0.13.2

io_extras =

openpyxl==3.0.9

Could we just add pandas[excel]==1.5.1 ? And actually thinking about it, geopandas has shapely as a dependency, so can just remove it right?

emily-bicks

Looks great to me! My only comment is do we want to make the "download_as" argument now be specific as pandas dataframe vs geopandas dataframe? And is there any reason why a user would want to return the entire geojson as a dictionary rather than just loading the features attribute into a geopandas df?

emily-bicks · 2023-10-26T18:31:46Z

nesta_ds_utils/loading_saving/S3.py

    else:
        raise Exception(
-            "Uploading dataframe currently supported only for 'csv' and 'parquet'."
+            "Uploading dataframe currently supported only for 'csv', 'parquet', 'xlsx', xlsm' and 'geojson'."


may be worth specifying this as a separate function given it's a geopandas df rather than a pandas df?

I don't mind either way! I think a geopandas df still counts as a pandas df. The only difference being an extra geometry column that has a particular typing and that the package itself has extra functionality for geojsons etc. But happy to separate into another function, if that makes readability better?

I think maybe? On the geojson question, I've never needed it but someone might! So will add it in!

Latest geopandas docs indicates that geopandas df is a separate object type even though the library is supposedly built on top of pandas.

I think you'd need to add an instance type check condition for gpd.GeoDataFrame for each fnmatch branch, and raise an appropriate error message whether the filetype is implemented for GeoDataFrames. Or just handle GeoDataFrame types above the fnmatch branches (or move the handling to another function separately, then calling your new functions conditionally).

something like:

if isinstance(df_data, gpd.GeoDataFrame): your_new_geo_functions(params)

and also, importing Union from typing, and modify the function parameters type hints to:
def _df_to_fileobj(df_data: Union[pd.DataFrame, gpd.GeoDataFrame], ...

Linking below some reference to supported IO operations for GeoPandas:

User Guide

IO methods

GeoPandas.GeoDataFrame instance methods

Note, there is some overlap between the IO methods and "instance methods" sections, but there's a few extra to_something methods in the latter.

Ah fab! So the isinstance(geodf, pd.DataFrame ) returns true for geopandas dataframe, but it doesnt work in the other direction. So it must count as pandas somewhere in it. I'll just make a separate function!

I'm slightly confused, did you do instance checking on a gpd.GeoDataFrame or pd.DataFrame?

Sorry! I think I've confused things woops. I had currently just checked for pd.DataFrame because it was neater and geodataframes are inherently pandas dataframes (at least according to the isinstance), but I will separate them out as its more robust and has caused confusion ahah.

Just checked the geopandas source code as well:

You could also try:

from geopandas.base import GeoPandasBase

issubclass(gpd.GeoDataFrame, GeoPandasBase)

Since GeoDataFrame inherits from GeoPandasBase and pandas DataFrame

Ooo I like that! Thanks so much for checking everything :)

Removing shapely from test. Co-authored-by: Solomon Yu <[email protected]>

j-gillam · 2023-10-30T17:48:19Z

Hello both! I have updated the files based on your comments :)

Things to note:

@sqr00t pandas 1.5.1 doesn't allow extras, so I went back to your initial idea!
Separated the upload/download for geodataframes into separate functions.
Noticed a few spelling errors, so fixed those!
Removed from xmlrpc.client import Boolean for S3.py as I couldn't see where it was used?

For adding geojson upload/download from dictionaries, I couldn't find a way to check typing for geojson specifically. As from my understanding of the docs, it is a json with specific formatting. So what I did was add a check to see if it has a "type" member and that member is one of the accepted types based on the specification in section 3. If anyone knows of a better way I am happy to update!

…c Boolean

sqr00t

I've added some smaller changes:

Removed fnmatch and xmlrpc imports from file_ops (likely a prior refactor artefact, great spot @j-gillam)
Specified error types raised
Refined instance checking to gpd.base.GeoPandasBase

sqr00t · 2023-11-01T10:42:32Z

nesta_ds_utils/loading_saving/S3.py

-                "for 'csv' and 'parquet'."
+                "for 'csv','parquet','xlsx' and 'xlsm'."
+            )
+    elif download_as == "geodataframe":


Suggested change

elif download_as == "geodataframe":

elif download_as.lower() in ["geodataframe", "gdf", "geodf", "geo_df"]:

geodataframe may be lengthly for end users

I think its probably simpler to just have one option? We could go with geodf?

sqr00t · 2023-11-01T10:58:34Z

nesta_ds_utils/loading_saving/S3.py

    elif download_as == "np.array":
        if path_from.endswith(tuple([".csv", ".parquet"])):
            return _fileobj_to_np_array(fileobj, path_from, **kwargs_reading)
        else:
-            raise Exception(
+            raise NotImplementedError(
                "Download as numpy array currently supported only "
                "for 'csv' and 'parquet'."
            )
    elif not download_as:


Suggested change

elif not download_as:

if not download_as:

move this to the top statement

j-gillam added 2 commits October 26, 2023 17:24

[82,22] S3 download/upload support for xlsx,xlsm and geojson

010b21e

[82,22] S3 download/upload support for xlsx,xlsm and geojson

d876dc4

j-gillam self-assigned this Oct 26, 2023

j-gillam requested a review from emily-bicks October 26, 2023 16:35

sqr00t reviewed Oct 26, 2023

View reviewed changes

setup.cfg Outdated Show resolved Hide resolved

sqr00t reviewed Oct 26, 2023

View reviewed changes

emily-bicks approved these changes Oct 26, 2023

View reviewed changes

j-gillam and others added 2 commits October 30, 2023 09:38

Update setup.cfg

5c34618

Removing shapely from test. Co-authored-by: Solomon Yu <[email protected]>

adding updates based on comments, support for geojson to dictionary

fc7cd54

j-gillam requested review from sqr00t and emily-bicks October 30, 2023 17:49

sqr00t added 4 commits October 31, 2023 10:22

chore: remove unused imports from file_ops, use bool instead of xmlrp…

f69e116

…c Boolean

refactor: improve error types raised

8bb5380

refactor: improve error types raised

1c37636

refactor: refine geodataframe instance type check

683eb7f

sqr00t approved these changes Oct 31, 2023

View reviewed changes

sqr00t reviewed Nov 1, 2023

View reviewed changes

j-gillam added 3 commits November 1, 2023 11:24

simplifying geodataframe call and moved None condition to top

c0f77d1

simplifying geodataframe call and moved None condition to top

06426ba

simplifying geodataframe call and moved None condition to top

b842917

j-gillam merged commit 046fbae into dev Nov 1, 2023
3 checks passed

j-gillam deleted the 82_adding_support_excel_geojson branch November 1, 2023 11:38

This was linked to issues Nov 1, 2023

[Feature]: Add xlsx support to S3.loading_saving() #82

Closed

Add geospatial functionality #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[82,22] S3 download/upload support for xlsx,xlsm and geojson #88

[82,22] S3 download/upload support for xlsx,xlsm and geojson #88

j-gillam commented Oct 26, 2023 •

edited by sqr00t

Loading

sqr00t Oct 26, 2023 •

edited

Loading

j-gillam Oct 30, 2023

sqr00t Oct 30, 2023

emily-bicks left a comment

emily-bicks Oct 26, 2023

j-gillam Oct 30, 2023

j-gillam Oct 30, 2023

sqr00t Oct 30, 2023 •

edited

Loading

j-gillam Oct 30, 2023

sqr00t Oct 30, 2023

j-gillam Oct 30, 2023 •

edited

Loading

sqr00t Oct 30, 2023 •

edited

Loading

j-gillam Oct 30, 2023

j-gillam commented Oct 30, 2023

sqr00t left a comment

sqr00t Nov 1, 2023

j-gillam Nov 1, 2023

sqr00t Nov 1, 2023

	elif download_as == "geodataframe":
	elif download_as.lower() in ["geodataframe", "gdf", "geodf", "geo_df"]:

[82,22] S3 download/upload support for xlsx,xlsm and geojson #88

[82,22] S3 download/upload support for xlsx,xlsm and geojson #88

Conversation

j-gillam commented Oct 26, 2023 • edited by sqr00t Loading

S3 download/upload support for xlsx,xlsm and geojson

Description

Checklist:

sqr00t Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emily-bicks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sqr00t Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-gillam Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

sqr00t Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-gillam commented Oct 30, 2023

sqr00t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-gillam commented Oct 26, 2023 •

edited by sqr00t

Loading

sqr00t Oct 26, 2023 •

edited

Loading

sqr00t Oct 30, 2023 •

edited

Loading

j-gillam Oct 30, 2023 •

edited

Loading

sqr00t Oct 30, 2023 •

edited

Loading