Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[82,22] S3 download/upload support for xlsx,xlsm and geojson #88

Merged
merged 11 commits into from
Nov 1, 2023

Conversation

j-gillam
Copy link
Contributor

@j-gillam j-gillam commented Oct 26, 2023

S3 download/upload support for xlsx,xlsm and geojson

This PR closes the issues: 82 and 22.

Description

S3 download/upload support for xlsx,xlsm and geojson.
Added:

  • Updates to S3.py file for uploading/downloading xlsx,xlsm and geojson.
  • Updated setup with new package requirements.
  • Added tests for uploading and downloading the files.
  • Added dummy xlsx,xlsm and geojson files.

Checklist:

  • I have followed Contributor Guidelines
  • I have updated the documentation to include any new functionality I have added or modified
  • I have written unit tests for any new functionality I have added
  • I have ran and passed all tests locally with >> pytest
  • I have checked that all tests ran successfully on the Github after I pushed

@j-gillam j-gillam self-assigned this Oct 26, 2023
@j-gillam j-gillam requested a review from emily-bicks October 26, 2023 16:35
setup.cfg Outdated Show resolved Hide resolved
setup.cfg Outdated
Comment on lines 23 to 25
openpyxl==3.0.9
shapely==2.0.2
geopandas==0.13.2
Copy link
Contributor

@sqr00t sqr00t Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be best to add a gis environment to include shapely and geopandas
I'd suggest adding openpyxl to another environment io_extras to reduce bloat, or add to install_requires
Another possibility is adding [all] to pandas in install_requires i.e. pandas[all]==1.5.1
We could also follow how pandas adds each engine/ extra and enable it downstream in this package

Suggested change
openpyxl==3.0.9
shapely==2.0.2
geopandas==0.13.2
gis =
geopandas==0.13.2
io_extras =
openpyxl==3.0.9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just add pandas[excel]==1.5.1 ? And actually thinking about it, geopandas has shapely as a dependency, so can just remove it right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

Copy link
Contributor

@emily-bicks emily-bicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! My only comment is do we want to make the "download_as" argument now be specific as pandas dataframe vs geopandas dataframe? And is there any reason why a user would want to return the entire geojson as a dictionary rather than just loading the features attribute into a geopandas df?

else:
raise Exception(
"Uploading dataframe currently supported only for 'csv' and 'parquet'."
"Uploading dataframe currently supported only for 'csv', 'parquet', 'xlsx', xlsm' and 'geojson'."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be worth specifying this as a separate function given it's a geopandas df rather than a pandas df?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either way! I think a geopandas df still counts as a pandas df. The only difference being an extra geometry column that has a particular typing and that the package itself has extra functionality for geojsons etc. But happy to separate into another function, if that makes readability better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe? On the geojson question, I've never needed it but someone might! So will add it in!

Copy link
Contributor

@sqr00t sqr00t Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest geopandas docs indicates that geopandas df is a separate object type even though the library is supposedly built on top of pandas.

I think you'd need to add an instance type check condition for gpd.GeoDataFrame for each fnmatch branch, and raise an appropriate error message whether the filetype is implemented for GeoDataFrames. Or just handle GeoDataFrame types above the fnmatch branches (or move the handling to another function separately, then calling your new functions conditionally).

something like:

if isinstance(df_data, gpd.GeoDataFrame):
    your_new_geo_functions(params)

and also, importing Union from typing, and modify the function parameters type hints to:
def _df_to_fileobj(df_data: Union[pd.DataFrame, gpd.GeoDataFrame], ...

Linking below some reference to supported IO operations for GeoPandas:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah fab! So the isinstance(geodf, pd.DataFrame ) returns true for geopandas dataframe, but it doesnt work in the other direction. So it must count as pandas somewhere in it. I'll just make a separate function!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly confused, did you do instance checking on a gpd.GeoDataFrame or pd.DataFrame?

Copy link
Contributor Author

@j-gillam j-gillam Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! I think I've confused things woops. I had currently just checked for pd.DataFrame because it was neater and geodataframes are inherently pandas dataframes (at least according to the isinstance), but I will separate them out as its more robust and has caused confusion ahah.

Copy link
Contributor

@sqr00t sqr00t Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked the geopandas source code as well:

You could also try:

from geopandas.base import GeoPandasBase
issubclass(gpd.GeoDataFrame, GeoPandasBase)

Since GeoDataFrame inherits from GeoPandasBase and pandas DataFrame

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo I like that! Thanks so much for checking everything :)

@j-gillam
Copy link
Contributor Author

Hello both! I have updated the files based on your comments :)

Things to note:

  • @sqr00t pandas 1.5.1 doesn't allow extras, so I went back to your initial idea!
  • Separated the upload/download for geodataframes into separate functions.
  • Noticed a few spelling errors, so fixed those!
  • Removed from xmlrpc.client import Boolean for S3.py as I couldn't see where it was used?

For adding geojson upload/download from dictionaries, I couldn't find a way to check typing for geojson specifically. As from my understanding of the docs, it is a json with specific formatting. So what I did was add a check to see if it has a "type" member and that member is one of the accepted types based on the specification in section 3. If anyone knows of a better way I am happy to update!

Copy link
Contributor

@sqr00t sqr00t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some smaller changes:

  • Removed fnmatch and xmlrpc imports from file_ops (likely a prior refactor artefact, great spot @j-gillam)
  • Specified error types raised
  • Refined instance checking to gpd.base.GeoPandasBase
    Screenshot 2023-10-31 at 11 09 38

"for 'csv' and 'parquet'."
"for 'csv','parquet','xlsx' and 'xlsm'."
)
elif download_as == "geodataframe":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif download_as == "geodataframe":
elif download_as.lower() in ["geodataframe", "gdf", "geodf", "geo_df"]:

geodataframe may be lengthly for end users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its probably simpler to just have one option? We could go with geodf?

elif download_as == "np.array":
if path_from.endswith(tuple([".csv", ".parquet"])):
return _fileobj_to_np_array(fileobj, path_from, **kwargs_reading)
else:
raise Exception(
raise NotImplementedError(
"Download as numpy array currently supported only "
"for 'csv' and 'parquet'."
)
elif not download_as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif not download_as:
if not download_as:

move this to the top statement

@j-gillam j-gillam merged commit 046fbae into dev Nov 1, 2023
3 checks passed
@j-gillam j-gillam deleted the 82_adding_support_excel_geojson branch November 1, 2023 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add geospatial functionality [Feature]: Add xlsx support to S3.loading_saving()
3 participants