-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[82,22] S3 download/upload support for xlsx,xlsm and geojson #88
Conversation
setup.cfg
Outdated
openpyxl==3.0.9 | ||
shapely==2.0.2 | ||
geopandas==0.13.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be best to add a gis
environment to include shapely
and geopandas
I'd suggest adding openpyxl
to another environment io_extras
to reduce bloat, or add to install_requires
Another possibility is adding [all] to pandas in install_requires i.e. pandas[all]==1.5.1
We could also follow how pandas adds each engine/ extra and enable it downstream in this package
openpyxl==3.0.9 | |
shapely==2.0.2 | |
geopandas==0.13.2 | |
gis = | |
geopandas==0.13.2 | |
io_extras = | |
openpyxl==3.0.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just add pandas[excel]==1.5.1 ? And actually thinking about it, geopandas has shapely as a dependency, so can just remove it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! My only comment is do we want to make the "download_as" argument now be specific as pandas dataframe vs geopandas dataframe? And is there any reason why a user would want to return the entire geojson as a dictionary rather than just loading the features attribute into a geopandas df?
nesta_ds_utils/loading_saving/S3.py
Outdated
else: | ||
raise Exception( | ||
"Uploading dataframe currently supported only for 'csv' and 'parquet'." | ||
"Uploading dataframe currently supported only for 'csv', 'parquet', 'xlsx', xlsm' and 'geojson'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be worth specifying this as a separate function given it's a geopandas df rather than a pandas df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either way! I think a geopandas df still counts as a pandas df. The only difference being an extra geometry column that has a particular typing and that the package itself has extra functionality for geojsons etc. But happy to separate into another function, if that makes readability better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe? On the geojson question, I've never needed it but someone might! So will add it in!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest geopandas docs indicates that geopandas df is a separate object type even though the library is supposedly built on top of pandas.
I think you'd need to add an instance type check condition for gpd.GeoDataFrame
for each fnmatch
branch, and raise an appropriate error message whether the filetype is implemented for GeoDataFrames. Or just handle GeoDataFrame
types above the fnmatch
branches (or move the handling to another function separately, then calling your new functions conditionally).
something like:
if isinstance(df_data, gpd.GeoDataFrame):
your_new_geo_functions(params)
and also, importing Union
from typing
, and modify the function parameters type hints to:
def _df_to_fileobj(df_data: Union[pd.DataFrame, gpd.GeoDataFrame], ...
Linking below some reference to supported IO operations for GeoPandas:
- User Guide
- IO methods
- GeoPandas.GeoDataFrame instance methods
- Note, there is some overlap between the IO methods and "instance methods" sections, but there's a few extra
to_something
methods in the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah fab! So the isinstance(geodf, pd.DataFrame ) returns true for geopandas dataframe, but it doesnt work in the other direction. So it must count as pandas somewhere in it. I'll just make a separate function!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly confused, did you do instance checking on a gpd.GeoDataFrame or pd.DataFrame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! I think I've confused things woops. I had currently just checked for pd.DataFrame because it was neater and geodataframes are inherently pandas dataframes (at least according to the isinstance), but I will separate them out as its more robust and has caused confusion ahah.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked the geopandas source code as well:
You could also try:
from geopandas.base import GeoPandasBase
issubclass(gpd.GeoDataFrame, GeoPandasBase)
Since GeoDataFrame inherits from GeoPandasBase and pandas DataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo I like that! Thanks so much for checking everything :)
Removing shapely from test. Co-authored-by: Solomon Yu <[email protected]>
Hello both! I have updated the files based on your comments :) Things to note:
For adding geojson upload/download from dictionaries, I couldn't find a way to check typing for geojson specifically. As from my understanding of the docs, it is a json with specific formatting. So what I did was add a check to see if it has a "type" member and that member is one of the accepted types based on the specification in section 3. If anyone knows of a better way I am happy to update! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some smaller changes:
- Removed
fnmatch
andxmlrpc
imports fromfile_ops
(likely a prior refactor artefact, great spot @j-gillam) - Specified error types raised
- Refined instance checking to gpd.base.GeoPandasBase
nesta_ds_utils/loading_saving/S3.py
Outdated
"for 'csv' and 'parquet'." | ||
"for 'csv','parquet','xlsx' and 'xlsm'." | ||
) | ||
elif download_as == "geodataframe": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif download_as == "geodataframe": | |
elif download_as.lower() in ["geodataframe", "gdf", "geodf", "geo_df"]: |
geodataframe may be lengthly for end users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its probably simpler to just have one option? We could go with geodf?
nesta_ds_utils/loading_saving/S3.py
Outdated
elif download_as == "np.array": | ||
if path_from.endswith(tuple([".csv", ".parquet"])): | ||
return _fileobj_to_np_array(fileobj, path_from, **kwargs_reading) | ||
else: | ||
raise Exception( | ||
raise NotImplementedError( | ||
"Download as numpy array currently supported only " | ||
"for 'csv' and 'parquet'." | ||
) | ||
elif not download_as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif not download_as: | |
if not download_as: |
move this to the top statement
S3 download/upload support for xlsx,xlsm and geojson
This PR closes the issues: 82 and 22.
Description
S3 download/upload support for xlsx,xlsm and geojson.
Added:
Checklist:
>> pytest