Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/geoparquet storage #204

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Bugfix/geoparquet storage #204

wants to merge 14 commits into from

Conversation

tgrandje
Copy link
Collaborator

@tgrandje tgrandje commented Aug 6, 2024

  • Should handle the geoparquet format for geodataframe
  • Replace multiprocessing.Pool by pebble.ThreadPool

* Switch the storing/loading of geoparquets from pandas to geopandas
* black formatting
* switch from multiprocessing.Pool to pebble.ThreadPool
* black formatting
@tgrandje
Copy link
Collaborator Author

tgrandje commented Aug 6, 2024

2 tests failed on my machine:

FAILED tests/download/test_pynsee_download.py::MyTests::test_download_file_all - AssertionError: False is not true
FAILED tests/geodata/test_pynsee_geodata.py::TestFunction::test_get_geodata_short - ValueError: !!! geometry column is missing !!!

Second one might be linked to a proxy configuration (there is a requests Session built on the fly among the test): let's see what is happening on github.

First failure shouldn't be linked to this PR (I'll wait for the results on github before further inspection if I can)

@tgrandje
Copy link
Collaborator Author

tgrandje commented Aug 7, 2024

Ok, I think I've fixed the failing tests on my machine.

I'll also try a patch to requests using requests-cache: I'm not sure if github is reusing the same machine everytime it starts a bunch of tests on each python version. If that's the case, we should have fastest results for the whole benchmark (as requests will find previous results from disk instead of performing http(s) requests). We'll see what the results are.
Note: I have to rewrite some of the code, as the session object should be created outside multithreading/multiprocessing to avoid crashing the kernel.

By the way, I seem to recall a discussion validating the testing of only lower and upper python version. Wasn't it ever implemented? Should I do that at the same time?

@tgrandje
Copy link
Collaborator Author

tgrandje commented Aug 7, 2024

Am I right to assume the tests are failing because they are checked against the workflows yml from the master branch?
(The latest failed test has no step named "Add requests-cache", though you find it in the linked yaml file from the result's page...).

Any idea on how to test the impact of requests-cache on multiple tests? Maybe a temporary adding to the requirements-extra.txt file if that's ok with you?

(On my machine, without clearing the cache : first run takes ~30min. Next one: ~5min. The sqlite cache is 1.3 Go though, so not cached as an artifact right now.)

@tfardet
Copy link
Collaborator

tfardet commented Aug 7, 2024

By the way, I seem to recall a discussion validating the testing of only lower and upper python version. Wasn't it ever implemented? Should I do that at the same time?

Yes, please! (3.8 and 3.12)
But this and the caching should be in a separate PR so we're sure that no bugs can be overlooked.

Am I right to assume the tests are failing because they are checked against the workflows yml from the master branch?

Yes, I don't think there is any way to avoid this since otherwise any random malicious PR could leak secrets.

Regarding the geoparquet issue, do I understand correctly that the only reason this fails is because of the GeoFrDataFrame?

@tgrandje
Copy link
Collaborator Author

tgrandje commented Aug 8, 2024

But this and the caching should be in a separate PR so we're sure that no bugs can be overlooked.

Yes, I got carried away... (Started this as resolving a pure geodataframe/arrow issue, then bunked into the mutiprocessing issue, then to 6 hours tests...).
If that's ok with you, I'll let the multithreading fix into this one, as that prevented me to carry the tests on my laptop.
I'd rather also let the rewriting on session outside of multithreading scope : this is not necessary for classical purposes, but patching requests on the test resulting on crashing the kernel during multithreading. So I'd say that's a 50/50 share between the 2 PR...

I'll revert the caching and python test version into another PR.

@tgrandje
Copy link
Collaborator Author

tgrandje commented Aug 8, 2024

Regarding the geoparquet issue, do I understand correctly that the only reason this fails is because of the GeoFrDataFrame?

No, there was a unresolved bug still on my first commits. Once I remove the caching and workflow alterations, this should work fine.

@tfardet
Copy link
Collaborator

tfardet commented Aug 8, 2024

No, there was a unresolved bug still on my first commits. Once I remove the caching and workflow alterations, this should work fine.

I meant the original issue of not being able to write the parquet file

@tgrandje
Copy link
Collaborator Author

tgrandje commented Aug 8, 2024

Oh, sorry :-)

So yes, I suppose that's one way of thinking: if the GeoFrDataFrame inherits from pd.DataFrame and not gpd.GeoDataFrame, then the .write_parquet method is not the good one.
Like I said in the issue, I'm in favor of adding geopandas as a dependency, but I can understand @hadrilec 's reserves. The compromise to move geopandas into the optional dependencies seems a good one, and that's what i dit in the latests commits.

@tfardet
Copy link
Collaborator

tfardet commented Aug 8, 2024

I'm in favor of adding geopandas as a dependency, but I can understand @hadrilec 's reserves. The compromise to move geopandas into the optional dependencies seems a good one, and that's what i dit in the latests commits.

I'm not a fan of that compromise as it adds complexity to a code that, as we've seen, is already quite elaborate.
I did not see anyone complaining about geopandas installation over the past year, except one person stuck with python 3.7 on a windows where they could not update anything, and this was just related to not being able to use the version they wanted (we need 0.8+ so this is a non-issue for us).

@tgrandje
Copy link
Collaborator Author

tgrandje commented Aug 8, 2024

I agree with you, but I'm also in favor of a quick release of the bugfix (it is heavily slowing some algorithms of mine by downloading 100Mo dataset over and over...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants