Bugfix/geoparquet storage #204

tgrandje · 2024-08-06T16:01:50Z

Should handle the geoparquet format for geodataframe
Replace multiprocessing.Pool by pebble.ThreadPool

* Switch the storing/loading of geoparquets from pandas to geopandas * black formatting

* switch from multiprocessing.Pool to pebble.ThreadPool * black formatting

tgrandje · 2024-08-06T16:05:25Z

2 tests failed on my machine:

FAILED tests/download/test_pynsee_download.py::MyTests::test_download_file_all - AssertionError: False is not true
FAILED tests/geodata/test_pynsee_geodata.py::TestFunction::test_get_geodata_short - ValueError: !!! geometry column is missing !!!

Second one might be linked to a proxy configuration (there is a requests Session built on the fly among the test): let's see what is happening on github.

First failure shouldn't be linked to this PR (I'll wait for the results on github before further inspection if I can)

tgrandje · 2024-08-07T10:02:05Z

Ok, I think I've fixed the failing tests on my machine.

I'll also try a patch to requests using requests-cache: I'm not sure if github is reusing the same machine everytime it starts a bunch of tests on each python version. If that's the case, we should have fastest results for the whole benchmark (as requests will find previous results from disk instead of performing http(s) requests). We'll see what the results are.
Note: I have to rewrite some of the code, as the session object should be created outside multithreading/multiprocessing to avoid crashing the kernel.

By the way, I seem to recall a discussion validating the testing of only lower and upper python version. Wasn't it ever implemented? Should I do that at the same time?

tgrandje · 2024-08-07T14:53:09Z

Am I right to assume the tests are failing because they are checked against the workflows yml from the master branch?
(The latest failed test has no step named "Add requests-cache", though you find it in the linked yaml file from the result's page...).

Any idea on how to test the impact of requests-cache on multiple tests? Maybe a temporary adding to the requirements-extra.txt file if that's ok with you?

(On my machine, without clearing the cache : first run takes ~30min. Next one: ~5min. The sqlite cache is 1.3 Go though, so not cached as an artifact right now.)

tfardet · 2024-08-07T21:02:55Z

By the way, I seem to recall a discussion validating the testing of only lower and upper python version. Wasn't it ever implemented? Should I do that at the same time?

Yes, please! (3.8 and 3.12)
But this and the caching should be in a separate PR so we're sure that no bugs can be overlooked.

Am I right to assume the tests are failing because they are checked against the workflows yml from the master branch?

Yes, I don't think there is any way to avoid this since otherwise any random malicious PR could leak secrets.

Regarding the geoparquet issue, do I understand correctly that the only reason this fails is because of the GeoFrDataFrame?

tgrandje · 2024-08-08T07:59:05Z

But this and the caching should be in a separate PR so we're sure that no bugs can be overlooked.

Yes, I got carried away... (Started this as resolving a pure geodataframe/arrow issue, then bunked into the mutiprocessing issue, then to 6 hours tests...).
If that's ok with you, I'll let the multithreading fix into this one, as that prevented me to carry the tests on my laptop.
I'd rather also let the rewriting on session outside of multithreading scope : this is not necessary for classical purposes, but patching requests on the test resulting on crashing the kernel during multithreading. So I'd say that's a 50/50 share between the 2 PR...

I'll revert the caching and python test version into another PR.

tgrandje · 2024-08-08T08:04:11Z

Regarding the geoparquet issue, do I understand correctly that the only reason this fails is because of the GeoFrDataFrame?

No, there was a unresolved bug still on my first commits. Once I remove the caching and workflow alterations, this should work fine.

tfardet · 2024-08-08T08:21:40Z

No, there was a unresolved bug still on my first commits. Once I remove the caching and workflow alterations, this should work fine.

I meant the original issue of not being able to write the parquet file

tgrandje · 2024-08-08T10:01:14Z

Oh, sorry :-)

So yes, I suppose that's one way of thinking: if the GeoFrDataFrame inherits from pd.DataFrame and not gpd.GeoDataFrame, then the .write_parquet method is not the good one.
Like I said in the issue, I'm in favor of adding geopandas as a dependency, but I can understand @hadrilec 's reserves. The compromise to move geopandas into the optional dependencies seems a good one, and that's what i dit in the latests commits.

tfardet · 2024-08-08T11:24:30Z

I'm in favor of adding geopandas as a dependency, but I can understand @hadrilec 's reserves. The compromise to move geopandas into the optional dependencies seems a good one, and that's what i dit in the latests commits.

I'm not a fan of that compromise as it adds complexity to a code that, as we've seen, is already quite elaborate.
I did not see anyone complaining about geopandas installation over the past year, except one person stuck with python 3.7 on a windows where they could not update anything, and this was just related to not being able to use the version they wanted (we need 0.8+ so this is a non-issue for us).

tgrandje · 2024-08-08T11:27:53Z

I agree with you, but I'm also in favor of a quick release of the bugfix (it is heavily slowing some algorithms of mine by downloading 100Mo dataset over and over...).

tgrandje added 4 commits August 6, 2024 16:22

Update requirements.txt

a4ae31a

Use geopandas to handle geoparquet

7db0561

* Switch the storing/loading of geoparquets from pandas to geopandas * black formatting

Remove unused warnings catchment

0a8d102

Switch from multiprocessing to multithreading

279d466

* switch from multiprocessing.Pool to pebble.ThreadPool * black formatting

bugfix on save_df

dfd2ecf

tgrandje added 8 commits August 7, 2024 15:08

Switch to geopandas optional dependency

c44c6d6

Add requests-cache to env for tests only

e30af86

Patch requests at pytest session start

f53b726

Create session outside multithreading/multiprocessing

411b3cc

Update workflows

902c79a

Fix workflows(?)

fbb0787

Dummy commit to refresh workflows

2fc678e

Move requests-cache install to different job step

daab004

Remove tests' requests' patching from current PR

7381628

tfardet mentioned this pull request Aug 9, 2024

Add GeoPandas as dependency? #169

Open

tgrandje mentioned this pull request Aug 31, 2024

Enhancement/switch multiprocessing to multithreading #216

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix/geoparquet storage #204

Bugfix/geoparquet storage #204

tgrandje commented Aug 6, 2024

tgrandje commented Aug 6, 2024 •

edited

Loading

tgrandje commented Aug 7, 2024 •

edited

Loading

tgrandje commented Aug 7, 2024

tfardet commented Aug 7, 2024

tgrandje commented Aug 8, 2024

tgrandje commented Aug 8, 2024

tfardet commented Aug 8, 2024

tgrandje commented Aug 8, 2024

tfardet commented Aug 8, 2024

tgrandje commented Aug 8, 2024

Bugfix/geoparquet storage #204

Are you sure you want to change the base?

Bugfix/geoparquet storage #204

Conversation

tgrandje commented Aug 6, 2024

tgrandje commented Aug 6, 2024 • edited Loading

tgrandje commented Aug 7, 2024 • edited Loading

tgrandje commented Aug 7, 2024

tfardet commented Aug 7, 2024

tgrandje commented Aug 8, 2024

tgrandje commented Aug 8, 2024

tfardet commented Aug 8, 2024

tgrandje commented Aug 8, 2024

tfardet commented Aug 8, 2024

tgrandje commented Aug 8, 2024

tgrandje commented Aug 6, 2024 •

edited

Loading

tgrandje commented Aug 7, 2024 •

edited

Loading