Fix read from multiple s3 regions #1453

jiakai-li · 2024-12-20T01:48:57Z

This PR closes:

[bug] read from multiple s3 regions #1279

jiakai-li · 2024-12-20T01:55:50Z

Hey @kevinjqliu , I hope you are ready for the Christmas time :-)

After some investigation, I noticed the PyArrowFileIO._initialize_fs function doesn't take netloc parameter into account when initialize the S3FileSystem. Instead, it always uses the region found in properties. I modified the precedence order to make netloc take higher priority compare to the region saved in properties.

Would be keen to get some quick feedback, and will add more unit test if this sounds a fix on the correct track? Thanks!

kevinjqliu · 2024-12-20T16:55:39Z

@jiakai-li Thanks for working on this! And happy holidays :)

I noticed the PyArrowFileIO._initialize_fs function doesn't take netloc parameter into account when initialize the S3FileSystem

looking through the usage for _initialize_fs, it doesnt look like netloc is used at all.

it always uses the region found in properties

I think that's one of the problems we need to tackle. The current S3 configuration requires a specific "region" to be set. This assumes that all data and metadata files are from the same region as the one specified. But what if i have some files in one region and some in another?

I think a potential solution might be to omit the "region" property and allow the S3FileSystem to determine the proper region using resolve_s3_region. This is recommended in the S3FileSystem docs for region.

Another potential issue is the way we cache fs, it assumes that there's only one fs per scheme. With the region approach above, we break this assumption.

kevinjqliu · 2024-12-20T16:57:27Z

BTW theres a similar issue in #1041

tests/io/test_pyarrow.py

jiakai-li · 2024-12-20T17:24:57Z

Thank you @kevinjqliu , just try to clear my head a little bit

I think a potential solution might be to omit the "region" property and allow the S3FileSystem to determine the proper region using resolve_s3_region. This is recommended in the S3FileSystem docs for region.

Is the change I made in accordance with this option? What I've done essentially is using the netloc to determine the bucket region. Only in case when, for some reason, the region cannot be determined then we fall back to the properties configuration.

Another potential issue is the way we cache fs, it assumes that there's only one fs per scheme. With the region approach above, we break this assumption.

Please correct me if I miss something for how the fs cache works. But here is my understanding:

I see we use lru_cache, so it should cache one fs for each different bucket since they will have different netloc and thus a different key in the cache. Previously, it looks like we only have one cached fs. It seems relates to the netloc not being used. As a result, netloc is not connected with the client_kwargs["region"] configuration. In this case, even two cache keys point to two fs instances, the two fs instances are still of the same region (the one configured in properties).

I think solving the netloc issue will also resolve the cache issue as the lru_cache key now links with the region and will return the correct instance.

jiakai-li · 2024-12-20T18:04:37Z

BTW theres a similar issue in #1041

Can I tackle on this issue as well if there is no one working on it?

kevinjqliu · 2024-12-20T21:38:43Z

Is the change I made in accordance with this option? What I've done essentially is using the netloc to determine the bucket region. Only in case when, for some reason, the region cannot be determined then we fall back to the properties configuration.

Im dont think netloc can be used to determine the region. S3 URI scheme doesn't use netloc, only S3 URL does.
For example, heres how fs_by_scheme is typically used

iceberg-python/pyiceberg/io/pyarrow.py

Lines 434 to 436 in dbcf65b

    
           scheme, netloc, path = self.parse_location(location) 
        
           return PyArrowFile( 
        
               fs=self.fs_by_scheme(scheme, netloc),

and running an example S3 URI:

from pyiceberg.io.pyarrow import PyArrowFileIO
scheme, netloc, path = PyArrowFileIO.parse_location("s3://a/b/c/1.txt")
# returns ('s3', 'a', 'a/b/c/1.txt')

In order to support multiple regions, we might need to call resolve_s3_region first and pass the region to fs_by_scheme. If you look at it from S3FileSystem's perspective, we need a new S3FileSystem object per region. This relates to how the FileSystem is cached.

BTW a good test scenario can be a table where my metadata files are stored in one bucket while my data files are stored in another. We might be able to construct this test case by modifying the minio settings to create different regional buckets; I haven't tested this yet.

kevinjqliu · 2024-12-20T21:39:36Z

Can I tackle on this issue as well if there is no one working on it?

I don't think anyone's working on it right now, feel free to pick it up.

jiakai-li · 2024-12-20T23:31:56Z

Thank you @kevinjqliu , can I have some more guidance on this please?

Im dont think netloc can be used to determine the region. S3 URI scheme doesn't use netloc, only S3 URL does.

I did some search and seems in terms of s3 scheme, the format is s3://<bucket-name>/<key-name>. The netloc parsed from urlparse (essentially passed to the _initialize_fs call) then points to the bucket-name.

In the below example, I would expect 'a' to be netloc and also the bucket-name. Is there an exception that doesn't follow this format?

from pyiceberg.io.pyarrow import PyArrowFileIO
scheme, netloc, path = PyArrowFileIO.parse_location("s3://a/b/c/1.txt")
# returns ('s3', 'a', 'a/b/c/1.txt')

BTW a good test scenario can be a table where my metadata files are stored in one bucket while my data files are stored in another. We might be able to construct this test case by modifying the minio settings to create different regional buckets; I haven't tested this yet.

Yep, I tested the change using a similar scenario locally with my own handcrafted s3 files. But will add more proper test cases as I make more progress. Thanks again!

kevinjqliu · 2024-12-20T23:54:17Z

I did some search and seems in terms of s3 scheme, the format is s3:///. The netloc parsed from urlparse (essentially passed to the _initialize_fs call) then points to the bucket-name.

ah yes, you're right. sorry for the confusion. I was thinking of something else.
Using netloc, we're essentially mapping S3 bucket to its region. And this is fine because each bucket should belong to a specific region; perhaps we can cache this.
This makes sense to me!

kevinjqliu · 2024-12-20T23:54:53Z

BTW there are 2 FileIO implementations, one for pyarrow, another for fsspec.

We might want to do the same for fsspec

iceberg-python/pyiceberg/io/fsspec.py

Lines 133 to 141 in dbcf65b

    
           def _s3(properties: Properties) -> AbstractFileSystem: 
        
               from s3fs import S3FileSystem 
        
               client_kwargs = { 
        
                   "endpoint_url": properties.get(S3_ENDPOINT), 
        
                   "aws_access_key_id": get_first_property_value(properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID), 
        
                   "aws_secret_access_key": get_first_property_value(properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY), 
        
                   "aws_session_token": get_first_property_value(properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN), 
        
                   "region_name": get_first_property_value(properties, S3_REGION, AWS_REGION),

jiakai-li · 2024-12-20T23:56:20Z

Sweet, I'll go ahead with this approach then. Thanks very much @kevinjqliu !

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow.py

jiakai-li · 2024-12-23T00:57:20Z

BTW there are 2 FileIO implementations, one for pyarrow, another for fsspec.

We might want to do the same for fsspec

iceberg-python/pyiceberg/io/fsspec.py

Lines 133 to 141 in dbcf65b

def _s3(properties: Properties) -> AbstractFileSystem:

from s3fs import S3FileSystem

client_kwargs = {

"endpoint_url": properties.get(S3_ENDPOINT),

"aws_access_key_id": get_first_property_value(properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),

"aws_secret_access_key": get_first_property_value(properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),

"aws_session_token": get_first_property_value(properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN),

"region_name": get_first_property_value(properties, S3_REGION, AWS_REGION),

Hi @kevinjqliu , for the above concern, I tested it locally and also did some investigation. According to what I found here seems fsspec doesn't have the same issue as pyarrow. So I guess we can leave it?

kevinjqliu · 2024-12-23T20:01:55Z

According to what I found here seems fsspec doesn't have the same issue as pyarrow. So I guess we can leave it?

wow thats interesting, i didn't know about that. I like that solution :) Hopefully pyarrow fs will have this feature one day

Whenever a new bucket is used, it will first find out which region it belongs and then use the client for that region.

kevinjqliu

some more comments, thanks for working on this!

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow.py

Co-authored-by: Kevin Liu <[email protected]>

pyiceberg/io/pyarrow.py

jiakai-li · 2024-12-24T02:23:51Z

This PR is ready for review now. Thanks very much and merry christmas! Please let me know if any further change is required.

jiakai-li added 2 commits December 20, 2024 12:39

Take netloc into account for s3 filesystem when calling _initialize_fs

7f52a1a

Fix unit test for s3 fileystem

b53e89f

jiakai-li marked this pull request as draft December 20, 2024 01:59

kevinjqliu reviewed Dec 20, 2024

View reviewed changes

tests/io/test_pyarrow.py Show resolved Hide resolved

kevinjqliu reviewed Dec 21, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

pyiceberg/io/pyarrow.py Show resolved Hide resolved

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

tests/io/test_pyarrow.py Show resolved Hide resolved

jiakai-li added 4 commits December 22, 2024 09:21

Merge branch 'main' into fix-read-from-multiple-s3-regions

7699a76

Update ArrowScan to use different FileSystem per file

eb5e491

Add unit test for PyArrorFileIO.fs_by_scheme cache behavior

0c61ac8

Add error handling

327dbac

kevinjqliu reviewed Dec 23, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

pyiceberg/io/pyarrow.py Show resolved Hide resolved

tests/io/test_pyarrow.py Outdated Show resolved Hide resolved

tests/io/test_pyarrow.py Show resolved Hide resolved

tests/io/test_pyarrow.py Show resolved Hide resolved

Update tests/io/test_pyarrow.py

b4fccf2

Co-authored-by: Kevin Liu <[email protected]>

kevinjqliu reviewed Dec 23, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Show resolved Hide resolved

jiakai-li added 4 commits December 24, 2024 10:35

Update s3.region document and a test case

48bb811

Add test case for PyArrowFileIO.new_input multi region

8404e6b

Shuffle code location for better maintainability

53951f5

Comment for future integration test

51fb6ff

jiakai-li marked this pull request as ready for review December 24, 2024 02:23

jiakai-li added 2 commits December 24, 2024 15:41

Typo fix

0cd06c4

Document wording

64fbdab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix read from multiple s3 regions #1453

Fix read from multiple s3 regions #1453

jiakai-li commented Dec 20, 2024

jiakai-li commented Dec 20, 2024 •

edited

Loading

kevinjqliu commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

jiakai-li commented Dec 20, 2024

jiakai-li commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

jiakai-li commented Dec 20, 2024 •

edited

Loading

kevinjqliu commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

jiakai-li commented Dec 20, 2024

jiakai-li commented Dec 23, 2024 •

edited

Loading

kevinjqliu commented Dec 23, 2024

kevinjqliu left a comment

jiakai-li commented Dec 24, 2024 •

edited

Loading

Fix read from multiple s3 regions #1453

Are you sure you want to change the base?

Fix read from multiple s3 regions #1453

Conversation

jiakai-li commented Dec 20, 2024

jiakai-li commented Dec 20, 2024 • edited Loading

kevinjqliu commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

jiakai-li commented Dec 20, 2024

jiakai-li commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

jiakai-li commented Dec 20, 2024 • edited Loading

kevinjqliu commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

jiakai-li commented Dec 20, 2024

jiakai-li commented Dec 23, 2024 • edited Loading

kevinjqliu commented Dec 23, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

jiakai-li commented Dec 24, 2024 • edited Loading

jiakai-li commented Dec 20, 2024 •

edited

Loading

jiakai-li commented Dec 20, 2024 •

edited

Loading

jiakai-li commented Dec 23, 2024 •

edited

Loading

jiakai-li commented Dec 24, 2024 •

edited

Loading