Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in botocore's use of JSONFileCache for credential loading #3213

Open
mikelambert opened this issue Jun 27, 2024 · 8 comments
Open
Labels
bug This issue is a confirmed bug. credentials needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue

Comments

@mikelambert
Copy link

Describe the bug

I am getting the a rare crash when using botocore to load files.

It's a JSONDecodeError: Expecting value: line 1 column 1 (char 0) stemming from JSONFileCache.__getitem__, which looks to imply the json file is empty (didn't find any value at char 0). Someone helped point out it might be a race condition in the JSONFileCache.__set__ , which appears to do:

            f.truncate()
            f.write(file_content)

We have multiple concurrent processes starting on a box that each are using botocore, so maybe this is just a race condition if one of them happens to look at the file post-truncate-pre-write? Not sure if a flock, or write-then-rename, or something else ends up a proper solution here?

Expected Behavior

It shouldn't crash

Current Behavior

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
  File "botocore/utils.py", line 3518, in __getitem__
    return json.load(f)
  File "__init__.py", line 293, in load
    return loads(fp.read(),
  File "__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None


  File "botocore/client.py", line 553, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "botocore/client.py", line 989, in _make_api_call
    http, parsed_response = self._make_request(
  File "botocore/client.py", line 1015, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "botocore/endpoint.py", line 198, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "botocore/endpoint.py", line 134, in create_request
    self._event_emitter.emit(
  File "botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "botocore/signers.py", line 105, in handler
    return self.sign(operation_name, request)
  File "botocore/signers.py", line 186, in sign
    auth = self.get_auth_instance(**kwargs)
  File "botocore/signers.py", line 301, in get_auth_instance
    frozen_credentials = credentials.get_frozen_credentials()
  File "botocore/credentials.py", line 634, in get_frozen_credentials
    self._refresh()
  File "botocore/credentials.py", line 522, in _refresh
    self._protected_refresh(is_mandatory=is_mandatory_refresh)
  File "botocore/credentials.py", line 538, in _protected_refresh
    metadata = self._refresh_using()
  File "botocore/credentials.py", line 685, in fetch_credentials
    return self._get_cached_credentials()
  File "botocore/credentials.py", line 693, in _get_cached_credentials
    response = self._load_from_cache()
  File "botocore/credentials.py", line 711, in _load_from_cache
    creds = deepcopy(self._cache[self._cache_key])
  File "botocore/utils.py", line 3520, in __getitem__
    raise KeyError(cache_key)

Reproduction Steps

Hard to repro, and haven't tried myself. I assume thousands of processes recreating the botocore cached credentials file would do it.

Possible Solution

Perhaps a flock or a write-to-temp-file-then-rename-to-destination-file-address, instead of truncate-then-write?

Additional Information/Context

No response

SDK version used

1.34.42

Environment details (OS name and version, etc.)

AWS instance running x86_64 GNU/Linux

@mikelambert mikelambert added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Jun 27, 2024
@Laurent-Dx
Copy link

Laurent-Dx commented Jun 27, 2024

The following does reproduce the issue instantly:

from threading import Thread
from botocore.utils import JSONFileCache

cache = JSONFileCache()


def f():
    for i in range(100000):
        cache["key"] = 10
        cache["key"]


threads = []
for i in range(2):
    thread = Thread(target=f, name=f"Thread-{i}")
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

Adding a lock in each of __getitem__, __delitem__, __setitem__ as you suggested does appear to solve the issue.

@tim-finnigan tim-finnigan self-assigned this Jun 28, 2024
@tim-finnigan
Copy link
Contributor

Thanks for reaching out. The error you referenced was also reported here: #3106. I was advised that an error message like this could help clarify the behavior here: https://github.com/boto/botocore/pull/3183/files. Does deleting the cache file fix this for you?

@tim-finnigan tim-finnigan added response-requested Waiting on additional info and feedback. and removed needs-triage This issue or PR still needs to be triaged. labels Jun 28, 2024
@mikelambert
Copy link
Author

Yeah wrapping this in a retry in our app code does work (and is what I did in parallel to filing this bug). Apologies for not realizing it was previously reported. And thank you Laurent for the multi-threaded repro! (Our problematic setup is multi-process, which is why I didn't propose in-process locks)

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. label Jun 29, 2024
@tim-finnigan
Copy link
Contributor

Thanks for following up and confirming, I'll go ahead and close this as a duplicate and we continue tracking the issue in #3106.

@tim-finnigan tim-finnigan closed this as not planned Won't fix, can't repro, duplicate, stale Jul 2, 2024
@tim-finnigan tim-finnigan added the duplicate This issue is a duplicate. label Jul 2, 2024
Copy link

github-actions bot commented Jul 2, 2024

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

@tim-finnigan tim-finnigan reopened this Jul 11, 2024
@tim-finnigan tim-finnigan removed their assignment Jul 11, 2024
@tim-finnigan tim-finnigan added credentials needs-review This issue or pull request needs review from a core team member. and removed duplicate This issue is a duplicate. labels Jul 11, 2024
@sfc-gh-abalik
Copy link

I'm hitting this same issue - In my case I'm spawning 10 aws s3 cp processes concurrently in Go code using goroutines and a few of them will fail with the JSONDecodeError while the others succeed.

From what I can tell this only happens when the cache file doesn't already exist, so I'm working around it by doing a dummy aws cli call like aws sts get-caller-identity to force the cache file to be written to disk before I spawn the concurrent commands.

@tim-finnigan tim-finnigan added the p2 This is a standard priority issue label Jul 31, 2024
@amberkushwaha
Copy link

markdown is supported security policy and code of conduct is also different as security policy for the proper use

@cg505
Copy link

cg505 commented Dec 11, 2024

This happens to me frequently (more than once a week) while using skypilot, which makes heave use of the botocore library for the AWS API, often in parallel in multiple processes. It is not a temporary problem - the cache json will become malformed and all future attempts will crash until I manually fix or delete the malformed json file.

The nature of the broken file is like this:

{"startUrl": "https://<redacted>.awsapps.com/start", "region": "us-west-2", "accessToken": "<redacted>", "expiresAt": "2024-12-12T23:02:38Z", "clientId": "<redacted>", "clientSecret": "<redacted>", "registrationExpiresAt": "2025-01-20T23:08:06Z", "refreshToken": "<redacted>"}"}

There is always one or two extra characters at the end of the file. I believe this is likely due to the race condition mentioned in OP:

  • process 1 calls f.truncate()
  • process 2 calls f.truncate()
  • process 1 calls f.write()
  • process 2 calls f.write() (contents are 1-2 chars shorter than what process 1 wrote)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. credentials needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

6 participants