Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise exception when open with write mode in call stack #140

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

jayqi
Copy link
Member

@jayqi jayqi commented Apr 3, 2021

A pass at trying to detect when __fspath__ is called by open with a write mode. IMO it doesn't work all that well. It appears open is a weird function that doesn't have its own frame in the call stack. It jumps from __fspath__'s frame directly to the frame that calls open. I don't think there's a straightforward way to get object references from the frames (it's bytecode or source), so also hard to reliably detect if open is being called on the cloud path instance, or to pick up on what mode open is being called with.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 3, 2021

@codecov
Copy link

codecov bot commented Apr 3, 2021

Codecov Report

Merging #140 (e6c2bae) into master (ecf787e) will decrease coverage by 0.0%.
The diff coverage is 91.8%.

@@           Coverage Diff            @@
##           master    #140     +/-   ##
========================================
- Coverage    93.5%   93.5%   -0.1%     
========================================
  Files          21      21             
  Lines        1119    1154     +35     
========================================
+ Hits         1047    1079     +32     
- Misses         72      75      +3     
Impacted Files Coverage Δ
cloudpathlib/cloudpath.py 90.6% <91.1%> (+<0.1%) ⬆️
cloudpathlib/client.py 87.5% <100.0%> (ø)
cloudpathlib/exceptions.py 100.0% <100.0%> (ø)
cloudpathlib/s3/s3client.py 92.8% <0.0%> (ø)

# p,
# "w",
# ) as f:
# pass
Copy link
Member Author

@jayqi jayqi Apr 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case fails because the test frame's code_context looks like this:

['        with open(\n']

Not sure what to do about it.

@pjbull
Copy link
Member

pjbull commented Apr 3, 2021

Ok @jayqi, this really sent me down a 🐰 🕳️ ...

Here's a notebook showing a few more options:
https://gist.github.com/pjbull/1b8f92b84a40cd6fe2033468f98594a4

Thoughts on those?

@jayqi
Copy link
Member Author

jayqi commented Apr 3, 2021

Wow, well done. 👏 It looks to me like you're capturing the main cases. I think grabbing the entire source of the frame and parsing the AST is probably the only reliable way to do this. One concern I have is how much overhead this adds to each call, and how it scales with the frame's size.

@pjbull
Copy link
Member

pjbull commented Apr 3, 2021

Yeah, that's exactly my concern too. I'll see if I can add some benchmarking. No clue how long those things take....

@pjbull
Copy link
Member

pjbull commented Apr 4, 2021

OK, I updated that gist above with some benchmarking. Here are my takeaways:

  • Adding the check makes the __fspath__ take ~10x the time of having no check
  • On my machine, the slowdown is in the same ballpark as writing a 48KB file or reading a 1.6MB on. As file sizes grow larger than those, the slowdown will start to be overwhelmed by the cost of reading/writing.
  • inspect.getsource is the most expensive call. I didn't track down a faster call for this in some poking around.
  • Neither AST parsing nor regex change execution time appreciably, even for reasonably large code contexts, so choosing one of those should not be based on performance.

So, I think if we go with Option 1 as outlined in #128 we should:
(1) add the AST version based on getcurrentframe which is fast and robust as compared to the other options, and (2) implement an environment variable to allow opting out of the performance hit if you are sure you don't use open with cloudpaths.

For me, options 2-4 as outlined in #128 are also still on the table given this research.

@jayqi
Copy link
Member Author

jayqi commented Apr 4, 2021

Do you think inspect.getsource might have a different speed if it's getting the source from an installed version of cloudpathlib? (Editable and normal)

6 ms wall time per run isn't untenably slow but I guess not trivial. Adds a minute to 10,000 files.

@pjbull
Copy link
Member

pjbull commented Apr 4, 2021

Do you think inspect.getsource might have a different speed if it's getting the source from an installed version of cloudpathlib? (Editable and normal)

I think since the frame we want to look at is from the caller's code, it will depend on how their code is installed so not sure that we can speed it up.

I did just poke at bytecode disassembly with dis as well, and that gave me the same order of magnitude as getsource (~5ms).

Agreed that 6ms isn't large enough to rule this approach out.

I think to improve the user friendliness of working with paths, I'm inclined towards this change.

@pjbull
Copy link
Member

pjbull commented Apr 12, 2021

@jayqi Check out the latest commits. This seems to work, but it does add a ton of complex code for a narrow-ish use case.

Copy link
Member Author

@jayqi jayqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Pull request authors can't request changes on their own pull request." 😂 🙄

cloudpathlib/cloudpath.py Outdated Show resolved Hide resolved


# entire function is passed as source, so check separately
# that all of the built in open write modes fail
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is interesting. I thought it was a feature of my previous version of the test that the function would have both safe and unsafe opens, and only fail for the unsafe one. Is it the case that if we had a safe open first and unsafeopen second in the same function, it would fail on the first one because the checking code would see the second open?

Anyways, I think some additional things we want to test:

  • open(pathlib_path, "w") and open(cloud_path, "w")` in the same function
  • monkeypatching that constant to False disables the check

cloudpathlib/cloudpath.py Outdated Show resolved Hide resolved
@pjbull
Copy link
Member

pjbull commented Apr 14, 2021

Another rev here. I thought that with the AST we lost the line numbers for matching writeable opens we detected and the entry point from calls, but it turns out that with a little bit of tracking, we can use both the AST approach and raise the error on the right line.

For example, here is the right trace in a notebook:
image

Tests path locally on 3.8 on linux as well, so there is likely some fix needed for Python<3.8

Small bonus fix: silence warnings in tests by checking for attributes in __del__ methods

@jayqi
Copy link
Member Author

jayqi commented Apr 14, 2021

Uh oh, did we end up using a Python 3.8-only feature? It looks like the 3.8 build passed but 3.6 and 3.7 failed.

@pjbull
Copy link
Member

pjbull commented Apr 14, 2021

Yeah, see above:

Tests path locally on 3.8 on linux as well, so there is likely some fix needed for Python<3.8

I'm seeing differences in what line inspect reports for a given frame across versions. Need to dig in more to find a fix.

@pjbull
Copy link
Member

pjbull commented May 14, 2021

OK, I think this is ready for review. It is passing all the tests.

I do think that this is a pretty substantial amount of complexity to add for maitenance, so we may want to turn this off by default if we get lots of bug reports that come up in this code block.

I wouldn't be surprised if there are cases where the inspect calls don't do what we expect given all the different ways one can run Python.

Performance update


Here's info on timing for this implementation:

Script:

from cloudpathlib import CloudPath
import statistics
import time


if __name__ == "__main__":
    p = CloudPath('s3://cloudpathlib-test-bucket/5bmg9iTs74wmewasLTP44W-test_cloudpath_file_io-test_file_discovery/dir_0/file0_0.txt')

    it = 100
    res = []

    for _ in range(it):
        start = time.time()

        with open(p, "r"):
            pass

        res.append(time.time() - start)

    print(f"Median time {it} iterations: {statistics.median(res)}")

Results:

Around +0.05s on average for each call to open

❯ CLOUDPATHLIB_CHECK_UNSAFE_OPEN=True python timing.py
Median time 100 iterations: 0.2683759927749634

❯ CLOUDPATHLIB_CHECK_UNSAFE_OPEN=False python timing.py
Median time 100 iterations: 0.2952016592025757

@pjbull pjbull marked this pull request as ready for review May 14, 2021 20:47
@pjbull pjbull changed the title [WIP] Raise exception when open with write mode in call stack Raise exception when open with write mode in call stack May 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants