Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass include/exclude args to semgrep run #80

Closed
wants to merge 8 commits into from
Closed

Conversation

clavedeluna
Copy link
Contributor

@clavedeluna clavedeluna commented Oct 17, 2023

Overview

Like codemodder, semgrep also has include/exclude cli args that we are now going to pass along to improve performance

Description

  • I'll immediately point out that I was eager to add unit tests, but as soon as I added the flags (initially incorrectly) all tests exploded until I correctly added them. So I think as far as testing we're good since semgrep is used in every single test run.
  • semgrep can take multiple of these patterns as --include=foo.* --include=bar.*', just like we do with --config

new scope
while working on this PR new scope was added and some ticketed:

  • we decided that the correct behavior is that incluide/exclude should be relative to project directory, not global
  1. this means if we call codemodder tests/samples then a correct includes path should be --path-include=unverified_request.py NOT --path-include=tests/samples/unverified_request.py, for example
  2. this also meant passing the parent path to semgrep. I discovered some really weird discrepancies which led me to realize we should only pass parent path to exclude, not to include. It appears that semgrep's internals have some intersecting behavior between include/excludes, which I could not pin point but could at least determine what we should not be passing to it to get the behavor we want.
  3. To correctly handle fnmatch behavior, if a pattern is like "**.py" we concat without a "/"

Follow up work:

  1. We had unit tests that mimic running codemodder . ...args. We don't correctly handle this type of relative path right now so I will ticket it.
  2. We are currently passing all include/exclude patterns to semgrep, but in fact we should be handling patterns with line numbers differently. This behavior was clarified here Clarify the behavior of path pattern matching codemodder-specs#10 and we will work on it soon.

@clavedeluna clavedeluna changed the title Semgrep inc exl Pass include/exclude args to semgrep run Oct 17, 2023
@clavedeluna clavedeluna marked this pull request as ready for review October 17, 2023 19:36
@codecov
Copy link

codecov bot commented Oct 17, 2023

Codecov Report

Merging #80 (791a81b) into main (09b0795) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #80      +/-   ##
==========================================
+ Coverage   95.70%   95.72%   +0.01%     
==========================================
  Files          48       48              
  Lines        1957     1964       +7     
==========================================
+ Hits         1873     1880       +7     
  Misses         84       84              
Files Coverage Δ
src/codemodder/code_directory.py 100.00% <100.00%> (ø)
src/codemodder/codemodder.py 98.00% <100.00%> (-0.02%) ⬇️
src/codemodder/context.py 97.01% <100.00%> (+0.13%) ⬆️
src/codemodder/semgrep.py 95.65% <100.00%> (+0.91%) ⬆️

Copy link
Contributor

@andrecsilva andrecsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but just as a sanity check how does it behave when you pass a line include/exclude with a line number (e.g. path/to/code.py:42)

@clavedeluna
Copy link
Contributor Author

good catch, for some reason I thought they were separate flags, need to test that out.

if not pat.startswith("*")
else parent_path + pat
for pat in patterns
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to combine and avoid doing multiple list comp operations but it was honestly easier to iterate and easier to understand separating it. We can refactor later on with a generator or something.

Also, I don't feel super confident with this if not pat.startswith("*") logic, but tests show it works right now. happy to revisit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:
I think you can use a combination of glob and pathlib to solve issues with globs and relative paths

full_path = (Path(parent_path) / Path(pat)).resolve()
files = glob.glob(str(full_path))

There are two caveats: (1) resolve will also make it absolute, we may need to handle the current directory carefully, (2) glob will return actual files in the filesystem which is slower than fnmatch

files = match_files(
dir_structure, exclude_paths=["**/samples/**", "**/more_samples/**"]
)
self._assert_expected(files, expected)

def test_include_test_overridden_by_default_excludes(self, mocker):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are using parent_dir as "." which does not work as expected. We will revisit

@clavedeluna
Copy link
Contributor Author

@andrecsilva @drdavella fyi I've request re-review

Copy link
Member

@drdavella drdavella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I'm just becoming more confused by this. I thought that each pattern should just be joined to the parent path and used that way consistently both internally and when passed to semgrep. If that's not working for some reason, I think we need to revisit the design.


# TODO: handle case when parent path is "."
patterns = [
str(Path(parent_path) / Path(pat))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably faster with os.path.join

# TODO: handle case when parent path is "."
patterns = [
str(Path(parent_path) / Path(pat))
if not pat.startswith("*")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition doesn't make sense to me. Do tests fail without it? This means that the behavior is going to be entirely dependent on the presence or absence of a trailing slash in parent_path, which I don't think is what we want. If my pattern is *.py I want it to mean /path/to/foo/*.py and not /path/to/foo*.py.

I'm wondering whether removing this also correctly handles the . case, although that would also be fixed by normalizing all of the given paths using os.path.abspath.

command.extend(
itertools.chain.from_iterable(
map(
lambda f: ["--exclude", f"{execution_context.directory}{f}"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there should be a path separator here.

)
)
if execution_context.path_include:
# Note: parent path is not passed with --include
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this should be the case.

@drdavella
Copy link
Member

@clavedeluna I appreciate your work on this and am sorry that it became a bit of a tar bit. However, I think we're going to pursue a slightly different optimization here so I'm closing this PR (for now).

@drdavella drdavella closed this Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants