Pass include/exclude args to semgrep run #80

clavedeluna · 2023-10-17T19:29:46Z

Overview

Like codemodder, semgrep also has include/exclude cli args that we are now going to pass along to improve performance

Description

I'll immediately point out that I was eager to add unit tests, but as soon as I added the flags (initially incorrectly) all tests exploded until I correctly added them. So I think as far as testing we're good since semgrep is used in every single test run.
semgrep can take multiple of these patterns as --include=foo.* --include=bar.*', just like we do with --config

new scope
while working on this PR new scope was added and some ticketed:

we decided that the correct behavior is that incluide/exclude should be relative to project directory, not global

this means if we call codemodder tests/samples then a correct includes path should be --path-include=unverified_request.py NOT --path-include=tests/samples/unverified_request.py, for example
this also meant passing the parent path to semgrep. I discovered some really weird discrepancies which led me to realize we should only pass parent path to exclude, not to include. It appears that semgrep's internals have some intersecting behavior between include/excludes, which I could not pin point but could at least determine what we should not be passing to it to get the behavor we want.
To correctly handle fnmatch behavior, if a pattern is like "**.py" we concat without a "/"

Follow up work:

We had unit tests that mimic running codemodder . ...args. We don't correctly handle this type of relative path right now so I will ticket it.
We are currently passing all include/exclude patterns to semgrep, but in fact we should be handling patterns with line numbers differently. This behavior was clarified here Clarify the behavior of path pattern matching codemodder-specs#10 and we will work on it soon.

codecov · 2023-10-17T19:45:09Z

Codecov Report

Merging #80 (791a81b) into main (09b0795) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #80      +/-   ##
==========================================
+ Coverage   95.70%   95.72%   +0.01%     
==========================================
  Files          48       48              
  Lines        1957     1964       +7     
==========================================
+ Hits         1873     1880       +7     
  Misses         84       84

Files	Coverage Δ
src/codemodder/code_directory.py	`100.00% <100.00%> (ø)`
src/codemodder/codemodder.py	`98.00% <100.00%> (-0.02%)`	⬇️
src/codemodder/context.py	`97.01% <100.00%> (+0.13%)`	⬆️
src/codemodder/semgrep.py	`95.65% <100.00%> (+0.91%)`	⬆️

andrecsilva

Looks good, but just as a sanity check how does it behave when you pass a line include/exclude with a line number (e.g. path/to/code.py:42)

clavedeluna · 2023-10-18T10:40:58Z

good catch, for some reason I thought they were separate flags, need to test that out.

clavedeluna · 2023-10-24T11:38:51Z

src/codemodder/code_directory.py

+        if not pat.startswith("*")
+        else parent_path + pat
+        for pat in patterns
+    ]


I'd like to combine and avoid doing multiple list comp operations but it was honestly easier to iterate and easier to understand separating it. We can refactor later on with a generator or something.

Also, I don't feel super confident with this if not pat.startswith("*") logic, but tests show it works right now. happy to revisit

Suggestion:
I think you can use a combination of glob and pathlib to solve issues with globs and relative paths

full_path = (Path(parent_path) / Path(pat)).resolve() files = glob.glob(str(full_path))

There are two caveats: (1) resolve will also make it absolute, we may need to handle the current directory carefully, (2) glob will return actual files in the filesystem which is slower than fnmatch

clavedeluna · 2023-10-24T11:40:07Z

tests/test_code_directory.py

        files = match_files(
            dir_structure, exclude_paths=["**/samples/**", "**/more_samples/**"]
        )
        self._assert_expected(files, expected)

-    def test_include_test_overridden_by_default_excludes(self, mocker):


These tests are using parent_dir as "." which does not work as expected. We will revisit

clavedeluna · 2023-10-24T13:46:14Z

@andrecsilva @drdavella fyi I've request re-review

drdavella

I feel like I'm just becoming more confused by this. I thought that each pattern should just be joined to the parent path and used that way consistently both internally and when passed to semgrep. If that's not working for some reason, I think we need to revisit the design.

drdavella · 2023-10-24T13:40:08Z

src/codemodder/code_directory.py

+
+    # TODO: handle case when parent path is "."
+    patterns = [
+        str(Path(parent_path) / Path(pat))


This is probably faster with os.path.join

drdavella · 2023-10-24T13:41:48Z

src/codemodder/code_directory.py

+    # TODO: handle case when parent path is "."
+    patterns = [
+        str(Path(parent_path) / Path(pat))
+        if not pat.startswith("*")


This condition doesn't make sense to me. Do tests fail without it? This means that the behavior is going to be entirely dependent on the presence or absence of a trailing slash in parent_path, which I don't think is what we want. If my pattern is *.py I want it to mean /path/to/foo/*.py and not /path/to/foo*.py.

I'm wondering whether removing this also correctly handles the . case, although that would also be fixed by normalizing all of the given paths using os.path.abspath.

drdavella · 2023-10-24T13:45:15Z

src/codemodder/semgrep.py

+            command.extend(
+                itertools.chain.from_iterable(
+                    map(
+                        lambda f: ["--exclude", f"{execution_context.directory}{f}"],


It feels like there should be a path separator here.

drdavella · 2023-10-24T13:45:41Z

src/codemodder/semgrep.py

+                )
+            )
+        if execution_context.path_include:
+            # Note: parent path is not passed with --include


I don't understand why this should be the case.

drdavella · 2023-10-30T01:02:43Z

@clavedeluna I appreciate your work on this and am sorry that it became a bit of a tar bit. However, I think we're going to pursue a slightly different optimization here so I'm closing this PR (for now).

clavedeluna changed the title ~~Semgrep inc exl~~ Pass include/exclude args to semgrep run Oct 17, 2023

clavedeluna marked this pull request as ready for review October 17, 2023 19:36

clavedeluna requested review from drdavella and andrecsilva as code owners October 17, 2023 19:36

drdavella approved these changes Oct 18, 2023

View reviewed changes

andrecsilva approved these changes Oct 18, 2023

View reviewed changes

clavedeluna added 8 commits October 24, 2023 08:17

context gets all args

482c5ec

add flags to semgrep

6279bc0

remove import

2b8c7c5

typehint for lists

71d9d69

unit test with defaults

396ee70

filter files with parent path

3ce7a3b

semgrep exclude pats get parent dir

1817467

handle * v just path

791a81b

clavedeluna force-pushed the semgrep-inc-exl branch from cc783bb to 791a81b Compare October 24, 2023 11:20

clavedeluna requested review from drdavella and andrecsilva October 24, 2023 11:37

clavedeluna commented Oct 24, 2023

View reviewed changes

drdavella requested changes Oct 24, 2023

View reviewed changes

clavedeluna added the BLOCKED label Oct 26, 2023

drdavella closed this Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass include/exclude args to semgrep run #80

Pass include/exclude args to semgrep run #80

clavedeluna commented Oct 17, 2023 •

edited

Loading

codecov bot commented Oct 17, 2023 •

edited

Loading

andrecsilva left a comment

clavedeluna commented Oct 18, 2023

clavedeluna Oct 24, 2023

andrecsilva Oct 24, 2023

clavedeluna Oct 24, 2023

clavedeluna commented Oct 24, 2023

drdavella left a comment

drdavella Oct 24, 2023

drdavella Oct 24, 2023

drdavella Oct 24, 2023

drdavella Oct 24, 2023

drdavella commented Oct 30, 2023

Pass include/exclude args to semgrep run #80

Pass include/exclude args to semgrep run #80

Conversation

clavedeluna commented Oct 17, 2023 • edited Loading

Overview

Description

codecov bot commented Oct 17, 2023 • edited Loading

Codecov Report

andrecsilva left a comment

Choose a reason for hiding this comment

clavedeluna commented Oct 18, 2023

clavedeluna Oct 24, 2023

Choose a reason for hiding this comment

andrecsilva Oct 24, 2023

Choose a reason for hiding this comment

clavedeluna Oct 24, 2023

Choose a reason for hiding this comment

clavedeluna commented Oct 24, 2023

drdavella left a comment

Choose a reason for hiding this comment

drdavella Oct 24, 2023

Choose a reason for hiding this comment

drdavella Oct 24, 2023

Choose a reason for hiding this comment

drdavella Oct 24, 2023

Choose a reason for hiding this comment

drdavella Oct 24, 2023

Choose a reason for hiding this comment

drdavella commented Oct 30, 2023

clavedeluna commented Oct 17, 2023 •

edited

Loading

codecov bot commented Oct 17, 2023 •

edited

Loading