Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SBOM cataloger #1029

Merged
merged 9 commits into from
Nov 16, 2022
Merged

SBOM cataloger #1029

merged 9 commits into from
Nov 16, 2022

Conversation

patrikbeno
Copy link
Contributor

Supports ad hoc binary application layouts:

CI saves CycloneDX SBOM JSON in docker image or application deployment/installation folder.

During image/directory scanning, Syft does not need to rely on specific layouts supported by specific catalogers.

Instead, application may use whatever layout and format it needs, and provides SBOM JSON artifact for integration with Syft or other tools.

Typical scenario is a node/react/webpack build folder with static content for web server. This particular use case can be supported by providing npm/yarn lock file and enabling javascript-lock-cataloger (#1022).

SBOM cataloger provides more generic approach to this problem.

@sambhav
Copy link
Contributor

sambhav commented Jun 5, 2022

Related #737

We should catalog sboms using known file extensions.

Currently each format that syft supports has a registered file name/extension.

spdx - *.spdx.json, *.spdx.xml, *.spdx
CycloneDX - *.cdx.json, *.cdx.xml and sbom.json/sbom.xml
syft - *.syft.json

We should use the above file patterns for the sbom cataloger.

@sambhav
Copy link
Contributor

sambhav commented Jun 5, 2022

A larger question re:design is that we have now introduced a dependency on our internal formatting library on the cataloger. The way that syft is currently structured, we have kept the formatters as a separate layer of logic that build on top of the syft sbom model. This is probably something we might want to think about as we introduce dependencies between the formatters and the catalogers.

The PR also currently only parses cyclonedx SBOMs, we should probably extend it to the other formats syft supports if we agree with the proposed design.

One of the other sticking points we have discussed in the past when we have talked about cataloging sboms is providing links back to the original sbom as evidence. Currently syft is able to provide details as to 'why' it included a specific component in the output. One of the blockers to implementing #737 has been figuring out an appropriate way to represent that syft itself didn't discover the cataloged components, rather it pulled them from an sbom present in the image.

cc: @wagoodman, @luhring

// NewSBOMCataloger returns a new SBOM cataloger object loaded from saved SBOM JSON.
func NewSBOMCataloger() *common.GenericCataloger {
globParsers := map[string]common.ParserFn{
"**/deps.json": parseSBOM,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find this pattern in the recognized file patterns for cyclonedx. (See https://cyclonedx.org/specification/overview/#recognized-file-patterns)

Where does this come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just not aware of CycloneDX recognized file patterns. My bad. Will fix.

Copy link
Contributor

@sambhav sambhav Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Functionally it looks great. We still have a few other open questions about functional and non-functional requirements around this feature that we will probably need more input from @wagoodman / @luhring and team. (#1029 (comment))

// NewSBOMCataloger returns a new SBOM cataloger object loaded from saved SBOM JSON.
func NewSBOMCataloger() *common.GenericCataloger {
globParsers := map[string]common.ParserFn{
"**/syft.json": parseSyftJson,
Copy link
Contributor

@sambhav sambhav Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not a valid syft SBOM file pattern. The IANA registered type for syft is *.syft.json per https://www.iana.org/assignments/media-types/application/vnd.syft+json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaned up

@wurstbrot
Copy link

@patrikbeno I just thought about merging SBOMs or filtering it with a tool like Dependency Track.
Very nice that you created a PR, already.

@patrikbeno
Copy link
Contributor Author

rebased

@tofay
Copy link
Contributor

tofay commented Jun 23, 2022

Currently syft is able to provide details as to 'why' it included a specific component in the output. One of the blockers to implementing #737 has been figuring out an appropriate way to represent that syft itself didn't discover the cataloged components, rather it pulled them from an sbom present in the image.

I've just had a play with this branch to look at this, as I'm keen to see this function get added.

It is determinable from the SBOMs that syft discovered this component from an SBOM on disk:

  • for cyclonedx format, the syft:package:foundBy package property has a value of sbom-cataloger
  • for syft json format, foundBy has a value of sbom-cataloger
  • for spdx format, the source information field of packages indicates that it came from an SBOM

Also, thanks to #1038, users can disable the SBOM cataloger should they desire.

Is this still a blocking issue?

Copy link
Contributor

@spiffcs spiffcs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission @patrikbeno! I've gone through and labeled some concerns regarding the detection behavior as it stands for the current PR.

Another issue I see here is making sure that the list of packages returned by the found SBOM are deduplicated or in some way updated to say that they were discovered as part of an SBOM cataloger and NOT their original discovery point. The current implementation would make some incorrect assertions around having discovered the package on disk when actually it was discovered only as part of a declared SBOM that was parsed and returned to the package list.

In short - we really appreciate the contribution and would like some time to work on the branch to update integration/unit tests surrounding this new cataloger while also covering some of the usability concerns highlighted above.

// NewSBOMCataloger returns a new SBOM cataloger object loaded from saved SBOM JSON.
func NewSBOMCataloger() *common.GenericCataloger {
globParsers := map[string]common.ParserFn{
"**/*.syft.json": parseSyftJson,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want these to be always on by default and so broad. First, syft already makes a pretty large distinction where, when image scanning, we don't try to merge to list packages that are declared. Rather, we generally only try and put packages in the SBOM when we have validated 100% yes these are installed.

If we're going to extend image scanning to include declared packages I think the first iteration should be where a user can configure which paths they would like to check for declared packages (in SBOM format) and this behavior should by default be off so we're secure out of the box.

@tofay
Copy link
Contributor

tofay commented Jun 23, 2022

The current implementation would make some incorrect assertions around having discovered the package on disk when actually it was discovered only as part of a declared SBOM that was parsed and returned to the package list.

Looking at the SBOMs producing using this cataloger, the sourceInfo in SPDX SBOMs is not correct:

  {
   "SPDXID": "SPDXRef-5ddd6a37123ac5cc",
   "name": "AWSSDK.Core",
   "licenseConcluded": "NONE",
   "downloadLocation": "NOASSERTION",
   "externalRefs": [
<snip>
   ],
   "filesAnalyzed": false,
   "licenseDeclared": "NONE",
   "sourceInfo": "acquired package info from dotnet project assets file: /golang.spdx.json",
   "versionInfo": "3.7.10.6"
  },

Another example where a JAR is in an SBOM gives "sourceInfo": "acquired package info from installed java archive: /golang.spdx.json",.

Were you expecting any other inaccuracies that need resolving? I can't see anything incorrect in the cyclonedx or syft that misleads regarding detection (and per #1029 (comment) in those cases it's trivial to identify packages identified by the sbom-cataloguer).

@patrikbeno
Copy link
Contributor Author

rebased

@spiffcs spiffcs added the blocked Progress is being stopped by something label Jul 5, 2022
@wurstbrot
Copy link

Hi!
I am wondering about the status of this PR.
@patrikbeno does your latest force-push includes a fix for the mentioned bug of wrong spdx reported by @tofay ?

Kind regards
Timo

@patrikbeno
Copy link
Contributor Author

rebased

@patrikbeno
Copy link
Contributor Author

patrikbeno commented Jul 19, 2022

Hi! I am wondering about the status of this PR. @patrikbeno does your latest force-push includes a fix for the mentioned bug of wrong spdx reported by @tofay ?

Kind regards Timo

sort of, I did my best but it needs some improvements
cb10e39#diff-c7a11830ac100436d2a97b022ed008c98df328cb13aff4f8108200305fbc0f34L52

@patrikbeno
Copy link
Contributor Author

Hi! I am wondering about the status of this PR. @patrikbeno does your latest force-push includes a fix for the mentioned bug of wrong spdx reported by @tofay ?
Kind regards Timo

sort of, I did my best but it needs some improvements cb10e39#diff-c7a11830ac100436d2a97b022ed008c98df328cb13aff4f8108200305fbc0f34L52

improvement?
3016aef#diff-c7a11830ac100436d2a97b022ed008c98df328cb13aff4f8108200305fbc0f34R50

@tofay
Copy link
Contributor

tofay commented Jul 27, 2022

Another issue I see here is making sure that the list of packages returned by the found SBOM are deduplicated or in some way updated to say that they were discovered as part of an SBOM cataloger and NOT their original discovery point.

I've played with this branch to investigate this concern. There is no deduplication, that's consistent with other package types, e.g both the python cataloger and OS package scanners will detect python packages that are in OS packages. But the SBOM cataloger doesn't obscure the other catalogers results either.

@spiffcs what are the outstanding issues for this PR, and can the community help resolve them? I think it's 1) adding integration tests 2) making this cataloger off by default. Is there anything else?

@spiffcs
Copy link
Contributor

spiffcs commented Jul 27, 2022

@tofay I think that's correct

  • Integration tests that prove no collision and assert on packages being resolved as imported or detected as part of an SBOM and not asserted on disk
  • disable by default
  • cc @wagoodman if he had other ideas on additions or updates to this PR

@patrikbeno
Copy link
Contributor Author

rebased

@tgerla tgerla self-assigned this Aug 11, 2022
@tgerla
Copy link
Contributor

tgerla commented Aug 11, 2022

Hi @patrikbeno and @tofay, hope you are doing well. We were discussing this pull request as a team today, and we realized it would be helpful to have a live conversation about this feature. Would you be able to join one of our upcoming community meetings? They are held every two weeks on Thursdays, at 1200 EST. The next one is August 18: https://github.com/anchore/grype#join-our-community-meetings -- thanks in advance!

Tim

@abderrahim
Copy link

I tried this on a project I hope to convert to syft. It has an SPDX fragment (in tag value format) for each dependency. I noticed two issues (with regards to the SPDX fragments already there):

  • syft complains about missing SPDXID. While it's probably correct in doing that, the output SPDX has new ids anyway, so maybe it would be nice to relax this requirement.
  • It doesn't recognize DataLicense

I guess this isn't this PR's fault but from the syft parser, but I wanted to put them out there for discussion. And potentially get an idea of the amount of effort needed to implement in a followup.

@wagoodman wagoodman self-assigned this Nov 14, 2022
@wagoodman wagoodman force-pushed the sbom-cataloger branch 3 times, most recently from e407663 to d2be4b7 Compare November 14, 2022 21:01
@wagoodman
Copy link
Contributor

wagoodman commented Nov 14, 2022

@patrikbeno We recently came to a conclusion internally that SBOMs with multiple sources aren't anywhere on the near-term horizon, which brought the finish line much closer to the current state of this PR. I wanted to try and get this to the finish line.

I've rebased this PR and made a few additions based on multiple conversations in the community meeting and other threads:

  • The pkg.Package.Locations array is overwritten with a single location: the path to the SBOM that describes this package. Why make this change? The Locations field is meant to be the paths where the existence of this path is evident. In this case the cataloger only considers SBOM paths, thus, the only evidence paths should be the SBOM itself and not what is contained within the SBOM. I'm not 100% sure about this addition, but wanted to stick to the existing semantics of pkg.Package.Locations by default.

  • For every package the SBOM cataloger finds a relationship to the SBOM that describes that package is additionally returned.

What do you think about these additions? Happy to chat through them or remove them if that's not what you're looking for.

Questions for future PRs to consider:

  • What about non-package [node] elements that are in a SBOM (e.g. files)? Do we additionally bring these nodes into the SBOM being crafted?
  • What about relationships (edges) that are in the SBOM? Do we additionally merge related relationships found in the nested SBOM into the SBOM being crafted (and thus, also any related nodes as well)?

Comment on lines 98 to 103
func decodeName(group string, name string) string {
if group != "" {
return group + "/" + name
}
return name
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems as though it might be causing an issue in the encode-decode-encode test, should this change be separate from this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I do think this should land in another PR, I'll follow up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken into a separate PR #1345

Comment on lines 21 to 25
WithParserByGlobs(makeParser(syftjson.Format()), "**/*.syft.json").
WithParserByGlobs(makeParser(cyclonedxjson.Format()), "**/bom.json", "**/*.cdx.json").
WithParserByGlobs(makeParser(cyclonedxxml.Format()), "**/bom.xml", "**/*.cdx.xml").
WithParserByGlobs(makeParser(spdx22json.Format()), "**/*.spdx.json").
WithParserByGlobs(makeParser(spdx22tagvalue.Format()), "**/*.spdx", "**/*.spdx.tv")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this all seems accurate, would it be better to just call the generic decode without a specific format in case people name files incorrectly -- like JSON with .spdx suffix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree. I'll update

@wagoodman
Copy link
Contributor

While adjusting for #1029 (comment) to generalize the SBOM parsing, it became apparent that high-level API functions were needed. All of the functions under the top-level API at syft.Format* were exposed when all format implementations were under internal. Once internal/formats was moved to syft/format to expose it on the API, a similar migration of the syft.Format* functions should have been migrated to that package. In order to use these functions without an import cycle this PR migrates these functions. The original syft.Format* functions still remain as empty shells that call back to the migrated function under format to not break folks out in the world. That being said, this should be removed at syft v1.0.

@wagoodman wagoodman merged commit 0c4b99c into anchore:main Nov 16, 2022
@spiffcs
Copy link
Contributor

spiffcs commented Nov 16, 2022

Huge thanks to everyone who contributed thoughts, comments, and guidance on this PR!

Special shoutout to @patrikbeno for getting the ball rolling and starting the discussion. Congrats to @wagoodman for getting it finalized and @kzantow for the review needed to get something like this merged into syft.

@spiffcs spiffcs removed the blocked Progress is being stopped by something label Nov 16, 2022
@wagoodman
Copy link
Contributor

Thanks @spiffcs 🙌 -- @patrikbeno let me know if there is anything about the changes mentioned in #1029 (comment) that don't sit right or you have questions on and we can follow up with changes in another PR.

@errordeveloper
Copy link

errordeveloper commented Nov 17, 2022 via email

GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
* SBOM cataloger

Signed-off-by: Patrik Beno <[email protected]>

* sbom-cataloger: turn off by default

and add integration test

Signed-off-by: Patrik Beno <[email protected]>

* SBOM cataloger

Signed-off-by: Patrik Beno <[email protected]>

* SBOM cataloger (optimize)

Signed-off-by: Patrik Beno <[email protected]>

* SBOM cataloger (fix)

Signed-off-by: Patrik Beno <[email protected]>

* SBOM cataloger (fix imports anchore#1172)

Signed-off-by: Patrik Beno <[email protected]>

* SBOM cataloger (fix: support group attribute in CDX SBOMs)

Signed-off-by: Patrik Beno <[email protected]>

* port to generic cataloger and add relationship to original file

Signed-off-by: Alex Goodman <[email protected]>

* generalize parser for all format globs

Signed-off-by: Alex Goodman <[email protected]>

Signed-off-by: Patrik Beno <[email protected]>
Signed-off-by: Alex Goodman <[email protected]>
Co-authored-by: Tom Fay <[email protected]>
Co-authored-by: Alex Goodman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.