Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: include inspector package urls as part of the malicious metadata facts for pypi packages #935

Open
wants to merge 7 commits into
base: staging
Choose a base branch
from

Conversation

art1f1c3R
Copy link
Member

To allow for retention of packages when they are taken off PyPI, this new feature includes the inspector.pypi.io URL for the distribution files as part of the MaliciousMetadataFacts detail_information field. This has been done by modifying the wheel_absence.py heuristic to instead of returning filenames, return the python hosted URL and corresponding pypi inspector URL. The unit test for this, test_wheel_absence.py has been updated to reflect these changes accordingly.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Dec 3, 2024
@art1f1c3R art1f1c3R self-assigned this Dec 3, 2024
@art1f1c3R
Copy link
Member Author

art1f1c3R commented Dec 3, 2024

One change I have made that I am unsure about is, in wheel_absence.py:91 if _valid_url returns False, then currently I add an empty string. I am not sure what action should be taken if the constructed inspector link is not accessible.

The reason the check for a 200 OK response exists is due to the way I have constructed the pypi inspector link. I observe that the format of the pypi inspector link was the same as files.pythonhosted.org except replacing this prefix with the inspector URL, followed by the package name and then version. Everything else after was the same (using blake2b_256 and the filename), only this was only an observation and I haven't managed to find this documented anywhere, so I added in that check.

if send_head_http_raw(inspector_link) is None:
inspector_link = ""

release_files.append(release_metadata["url"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add separate fields in a dict for the inspector and file_server instead of adding all to a list?

Also, we could add a separate field for version instead of using the version as the key for the return value of this heuristic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done three fields, version, file_server_links, and inspector_links. I agree that makes much more sense, and allows me to resolve an invalid pypi inspector URL much easier.

@behnazh-w
Copy link
Member

One change I have made that I am unsure about is, in wheel_absence.py:91 if _valid_url returns False, then currently I add an empty string. I am not sure what action should be taken if the constructed inspector link is not accessible.

The reason the check for a 200 OK response exists is due to the way I have constructed the pypi inspector link. I observe that the format of the pypi inspector link was the same as files.pythonhosted.org except replacing this prefix with the inspector URL, followed by the package name and then version. Everything else after was the same (using blake2b_256 and the filename), only this was only an observation and I haven't managed to find this documented anywhere, so I added in that check.

I have added a suggestion in this comment to improve the return value. If the inspector HEAD request fails, the value corresponding to the inspector could be None.

# include the pypi inspector link, which uses the same suffix of
# packages/{blake2b_256}/file_name
inspector_prefix = f"{self.INSPECTOR_PREFIX}{name.lower()}/{version}/"
inspector_link = release_metadata["url"].replace(self.PYPI_PREFIX, inspector_prefix)
Copy link
Member

@behnazh-w behnazh-w Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the code correctly, if a project doesn't publish the wheel or tarball, we won't send a corresponding HEAD request to check the inspector link, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since we iterate over the release data for that version, each item in that is for a separate way of distributing it (.tar.gz, .whl) so if one of them doesn't exist it won't exist to iterate over, so the HEAD request won't get sent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants