Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] tool.setuptools.license-files results in invalid metadata #4759

Open
dnicolodi opened this issue Nov 26, 2024 · 17 comments
Open

[BUG] tool.setuptools.license-files results in invalid metadata #4759

dnicolodi opened this issue Nov 26, 2024 · 17 comments
Labels
bug Needs Triage Issues that need to be evaluated for severity and status.

Comments

@dnicolodi
Copy link
Contributor

setuptools version

setuptools==74.1.2

Python version

Python 3.13

OS

any

Additional environment information

No response

Description

If any of the the glob patterns specified intool.setuptools.license-files matches a file in the package, setuptools generates invalid metadata: it includes a License-File field while specifying Metadata-Version to be 2.1. This is invalid and packaging raises an exception while parsing the metadata. This likely results in the resulting distributions to not be accepted by PyPI.

Because tool.setuptools.license-files has a default value of ['LICEN[CS]E*', 'COPYING*', 'NOTICE*', 'AUTHORS*'] the problem can be encountered also in packages that do not explicitly set this field in pyproject.toml but happen to have a file matching the default glob pattern.

Expected behavior

Do not emit the License-File field or do it and specify Metadata-Version: 2.4 as per PEP 639.

How to Reproduce

Here is a short reproducer:

$ cat >pyproject.toml <<EOF
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
[project]
name = "foo"
version = "1.2.3"
[tool.setuptools]
license-files = ["LICENSE"]
EOF
$ touch LICENSE
$ uv build .
$ tar xf dist/foo-1.2.3.tar.gz
$ python
>>> import pathlib
>>> import packaging.metadata
>>> packaging.metadata.Metadata.from_email(pathlib.Path('dist/foo-1.2.3/PKG-INFO').read_text())

Output

  + Exception Group Traceback (most recent call last):
  |   File "<python-input-3>", line 1, in <module>
  |     packaging.metadata.Metadata.from_email(pathlib.Path('dist/foo-1.2.3/PKG-INFO').read_text())
  |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/Users/daniele/src/twine/.venv/lib/python3.13/site-packages/packaging/metadata.py", line 781, in from_email
  |     raise ExceptionGroup(
  |         "invalid or unparsed metadata", exc_group.exceptions
  |     ) from None
  | ExceptionGroup: invalid or unparsed metadata (1 sub-exception)
  +-+---------------- 1 ----------------
    | packaging.metadata.InvalidMetadata: license-file introduced in metadata version 2.4, not 2.1
    +------------------------------------
>>>
@dnicolodi dnicolodi added bug Needs Triage Issues that need to be evaluated for severity and status. labels Nov 26, 2024
@abravalheri
Copy link
Contributor

abravalheri commented Nov 26, 2024

This is a well-known case of an early implementation of a PEP 639 draft. Right now tools are equipped to accept this variation.

In time it will be fixed. But not immediately, due to effort constraints and release scheduling (we are in the process of implementing previous versions of metadata first).

Probably we can close this as a "kind of duplicate" of the request to implement PEP 639.

@dnicolodi
Copy link
Contributor Author

This is a well-known case of an early implementation of a PEP 639 draft. Right now tools are equipped to accept this variation.

What I am trying to say is that tools are not equipped to accept this variation. packaging is the PyPA library for parsing distribution metadata and, as shown above, it does not handle this variation. Which other tools dealing with packages metadata that handle this correctly do you have in mind?

@cdce8p
Copy link
Contributor

cdce8p commented Nov 27, 2024

@dnicolodi Is right here. The License-File metadata generated by setuptools causes an error during validation with packaging==24.2 since the metadata version isn't >=2.4. This hasn't been an issue so far as twine doesn't support version 2.4 and thus License-File isn't included in the form fields during upload.

What I'm uncertain about is the solution. Sure, we can remove the License-File field from the generated metadata, and probably do so anyway as it isn't spec compliant even with 2.4. However that wouldn't resolve the issue for all packages which pin older setuptool versions. As it was added here quite a while ago, I suspect the impact will be quite large. This might be something which needs to be handled upstream (twine / warehouse) too, e.g. only push the License-File field if the metadata version is >=2.4 and ignore it for older versions.

dnicolodi added a commit to dnicolodi/twine that referenced this issue Nov 27, 2024
@abravalheri
Copy link
Contributor

abravalheri commented Nov 27, 2024

Which other tools dealing with packages metadata that handle this correctly do you have in mind?

I was thinking that twine/PyPI/pip have been accepting License-File with metadata version 2.2 for a while.

Just to emphasize that we are going to tackle this problem in time (btw, thanks @cdce8p for the PRs). But before that we will implement metadata version 2.3 (I just got a review recently on one of the PRs necessary for 2.3, but this week I don't have the time to delve into it).

Meanwhile, my strong preference is to not change anything that has been in place for the last 2 or 3 years.

If you need the process core metadata using packaging before we proceed with the updates in setuptools from 2.2, thourgh 2.3 and all the way to 2.4, maybe you can consider deleting/ignoring the License-File field before the validation?

@dnicolodi
Copy link
Contributor Author

I was thinking that twine/PyPI/pip have been accepting License-File with metadata version 2.2 for a while.

I am not very familiar with all the consumers of package metadata. I have the impression that there are several tools using different approaches for metadata parsing and validation. Only relatively recently packaging implemented metadata validation. I hope there is going to be a convergence toward more strict parsing of metadata, either via packaging or via equivalent tools.

Indeed, I bumped into the issue with License-File because I would like to move twine from using a semi-maintained library for extracting metadata from distribution files to a solution based on packaging pypa/twine#1180 This comes with stricter metadata validation, which IMHO is a good thing. I have now implemented a "work-around" for the metadata emitted by existing setuptools versions as there is nothing we can do about those.

I was under the impression that PyPI implements validation of the distribution files metadata, but if it does, the validation is not very strict. What is validated strictly is the form data that is sent alongside the distribution files. That can be tweaked as needed to support the metadata emitted by existing setuptools releases.

pip is most likely the most permissive consumer of metadata, thus I don't expect it to do any validation.

I don't see any reason to change setuptools to fix this issue straight away. I filed this issue to make sure that you are aware of it. Because, as I wrote above, metadata parsing has not been very strict so far, it could have been that you are not aware of it.

@abravalheri
Copy link
Contributor

Thank you very much @dnicolodi.
Hopefully we can have it soon and start to converge towards packaging.

One thing that would help a lot is if packaging itself could provide functionally to emit metadata.
This way we use a single and consistent library for both emitting and parsing metadata.
It is probably been already tracked by pypa/packaging#570.

@dnicolodi
Copy link
Contributor Author

meson-python and scikit-core (and possibly other build backends) use pytproject-metadata https://github.com/pypa/pyproject-metadata/ for translating pyproject.toml into an RFC822 metadata file (I am one of the maintainers of meson-python).

There is some work toward incorporating pyproject-metadata into packaging pypa/packaging#846 pypa/packaging#847

On the other hand, I like that there are more that one implementation of the standard: that makes it easier to ensure that the only implementation does not diverge from the standard and avoids such implementation, bugs included, to become the de-facto standard. We were in that situation before PEP 517 and PEP 621, and we are still cleaning up the mess... 🙂

@cdce8p
Copy link
Contributor

cdce8p commented Nov 27, 2024

I was thinking that twine/PyPI/pip have been accepting License-File with metadata version 2.2 for a while.

PyPI uses packaging.metadata to validate the submitted form fields. Unknown fields are just ignored. With 24.2 license_files was defined as being added in 2.4 thus it would start to fail now when submitted with older version.

Saw the commit from @dnicolodi on the packaging PR: dnicolodi/twine@ab3bf7d. All considered, that's probably the most practical solution here.

Just to emphasize that we are going to tackle this problem in time (btw, thanks @cdce8p for the PRs). But before that we will implement metadata version 2.3.

Meanwhile, my strong preference is to not change anything that has been in place for the last 2 or 3 years.

Sorry in advance if I'm a bit annoying here. I can understand your position, but wouldn't fully agree with it. We can do more even before we implement metadata version 2.3.

Why am I pushing for these changes? For Home Assistant we use a script to try to validate the licenses of all requirements and tbh it's just a mess. Some packages use the outdated classifiers, some custom license strings and others the full license in the metadata. I'm prepared to open PRs for some of these dependencies but it only really makes sense if I can use the final pyproject.toml metadata. Otherwise I'd have to do that all over again once setuptools is able to emit valid 2.4 metadata. With the changes above, it only needs a new release with an updated setuptools version and everything would be fine.

@dnicolodi
Copy link
Contributor Author

PyPI uses packaging.metadata to validate the submitted form fields. Unknown fields are just ignored. With 24.2 license_files was defined as being added in 2.4 thus it would start to fail now when submitted with older version.

AFAIU warehouse (the software powering PyPI) does not ignore unknown fields. This should be the validation code: https://github.com/pypi/warehouse/blob/d57082ee37327bc1e8a28f96470b00ed226c0f87/warehouse/forklift/metadata.py#L263-L348

However, twine submits only fields it known about, and it does not know yet about license_files (which should appear as multiple license_file in the form data, but this is a minor technical detail).

@dnicolodi
Copy link
Contributor Author

hatchling and poetry [...] are also blocked on the twine update and can't yet move to 2.4.

Can you point me to more information regarding this? I was under the impression that hatchling users use the upload mechanism provided by the hatch framework, and same for poetry. Isn't this the case? One of the PyPI maintainers on Discourse said that there are PyPI uploads conforming to metadata 2.4. I thought those were hatchling projects.

@cdce8p
Copy link
Contributor

cdce8p commented Nov 27, 2024

PyPI uses packaging.metadata to validate the submitted form fields. Unknown fields are just ignored. With 24.2 license_files was defined as being added in 2.4 thus it would start to fail now when submitted with older version.

AFAIU warehouse (the software powering PyPI) does not ignore unknown fields. This should be the validation code: https://github.com/pypi/warehouse/blob/d57082ee37327bc1e8a28f96470b00ed226c0f87/warehouse/forklift/metadata.py#L263-L348

You're right. I only checked the packaging call Metadata.from_raw(...) which ignores unknown fields. Didn't know warehouse implemented additional checks.

hatchling and poetry [...] are also blocked on the twine update and can't yet move to 2.4.

Can you point me to more information regarding this? I was under the impression that hatchling users use the upload mechanism provided by the hatch framework, and same for poetry. Isn't this the case?

For hatchling see pypa/hatch#1828 (comment). The same would likely apply to poetry as well. Although they haven't yet started implementing PEP 639 to begin with python-poetry/poetry#9670. The comment for poetry was more referencing the fact that they already allow project.license = MIT in pyproject.toml and just back-populate it to the License field instead of License-Expression.

One of the PyPI maintainers on Discourse said that there are PyPI uploads conforming to metadata 2.4. I thought those were hatchling projects.

AFAIK you can overwrite the Metadata-Version with hatchling and at some point (before PyPI added official support for it I believe) this was even the default. It has since been reverted though.

The last comment on discuss I saw was

No further uploads have occurred since that first attempt.

@abravalheri
Copy link
Contributor

Hi guys, I understand the urgency of the topic, but I am also very conscious that we need to move very carefully to avoid breaking the ecosystem all at once. In the past we had a lot of bad experiences with botched releases, so I am trying to take the most conservative approach possible (and even with that it is possible there will be problems).

The good thing is that we have most of the pieces already in place (thanks again for the PRs). We now need to coordinate to release them, collect feedback and fix if things are broken.

My preference is to do things step by step and wait one week or so between steps to receive feedback of early adopters on edge cases (of course respecting the holiday season coming ahead and the times of all collaborations, so probably longer than that). I suggest the following:

  1. Setuptools to finish testing Metadata 2.2 implementation (followed by release + feedback + bug fixing)
  2. Setuptools to bump Metadata 2.3 - should be trivial - (followed by release + feedback + bug fixing)
  3. Setuptools to implement PEP 639 and Metadata 2.4 - (followed by release + feedback + bug fixing)
  4. Twine and other downstream consumers to support strict Metadata 2.4 1

I think that step 2 is going to be trivial to support, once we concede that we don't have to worry to much about the pypa/packaging#845 and treat it as a temporary bug.

Footnotes

  1. I am not sure twine will be able to fully drop lenient parsing for Metadata < 2.4 (many users do pin the version of their build dependencies, so there may be some backlash form the community). But this is a decision that the twine team can decide to take.

@cdce8p
Copy link
Contributor

cdce8p commented Nov 27, 2024

One last post from me and then I'll shut up and respect your decision :)

we need to move very carefully to avoid breaking the ecosystem all at once. In the past we had a lot of bad experiences with botched releases, so I am trying to take the most conservative approach possible (and even with that it is possible there will be problems).

Absolutely! The only thing our opinions differ I believe is in the risk these PRs actually pose. What I'm trying to say is it's quite small and we can safely do it now. Let me explain

Doing these now would provide valuable feedback way before we'd consider moving to 2.4. That's the only thing I would change about your proposed timeline.

@dnicolodi
Copy link
Contributor Author

I don't want to influence the pace of introduction of new features in any way (egoistically I would like support for metadata 2.4 to land as soon as possible to be able to use it in my projects, but it would only be something nice to have) so this should not be read as supporting one position or the other, but I would like to point out that pyproject-metadata and thus meson-python have taken the approach of emitting metadata where the metadata version is set to what is required to represent faithfully the user input, namely the content of pyproject.toml.

meson-python does not support any dynamic metadata fields, thus for it this means either metadata version 2.2 or 2.4. The latter is used only when pyproject.toml contains a PEP 639 style license declaration, namely a string value for project.license or a non empty project.license-files.

I think this is the only way to proceed as pre PEP 639 and post PEP 639 license declaration formats are not compatible with each other, thus emitting metadata version 2.4 with a pyproject.toml using an older metadata format is at minimum very tricky.

These PRs have more details pypa/pyproject-metadata#132 pypa/pyproject-metadata#206

@cdce8p
Copy link
Contributor

cdce8p commented Nov 27, 2024

I think this is the only way to proceed as pre PEP 639 and post PEP 639 license declaration formats are not compatible with each other, thus emitting metadata version 2.4 with a pyproject.toml using an older metadata format is at minimum very tricky.

I'm proposing the reverse: Emitting the current core metadata 2.1 with a post PEP 639 pyproject.toml. Even if that means some fields wouldn't be written even though the data in the project table is there (namely License-Expression).

That approach works fine for hatchling. I've also opened PRs to add the same to flit.

@dnicolodi
Copy link
Contributor Author

I'm proposing the reverse: Emitting the current core metadata 2.1 with a post PEP 639 pyproject.toml.

I don't understand what the advantage of doing this is. IIUC, your goal is to move as fast as possible to have packages with PEP 639 metadata. However, with this approach, having packages with PEP 639 license metadata will require two releases of the packages involved: one that updated the metadata fields in pyproject.toml to the ones specified in PEP 639 and one to rebuild the package metadata shipped in the distribution files with a future setuptools release.

The cost of doing this is not only the two releases, but also that between the two releases the involved packages will not have clear license information displayed on PyPI. IIRC, PEP 639 forbids having classifiers indicating the package license while using the License-Expression metadata field. Because currently PyPI takes the license information displayed on the package pages from the classifiers, the packages will do not show clear license information.

@cdce8p
Copy link
Contributor

cdce8p commented Nov 29, 2024

However, with this approach, having packages with PEP 639 license metadata will require two releases of the packages involved: one that updated the metadata fields in pyproject.toml to the ones specified in PEP 639 and one to rebuild the package metadata shipped in the distribution files with a future setuptools release.

Correct.

I don't understand what the advantage of doing this is. IIUC, your goal is to move as fast as possible to have packages with PEP 639 metadata.

Updating packages always takes time. What can be optimized though is developer time. I can either fix / convert all wrong licenses to SPDX expressions now and do a second pass to convert {text = "..."} to the new syntax, or just to the update once knowing that eventually a new release will just use the License-Expression instead. Given that it often takes >1 month for smaller projects to merge PRs, this would just be exhausting. So for now, I've paused updating the license strings until the new syntax in whatever form is supported.

To give an example, poetry already supports the license = "MIT" syntax while still missing PEP 639 support. So these aren't an issue.

Just for Home Assistant, I currently track over 650 packages with inaccurate license data, no SPDX expression in either the License-Expression or the License field.

The cost of doing this is not only the two releases, but also that between the two releases the involved packages will not have clear license information displayed on PyPI.

Not entirely. PyPI can handle all case. This is with both classifier and (old) license metadata 1

Screenshot 2024-11-29 at 20 13 23

and here an example with only the (old) license metadata, but still a valid SPDX expression. 2

Screenshot 2024-11-29 at 20 14 27

IIRC, PEP 639 forbids having classifiers indicating the package license while using the License-Expression metadata field. Because currently PyPI takes the license information displayed on the package pages from the classifiers, the packages will do not show clear license information.

No, build tools MAY raise an error if a license classifier is present.3 PyPI must only reject uploads with both License and License-Expression fields but those don't collide here.4

Footnotes

  1. https://pypi.org/project/homeassistant/

  2. https://pypi.org/project/AEMET-OpenData/

  3. https://peps.python.org/pep-0639/#deprecate-license-classifiers

  4. https://peps.python.org/pep-0639/#deprecate-license-field

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Needs Triage Issues that need to be evaluated for severity and status.
Projects
None yet
Development

No branches or pull requests

3 participants