-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-108580: Structure for importlib metadata idents #108585
Conversation
orbisvicis
commented
Aug 28, 2023
•
edited
Loading
edited
- Issue: Structure for importlib metadata identities #108580
9385c3b
to
229c51f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good. I'd like to see more test coverage and have a few comments on the approach, but in general this looks acceptable, with the biggest concern being the possible backward incompatible change to the protocol.
This change will need to be backported to importlib_metadata if it's not contributed to there first. It would be preferable to contribute it there first as the test suite there is more robust, covering different Python implementations, performance checks, coverage reports, and exercising doctests (among other things). It's also easier to develop and faster to iterate and experiment there. And changes there get released to the public quickly for rapid feedback. It's by no means a requirement, but there for your consideration. Anything that's contributed there gets merged into CPython usually in short order.
Lib/importlib/metadata/_meta.py
Outdated
@property | ||
def authors(self) -> set[Ident]: | ||
""" | ||
Minimal parsing for "Author" and "Author-email" fields. | ||
""" | ||
|
||
@property | ||
def maintainers(self) -> set[Ident]: | ||
""" | ||
Minimal parsing for "Maintainer" and "Maintainer-email" fields. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When these are introduced, they will invalidate the compliance with the Protocol for any implementations except the built-in one. I suspect other libraries are allowed to implement their own PackageMetadata
. It may be okay to introduce this change, or it may be a backward-incompatibility that needs to be considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not OK, what are the alternatives? I can leave the methods on the implementation but remove them from the protocol. Or I can move the new methods to PackageMetadataV2(PackageMetadata)
, a child protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the best approach. I'm not sure how rigorous checks for protocol compliance are. Apparently this protocol isn't decorated with runtime_checkable, suggesting that it mainly serves as documentation and not runtime behavior.
Or I can move the new methods to
PackageMetadataV2(PackageMetadata)
, a child protocol.
Eww. I don't like that at all. That can't possibly be the best way to evolve a protocol. I remember from C APIs that they were very struct about evolution, meaning that public APIs like the Win32 API would evolve the API once, from CallName to CallNameEx, meaning that after a year or two, the standard was for everyone to use the Ex
calls.
As I think about it more, the V2
concept isn't horrible. It does actually model the evolution of the protocol.
A part of me wishes there was an abstraction between the underlying versions of the protocol and the current recommended protocol, e.g. PackageMetadataLatest
except without the Latest
wart.
I guess that's what bothers me the most is the name is now responsible for two concerns, what the protocol is attempting to model and what iteration it is.
Here's what I recommend - let's get this change released as drafted to importlib_metadata
and CPython 3.13 alpha and see if there are any adverse effects and learn/adapt. I may be overthinking the concern.
2c3b72b
to
9f46019
Compare
I've started backporting the changes to |
@@ -0,0 +1,404 @@ | |||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one file increases the size of the test suite for importlib.metadata by ~50%. That's a lot of complexity for a very small aspect of the code. Since it's just the tests, I'm not too concerned, but it makes me wonder how much of this module should be factored out into another library.
At the very least, I'd like one or two docstrings explaining what this module does and an overview of its purpose/function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further consideration, I think it was a bad choice to add hypothesis tests late in the PR. It's unnecessary to the implementation and possibly overkill and should be considered separately. Please split the hypothesis tests into a separate review with a separate justification. Is it possible to get the "example" tests to work without all of this infrastructure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hypothesis strategy is really clunky and complex, isn't it? But it did identify some edge cases that required me to rewrite both regexes, if you look at my commit history. I think that's justification enough. And believe it or not, the regexes are written to be simple and straightforward to understand. There's a lot of possible optimizations, such as loop unrolling, for which having this hypothesis strategy might eventually prove more useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess that's a good justification. I do appreciate that the tests are noticeably enhanced by the adoption of hypothesis. Whether or not we split out the hypothesis concern, I'd definitely want to be sure it can be adopted upstream in importlib_metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll separate out the Hypothesis identities strategy as a separate PR for importlib_metadata.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
I have made the requested changes; please review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 just dropping in to offer some comments on the Hypothesis strategy here, since it's one of the more ambitious I've seen and has led to several issues, stack overflow questions, and a recent Hypothesis release. I hope the review will be useful whether or not the property-based tests are ultimately included in this PR!
@Zac-HD Thanks for the feedback 👍. I've implemented everything but I'm still using function objects as categories just to reduce boilerplate. Unfortunately the tests now run more slowly, if I can trust that Hypothesis doesn't resume/shrink along the same slow path. total=0; fail=0; for i in {0..100}; do ((total++)); tox -- tests/test_api.py::MetadataAPITests; status="$?"; echo -e "\n"; if ((status != 0)); then ((fail++)); fi; done; printf "Fail: %s\nTotal: %s\n" "$fail" "$total" Failure rates:
Summary of changes:
|
Add `PackageMetadata.authors` and `PackageMetadata.maintainers` to the `importlib.metadata` module. These unify and provide minimal parsing for the respective core metadata fields ("Author", "Author-email"), and ("Maintainer", "Maintainer-email").
Correct the parsing of name/email entries in the 'Author-email' and 'Maintainer-email' fields of the core metadata specification.
A hypothesis strategy for generating structured core metadata and equivalent unstructured text. Ensures that parsing the text using PackageMetadata results in the same structure - a roundtrip test.
A `note` stub which does nothing.
Correct the generation of identity entries in the core metadata specification.
* Fix doctest string escapes * Ignore Ruff E501 (line-too-long) for a long multiline string used as test argument.
* Manipulate strategies rather than data - functions now return strategies instead of data. * Remove nesting by moving all functions to module scope. * Pass state via function parameters rather than using nonlocal variables. * Draw directly from 'st.lists' rather than drawing list length from 'st.integers'. * Switch to 'draw(st.booleans())'. * Remove 'merge_char_opts', which was over-engineered.
Required for: * 'blacklist_categories' -> 'exclude_categories' * 'blacklist_characters' -> 'exclude_characters'
777efc2
to
80e05c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that much of the remaining failure rate is due to the modification of already-drawn structures. If you take that out, it might be faster? I'm definitely just guessing here though and would be pretty close to grabbing a profiler myself.
# if simple_chars, smaller search space and easier visualization | ||
exclude_categories=("C",) if simple_chars else ("Cs", "Co", "Cn"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd only support the simple case, since having two distinct options makes the database less effective and I don't expect control-character-specific bugs to be common, or for that matter worth fixing. It's a lot of state
-management code for very little return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having the option to test every possible combination, but I'll make simple_chars
the default.
I'm not sure about that since I forgot to draw from Anyway, the profile doesn't show anything interesting though it does slow down the test enough that I can't get any successes. At this point I'm ready to chalk this up to "my slow computer". Besides CPython runs Hypothesis in a profile that suppresses
|
After further consideration, I realized that maybe this feature shouldn't be implemented in importlib.metadata at all. In prior decisions (e.g. python/importlib_metadata#429 (comment)), the team has decided to make importlib metadata a low-level reflector of the metadata in a fairly raw form and defer interpretation and more sophisticated objects to a third-party library (namely packaging). Another example of this strategy is in how If we wish to remain consistent in our approach, we should probably consider contributing this behavior (interpretation of Author/Maintainer contacts) to
I regret not having made this connection earlier, as I know the contributor has invested a lot in this change. I'll be honest, I don't love this strategy because of how it requires the callers to draw in the dependency and separately wrap the calls to arrive at high-level objects. It does, however, have the advantage of keeping the low-level implementation simple and decoupling the more sophisticated objects into a single, third-party library. In light of these concerns and past decisions, can we make the case that idents/contacts should get special treatment in this regard? |
Why does |
I have roughly the same objections as @jaraco, but I also don't want to be stuck between a refactoring of Class Class Like I said, I'm not pleased with third-part dependencies, but the workflow is slightly more convoluted: With this PR: from importlib.metadata import packages_distributions, metadata
# The module name is more stable than the package name
pkg_metadata = metadata(packages_distributions()["some-importname"][0])
pkg_authors = pkg_metadata.authors As properties on from importlib.metadata import packages_distributions, metadata
from packaging.metadata import Metadata
# The module name is more stable than the package name
pkg_metadata = Metadata.from_email(
metadata(packages_distributions()["some-importname"][0])
)
authors = pkg_metadata.authors Which would hook into a new module on from importlib.metadata import packages_distributions, metadata
from packaging.identity import authors
# The module name is more stable than the package name
pkg_metadata = metadata(packages_distributions()["some-importname"][0])
pkg_authors = authors(pkg_metadata)
That's a question for the team, but I'm noting that python/importlib_metadata#429, which became pypa/packaging#701, is still open. |
To be clear, this is wheel metadata, not Python's own metadata (which is primarily reflected in the
Because
Correct.
Also correct.
We're happy with the current design, which went through multiple discussions over several months, as it focuses more on a simple API and validation than a heavily ergonomic, opinionated view of the metadata (i.e., it's meant to very much map directly to the underlying core metadata format). So I think providing some higher-level API would be best kept in a 3rd-party package.
It's done, it just hasn't been released yet (should be happening in the near future). |
@jaraco, do you want it as a new module in importlib.metadata? Otherwise I'll put it in one of my packages. But I think there's some benefit to minimizing the number of packages.
The core metadata specification is sufficiently ambiguous that interpretation is required. Following my proposal at the Python packaging discourse, Structure for importlib metadata identities, I've tried to make as few assumptions as possible to maximize the space of inputs that can be correctly parsed.
Aren't we striving for a world where the two are indistinguishable? |
Ugh. What a mess we've gotten into. We've got two low-level metadata providers (importlib metadata and packaging) and one high-level provider (packaging), yet packaging isn't high-level enough to provide any ergonomic features, so we need yet another layer of third party package just for the ergonomics? To be sure, packaging does provide some high-level ergonomics, like interpreting the meaning of a Version or Requirement or Specification, just not ergonomics that cross two fields. Now that I see that
I'm unconcerned about the specific organization within importlib metadata. I do still want to be sure that this is contributed to
I don't think there's any plans to converge these two as they're unrelated domains of concern. Python's metadata doesn't have anything to do with packages (even in the stdlib). To be sure, |
There's an equivalent, flattened, PR already waiting to run at importlib_metadata: python/importlib_metadata#471. I might have to replace usage of the parameterized package (in this PR) if we're not committing the Hypothesis tests to CPython, but aside from that it's largely ready.
I've noticed that package metadata might not be a solved problem, what with the multiple interfaces, multiple packages, and multiple PRs, but I think I have a interesting proposal that preserves Then there's the opposite approach which isolates my PR as much as possible in a separate module. I'm not a fan of this approach but however it happens I do need the functionality for my own tooling. from importlib.metadata import packages_distributions, metadata
from importlib.metadata.structure import authors
pkg_metadata = metadata(packages_distributions()["some-importname"][0])
pkg_authors = authors(pkg_metadata.authors) |
Then let's work towards fixing the ambiguity in the specs instead of trying to come up with our own interpretation via code. That's what leads to conventions being assumed to be standardized. |
I'm not sure that's a good idea. As I see it, the core metadata has several shortcomings:
If it were up to me I'd replace the custom format with TOML or JSON or some more structured format and deduplicate keys such as "Author", "Author-email" and "Maintainer", "Maintainer-email". As I understand it, there have been various PRs to enhance the ergonomics of the metadata API, so a large part of this project would be to collect feedback from the community. In 2021 there was a discussion [1] about the encoding of non-ascii names and emails in which the core metadata identity fields "{name} <{email}> [were described as] intended to only be a guide, not that it should be the exact thing to use when formatting the field" and which led to a PR of the specification [2] active to this day describing the core metadata as "exactly which email RFC applies to packaging metadata is not specified" ... "The need for tools to work with years worth of existing packages makes it difficult to shift to a new format." (referring to PEP 566 from 2017, which describes a JSON-compatible version of the core metadata). This prompted a push [3] to better specify the existing core metadata format and which pushed That said, I'd suggest the core metadata specification is replaced sooner rather than later, before the technical debt of all published Python packages becomes insurmountable. [1] https://discuss.python.org/t/core-metadata-email-fields-unicode/7421 |
FYI I'm unsubscribing as I'm not really seeing a need for any further input from me. |
* Document `PackageMetadata.authors` and `PackageMetadata.maintainers` in the usage guide. * Export `Ident` so it can be documented but also used to build custom parsing strategies.
I mentioned it on discuss.p.o, but for whatever it's worth, I think we should not be adding APIs to these core utilities like If the spec isn't good enough, then we should improve the spec, not paper over it in code. |
Properties `PackageMetadata.authors` and `PackageMetadata.maintainers` are now lists with repeated elements removed, rather than sets.
Since there seems to be widespread consensus that importlib.metadata should reflect the spec, I took a look at the spec for Author and Author-email, which gives these examples:
Assuming the latter two are combined as:
How would Based on the feedback above, if this proposed change provides a meaningful interpretation of what the spec prescribes, it's worth implementing, but it should avoid papering over ambiguity in the spec or otherwise creating a de facto spec by creating implicit constraints beyond what the spec provides. |
What about trying not to bite off so much and instead specify two fields, "Authors" and "Maintainers" with a specified structure that supports the desired use-cases. Or consider a "Contributor" field that contains any number of entities with their associated role (author, maintainer, etc). Is there anything about the email header format/syntax that would prohibit supplying all possible values? Still, I understand if you're reluctant to go down that path and I'm comfortable allowing importlib.metadata to facilitate access to the aspects already specified.
Have you thought any more about these questions and concerns? I feel like they're the primary blockers against accepting this change. |
In python/importlib_metadata#471 (comment), I'm diving into the details of this proposal and have identified some unexpected behaviors. |
The work on this has stalled out, so I'm closing this for now. We can revisit after addressing some crucial issues with the approach. Feel free to request a re-open or to open a new PR in the future (preferably in importlib_metadata first). |