Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] [NFC] Fix as_bytes() documentation #3864

Closed

Conversation

martinvuyk
Copy link
Contributor

Fix as_bytes() documentation

@martinvuyk martinvuyk requested a review from a team as a code owner December 11, 2024 17:05
return Span[Byte, StaticConstantOrigin](
ptr=self.unsafe_ptr(), length=self.byte_length()
)

@always_inline
fn as_bytes(ref self) -> Span[Byte, __origin_of(self)]:
"""Returns a contiguous slice of the bytes owned by this string.
"""Returns a contiguous slice of bytes.
Copy link
Contributor

@lsh lsh Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see how this is better. The bytes are owned by the literal.

Returns a contiguous slice of bytes.

doesn't really describe what the bytes contain at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to keep the documentation consistent across owning and non-owning types.

doesn't really describe what the bytes contain at all.

"utf-8 bytes" sounds less confusing 4u?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that "owned by this string" was descriptive about ownership and this PR would make it more ambiguous.


Returns:
A contiguous slice pointing to the bytes owned by this string.
A contiguous slice pointing to bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems grammatically/contextually confusing. If bytes was a variable it would make sense, but in this case it doesn't read quite right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes is the plural of a byte (?), and a contiguous slice of bytes is [byte, byte, ...] (?). Not sure how that's confusing. I'll change it to "utf-8 bytes" if that solves it

Copy link
Contributor Author

@martinvuyk martinvuyk Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, I think it will be more restraining on the API (AsBytes trait) to say explicitly UTF-8, these bytes could encode anything. Bytes is a type in Python, so this makes sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change just the String, StringLiteral, and StringSlice docs to say UTF-8 bytes though WDYT?.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was more:

A contiguous slice pointing to bytes.

What bytes? Who owns them? What do they contain? While this may seem obvious, your change removes that context.

Copy link
Contributor Author

@martinvuyk martinvuyk Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What bytes? [...] What do they contain?

UTF-8 bytes

Who owns them?

The origin that the Span carries


My main goal is to eventually abstract this API over all Stringlike types, owning and non-owning, once the compiler Origin issues are resolved (see PR #3677 files, it won't necessarily look like that but along those lines to be able to receive a T: Stringlike and use .as_btyes() instead of .unsafe_ptr())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not seeing how this is more descriptive as a documentation line. It seems less descriptive to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I'd hold off a little on string unification work until the team has time to discuss your proposal. We should definitely improve the general API surface, but we need to think as a team how we want to handle it. For example, implicit conversion into StringSlice will handle a lot of the general string cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm trying to make the documentation consistent across all Stringlike types, not just owning ones. And that means making it more generic. I can add the encoding (UTF-8), but even that information might not hold in the future if we implement UTF-16 & UTF-32 as a parametrized String, and since Python uses encode(string, encoding) -> bytes represents all possible encodings, that complicates things even further.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @lsh that removing "the bytes owned by this string" isn't an improvement. You haven't made a compelling case for why having less information here helps people using the type.

There's no requirement that different types implementing the same trait have identical API doc for its methods. And where those methods have differences—like how the data is owned—the API doc should note it.

@lsh
Copy link
Contributor

lsh commented Dec 12, 2024

I'm going to close this out for now since I don't think this is better than what we have already. I also am opposed to removing the note about the null terminator, which is important to know.

@lsh lsh closed this Dec 12, 2024
@martinvuyk
Copy link
Contributor Author

I never removed anything about null termination except a redundant code comment (because the docstrings have that info), besides Python doesn't enforce null terminated bytes.

@lsh
Copy link
Contributor

lsh commented Dec 13, 2024

StringLiteral does enforce null terminated bytes, even if Python does not. This makes StringLiteral directly compatible with FFI without reallocation.

@lsh lsh reopened this Dec 13, 2024
@lsh
Copy link
Contributor

lsh commented Dec 13, 2024

I'll talk more with the team about whether more generic and less descriptive docs are desirable for specific types. In general, I think a trait can have a more general description, but implementors should add specific information that may be relevant to the concrete usage

@martinvuyk
Copy link
Contributor Author

StringLiteral does enforce null terminated bytes, even if Python does not. This makes StringLiteral directly compatible with FFI without reallocation.

I know that, but it also has the misleading docs:

struct StringLiteral(...):
    """This type represents a string literal.

    String literals are all null-terminated for compatibility with C APIs, but
    this is subject to change. String literals store their length as an integer,
    and this does not include the null terminator.
    """
    ...

so whenever I'm doing things with all Stringlike types I just build a span with an unsafe pointer and their .byte_length(). I just want to make those operations more generic in functions. In PR #3861 you can see another attempt at going generic over owning or non-owning stringlike types, that is now blocked by UnsafePointer carrying an origin. There are many APIs which would benefit from being generic over T: Stringlike and avoid allocations or implicit casting

@martinvuyk
Copy link
Contributor Author

I'll talk more with the team about whether more generic and less descriptive docs are desirable for specific types. In general, I think a trait can have a more general description, but implementors should add specific information that may be relevant to the concrete usage

Yeah I also think we could be more specific and use "UTF-8 bytes that make up this string" or something like that, I'm mostly just standing by not calling those bytes owned

@martinvuyk martinvuyk closed this Dec 19, 2024
@martinvuyk martinvuyk deleted the fix-asbytes-documentation branch December 19, 2024 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants