Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Reintroduce Stringlike trait and use it for Stringlike.find() #3861

Draft
wants to merge 12 commits into
base: nightly
Choose a base branch
from

Conversation

martinvuyk
Copy link
Contributor

Reintroduce Stringlike trait and use it for Stringlike.find()

This allows some ergonomic code and not needing to allocate for many cases like String("123").find("1") which would previously cast "1" to a String.

@martinvuyk martinvuyk requested a review from a team as a code owner December 11, 2024 14:57
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk
Copy link
Contributor Author

Hi @JoeLoser and @ConnorGray sorry if I'm drowning you in PRs I took your advice quite literally, they are mostly small and independent.

In this one I'm coming across the same problem as in #3677 where I tried to build a generic function that returns a Span for non-owning types as well as owning ones. Now that we have UnsafePointer carrying the origin of String, StringLiteral and StringSlice the same problem arises, which is we need an API that works for all types not just those that can return __origin_of(self).

One of the ideas I had was adding a method __origin__(self) -> Origin, but the compiler doesn't seem to like using dynamic values (this was one of the motivations for #3756)

We could make String, StringLiteral and StringSlice return a MutableAnyOrigin pointer for now, or try to use this PR as an introduction for whatever fix we can find. WDYT?

@ConnorGray
Copy link
Collaborator

Hey @martinvuyk, thanks for breaking down the PRs, they're looking great, no need to apologize at all😄

With regards to this PR specifically I think I see what you're aiming for in trying to build an abstraction over all "string like" collections (StringLiteral, String, StringSlice). Modeling that abstraction with a trait is one possible approach. But I wonder if using StringSlice as the "bottom" or "common denominator" type might work as well.

To illustrate what I mean, let's have a look at the contract of the Stringlike trait you're proposing:

trait Stringlike(CollectionElement, CollectionElementNew):
    alias is_mutable: Bool
    alias origin: Origin[is_mutable]

    fn byte_length(self) -> Int:
        ...

    fn unsafe_ptr(self) -> UnsafePointer[Byte, is_mutable=is_mutable, origin=origin]:
        ...

If we compare that to StringSlice:

struct StringSlice[
    mut: Bool,
    origin: Origin[mut]
]:
    fn byte_length(self) -> Int:
        ...

    fn unsafe_ptr(self) -> UnsafePointer[Byte, mut=mut, origin=origin]:
        ...

we see that they're essentially identical. I think that this means that a function accepting an arg: StringSlice is roughly an equivalently powerful abstraction as accepting an [T: Stringlike] (arg: T). Additionally, it means that any type that could implement Stringlike could just as easily convert itself into a concrete instance of StringSlice.

All else being equal then, I think the simpler design choice to use a concrete type (StringSlice) is likely easier to learn, easier to use, and easier to maintain. Additionally, it has the benefit of avoiding "bloat" in the compiled binary—instead of compiling separate versions of find() (one for StringSlice, one for StringLiteral, one for String, etc.) for each string-like type, we'll have just one (for StringSlice).

For an example, consider this these two different signatures:

    fn find[T: Stringlike, //](self, substr: T, start: Int = 0) -> Int:
    fn find(self, substr: StringSlice, start: Int = 0) -> Int:

For each signature, the function body can do essentially the same things: get the byte length, and work with a pointer to the string data. But the StringSlice version I would argue has all the advantages of the Stringlike abstraction, while also being a bit simpler since it can avoid depending on generics.

I'm curious to hear your thoughts on this as well 🙂 — this is just my first impression after seeing how this abstraction is planned to be used initially, but perhaps there's advantages I've missed, or a future state you're building towards that I might not see yet 😌

@martinvuyk
Copy link
Contributor Author

@ConnorGray yes I like the idea of using StringSlice as the "common denominator". But, I don't want us to use it for function signatures, IMO it removes much needed ergonomics and is quite frankly something that I find very annoying in other languages because it leads to e.g. some_var.to_some_type().map(some_func).to_another_type()....

This all relates to my goals for abstracting the as_bytes() method as well. I would like this code

"12345".as_bytes().find("123".as_bytes())

to become

Span[Byte]("12345").find("123")

by having the signatures receive T: AsBytes

In the case of Span I'd actually like us to eventually have a parametrized __span__(self) -> Span[Scalar[D], origin] trait for generic SerDe/processing purposes. And it should work for owning or non-owning types.

so going back to this specific PR, I'd like to achieve something like

trait Stringlike(CollectionElement, CollectionElementNew):
    alias mut: Bool
    alias origin: Origin[mut]

    fn as_string_slice(self) -> StringSlice[origin]:
        ...

What this enables is for example if someone decided to implement a very efficient ASCIIString and still wanted to be able to do

var custom_value = ASCIIString("123")
var stdlib_str = "123"
print(stdlib_str == custom_value)
print(String(stdlib_str) == custom_value)
print(StringSlice(stdlib_str) == custom_value)

without needing to implement the __eq__ for their type for every one of those types (and having to always put their type in the lhs), since our implementations would simply have fn __eq__[T: Stringlike](self, other: T) -> Bool. And also without having to write .as_string_slice() every two lines.

To avoid huge monomorphization steps we could make the implementations of the StringSlice methods always receive StringSlice arguments, and have String, StringSlice and StringLiteral do the transformations before sending the data to the implementation. WDYT?

@ConnorGray
Copy link
Collaborator

ConnorGray commented Dec 17, 2024

Hi Martin, thanks for elaborating on how you're intending to use this abstraction, that helps a lot in my understanding 🙂

Could we accomplish many of the same goals if we supported implicit conversions from all of the "string-like" types to StringSlice?

For instance, the example you provide:

"12345".as_bytes().find("123".as_bytes())

Span[Byte]("12345").find("123")

could be written as:

StringSlice("12345").find("123")

If we had a StringSlice from StringLiteral implicit conversion.

I think we could avoid many methods taking a T: Stringlike if we instead had a StringSlice
implicit conversion from T: Stringlike, e.g.:

trait Stringlike:
    # TODO: Not sure how to write the `origin` correctly today off the top of my head
    fn as_string_slice(self) -> StringSlice:
        ...

struct StringSlice:
    fn __init__[T: Stringlike](out self, stringlike: T):
        self = stringlike.as_string_slice()

Then, we could write many methods to take a concrete StringSlice, and any type that implements Stringlike could be passed as an argument. The only monomorphizations that would get generated would be for the StringSlice.__init__[T: Stringlike]() initializer.

To be clear, re. what you said here:

without needing to implement the __eq__ for their type for every one of those types (and having to always put their type in the lhs), since our implementations would simply have fn __eq__[T: Stringlike](self, other: T) -> Bool. And also without having to write .as_string_slice() every two lines.

We would instead have fn __eq__(self, other: StringSlice) method on String / StringLiteral / etc. And because StringSlice would be implicitly constructible from any Stringlike, there would be no need to write .as_string_slice() everywhere 🙂

What do you think of that approach?

The ASCIIString stuff I'll need to digest a bit more.

@ConnorGray
Copy link
Collaborator

I just wanted to add a quick note for context: Us adding an implicit conversion from String / StringLiteral / etc. to StringSlice is similar in purpose to what Rust calls "deref coercion" (though not the same semantics), whereby with minimal syntax you can "coerce" a Rust String type down to a &str just by referencing:

fn main() {
    let string = String::from("foo");
    
    let s: &str = &string;
}

@martinvuyk
Copy link
Contributor Author

martinvuyk commented Dec 17, 2024

trait Stringlike: # TODO: Not sure how to write the origin correctly today off the top of my head fn as_string_slice(self) -> StringSlice: ... struct StringSlice: fn init[T: Stringlike](out self, stringlike: T): self = stringlike.as_string_slice()

This is a good approach IMO. I actually already tried to implement something similar.

I implemented the current fn __init__[O: ImmutableOrigin, //](out self: StringSlice[O], ref [O]value: String):, then I tried to make the origin inferred fn __init__(out self: StringSlice[__origin_of(value)], ref value: String): but we still need the arguments to be ordered.

Problem is that StringLiteral is very funky at compile time. And every time I tried to use the generic Stringlike type it beautifully exploded unless I used _type_is_eq and sent it elsewhere.

So I guess for now we could just use .as_string_slice() as it currently is which is only for owning types, + a specialized constructor only for StringLiteral which we already have.

This is OK for now, but it still constraints the types to be owning (to be able to implement the trait and carry the correct origin), in the case of strings I don't think it's really a problem. But for .as_bytes() I think it will be in the future.

@martinvuyk
Copy link
Contributor Author

trait StringLike(CollectionElement, CollectionElementNew):
    """Trait intended to be used as a generic entrypoint for all String-like
    types."""
    ...

trait StringOwnerLike(StringLike):
    fn as_bytes(ref self) -> Span[Byte, __origin_of(self)]:
        """Returns a contiguous slice of the bytes owned by this string.

        Returns:
            A contiguous slice pointing to the bytes owned by this string.

        Notes:
            This does not include the trailing null terminator.
        """
        ...

    fn as_string_slice(ref self) -> StringSlice[__origin_of(self)]:
        """Returns a string slice of the data owned by this string.

        Returns:
            A string slice pointing to the data owned by this string.
        """
        ...

trait StringSliceLike(StringLike):
    alias mut: Bool
    """The mutability of the origin."""
    alias origin: Origin[mut]
    """The origin of the data."""

    fn as_bytes(self) -> Span[Byte, origin]:
        """Returns a contiguous slice of the bytes.

        Returns:
            A contiguous slice pointing to the bytes.

        Notes:
            This does not include the trailing null terminator.
        """
        ...

    fn as_string_slice(self) -> StringSlice[origin]:
        """Returns a string slice of the data.

        Returns:
            A string slice pointing to the data.
        """
        ...

trait StringLiteralLike(StringLike):
    fn as_bytes(self) -> Span[Byte, StaticConstantOrigin]:
        """Returns a contiguous slice of the bytes.

        Returns:
            A contiguous slice pointing to the bytes.

        Notes:
            This does not include the trailing null terminator.
        """
        ...

    fn as_string_slice(self) -> StringSlice[StaticConstantOrigin]:
        """Returns a string slice of the data.

        Returns:
            A string slice pointing to the data.
        """
        ...

@ConnorGray this isn't necessarily the end version, but this is what I mean by saying that we can't yet generalize the origin that the non-owning types return when calling methods like as_bytes() and as_string_slice(). We could also just have 3 traits and always 3 functions for cases where we need to generalize a lot until we have tools to check which subset of traits a type implements.

All 3 of these are types that someone else might implement.

  • StringOwnerLike/AsBytesOwnerLike: if somebody implements another string type specialized for their use case.
  • StringSliceLike/AsBytesSliceLike: I'm currently playing around with some json parsing and validation ideas, this applies to many other cases.
  • StringLiteralLike/AsBytesStaticLike: If we eventually allow embedding custom binary data into the Mojo binary itself, someone might have their own binary data which they wish to traverse in a certain way. It would have a StaticConstantOrigin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants