Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Fix String.split() and start fixing String.__len__() #2960

Closed
wants to merge 33 commits into from

Conversation

martinvuyk
Copy link
Contributor

@martinvuyk martinvuyk commented Jun 5, 2024

Fix string split to take in NoneType as input and its body to use isspace(self) method. Fixes #2880

Leave the fix for lstrip and rstrip to use String.isspace() as TODO once llvm intrinsics can be used at comp time

Added a method String.byte_length() but left the builtin String.__len__() alone for now since many other methods assume it returns byte length and it will require mayor refactoring of too many places for 1 PR. Added deprecation warning to _byte_length()

@martinvuyk martinvuyk changed the title start fixing isspace and str len [stdlib] Fix String.split() & String.isspace() and start fixing String.__len__() Jun 5, 2024
@martinvuyk
Copy link
Contributor Author

@laszlokindrat no idea what this error is.

When calling fn lstrip(self) -> String:

/Users/runner/work/mojo/mojo/stdlib/src/builtin/string.mojo:1874:8: note:                             failed to interpret function @stdlib::builtin::string::String::lstrip(stdlib::builtin::string::String)
    fn lstrip(self) -> String:
       ^
/Users/runner/work/mojo/mojo/stdlib/src/builtin/string.mojo:1881:18: note:                               failed to evaluate call
        for s in self:
                 ^
/Users/runner/work/mojo/mojo/stdlib/src/builtin/string.mojo:1194:8: note:                                 failed to interpret function @stdlib::builtin::string::String::__iter__[__mlir_type.i1,__mlir_type.!lit.lifetime<*(0,0)>](stdlib::builtin::string::String%)_REMOVED_ARG
    fn __iter__(ref [_]self) -> _StringIter[__lifetime_of(self)]:
       ^
/Users/runner/work/mojo/mojo/stdlib/src/builtin/string.mojo:1200:48: note:                                   failed to evaluate call
        return _StringIter[__lifetime_of(self)](
                                               ^
/Users/runner/work/mojo/mojo/stdlib/src/builtin/string.mojo:700:8: note:                                     failed to interpret function @stdlib::builtin::string::_StringIter::__init__(stdlib::builtin::string::_StringIter[$0, $1, $2]=&,stdlib::memory::unsafe_pointer::UnsafePointer[stdlib::builtin::simd::SIMD[{uint8}, {1}], {{0}}],stdlib::builtin::int::Int)_REMOVED_ARG,forward=1
    fn __init__(
       ^
/Users/runner/work/mojo/mojo/stdlib/src/builtin/string.mojo:708:31: note:                                       failed to evaluate call
            if _utf8_byte_type(unsafe_pointer[i]) == 1:
                              ^
/Users/runner/work/mojo/mojo/stdlib/src/builtin/string.mojo:[66](https://github.com/modularml/mojo/actions/runs/9390428176/job/25860270012?pr=2960#step:9:67)7:4: note:                                         failed to interpret function @stdlib::builtin::string::_utf8_byte_type(stdlib::builtin::simd::SIMD[{uint8}, {1}])
fn _utf8_byte_type(b: UInt8) -> UInt8:
   ^
/Users/runner/work/mojo/mojo/stdlib/src/sys/intrinsics.mojo:192:10: note:                                           failed to fold operation pop.call_llvm_intrinsic{fastmathFlags: #pop<fmf none>, hasSideEffects: false, intrin: "llvm.ctlz" : !kgen.string}(#pop<simd 223> : !pop.scalar<ui8>, false)
        ](arg0, arg1)
         ^
mojo: error: failed to run the pass manager```

Signed-off-by: martinvuyk <[email protected]>
@laszlokindrat
Copy link
Contributor

@martinvuyk Something (not sure where) is calling lstrip at compile time, but it can only work at runtime because its implementation somewhere uses an llvm intrinsic, and we don't support them at compile time yet. This is a known limitation that we are planning to lift.

@laszlokindrat laszlokindrat self-assigned this Jun 11, 2024
@martinvuyk martinvuyk marked this pull request as ready for review June 11, 2024 14:00
@martinvuyk martinvuyk requested a review from a team as a code owner June 11, 2024 14:00
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk
Copy link
Contributor Author

@laszlokindrat ready for review :D

Copy link
Contributor

@laszlokindrat laszlokindrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! I have a couple questions, but I like the direction you're going! Could you please rebase on the latest nightly?

.gitignore Outdated Show resolved Hide resolved
stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
stdlib/src/builtin/string.mojo Outdated Show resolved Hide resolved
stdlib/src/utils/string_slice.mojo Outdated Show resolved Hide resolved
@martinvuyk
Copy link
Contributor Author

Hey @laszlokindrat I ended up touching a lot more places than I thought I'd have to. Changed String._byte_length() to String.byte_length() since it is a method that is pretty commonly used before using pointers.

I used a @deprecated decorator to help me find all places in the stdlib that use String.__len__ assuming it's byte_length before doing pointer stuff and changed it to String.byte_length().

I also added some code in places where we should use unicode code point indexing and currently assume ASCII, but we still need issue #933 solved to be able to switch to that. And also the code is the first thing that came to me, not necessarily efficient.

@laszlokindrat
Copy link
Contributor

Thanks! Can we split up this patch then? The length related stuff will touch a lot more things internally, and it's also semantically separate.

@martinvuyk
Copy link
Contributor Author

I honestly don't know where to even split this since fn split() needs byte_length and so does _StringSliceIter instantiation. I didn't change any behavior regarding length methods except for StringSlice.__len__ that was already marked as returning unicode length

    fn __len__(self) -> Int:
        """Nominally returns the _length in Unicode codepoints_ (not bytes!).

        Returns:
            The length in Unicode codepoints.
        """
        # FIXME(MSTDL-160):
        #   Actually perform UTF-8 decoding here to count the codepoints.
        return len(self._slice)
    fn __len__(self) -> Int:
        """Nominally returns the _length in Unicode codepoints_ (not bytes!).

        Returns:
            The length in Unicode codepoints.
        """
        var unicode_length = self.byte_length()

        for i in range(unicode_length):
            if _utf8_byte_type(self._slice[i]) == 1:
                unicode_length -= 1

        return unicode_length

The length related stuff will touch a lot more things internally

That's what I was afraid of. For now the only breaking change is StringLiteral, StringSlice and String got their _byte_length() method renamed to byte_length() to be used before doing pointer stuff (using the internal method directly was a bad idea anyway). I could also just add a _byte_length() method with a deprecation warning to all WDYT?

@laszlokindrat
Copy link
Contributor

Let's start with the _byte_length -> byte_length renaming, as well as all the miscellaneous updates you did to move from len to byte_length. Then we can iterate again. Thanks for you patience with this, I promise it will eventually land :)

@martinvuyk martinvuyk changed the title [stdlib] Fix String.split() & String.isspace() and start fixing String.__len__() [stdlib] Fix String.split() and start fixing String.__len__() Jul 9, 2024
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk martinvuyk requested a review from laszlokindrat July 9, 2024 01:50
Copy link
Contributor

@laszlokindrat laszlokindrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can you please add a changelog entry describing the changes?

stdlib/src/builtin/string.mojo Show resolved Hide resolved
Signed-off-by: martinvuyk <[email protected]>
Copy link
Contributor

@laszlokindrat laszlokindrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@laszlokindrat
Copy link
Contributor

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Jul 9, 2024
@modularbot
Copy link
Collaborator

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@modularbot modularbot added the merged-internally Indicates that this pull request has been merged internally label Jul 9, 2024
modularbot added a commit that referenced this pull request Jul 10, 2024
…en__()` (#43119)

Fix string split to take in NoneType as input and its body to use
`isspace(self)` method. Fixes #2880

Leave the fix for `lstrip` and `rstrip` to use `String.isspace()` as
`TODO` once llvm intrinsics can be used at comp time

Added a method `String.byte_length()` but left the builtin
`String.__len__()` alone for now since many other methods assume it
returns byte length and it will require mayor refactoring of too many
places for 1 PR. Added deprecation warning to `_byte_length()`

ORIGINAL_AUTHOR=martinvuyk
<[email protected]>
Closes #2960
MODULAR_ORIG_COMMIT_REV_ID: 8ed4f3418bc1681d45252d2373227d65eeb31e04
@modularbot modularbot added the merged-externally Merged externally in public mojo repo label Jul 10, 2024
@modularbot
Copy link
Collaborator

Landed in cd65e1b! Thank you for your contribution 🎉

@modularbot modularbot closed this Jul 10, 2024
@martinvuyk martinvuyk deleted the fix-string-split branch July 16, 2024 13:16
modularbot added a commit that referenced this pull request Sep 13, 2024
…en__()` (#43119)

Fix string split to take in NoneType as input and its body to use
`isspace(self)` method. Fixes #2880

Leave the fix for `lstrip` and `rstrip` to use `String.isspace()` as
`TODO` once llvm intrinsics can be used at comp time

Added a method `String.byte_length()` but left the builtin
`String.__len__()` alone for now since many other methods assume it
returns byte length and it will require mayor refactoring of too many
places for 1 PR. Added deprecation warning to `_byte_length()`

ORIGINAL_AUTHOR=martinvuyk
<[email protected]>
Closes #2960
MODULAR_ORIG_COMMIT_REV_ID: 8ed4f3418bc1681d45252d2373227d65eeb31e04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally. merged-externally Merged externally in public mojo repo merged-internally Indicates that this pull request has been merged internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants