Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Implement Int.from_bytes() and Int.as_bytes() #3795

Closed
wants to merge 31 commits into from

Conversation

msaelices
Copy link
Contributor

Similar to the Python's int.from_bytes() and int.to_bytes() one.

@@ -1194,6 +1195,64 @@ struct Int(

writer.write(self)

@staticmethod
fn from_bytes[
type: DType, big_endian: Bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of adding a from_bytes and to_bytes method, but I'm not a fan of using a boolean to represent the endianness here. If we wanted to be Pythonic, we could use StringLiteral in the parameter and constrain it to be "little" or "big", or we could make an Endian wrapper struct which we can parameterize other methods on in the future. WDYT @JoeLoser?

I also don't think we want to add a DType parameter here. We can just get the sizeof[Int](), or perhaps add a helper to get the DType equivalent to Int.

Copy link
Contributor Author

@msaelices msaelices Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was adding these methods because they are very convenient for IO or networking protocols. For example I am needing these for a pure Mojo implementation of websockets.

If we don't have the DType parameter, and we get an array of 60 bytes from some stream, how can we do the following cases:

  1. I want to get 10 UInt64 integers (10x6=60)
  2. I want to get 10 Int64 integers
  3. I want to get 30 UInt16 integers (30x2=60)
  4. I want to get 30 Int16 integers

Python implementation of int.as_bytes() have the length and byteorder arguments, and int.from_bytes() assumes we always have a 4 bytes integers and only have the byteorder one, but as arguments (run-time) which is slower and in Mojo we can do it comptime.

If we want to implement something IO-related, like WebSockets, we already know the endianness of the target we want to read from or write to, so it's faster if we use parameters.

Copy link
Contributor

@lsh lsh Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything, these comments make me think that from_bytes and to_bytes belongs on SIMD, not Int.

Python implementation of int.as_bytes() have the length and byteorder arguments, and int.from_bytes() assumes we always have a 4 bytes integers and only have the byteorder one, but as arguments (run-time) which is slower and in Mojo we can do it comptime.

I agree :) that's why I still said it should be a parameter, but my comments were questioning what the parameter should look like, because I'm not sold on Bool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if we follow Python's lead, bytes is binary Byte and can be encoded in UInt16 etc. We could also hold off on doing it like Python and let this function be inferred and have bytes: Span[Scalar[D]].

I think type shouldn't be used as prolifically as it is for parameters since it's the name of Python's type builtin function that I think we'll eventually have in Mojo as well.

Suggested change
type: DType, big_endian: Bool = False
D: DType, big_endian: Bool = False

questioning what the parameter should look like, because I'm not sold on Bool.

As for this, I personally never liked sys.byteorder returning "big" or "little" because there is no third option, so a boolean makes sense to me 🤷‍♂️. But compatibility may be a solid argument in favor of it.

Copy link
Contributor Author

@msaelices msaelices Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree with @martinvuyk . We are already diverging from the Python int.from_bytes() and int.to_bytes(). As a Python developer, I personally don't mind making a slight adaptation (e.g., "big" -> True and "small" -> False). The real issue would be if there is no equivalent implementation in Mojo's stdlib; you would need to implement the logic yourself.

@msaelices msaelices requested a review from lsh November 24, 2024 18:27
@lsh lsh self-assigned this Nov 25, 2024
Copy link
Contributor

@martinvuyk martinvuyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @msaelices, cool to know someone else is also pushing things forward for binary serialization.

I have two main opinions:

  • I think we should let the DType be inferred and have the Span be of that DType. I know Python treats everything like bytes but I think it's something that will bite us in the long term. We have a statically typed language and we can make use of that. So IMO we should hold off on making everything revolve around Byte and having to if-else on every binary interface until we are absolutely sure we want to go that way.
  • I think we can hold off on controlling endianness in the parameter until users need it.

@@ -1194,6 +1195,64 @@ struct Int(

writer.write(self)

@staticmethod
fn from_bytes[
type: DType, big_endian: Bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if we follow Python's lead, bytes is binary Byte and can be encoded in UInt16 etc. We could also hold off on doing it like Python and let this function be inferred and have bytes: Span[Scalar[D]].

I think type shouldn't be used as prolifically as it is for parameters since it's the name of Python's type builtin function that I think we'll eventually have in Mojo as well.

Suggested change
type: DType, big_endian: Bool = False
D: DType, big_endian: Bool = False

questioning what the parameter should look like, because I'm not sold on Bool.

As for this, I personally never liked sys.byteorder returning "big" or "little" because there is no third option, so a boolean makes sense to me 🤷‍♂️. But compatibility may be a solid argument in favor of it.

Comment on lines 1234 to 1243
var ptr: UnsafePointer[Byte] = UnsafePointer.address_of(bytes[0])
var type_ptr: UnsafePointer[Scalar[type]] = ptr.bitcast[Scalar[type]]()
var value = type_ptr[]

@parameter
if is_big_endian() and not big_endian:
value = byte_swap(value)
elif not is_big_endian() and big_endian:
value = byte_swap(value)
return int(value)
Copy link
Contributor

@martinvuyk martinvuyk Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var ptr: UnsafePointer[Byte] = UnsafePointer.address_of(bytes[0])
var type_ptr: UnsafePointer[Scalar[type]] = ptr.bitcast[Scalar[type]]()
var value = type_ptr[]
@parameter
if is_big_endian() and not big_endian:
value = byte_swap(value)
elif not is_big_endian() and big_endian:
value = byte_swap(value)
return int(value)
var ptr = bytes.unsafe_ptr().bitcast[Scalar[type]]()
@parameter
if is_big_endian() and not big_endian:
return int(byte_swap(ptr[]))
elif not is_big_endian() and big_endian:
return int(byte_swap(ptr[]))
else:
return int(ptr[])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if is_big_endian() == big_endian:
  ...
else:
  ...

Comment on lines 1264 to 1271
var ptr: UnsafePointer[Scalar[type]] = UnsafePointer.address_of(value)
var byte_ptr: UnsafePointer[Byte] = ptr.bitcast[Byte]()
var list = List[Byte](capacity=type_len)

# TODO: Maybe this can be a List.extend(ptr, count) method
memcpy(list.unsafe_ptr(), byte_ptr, type_len)
list.size = type_len

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add more UnsafePointer APIs than absolutely necessary

Suggested change
var ptr: UnsafePointer[Scalar[type]] = UnsafePointer.address_of(value)
var byte_ptr: UnsafePointer[Byte] = ptr.bitcast[Byte]()
var list = List[Byte](capacity=type_len)
# TODO: Maybe this can be a List.extend(ptr, count) method
memcpy(list.unsafe_ptr(), byte_ptr, type_len)
list.size = type_len
var ptr = UnsafePointer.address_of(value)
var list = List[Byte](capacity=type_len)
memcpy(list.unsafe_ptr(), ptr.bitcast[Byte](), type_len)
list.size = type_len

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Changed here: msaelices@30027df

@@ -19,6 +19,8 @@ from testing import assert_equal, assert_true, assert_false, assert_raises
from python import PythonObject
from memory import UnsafePointer

alias Bytes = List[Byte]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This definition is not decided over yet. could you take this into the scope of your test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Done: msaelices@7766ea3

@msaelices
Copy link
Contributor Author

msaelices commented Nov 27, 2024

Hi @msaelices, cool to know someone else is also pushing things forward for binary serialization.

I have two main opinions:

  • I think we should let the DType be inferred and have the Span be of that DType. I know Python treats everything like bytes but I think it's something that will bite us in the long term. We have a statically typed language and we can make use of that. So IMO we should hold off on making everything revolve around Byte and having to if-else on every binary interface until we are absolutely sure we want to go that way.

Great suggestion. I did it here: msaelices@63fa5b4

Unfortunately, the tests do not work because this kind of errors:

/home/msaelices/src/mojo/stdlib/test/builtin/test_int.mojo:252:62: error: invalid call to 'from_bytes': failed to infer implicit parameter 'is_mutable' of argument 'bytes' type 'Span'            
    assert_equal(Int.from_bytes[DType.int16, big_endian=True](Bytes(0, 16)), 16)  

My concern is that we will have to complicate the currently complicated signature by explicitly pass the is_mutable or origin param. Do you have any idea?

  • I think we can hold off on controlling endianness in the parameter until users need it.

@martinvuyk
Copy link
Contributor

Unfortunately, the tests do not work because this kind of errors:

/home/msaelices/src/mojo/stdlib/test/builtin/test_int.mojo:252:62: error: invalid call to 'from_bytes': failed to infer implicit parameter 'is_mutable' of argument 'bytes' type 'Span'            
    assert_equal(Int.from_bytes[DType.int16, big_endian=True](Bytes(0, 16)), 16)  

My concern is that we will have to complicate the currently complicated signature by explicitly pass the is_mutable or origin param. Do you have any idea?

Hmm yet another place where origins get annoying 🙄. This seems like a complete bug, it should work:

struct Span:
    ...
    fn __init__(out self, ref [origin]list: List[T, *_]): ...

You could try something like forcing the casting of the origin:

assert_equal(Int.from_bytes[DType.int16, big_endian=True](Span(Bytes(0, 16)).get_immutable()), 16)  

but this should definitely not be necessary.

@msaelices
Copy link
Contributor Author

msaelices commented Nov 28, 2024

Unfortunately, the tests do not work because this kind of errors:

/home/msaelices/src/mojo/stdlib/test/builtin/test_int.mojo:252:62: error: invalid call to 'from_bytes': failed to infer implicit parameter 'is_mutable' of argument 'bytes' type 'Span'            
    assert_equal(Int.from_bytes[DType.int16, big_endian=True](Bytes(0, 16)), 16)  

My concern is that we will have to complicate the currently complicated signature by explicitly pass the is_mutable or origin param. Do you have any idea?

Hmm yet another place where origins get annoying 🙄. This seems like a complete bug, it should work:

struct Span:
    ...
    fn __init__(out self, ref [origin]list: List[T, *_]): ...

You could try something like forcing the casting of the origin:

assert_equal(Int.from_bytes[DType.int16, big_endian=True](Span(Bytes(0, 16)).get_immutable()), 16)  

but this should definitely not be necessary.

Maybe this change in the compiler will help here: 3c4f57c

Will check it probably today.

@msaelices
Copy link
Contributor Author

msaelices commented Nov 28, 2024

Unfortunately, the tests do not work because this kind of errors:

/home/msaelices/src/mojo/stdlib/test/builtin/test_int.mojo:252:62: error: invalid call to 'from_bytes': failed to infer implicit parameter 'is_mutable' of argument 'bytes' type 'Span'            
    assert_equal(Int.from_bytes[DType.int16, big_endian=True](Bytes(0, 16)), 16)  

My concern is that we will have to complicate the currently complicated signature by explicitly pass the is_mutable or origin param. Do you have any idea?

Hmm yet another place where origins get annoying 🙄. This seems like a complete bug, it should work:

struct Span:
    ...
    fn __init__(out self, ref [origin]list: List[T, *_]): ...

You could try something like forcing the casting of the origin:

assert_equal(Int.from_bytes[DType.int16, big_endian=True](Span(Bytes(0, 16)).get_immutable()), 16)  

but this should definitely not be necessary.

Maybe this change in the compiler will help here: 3c4f57c

Will check it probably today.

I could not get this working even after the support in the compiler of setting the implicit is_mutable parameter :(

I am afraid that we need to revert some of the changes from [6d395a1c691000ac06fa1bcbd96305d6359d1784] make it work. Done here: msaelices@b5e09c5

@martinvuyk we can always revisit this in the future and make it more general.

@msaelices msaelices requested a review from martinvuyk November 28, 2024 13:14
@@ -1212,6 +1217,65 @@ struct Int(

writer.write(self)

@staticmethod
fn from_bytes[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think my comment before was ever properly addressed. I think this method should be on SIMD, not Int. That way the DType makes more sense. We can think more about what it would look like on Int later, and doing int(UInt64.from_bytes(span)) doesn't seem like too much of a hit to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed it. Done:
msaelices@776b4df
msaelices@265b648

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all the change requests, but to be clear I am comfortable merging these changes on SIMD, but not on Int. I think the SIMD API makes a ton of sense, but the int API would need some more thought since taking a DType feels wrong.

Copy link
Contributor

@soraros soraros Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think we should/could land the uncontroversial part (from/to_byte on SIMD) first and iterate. WDYT @msaelices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually the whole point of this PR was to help Python developers who need to migrate Python code using int.from_bytes() and int.to_bytes(). So, to me, it's easier if we have it on the Int struct too, and iterate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with having it on the Int struct too because we would either:

  • Break code users relied on
  • Never get around to revising it and have a substandard API

I'm also not compelled by the idea that the goal is to mirror a Python API considering your earlier argument in favor of breaking away from the Python API using a parameter. As I mentioned before, I think int(UInt64.from_bytes(my_span)) is fine for now

@msaelices msaelices requested review from lsh and soraros December 16, 2024 09:15
stdlib/src/builtin/simd.mojo Outdated Show resolved Hide resolved
@msaelices msaelices force-pushed the int-from-and-to-bytes branch from 047b471 to 8caad41 Compare December 17, 2024 11:28
@msaelices msaelices requested a review from lsh December 17, 2024 12:39
stdlib/src/builtin/int.mojo Outdated Show resolved Hide resolved
stdlib/src/builtin/simd.mojo Show resolved Hide resolved
stdlib/src/builtin/simd.mojo Outdated Show resolved Hide resolved
@msaelices msaelices requested review from jackos and a team as code owners December 22, 2024 00:00
@msaelices msaelices force-pushed the int-from-and-to-bytes branch 2 times, most recently from 2fcb115 to 98c29fc Compare December 22, 2024 00:19
@lsh
Copy link
Contributor

lsh commented Dec 22, 2024

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Dec 22, 2024
@lsh
Copy link
Contributor

lsh commented Jan 7, 2025

!sync

@modularbot
Copy link
Collaborator

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@modularbot modularbot added the merged-internally Indicates that this pull request has been merged internally label Jan 7, 2025
@modularbot
Copy link
Collaborator

Landed in ab7bfa5! Thank you for your contribution 🎉

@modularbot modularbot added the merged-externally Merged externally in public mojo repo label Jan 7, 2025
modularbot pushed a commit that referenced this pull request Jan 7, 2025
… (#53303)

[External] [stdlib] Implement `Int.from_bytes()` and `Int.as_bytes()`

Similar to the Python's
[int.from_bytes()](https://docs.python.org/3/library/stdtypes.html#int.from_bytes)
and
[int.to_bytes()](https://docs.python.org/3/library/stdtypes.html#int.to_bytes)
one.

Co-authored-by: Manuel Saelices <[email protected]>
Co-authored-by: Lukas Hermann <[email protected]>
Closes #3795
MODULAR_ORIG_COMMIT_REV_ID: 1e78cc9dd643009a98c7f3de77c4c250e7c5ea3f
@modularbot modularbot closed this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally. merged-externally Merged externally in public mojo repo merged-internally Indicates that this pull request has been merged internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants