Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: add PortId::from_bytes and rework identifier validation #936

Closed
wants to merge 1 commit into from

Conversation

mina86
Copy link
Contributor

@mina86 mina86 commented Oct 26, 2023

The main motivating factor for this change is scenario in which one
has a slice of bytes and wants to parse it as PartId. Since PortId
implements FromStr and has a new method which takes a String such
slice must first be converted into a string. A parsing code might
look something like:

fn port_id_from_bytes(bytes: &[u8]) -> Result<PortId, Error> {
    let id = core::str::from_utf8(bytes)?;
    let id = PortId::from_str(id)?;
    Ok(id)
}

However, notice that in this situation the bytes are validated twice
(in fact three times, see below). First from_utf8 has to go through
it to check if the identifier is valid UTF-8; and then from_str has
to go through the bytes again. This by itself is wasteful but what’s
even worse is that Unicode strings are not valid identifiers so the
logic of checking if bytes are valid UTF-8 is unnecessary.

With PortId::from_bytes, the code checks whether the bytes includes
any invalid characters. If it doesn’t, than it knows the entire slice
is all ASCII bytes and thus can be converted to a string.

To handle error cases, introduce Error::InvalidUtf8 error which is
used in the bytes aren’t valid UTF-8.


With this change, validate_identifier_chars now works on slice of
bytes rather than on a str. This by itself is probably an
optimisation since iterating over bytes is easier than over characters
of a string. Since Unicode characters aren’t valid parts of
identifiers this doesn’t affect observable behaviour of the code.


Furthermore, this change also refactors the validation
code. Specifically, the old identifier validation code contained the
following:

if id.contains(PATH_SEPARATOR) {
    return Err(Error::ContainSeparator { id: id.into() });
}
if !id.chars().all(|c| /* ... */) {
    return Err(Error::InvalidCharacter { id: id.into() });
}

This means that all identifiers had to be scanned twice. First to
look for a slash and then to check if all characters are valid. After
all the refactoring the code is now equivalence of:

if !id.bytes().all(|c| /* ... */) {
    if id.as_bytes().contains(PATH_SEPARATOR) {
        return Err(Error::ContainSeparator { id: id.into() });
    } else {
        return Err(Error::InvalidCharacter { id: id.into() });
    }
}

With this, correct identifiers are scanned only once.

PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests.
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).
  • Tagged one reviewer who will be the one responsible for shepherding this PR.

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

The main motivating factor for this change is scenario in which one
has a slice of bytes and wants to parse it as PartId.  Since PortId
implements FromStr and has a new method which takes a String such
slice must first be converted into a string.  A parsing code might
look something like:

    fn port_id_from_bytes(bytes: &[u8]) -> Result<PortId, Error> {
        let id = core::str::from_utf8(bytes)?;
        let id = PortId::from_str(id)?;
        Ok(id)
    }

However, notice that in this situation the bytes are validated twice
(in fact three times, see below).  First `from_utf8` has to go through
it to check if the identifier is valid UTF-8; and then `from_str` has
to go through the bytes again.  This by itself is wasteful but what’s
even worse is that Unicode strings are not valid identifiers so the
logic of checking if bytes are valid UTF-8 is unnecessary.

With PortId::from_bytes, the code checks whether the bytes includes
any invalid characters.  If it doesn’t, than it knows the entire slice
is all ASCII bytes and thus can be converted to a string.

To handle error cases, introduce Error::InvalidUtf8 error which is
used in the bytes aren’t valid UTF-8.

----

With this change, validate_identifier_chars now works on slice of
bytes rather than on a str.  This by itself is probably an
optimisation since iterating over bytes is easier than over characters
of a string.  Since Unicode characters aren’t valid parts of
identifiers this doesn’t affect observable behaviour of the code.

----

Furthermore, this change also refactors the validation code.
Specifically, the old identifier validation code contained the
following:

    if id.contains(PATH_SEPARATOR) {
        return Err(Error::ContainSeparator { id: id.into() });
    }
    if !id.chars().all(|c| /* ... */) {
        return Err(Error::InvalidCharacter { id: id.into() });
    }

This means that all identifiers had to be scanned twice.  First to
look for a slash and then to check if all characters are valid.  After
all the refactoring the code is now equivalence of:

    if !id.bytes().all(|c| /* ... */) {
        if id.as_bytes().contains(PATH_SEPARATOR) {
            return Err(Error::ContainSeparator { id: id.into() });
        } else {
            return Err(Error::InvalidCharacter { id: id.into() });
        }
    }

With this, correct identifiers are scanned only once.
@codecov
Copy link

codecov bot commented Oct 26, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (553bc79) 67.65% compared to head (475fc6f) 67.86%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #936      +/-   ##
==========================================
+ Coverage   67.65%   67.86%   +0.20%     
==========================================
  Files         130      130              
  Lines       16415    16463      +48     
==========================================
+ Hits        11106    11172      +66     
+ Misses       5309     5291      -18     
Files Coverage Δ
...tes/ibc/src/core/ics24_host/identifier/validate.rs 100.00% <100.00%> (ø)
crates/ibc/src/core/ics24_host/identifier.rs 73.73% <94.20%> (+7.50%) ⬆️

... and 7 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rnbguy
Copy link
Collaborator

rnbguy commented Oct 30, 2023

Thanks for the PR @mina86. I see the point of your PR but without any unsafe code block will be great.

Note that the validate_* methods are used to validate the String/&str/[u8]. They shouldn't call from_utf8* inside.

The conversion to String/&str should be after the validation checks. How about we call .map(|&b| b as char).collect() on [u8] to convert it to String? We can directly use it to construct PortId.

@rnbguy rnbguy changed the title feat: add PortId::from_bytes and rework identifier validation perf: add PortId::from_bytes and rework identifier validation Oct 30, 2023
@mina86
Copy link
Contributor Author

mina86 commented Oct 30, 2023

I’m not sure I understand your suggestion. Rather than map and collect, to not include unsafe block the simplest way is to change:

            Ok(unsafe { core::str::from_utf8_unchecked(id) })

into either one of:

            core::str::from_utf8(id).map_err(|_| Error::InvalidUtf8 { id: id.into() })
            Ok(core::str::from_utf8(id).except("id has ASCII characters only"))

Doing collect has a downside of forcing the allocation which this PR is avoiding (see new method).

However, whether it’s using from_utf8 (as above) or .map(|&b| b as char).collect() (as you’ve suggested), this defeats the point of the commit by reintroducing the unnecessary O(n) operation.

What’d be your opinion on adding ascii dependency? That would at least replace UTF-8 validation with ASCII validation which is potentially faster. (Of course this is also why I find forbidding of unsafe to be somewhat silly since you’re using dependencies with unsafe blocks so can I create an ascii-validation crate with the unsafe code and that’d be fine?)

@rnbguy
Copy link
Collaborator

rnbguy commented Oct 30, 2023

Ah. I understood that you're trying to avoid UTF-8 validations - that's why I suggested u8 as char or char::from. I am fine with core::str::from_utf8.

Note, the extra linear check doesn't change the total O( ) of PortId::from_bytes as it will iterate on every byte anyway. I do agree this adds a constant factor to the number of operations during validation.

Regarding collect forcing extra allocation, I suggested not calling core::str::from_utf8 in the validation steps and keeping the same function signature for validation methods e.g.

pub fn validate_port_identifier(id: &str) -> Result<(), Error>

and factor PortId methods as follows.

    pub fn new(id: String) -> Result<Self, IdentifierError> {
        validate_port_identifier(id.as_bytes())?;
        Ok(Self(id))
    }

    pub fn from_bytes(bytes: &[u8]) -> Result<Self, IdentifierError> {
        validate_port_identifier(bytes)?;
        Ok(Self(bytes.iter().map(char::from).collect()))
    }

As you are converting &str to String anyway.

@mina86
Copy link
Contributor Author

mina86 commented Oct 31, 2023

Ah. I understood that you're trying to avoid UTF-8 validations - that's why I suggested u8 as char or char::from. I am fine with core::str::from_utf8.

Yes, but .map(|b| *b as char).collect()is still a linear operation. collect` has to encode each character into UTF-8 and since length of the resulting string is not known to it I’m also not sure how it handles allocations of the underlying vector. So while UTF-8 validation is avoided, it’s replaced by

Regarding collect forcing extra allocation, I suggested not calling core::str::from_utf8 in the validation steps and keeping the same function signature for validation methods e.g.

pub fn validate_port_identifier(id: &str) -> Result<(), Error>

and factor PortId methods as follows.

    pub fn from_bytes(bytes: &[u8]) -> Result<Self, IdentifierError> {
        validate_port_identifier(bytes)?;
        Ok(Self(bytes.iter().map(char::from).collect()))
    }

In that case I cannot call validate_port_identifier with a slice of bytes. That’s why validate_port_identifier needs to accept bytes slice.

Nonetheless, let’s pause this PR for a bit. I’ll try to get a smaller changes in before to get this one smaller and easier to talk about.

@mina86 mina86 marked this pull request as draft October 31, 2023 05:44
@rnbguy
Copy link
Collaborator

rnbguy commented Oct 31, 2023

since length of the resulting string is not known to it I’m also not sure how it handles allocations of the underlying vector.

It allocates once for the vector string using iterator size hint.

In that case I cannot call validate_port_identifier with a slice of bytes. That’s why validate_port_identifier needs to accept bytes slice.

ah my mistake to overlook the input type. I am okay with taking &[u8].

@Farhad-Shabani
Copy link
Member

Nonetheless, let’s pause this PR for a bit. I’ll try to get a smaller changes in before to get this one smaller and easier to talk about.

@mina86 what's the scoop on this PR? Any changes you still want to work on? If it's not something pressing, maybe turn it into an issue and tackle it at a better time according to our priorities. WDYT?

@rnbguy
Copy link
Collaborator

rnbguy commented Feb 20, 2024

I benchmarked among checked and unchecked from_utf8 and char::from with collect. from_utf8_unchecked is indeed the fastest - but it's just a little bit faster than safe from_utf8. from_utf8_unchecked indeed doesn't take any time. but from_utf8 is already very fast.

image
image

char::from with collect is the slowest. We can forget about it.

I think it is all right if we go ahead with String::from_utf8 instead of String::from_utf8_unchecked.

@Farhad-Shabani
Copy link
Member

Closing this PR as stale. Feel free, @mina86, to open an issue if anything here might still be useful for your projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants