perf: add PortId::from_bytes and rework identifier validation #936

mina86 · 2023-10-26T06:20:33Z

The main motivating factor for this change is scenario in which one
has a slice of bytes and wants to parse it as PartId. Since PortId
implements FromStr and has a new method which takes a String such
slice must first be converted into a string. A parsing code might
look something like:

fn port_id_from_bytes(bytes: &[u8]) -> Result<PortId, Error> {
    let id = core::str::from_utf8(bytes)?;
    let id = PortId::from_str(id)?;
    Ok(id)
}

However, notice that in this situation the bytes are validated twice
(in fact three times, see below). First from_utf8 has to go through
it to check if the identifier is valid UTF-8; and then from_str has
to go through the bytes again. This by itself is wasteful but what’s
even worse is that Unicode strings are not valid identifiers so the
logic of checking if bytes are valid UTF-8 is unnecessary.

With PortId::from_bytes, the code checks whether the bytes includes
any invalid characters. If it doesn’t, than it knows the entire slice
is all ASCII bytes and thus can be converted to a string.

To handle error cases, introduce Error::InvalidUtf8 error which is
used in the bytes aren’t valid UTF-8.

With this change, validate_identifier_chars now works on slice of
bytes rather than on a str. This by itself is probably an
optimisation since iterating over bytes is easier than over characters
of a string. Since Unicode characters aren’t valid parts of
identifiers this doesn’t affect observable behaviour of the code.

Furthermore, this change also refactors the validation
code. Specifically, the old identifier validation code contained the
following:

if id.contains(PATH_SEPARATOR) {
    return Err(Error::ContainSeparator { id: id.into() });
}
if !id.chars().all(|c| /* ... */) {
    return Err(Error::InvalidCharacter { id: id.into() });
}

This means that all identifiers had to be scanned twice. First to
look for a slash and then to check if all characters are valid. After
all the refactoring the code is now equivalence of:

if !id.bytes().all(|c| /* ... */) {
    if id.as_bytes().contains(PATH_SEPARATOR) {
        return Err(Error::ContainSeparator { id: id.into() });
    } else {
        return Err(Error::InvalidCharacter { id: id.into() });
    }
}

With this, correct identifiers are scanned only once.

PR author checklist:

Added changelog entry, using unclog.
Added tests.
Linked to GitHub issue.
Updated code comments and documentation (e.g., docs/).
Tagged one reviewer who will be the one responsible for shepherding this PR.

Reviewer checklist:

Reviewed Files changed in the GitHub PR explorer.
Manually tested (in case integration/unit/mock tests are absent).

The main motivating factor for this change is scenario in which one has a slice of bytes and wants to parse it as PartId. Since PortId implements FromStr and has a new method which takes a String such slice must first be converted into a string. A parsing code might look something like: fn port_id_from_bytes(bytes: &[u8]) -> Result<PortId, Error> { let id = core::str::from_utf8(bytes)?; let id = PortId::from_str(id)?; Ok(id) } However, notice that in this situation the bytes are validated twice (in fact three times, see below). First `from_utf8` has to go through it to check if the identifier is valid UTF-8; and then `from_str` has to go through the bytes again. This by itself is wasteful but what’s even worse is that Unicode strings are not valid identifiers so the logic of checking if bytes are valid UTF-8 is unnecessary. With PortId::from_bytes, the code checks whether the bytes includes any invalid characters. If it doesn’t, than it knows the entire slice is all ASCII bytes and thus can be converted to a string. To handle error cases, introduce Error::InvalidUtf8 error which is used in the bytes aren’t valid UTF-8. ---- With this change, validate_identifier_chars now works on slice of bytes rather than on a str. This by itself is probably an optimisation since iterating over bytes is easier than over characters of a string. Since Unicode characters aren’t valid parts of identifiers this doesn’t affect observable behaviour of the code. ---- Furthermore, this change also refactors the validation code. Specifically, the old identifier validation code contained the following: if id.contains(PATH_SEPARATOR) { return Err(Error::ContainSeparator { id: id.into() }); } if !id.chars().all(|c| /* ... */) { return Err(Error::InvalidCharacter { id: id.into() }); } This means that all identifiers had to be scanned twice. First to look for a slash and then to check if all characters are valid. After all the refactoring the code is now equivalence of: if !id.bytes().all(|c| /* ... */) { if id.as_bytes().contains(PATH_SEPARATOR) { return Err(Error::ContainSeparator { id: id.into() }); } else { return Err(Error::InvalidCharacter { id: id.into() }); } } With this, correct identifiers are scanned only once.

codecov · 2023-10-26T06:28:39Z

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (553bc79) 67.65% compared to head (475fc6f) 67.86%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #936      +/-   ##
==========================================
+ Coverage   67.65%   67.86%   +0.20%     
==========================================
  Files         130      130              
  Lines       16415    16463      +48     
==========================================
+ Hits        11106    11172      +66     
+ Misses       5309     5291      -18

Files	Coverage Δ
...tes/ibc/src/core/ics24_host/identifier/validate.rs	`100.00% <100.00%> (ø)`
crates/ibc/src/core/ics24_host/identifier.rs	`73.73% <94.20%> (+7.50%)`	⬆️

... and 7 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rnbguy · 2023-10-30T14:59:33Z

Thanks for the PR @mina86. I see the point of your PR but without any unsafe code block will be great.

Note that the validate_* methods are used to validate the String/&str/[u8]. They shouldn't call from_utf8* inside.

The conversion to String/&str should be after the validation checks. How about we call .map(|&b| b as char).collect() on [u8] to convert it to String? We can directly use it to construct PortId.

mina86 · 2023-10-30T19:01:13Z

I’m not sure I understand your suggestion. Rather than map and collect, to not include unsafe block the simplest way is to change:

            Ok(unsafe { core::str::from_utf8_unchecked(id) })

into either one of:

            core::str::from_utf8(id).map_err(|_| Error::InvalidUtf8 { id: id.into() })
            Ok(core::str::from_utf8(id).except("id has ASCII characters only"))

Doing collect has a downside of forcing the allocation which this PR is avoiding (see new method).

However, whether it’s using from_utf8 (as above) or .map(|&b| b as char).collect() (as you’ve suggested), this defeats the point of the commit by reintroducing the unnecessary O(n) operation.

What’d be your opinion on adding ascii dependency? That would at least replace UTF-8 validation with ASCII validation which is potentially faster. (Of course this is also why I find forbidding of unsafe to be somewhat silly since you’re using dependencies with unsafe blocks so can I create an ascii-validation crate with the unsafe code and that’d be fine?)

rnbguy · 2023-10-30T19:18:46Z

Ah. I understood that you're trying to avoid UTF-8 validations - that's why I suggested u8 as char or char::from. I am fine with core::str::from_utf8.

Note, the extra linear check doesn't change the total O( ) of PortId::from_bytes as it will iterate on every byte anyway. I do agree this adds a constant factor to the number of operations during validation.

Regarding collect forcing extra allocation, I suggested not calling core::str::from_utf8 in the validation steps and keeping the same function signature for validation methods e.g.

pub fn validate_port_identifier(id: &str) -> Result<(), Error>

and factor PortId methods as follows.

    pub fn new(id: String) -> Result<Self, IdentifierError> {
        validate_port_identifier(id.as_bytes())?;
        Ok(Self(id))
    }

    pub fn from_bytes(bytes: &[u8]) -> Result<Self, IdentifierError> {
        validate_port_identifier(bytes)?;
        Ok(Self(bytes.iter().map(char::from).collect()))
    }

As you are converting &str to String anyway.

mina86 · 2023-10-31T05:44:20Z

Ah. I understood that you're trying to avoid UTF-8 validations - that's why I suggested u8 as char or char::from. I am fine with core::str::from_utf8.

Yes, but .map(|b| *b as char).collect()is still a linear operation. collect` has to encode each character into UTF-8 and since length of the resulting string is not known to it I’m also not sure how it handles allocations of the underlying vector. So while UTF-8 validation is avoided, it’s replaced by

Regarding collect forcing extra allocation, I suggested not calling core::str::from_utf8 in the validation steps and keeping the same function signature for validation methods e.g.
pub fn validate_port_identifier(id: &str) -> Result<(), Error>
and factor PortId methods as follows.
    pub fn from_bytes(bytes: &[u8]) -> Result<Self, IdentifierError> {
        validate_port_identifier(bytes)?;
        Ok(Self(bytes.iter().map(char::from).collect()))
    }

In that case I cannot call validate_port_identifier with a slice of bytes. That’s why validate_port_identifier needs to accept bytes slice.

Nonetheless, let’s pause this PR for a bit. I’ll try to get a smaller changes in before to get this one smaller and easier to talk about.

rnbguy · 2023-10-31T14:34:33Z

since length of the resulting string is not known to it I’m also not sure how it handles allocations of the underlying vector.

It allocates once for the ~~vector~~ string using iterator size hint.

In that case I cannot call validate_port_identifier with a slice of bytes. That’s why validate_port_identifier needs to accept bytes slice.

ah my mistake to overlook the input type. I am okay with taking &[u8].

Farhad-Shabani · 2024-02-01T20:39:10Z

Nonetheless, let’s pause this PR for a bit. I’ll try to get a smaller changes in before to get this one smaller and easier to talk about.

@mina86 what's the scoop on this PR? Any changes you still want to work on? If it's not something pressing, maybe turn it into an issue and tackle it at a better time according to our priorities. WDYT?

rnbguy · 2024-02-20T17:17:31Z

I benchmarked among checked and unchecked from_utf8 and char::from with collect. ~~from_utf8_unchecked is indeed the fastest - but it's just a little bit faster than safe from_utf8.~~ from_utf8_unchecked indeed doesn't take any time. but from_utf8 is already very fast.

char::from with collect is the slowest. We can forget about it.

I think it is all right if we go ahead with String::from_utf8 instead of String::from_utf8_unchecked.

Farhad-Shabani · 2024-03-28T13:27:44Z

Closing this PR as stale. Feel free, @mina86, to open an issue if anything here might still be useful for your projects.

mina86 force-pushed the c branch from 1c62501 to 475fc6f Compare October 26, 2023 06:22

rnbguy changed the title ~~feat: add PortId::from_bytes and rework identifier validation~~ perf: add PortId::from_bytes and rework identifier validation Oct 30, 2023

mina86 marked this pull request as draft October 31, 2023 05:44

Farhad-Shabani closed this Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: add PortId::from_bytes and rework identifier validation #936

perf: add PortId::from_bytes and rework identifier validation #936

mina86 commented Oct 26, 2023 •

edited

Loading

codecov bot commented Oct 26, 2023

rnbguy commented Oct 30, 2023 •

edited

Loading

mina86 commented Oct 30, 2023

rnbguy commented Oct 30, 2023 •

edited

Loading

mina86 commented Oct 31, 2023

rnbguy commented Oct 31, 2023

Farhad-Shabani commented Feb 1, 2024

rnbguy commented Feb 20, 2024 •

edited

Loading

Farhad-Shabani commented Mar 28, 2024

perf: add PortId::from_bytes and rework identifier validation #936

perf: add PortId::from_bytes and rework identifier validation #936

Conversation

mina86 commented Oct 26, 2023 • edited Loading

PR author checklist:

Reviewer checklist:

codecov bot commented Oct 26, 2023

Codecov Report

rnbguy commented Oct 30, 2023 • edited Loading

mina86 commented Oct 30, 2023

rnbguy commented Oct 30, 2023 • edited Loading

mina86 commented Oct 31, 2023

rnbguy commented Oct 31, 2023

Farhad-Shabani commented Feb 1, 2024

rnbguy commented Feb 20, 2024 • edited Loading

Farhad-Shabani commented Mar 28, 2024

mina86 commented Oct 26, 2023 •

edited

Loading

rnbguy commented Oct 30, 2023 •

edited

Loading

rnbguy commented Oct 30, 2023 •

edited

Loading

rnbguy commented Feb 20, 2024 •

edited

Loading