-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: add PortId::from_bytes and rework identifier validation #936
Conversation
The main motivating factor for this change is scenario in which one has a slice of bytes and wants to parse it as PartId. Since PortId implements FromStr and has a new method which takes a String such slice must first be converted into a string. A parsing code might look something like: fn port_id_from_bytes(bytes: &[u8]) -> Result<PortId, Error> { let id = core::str::from_utf8(bytes)?; let id = PortId::from_str(id)?; Ok(id) } However, notice that in this situation the bytes are validated twice (in fact three times, see below). First `from_utf8` has to go through it to check if the identifier is valid UTF-8; and then `from_str` has to go through the bytes again. This by itself is wasteful but what’s even worse is that Unicode strings are not valid identifiers so the logic of checking if bytes are valid UTF-8 is unnecessary. With PortId::from_bytes, the code checks whether the bytes includes any invalid characters. If it doesn’t, than it knows the entire slice is all ASCII bytes and thus can be converted to a string. To handle error cases, introduce Error::InvalidUtf8 error which is used in the bytes aren’t valid UTF-8. ---- With this change, validate_identifier_chars now works on slice of bytes rather than on a str. This by itself is probably an optimisation since iterating over bytes is easier than over characters of a string. Since Unicode characters aren’t valid parts of identifiers this doesn’t affect observable behaviour of the code. ---- Furthermore, this change also refactors the validation code. Specifically, the old identifier validation code contained the following: if id.contains(PATH_SEPARATOR) { return Err(Error::ContainSeparator { id: id.into() }); } if !id.chars().all(|c| /* ... */) { return Err(Error::InvalidCharacter { id: id.into() }); } This means that all identifiers had to be scanned twice. First to look for a slash and then to check if all characters are valid. After all the refactoring the code is now equivalence of: if !id.bytes().all(|c| /* ... */) { if id.as_bytes().contains(PATH_SEPARATOR) { return Err(Error::ContainSeparator { id: id.into() }); } else { return Err(Error::InvalidCharacter { id: id.into() }); } } With this, correct identifiers are scanned only once.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #936 +/- ##
==========================================
+ Coverage 67.65% 67.86% +0.20%
==========================================
Files 130 130
Lines 16415 16463 +48
==========================================
+ Hits 11106 11172 +66
+ Misses 5309 5291 -18
☔ View full report in Codecov by Sentry. |
Thanks for the PR @mina86. I see the point of your PR but without any unsafe code block will be great. Note that the The conversion to |
I’m not sure I understand your suggestion. Rather than map and collect, to not include
into either one of:
Doing However, whether it’s using What’d be your opinion on adding |
Ah. I understood that you're trying to avoid UTF-8 validations - that's why I suggested Note, the extra linear check doesn't change the total O( ) of Regarding pub fn validate_port_identifier(id: &str) -> Result<(), Error> and factor pub fn new(id: String) -> Result<Self, IdentifierError> {
validate_port_identifier(id.as_bytes())?;
Ok(Self(id))
}
pub fn from_bytes(bytes: &[u8]) -> Result<Self, IdentifierError> {
validate_port_identifier(bytes)?;
Ok(Self(bytes.iter().map(char::from).collect()))
} As you are converting |
Yes, but
In that case I cannot call Nonetheless, let’s pause this PR for a bit. I’ll try to get a smaller changes in before to get this one smaller and easier to talk about. |
It allocates once for the
ah my mistake to overlook the input type. I am okay with taking |
@mina86 what's the scoop on this PR? Any changes you still want to work on? If it's not something pressing, maybe turn it into an issue and tackle it at a better time according to our priorities. WDYT? |
I benchmarked among checked and unchecked
I think it is all right if we go ahead with |
Closing this PR as stale. Feel free, @mina86, to open an issue if anything here might still be useful for your projects. |
The main motivating factor for this change is scenario in which one
has a slice of bytes and wants to parse it as PartId. Since PortId
implements FromStr and has a new method which takes a String such
slice must first be converted into a string. A parsing code might
look something like:
However, notice that in this situation the bytes are validated twice
(in fact three times, see below). First
from_utf8
has to go throughit to check if the identifier is valid UTF-8; and then
from_str
hasto go through the bytes again. This by itself is wasteful but what’s
even worse is that Unicode strings are not valid identifiers so the
logic of checking if bytes are valid UTF-8 is unnecessary.
With PortId::from_bytes, the code checks whether the bytes includes
any invalid characters. If it doesn’t, than it knows the entire slice
is all ASCII bytes and thus can be converted to a string.
To handle error cases, introduce Error::InvalidUtf8 error which is
used in the bytes aren’t valid UTF-8.
With this change, validate_identifier_chars now works on slice of
bytes rather than on a str. This by itself is probably an
optimisation since iterating over bytes is easier than over characters
of a string. Since Unicode characters aren’t valid parts of
identifiers this doesn’t affect observable behaviour of the code.
Furthermore, this change also refactors the validation
code. Specifically, the old identifier validation code contained the
following:
This means that all identifiers had to be scanned twice. First to
look for a slash and then to check if all characters are valid. After
all the refactoring the code is now equivalence of:
With this, correct identifiers are scanned only once.
PR author checklist:
unclog
.docs/
).Reviewer checklist:
Files changed
in the GitHub PR explorer.