-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose incremental utf8 decoding APIs #2408
base: master
Are you sure you want to change the base?
Conversation
core/src/Streamly/Unicode/Stream.hs
Outdated
@@ -75,18 +75,25 @@ | |||
-- | |||
module Streamly.Unicode.Stream | |||
( | |||
DecodeState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before exposing:
- Check the naming
- Check good documentation
- Check naming consistency
- Optional: Benchmark and Test
6a20721
to
5fa6d57
Compare
528ba6e
to
4e308f1
Compare
4e308f1
to
6c754f4
Compare
-- For multi-byte characters, the decoding state indicates the number of bytes | ||
-- remaining to complete the character. It is usually initialized to a non-zero | ||
-- value corresponding to the number of bytes in the multi-byte character, e.g | ||
-- DecodeState will be 1 for 2-bytes char. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation does not make much sense to the user of the library. Need to write which APIs use this, where does this come from, what values need to be supplied etc. The information is to be used to correctly understand how to use the APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha.
-- Calculate the code point value: Depending on the type of the leading byte, | ||
-- extract the significant bits from each byte of the sequence and combine them | ||
-- to form the complete code point value. The specific bit manipulations will | ||
-- differ based on the number of bytes used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not define a codepoint here. Need to say what it means in the context of the APIs and to be able to use the APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha.
d8470ed
to
3692740
Compare
-- ** Resumable UTF-8 Decoding | ||
, DecodeError(..) | ||
, DecodeState | ||
, CodePoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can return a more intelligent decode error:
data DecodeUTF8Error = DecodeUTF8Incomplete Word32 | DecodeUTF8NonStarter Word8 | DecodeUTF8Invalid
This should be enough to build a resumable decoder. When resuming we should supply the Word32
from DecodeUTF8Incomplete.
No description provided.