-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ustring() docs/questions #88
Comments
Yeah, I know that you can pass it position, but Sure, it would be nice to get rid of copying or optimize
I'm not sure I get what you mean, but I for sure need to check if the whole ASCII/Unicode stuff is done right, because I suspect I messed up badly with |
Ah, right. I misremembered - I though specifying an offset for Regarding If you really wanted support for binaries, you'd have to depart from strings and use So really, this is just about improving the documentation - I'm not really sure what you intend to use If (have you ever used it for anything?) |
I think we should probably remove JS strings are encoded in UTF-16, and lengths/offsets are measured in code units, which are encoding dependent, so in this case two bytes each. Whereas code points are the actual Unicode character numbers, independent of encoding - these are usually represented as one code unit in JS, but some characters fall outside that range, and those can occupy two code units in a JS string. Maybe this helps: That is, in JavaScript's UTF-16 encoding, this guy You can find the definitions of code point and code unit here: https://www.unicode.org/glossary/ Most important thing to note, is that JS strings cannot represent binary data - they can only represent valid UTF-16 code units. I think we should drop The offsets reported by So the parser library is currently for UTF-16 strings, nothing else. (all of the string methods are generic, so, in principle, you could enhance the library to support binary parsers. that's not something I'm personally interested in investing in.) What would you like to do? 🙂 |
What is the
ustring
parser for? Assuming you load a valid unicode text file (which in this day and age is every text file) wouldn't it just match everything?Best guess, this is for something like validating correct binary encoding of JSON files? But is it actually possible to load a non-valid unicode text file into a string in JavaScript?
I was expecting I'd use this for, say, keywords.
But then the "success" example in the documentation says:
String operations in JS generally operate in ranges of code points:
So these numbers aren't useful for error reporting, or any subsequent string operation in JS really.
Text editors usually measure positions in code points as well:
The documentation itself explains:
"hacky though performant", but it seems like this is doing a lot of unnecessary work to figure out a string position that isn't useful for most common use cases, like just matching a keyword or symbol, isn't it?
What I was expecting was a simpler parser that would useString.prototype.includes
, which ought to be the fastest native way to check for a specific string at a specific offset, I think?EDIT: oh, whoops, now I get it! I avoided
string
, because it it specifically says this will match "ASCII", which is incorrect. It would in fact match whatever Unicode characters you put in the string. Looks like a documentation problem.But I don't see any other parser for simple strings - and nothing relevant in the codebase callingincludes
.EDIT: looks like maybe there is room for a small optimization here to avoid copying.
It's also difficult to think of a name for such a parser, now that😅string
is taken.(I know I'm submitting a lot of feedback! I am already somewhat invested in this lovely library, and I do want to help out - if you want me to submit PRs for anything, let me know.)
EDIT: let me know if you'd like me to correct the documentation and/or try the minor optimization/simplification with
includes
in thestring
parser.The text was updated successfully, but these errors were encountered: