Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ustring() docs/questions #88

Open
mindplay-dk opened this issue Aug 13, 2023 · 3 comments
Open

ustring() docs/questions #88

mindplay-dk opened this issue Aug 13, 2023 · 3 comments
Labels
A-parsers Area: Issues related to parsers C-discussion Category: Discussions or questions that don't represent real issues

Comments

@mindplay-dk
Copy link
Contributor

mindplay-dk commented Aug 13, 2023

What is the ustring parser for? Assuming you load a valid unicode text file (which in this day and age is every text file) wouldn't it just match everything?

Best guess, this is for something like validating correct binary encoding of JSON files? But is it actually possible to load a non-valid unicode text file into a string in JavaScript?

I was expecting I'd use this for, say, keywords.

But then the "success" example in the documentation says:

Note that the index is 12, which is correct, since every hieroglyph here takes 3 bytes.

String operations in JS generally operate in ranges of code points:

image

So these numbers aren't useful for error reporting, or any subsequent string operation in JS really.

Text editors usually measure positions in code points as well:

image

The documentation itself explains:

This parser is very similar to the string parser, except it takes a bit hacky (though performant) approach, that is based on counting length of the given match string in bytes. It then subslices and compares string slice with that match string.

"hacky though performant", but it seems like this is doing a lot of unnecessary work to figure out a string position that isn't useful for most common use cases, like just matching a keyword or symbol, isn't it?

What I was expecting was a simpler parser that would use String.prototype.includes, which ought to be the fastest native way to check for a specific string at a specific offset, I think?

EDIT: oh, whoops, now I get it! I avoided string, because it it specifically says this will match "ASCII", which is incorrect. It would in fact match whatever Unicode characters you put in the string. Looks like a documentation problem.

But I don't see any other parser for simple strings - and nothing relevant in the codebase calling includes.

EDIT: looks like maybe there is room for a small optimization here to avoid copying.

It's also difficult to think of a name for such a parser, now that string is taken. 😅

(I know I'm submitting a lot of feedback! I am already somewhat invested in this lovely library, and I do want to help out - if you want me to submit PRs for anything, let me know.)

EDIT: let me know if you'd like me to correct the documentation and/or try the minor optimization/simplification with includes in the string parser.

@norskeld
Copy link
Owner

norskeld commented Aug 13, 2023

let me know if you'd like me to correct the documentation and/or try the minor optimization/simplification with includes in the string parser.

Yeah, I know that you can pass it position, but includes will search for a match starting from that position and sometimes will give false-positives, because the matching string can be somewhere further in the input string. We need a strict match at the specific position, hence using substring and comparing.

Sure, it would be nice to get rid of copying or optimize string/ustring in some other ways, but I don't know how tbh.

I avoided string, because it it specifically says this will match "ASCII", which is incorrect. It would in fact match whatever Unicode characters you put in the string. Looks like a documentation problem.

I'm not sure I get what you mean, but I for sure need to check if the whole ASCII/Unicode stuff is done right, because I suspect I messed up badly with ustring. :)

@norskeld norskeld added A-parsers Area: Issues related to parsers C-discussion Category: Discussions or questions that don't represent real issues labels Aug 13, 2023
@mindplay-dk
Copy link
Contributor Author

We need a strict match at the specific position, hence using substring and comparing.

Ah, right. I misremembered - I though specifying an offset for includes would make it search at that location. But you're right, JS doesn't actually have this feature. Sort of odd. 😊

Regarding string, I'm just pointing out the docs describe it as working with ASCII strings, which is incorrect - it works with Unicode strings, which is all JS knows how to work with.

If you really wanted support for binaries, you'd have to depart from strings and use Uint8Array instead - as far as I know, there are binary sequences that cannot be represented as strings in JS at all. I'm pretty sure that's not what you'd want. 😅

So really, this is just about improving the documentation - I'm not really sure what you intend to use ustring for, but string will definitely work for Unicode strings (and nothing else) so it shouldn't say ASCII in the docs.

If ustring has any use (?) it might be for something like working with buffers in Node? But since other parsers are going to count code points rather than bytes, even then, the numbers you get aren't likely to make any sense. You can't add code point offsets and byte offsets - that's apples to oranges; those numbers aren't going to be useful for anything. 😅

(have you ever used it for anything?)

@mindplay-dk
Copy link
Contributor Author

I'm not sure I get what you mean, but I for sure need to check if the whole ASCII/Unicode stuff is done right, because I suspect I messed up badly with ustring. :)

I think we should probably remove ustring, because it's confusing, and the numbers aren't going to be any use, since it's mixing different units.

JS strings are encoded in UTF-16, and lengths/offsets are measured in code units, which are encoding dependent, so in this case two bytes each. Whereas code points are the actual Unicode character numbers, independent of encoding - these are usually represented as one code unit in JS, but some characters fall outside that range, and those can occupy two code units in a JS string.

Maybe this helps:

image

That is, in JavaScript's UTF-16 encoding, this guy is a single code unit, whereas 🜲 is two code units - they are each one logical code point, but that doesn't matter to JS strings.

You can find the definitions of code point and code unit here:

https://www.unicode.org/glossary/

Most important thing to note, is that JS strings cannot represent binary data - they can only represent valid UTF-16 code units.

I think we should drop ustring and update the documentation of string.

The offsets reported by ustring don't make any sense - they can't be used with any JS functions, and even if they could in theory be used with binary data buffers, strings cannot represent arbitrary binary data to begin with.

So the parser library is currently for UTF-16 strings, nothing else. (all of the string methods are generic, so, in principle, you could enhance the library to support binary parsers. that's not something I'm personally interested in investing in.)

What would you like to do? 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parsers Area: Issues related to parsers C-discussion Category: Discussions or questions that don't represent real issues
Projects
None yet
Development

No branches or pull requests

2 participants