-
-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Floating point parsing support for scanf family #924
Conversation
This buffer is currently used only when decoding strings. Place it at the top level so other code can also use it. Ensure it is initialized to NULL.
The data it points to will be accessed directly. Also remove casts which are now unnecessary.
Keeps track of current buffer position.
This is similar to the READ macro. READs a character and places it in the buffer. Automatically maintains the NUL terminator. If the end is reached, reallocate the buffer.
This condition indicates that the floating point parser exited early.
Holds the parsed floating point value.
Handle all the documented format specifiers. Exit with error if given a length modifier other than "l". Skip spaces, initialize the buffer and then begin parsing. References: https://en.cppreference.com/w/c/io/fscanfî1;129A
The parsing of floating point numbers is quite complex. Reuse available libraries to accomplish the task.
The vcscanf function attempts to recognize the floating point number and then uses the strtod function to calculate its numeric value. It begins by recognizing an optional plus or minus sign, followed by either hexadecimal notation or NaN and infinity constants. Then it looks for an optional integer part, a point or comma, a fractional part, and an optional exponent symbol. If it finds an exponent, it recognizes the optional sign and digits. Every time a character is accepted, it is also buffered. The accumulated buffer is passed to strtod when recognition is finished and the output is assigned to the output pointer of appropriate length. This only happens if not in discard mode. Finally, the buffer is freed and its state is reset. References: https://en.cppreference.com/w/c/string/byte/strtof
Essentially an expanded version of the basic test case posted in GitHub issue jart#456. Also includes examples from documentation. References: jart#456 https://en.cppreference.com/w/c/string/byte/strtof
The presence of a pointer in buf could result in double free bugs when execution reaches the end of the function and it is freed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sending this.
Somehow I ended up introducing mixed tabs and spaces into the file.
Makes the goto statement easier to read. Also consistent with the rest of the codebase.
Test that floating point numbers are correctly parsed from strings.
There seems to be some bug lurking about. Only shows up if I parse several numbers from the same string one after the other. Hunting it down now. By the way, please let me know if you'd like me to clean up the git history later! |
Makes the goto statement easier to read. Also consistent with the rest of the codebase.
I don't know why I thought this wasn't necessary.
For some reason I thought the do while (c = BUFFER ...) loop already did it for me. Apparently not!
Alright! All tests pass. |
I fixed a bug somewhat related to this. Make sure it doesn't come back.
Place it after the GotFloatingPointNumber label where it belongs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you so much for doing this!
It appears the test segfaulted during the continuous integration build process.
Any indication as to why? Tests pass on my machine but I did not test the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After checking the change out locally and taking a closer look, could you please refactor all usages of the buf
variable? free(buf)
is crashing later in the program. I think the issue is that buf
is supposed to be used for the growing array, but it's also being assigned in two other places for seemingly unrelated purposes.
Create a dedicated buffer for floating point parsing purposes and restore the original string decoding buffer just as it was. This fixes the segmentation fault in the tinylinux mode tests. I suspect the floating point buffer was ending up in the free list somehow which resulted in a double free when the function exited.
Indeed, I was only able to reproduce the segfault when building in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. You're making a lot of people happy with this change, which is something we've really wanted for a while.
PR jart#924 appears to use `unget()` subtly incorrectly when parsing floating point numbers. The rest of the code only uses `unget()` immediately followed by `goto Done;` to return back the symbol that can't possibly belong to the directive we're processing. With floating-point, however, the ungot characters could very well be valid for the *next* directive, so we will essentially read them twice. It can't be seen in `sscanf()` tests because `unget()` is a no-op there, but the test I added for `fscanf()` fails like this: ... EXPECT_EQ(0xDEAD, i1) need 57005 (or 0xdead) = got 908973 (or 0x000ddead) ... EXPECT_EQ(0xBEEF, i2) need 48879 (or 0xbeef) = got 769775 (or 0x000bbeef) This means we read 0xDDEAD instead of 0xDEAD and 0xBBEEF instead of 0xBEEF. I checked that both musl and glibc read 0xDEAD/0xBEEF, as expected. Fix the failing test by removing the unneeded `unget()` calls.
PR jart#924 appears to use `unget()` subtly incorrectly when parsing floating point numbers. The rest of the code only uses `unget()` immediately followed by `goto Done;` to return back the symbol that can't possibly belong to the directive we're processing. With floating-point, however, the ungot characters could very well be valid for the *next* directive, so we will essentially read them twice. It can't be seen in `sscanf()` tests because `unget()` is a no-op there, but the test I added for `fscanf()` fails like this: ... EXPECT_EQ(0xDEAD, i1) need 57005 (or 0xdead) = got 908973 (or 0x000ddead) ... EXPECT_EQ(0xBEEF, i2) need 48879 (or 0xbeef) = got 769775 (or 0x000bbeef) This means we read 0xDDEAD instead of 0xDEAD and 0xBBEEF instead of 0xBEEF. I checked that both musl and glibc read 0xDEAD/0xBEEF, as expected. Fix the failing test by removing the unneeded `unget()` calls.
* Fix reading the same symbol twice when using `{f,}scanf()` PR #924 appears to use `unget()` subtly incorrectly when parsing floating point numbers. The rest of the code only uses `unget()` immediately followed by `goto Done;` to return back the symbol that can't possibly belong to the directive we're processing. With floating-point, however, the ungot characters could very well be valid for the *next* directive, so we will essentially read them twice. It can't be seen in `sscanf()` tests because `unget()` is a no-op there, but the test I added for `fscanf()` fails like this: ... EXPECT_EQ(0xDEAD, i1) need 57005 (or 0xdead) = got 908973 (or 0x000ddead) ... EXPECT_EQ(0xBEEF, i2) need 48879 (or 0xbeef) = got 769775 (or 0x000bbeef) This means we read 0xDDEAD instead of 0xDEAD and 0xBBEEF instead of 0xBEEF. I checked that both musl and glibc read 0xDEAD/0xBEEF, as expected. Fix the failing test by removing the unneeded `unget()` calls. * Don't read invalid floating-point numbers in `*scanf()` Currently, we just ignore any errors from `strtod()`. They can happen either because no valid float can be parsed at all, or because the state machine recognizes only a prefix of a valid floating-point number. Fix this by making sure `strtod()` parses everything we recognized, provided it's non-empty. This requires to pop the last character off the FP buffer, which is supposed to be parsed by the next `*scanf()` directive. * Make `%c` parsing in `*scanf()` respect the C standard Currently, `%c`-style directives always succeed even if there are actually fewer characters in the input than requested. Before the fix, the added test fails like this: ... EXPECT_EQ(2, sscanf("ab", "%c %c %c", &c2, &c3, &c4)) need 2 (or 0x02 or '\2' or ENOENT) = got 3 (or 0x03 or '\3' or ESRCH) ... EXPECT_EQ(0, sscanf("abcd", "%5c", s2)) need 0 (or 0x0 or '\0') = got 1 (or 0x01 or '\1' or EPERM) musl and glibc pass this test.
I have implemented floating point parsing logic in the
vcscanf
function and consequently all the other functions which call it.There is a small state machine in the code which attempts to recognize the floating point number before passing it to the
strtod
function from thegdtoa
third party library. I attempted to usedouble-conversion
but it's a C++ library.I'm open to review and any feedback. Support for the maximum length specifier is missing, I'll add it when I figure out a better implementation. Meanwhile I thought it was a good idea to contribute this and see what everyone thinks.
Relevant issue: #456