Floating point parsing support for scanf family #924

matheusmoreira · 2023-11-02T10:00:35Z

I have implemented floating point parsing logic in the vcscanf function and consequently all the other functions which call it.

There is a small state machine in the code which attempts to recognize the floating point number before passing it to the strtod function from the gdtoa third party library. I attempted to use double-conversion but it's a C++ library.

I'm open to review and any feedback. Support for the maximum length specifier is missing, I'll add it when I figure out a better implementation. Meanwhile I thought it was a good idea to contribute this and see what everyone thinks.

Relevant issue: #456

This buffer is currently used only when decoding strings. Place it at the top level so other code can also use it. Ensure it is initialized to NULL.

The data it points to will be accessed directly. Also remove casts which are now unnecessary.

Keeps track of current buffer position.

This is similar to the READ macro. READs a character and places it in the buffer. Automatically maintains the NUL terminator. If the end is reached, reallocate the buffer.

This condition indicates that the floating point parser exited early.

Holds the parsed floating point value.

Handle all the documented format specifiers. Exit with error if given a length modifier other than "l". Skip spaces, initialize the buffer and then begin parsing. References: https://en.cppreference.com/w/c/io/fscanfî1;129A

The parsing of floating point numbers is quite complex. Reuse available libraries to accomplish the task.

The vcscanf function attempts to recognize the floating point number and then uses the strtod function to calculate its numeric value. It begins by recognizing an optional plus or minus sign, followed by either hexadecimal notation or NaN and infinity constants. Then it looks for an optional integer part, a point or comma, a fractional part, and an optional exponent symbol. If it finds an exponent, it recognizes the optional sign and digits. Every time a character is accepted, it is also buffered. The accumulated buffer is passed to strtod when recognition is finished and the output is assigned to the output pointer of appropriate length. This only happens if not in discard mode. Finally, the buffer is freed and its state is reset. References: https://en.cppreference.com/w/c/string/byte/strtof

Essentially an expanded version of the basic test case posted in GitHub issue jart#456. Also includes examples from documentation. References: jart#456 https://en.cppreference.com/w/c/string/byte/strtof

The presence of a pointer in buf could result in double free bugs when execution reaches the end of the function and it is freed.

jart

Thanks for sending this.

libc/stdio/vcscanf.c

examples/parsefloat.c

Somehow I ended up introducing mixed tabs and spaces into the file.

Makes the goto statement easier to read. Also consistent with the rest of the codebase.

Test that floating point numbers are correctly parsed from strings.

matheusmoreira · 2023-11-02T22:32:51Z

There seems to be some bug lurking about. Only shows up if I parse several numbers from the same string one after the other. Hunting it down now.

By the way, please let me know if you'd like me to clean up the git history later!

Makes the goto statement easier to read. Also consistent with the rest of the codebase.

I don't know why I thought this wasn't necessary.

For some reason I thought the do while (c = BUFFER ...) loop already did it for me. Apparently not!

matheusmoreira · 2023-11-02T23:39:29Z

Alright! All tests pass.

I fixed a bug somewhat related to this. Make sure it doesn't come back.

Place it after the GotFloatingPointNumber label where it belongs.

jart

Looks good to me. Thank you so much for doing this!

matheusmoreira · 2023-11-05T06:11:48Z

It appears the test segfaulted during the continuous integration build process.

make MODE=tinylinux -j2 o/tinylinux/test/libc/stdio/sscanf_test.com.runs` terminated by SIGSEGV

Any indication as to why? Tests pass on my machine but I did not test the tinylinux mode.

jart

After checking the change out locally and taking a closer look, could you please refactor all usages of the buf variable? free(buf) is crashing later in the program. I think the issue is that buf is supposed to be used for the growing array, but it's also being assigned in two other places for seemingly unrelated purposes.

Create a dedicated buffer for floating point parsing purposes and restore the original string decoding buffer just as it was. This fixes the segmentation fault in the tinylinux mode tests. I suspect the floating point buffer was ending up in the free list somehow which resulted in a double free when the function exited.

matheusmoreira · 2023-11-13T02:43:09Z

Indeed, I was only able to reproduce the segfault when building in tinylinux mode. Refactored as requested. Created separate buffers for the floating point parser and string decoder. Now tests pass in both build modes.

jart

Looks good. You're making a lot of people happy with this change, which is something we've really wanted for a while.

PR jart#924 appears to use `unget()` subtly incorrectly when parsing floating point numbers. The rest of the code only uses `unget()` immediately followed by `goto Done;` to return back the symbol that can't possibly belong to the directive we're processing. With floating-point, however, the ungot characters could very well be valid for the *next* directive, so we will essentially read them twice. It can't be seen in `sscanf()` tests because `unget()` is a no-op there, but the test I added for `fscanf()` fails like this: ... EXPECT_EQ(0xDEAD, i1) need 57005 (or 0xdead) = got 908973 (or 0x000ddead) ... EXPECT_EQ(0xBEEF, i2) need 48879 (or 0xbeef) = got 769775 (or 0x000bbeef) This means we read 0xDDEAD instead of 0xDEAD and 0xBBEEF instead of 0xBEEF. I checked that both musl and glibc read 0xDEAD/0xBEEF, as expected. Fix the failing test by removing the unneeded `unget()` calls.

* Fix reading the same symbol twice when using `{f,}scanf()` PR #924 appears to use `unget()` subtly incorrectly when parsing floating point numbers. The rest of the code only uses `unget()` immediately followed by `goto Done;` to return back the symbol that can't possibly belong to the directive we're processing. With floating-point, however, the ungot characters could very well be valid for the *next* directive, so we will essentially read them twice. It can't be seen in `sscanf()` tests because `unget()` is a no-op there, but the test I added for `fscanf()` fails like this: ... EXPECT_EQ(0xDEAD, i1) need 57005 (or 0xdead) = got 908973 (or 0x000ddead) ... EXPECT_EQ(0xBEEF, i2) need 48879 (or 0xbeef) = got 769775 (or 0x000bbeef) This means we read 0xDDEAD instead of 0xDEAD and 0xBBEEF instead of 0xBEEF. I checked that both musl and glibc read 0xDEAD/0xBEEF, as expected. Fix the failing test by removing the unneeded `unget()` calls. * Don't read invalid floating-point numbers in `*scanf()` Currently, we just ignore any errors from `strtod()`. They can happen either because no valid float can be parsed at all, or because the state machine recognizes only a prefix of a valid floating-point number. Fix this by making sure `strtod()` parses everything we recognized, provided it's non-empty. This requires to pop the last character off the FP buffer, which is supposed to be parsed by the next `*scanf()` directive. * Make `%c` parsing in `*scanf()` respect the C standard Currently, `%c`-style directives always succeed even if there are actually fewer characters in the input than requested. Before the fix, the added test fails like this: ... EXPECT_EQ(2, sscanf("ab", "%c %c %c", &c2, &c3, &c4)) need 2 (or 0x02 or '\2' or ENOENT) = got 3 (or 0x03 or '\3' or ESRCH) ... EXPECT_EQ(0, sscanf("abcd", "%5c", s2)) need 0 (or 0x0 or '\0') = got 1 (or 0x01 or '\1' or EPERM) musl and glibc pass this test.

matheusmoreira added 11 commits November 2, 2023 06:01

vcscanf: move buffer variables to top of function

64c883a

This buffer is currently used only when decoding strings. Place it at the top level so other code can also use it. Ensure it is initialized to NULL.

vcscanf: change buf type to unsigned char *

3403b6c

The data it points to will be accessed directly. Also remove casts which are now unnecessary.

vcscanf: add buffer cursor variable

44cb305

Keeps track of current buffer position.

vcscanf: define BUFFER macro

c161713

This is similar to the READ macro. READs a character and places it in the buffer. Automatically maintains the NUL terminator. If the end is reached, reallocate the buffer.

vcscanf: free buf before returning if not NULL

0e7c6aa

This condition indicates that the floating point parser exited early.

vcscanf: define floating point result variable

c8e941b

Holds the parsed floating point value.

vcscanf: add floating point number scanner case

ec2b272

Handle all the documented format specifiers. Exit with error if given a length modifier other than "l". Skip spaces, initialize the buffer and then begin parsing. References: https://en.cppreference.com/w/c/io/fscanfî1;129A

vcscanf: include the gdtoa library

a2225a2

The parsing of floating point numbers is quite complex. Reuse available libraries to accomplish the task.

examples: add a parsefloat example

758cdb7

Essentially an expanded version of the basic test case posted in GitHub issue jart#456. Also includes examples from documentation. References: jart#456 https://en.cppreference.com/w/c/string/byte/strtof

vcscanf: ensure buf is NULL after string decoding

90f19ed

The presence of a pointer in buf could result in double free bugs when execution reaches the end of the function and it is freed.

matheusmoreira marked this pull request as draft November 2, 2023 10:24

jart requested changes Nov 2, 2023

View reviewed changes

libc/stdio/vcscanf.c Outdated Show resolved Hide resolved

libc/stdio/vcscanf.c Outdated Show resolved Hide resolved

examples/parsefloat.c Show resolved Hide resolved

matheusmoreira added 3 commits November 2, 2023 08:00

vcscanf: fix formatting errors

bb0d4c2

Somehow I ended up introducing mixed tabs and spaces into the file.

vcscanf: add curly braces around else clause

cc79f62

Makes the goto statement easier to read. Also consistent with the rest of the codebase.

test: add sscanf floating point parser tests

c38aec9

Test that floating point numbers are correctly parsed from strings.

matheusmoreira added 3 commits November 2, 2023 20:05

vcscanf: add curly braces around another else

ded97bf

Makes the goto statement easier to read. Also consistent with the rest of the codebase.

vcscanf: buffer the last 'y' of "infinity"

c78ae96

I don't know why I thought this wasn't necessary.

vcscanf: buffer the ')' of "nan(whatever)"

2daad6e

For some reason I thought the do while (c = BUFFER ...) loop already did it for me. Apparently not!

matheusmoreira marked this pull request as ready for review November 2, 2023 23:39

matheusmoreira requested a review from jart November 2, 2023 23:39

matheusmoreira added 2 commits November 2, 2023 20:47

test: parse more floats at once

64d2b12

I fixed a bug somewhat related to this. Make sure it doesn't come back.

vcscanf: remove strtod call repetition

8f12a06

Place it after the GotFloatingPointNumber label where it belongs.

jart approved these changes Nov 5, 2023

View reviewed changes

jart requested changes Nov 5, 2023

View reviewed changes

matheusmoreira requested a review from jart November 13, 2023 02:43

jart approved these changes Nov 18, 2023

View reviewed changes

jart merged commit 3ac473d into jart:master Nov 18, 2023
5 checks passed

G4Vi pushed a commit to G4Vi/cosmopolitan that referenced this pull request Jan 19, 2024

Floating point parsing support for scanf family (jart#924)

9345214

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Floating point parsing support for scanf family #924

Floating point parsing support for scanf family #924

matheusmoreira commented Nov 2, 2023

jart left a comment

matheusmoreira commented Nov 2, 2023

matheusmoreira commented Nov 2, 2023

jart left a comment

matheusmoreira commented Nov 5, 2023

jart left a comment

matheusmoreira commented Nov 13, 2023

jart left a comment

Floating point parsing support for scanf family #924

Floating point parsing support for scanf family #924

Conversation

matheusmoreira commented Nov 2, 2023

jart left a comment

Choose a reason for hiding this comment

matheusmoreira commented Nov 2, 2023

matheusmoreira commented Nov 2, 2023

jart left a comment

Choose a reason for hiding this comment

matheusmoreira commented Nov 5, 2023

jart left a comment

Choose a reason for hiding this comment

matheusmoreira commented Nov 13, 2023

jart left a comment

Choose a reason for hiding this comment