Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating point parsing support for scanf family #924

Merged
merged 20 commits into from
Nov 18, 2023

Conversation

matheusmoreira
Copy link
Contributor

I have implemented floating point parsing logic in the vcscanf function and consequently all the other functions which call it.

There is a small state machine in the code which attempts to recognize the floating point number before passing it to the strtod function from the gdtoa third party library. I attempted to use double-conversion but it's a C++ library.

I'm open to review and any feedback. Support for the maximum length specifier is missing, I'll add it when I figure out a better implementation. Meanwhile I thought it was a good idea to contribute this and see what everyone thinks.

Relevant issue: #456

This buffer is currently used only when decoding strings.
Place it at the top level so other code can also use it.
Ensure it is initialized to NULL.
The data it points to will be accessed directly.
Also remove casts which are now unnecessary.
Keeps track of current buffer position.
This is similar to the READ macro.
READs a character and places it in the buffer.
Automatically maintains the NUL terminator.
If the end is reached, reallocate the buffer.
This condition indicates that the floating point parser exited early.
Holds the parsed floating point value.
Handle all the documented format specifiers.
Exit with error if given a length modifier other than "l".
Skip spaces, initialize the buffer and then begin parsing.

References:

https://en.cppreference.com/w/c/io/fscanfî1;129A
The parsing of floating point numbers is quite complex.
Reuse available libraries to accomplish the task.
The vcscanf function attempts to recognize the floating point number
and then uses the strtod function to calculate its numeric value.
It begins by recognizing an optional plus or minus sign, followed by
either hexadecimal notation or NaN and infinity constants.
Then it looks for an optional integer part, a point or comma,
a fractional part, and an optional exponent symbol.
If it finds an exponent, it recognizes the optional sign and digits.

Every time a character is accepted, it is also buffered.
The accumulated buffer is passed to strtod when recognition is finished
and the output is assigned to the output pointer of appropriate length.
This only happens if not in discard mode.
Finally, the buffer is freed and its state is reset.

References:

https://en.cppreference.com/w/c/string/byte/strtof
Essentially an expanded version of the basic test case
posted in GitHub issue jart#456. Also includes examples from documentation.

References:

jart#456
https://en.cppreference.com/w/c/string/byte/strtof
The presence of a pointer in buf could result in double free bugs
when execution reaches the end of the function and it is freed.
@matheusmoreira matheusmoreira marked this pull request as draft November 2, 2023 10:24
Copy link
Owner

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending this.

libc/stdio/vcscanf.c Outdated Show resolved Hide resolved
libc/stdio/vcscanf.c Outdated Show resolved Hide resolved
examples/parsefloat.c Show resolved Hide resolved
Somehow I ended up introducing mixed tabs and spaces into the file.
Makes the goto statement easier to read.
Also consistent with the rest of the codebase.
Test that floating point numbers are correctly parsed from strings.
@matheusmoreira
Copy link
Contributor Author

There seems to be some bug lurking about. Only shows up if I parse several numbers from the same string one after the other. Hunting it down now.

By the way, please let me know if you'd like me to clean up the git history later!

Makes the goto statement easier to read.
Also consistent with the rest of the codebase.
I don't know why I thought this wasn't necessary.
For some reason I thought the do while (c = BUFFER ...) loop
already did it for me. Apparently not!
@matheusmoreira
Copy link
Contributor Author

Alright! All tests pass.

@matheusmoreira matheusmoreira marked this pull request as ready for review November 2, 2023 23:39
@matheusmoreira matheusmoreira requested a review from jart November 2, 2023 23:39
I fixed a bug somewhat related to this. Make sure it doesn't come back.
Place it after the GotFloatingPointNumber label where it belongs.
Copy link
Owner

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you so much for doing this!

@matheusmoreira
Copy link
Contributor Author

It appears the test segfaulted during the continuous integration build process.

make MODE=tinylinux -j2 o/tinylinux/test/libc/stdio/sscanf_test.com.runs` terminated by SIGSEGV

Any indication as to why? Tests pass on my machine but I did not test the tinylinux mode.

Copy link
Owner

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking the change out locally and taking a closer look, could you please refactor all usages of the buf variable? free(buf) is crashing later in the program. I think the issue is that buf is supposed to be used for the growing array, but it's also being assigned in two other places for seemingly unrelated purposes.

Create a dedicated buffer for floating point parsing purposes
and restore the original string decoding buffer just as it was.
This fixes the segmentation fault in the tinylinux mode tests.
I suspect the floating point buffer was ending up in the free list
somehow which resulted in a double free when the function exited.
@matheusmoreira
Copy link
Contributor Author

Indeed, I was only able to reproduce the segfault when building in tinylinux mode. Refactored as requested. Created separate buffers for the floating point parser and string decoder. Now tests pass in both build modes.

@matheusmoreira matheusmoreira requested a review from jart November 13, 2023 02:43
Copy link
Owner

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. You're making a lot of people happy with this change, which is something we've really wanted for a while.

@jart jart merged commit 3ac473d into jart:master Nov 18, 2023
5 checks passed
G4Vi pushed a commit to G4Vi/cosmopolitan that referenced this pull request Jan 19, 2024
dfyz added a commit to dfyz/cosmopolitan that referenced this pull request Feb 21, 2024
PR jart#924 appears to use `unget()` subtly incorrectly when parsing
floating point numbers. The rest of the code only uses `unget()`
immediately followed by `goto Done;` to return back the symbol that
can't possibly belong to the directive we're processing.

With floating-point, however, the ungot characters could very well
be valid for the *next* directive, so we will essentially read them
twice. It can't be seen in `sscanf()` tests because `unget()` is a
no-op there, but the test I added for `fscanf()` fails like this:

        ...
        EXPECT_EQ(0xDEAD, i1)
                need 57005 (or 0xdead) =
                 got 908973 (or 0x000ddead)
        ...
        EXPECT_EQ(0xBEEF, i2)
                need 48879 (or 0xbeef) =
                 got 769775 (or 0x000bbeef)

This means we read 0xDDEAD instead of 0xDEAD and 0xBBEEF instead of
0xBEEF. I checked that both musl and glibc read 0xDEAD/0xBEEF, as
expected.

Fix the failing test by removing the unneeded `unget()` calls.
dfyz added a commit to dfyz/cosmopolitan that referenced this pull request Feb 23, 2024
PR jart#924 appears to use `unget()` subtly incorrectly when parsing
floating point numbers. The rest of the code only uses `unget()`
immediately followed by `goto Done;` to return back the symbol that
can't possibly belong to the directive we're processing.

With floating-point, however, the ungot characters could very well
be valid for the *next* directive, so we will essentially read them
twice. It can't be seen in `sscanf()` tests because `unget()` is a
no-op there, but the test I added for `fscanf()` fails like this:

        ...
        EXPECT_EQ(0xDEAD, i1)
                need 57005 (or 0xdead) =
                 got 908973 (or 0x000ddead)
        ...
        EXPECT_EQ(0xBEEF, i2)
                need 48879 (or 0xbeef) =
                 got 769775 (or 0x000bbeef)

This means we read 0xDDEAD instead of 0xDEAD and 0xBBEEF instead of
0xBEEF. I checked that both musl and glibc read 0xDEAD/0xBEEF, as
expected.

Fix the failing test by removing the unneeded `unget()` calls.
jart pushed a commit that referenced this pull request Feb 23, 2024
* Fix reading the same symbol twice when using `{f,}scanf()`

PR #924 appears to use `unget()` subtly incorrectly when parsing
floating point numbers. The rest of the code only uses `unget()`
immediately followed by `goto Done;` to return back the symbol that
can't possibly belong to the directive we're processing.

With floating-point, however, the ungot characters could very well
be valid for the *next* directive, so we will essentially read them
twice. It can't be seen in `sscanf()` tests because `unget()` is a
no-op there, but the test I added for `fscanf()` fails like this:

        ...
        EXPECT_EQ(0xDEAD, i1)
                need 57005 (or 0xdead) =
                 got 908973 (or 0x000ddead)
        ...
        EXPECT_EQ(0xBEEF, i2)
                need 48879 (or 0xbeef) =
                 got 769775 (or 0x000bbeef)

This means we read 0xDDEAD instead of 0xDEAD and 0xBBEEF instead of
0xBEEF. I checked that both musl and glibc read 0xDEAD/0xBEEF, as
expected.

Fix the failing test by removing the unneeded `unget()` calls.

* Don't read invalid floating-point numbers in `*scanf()`

Currently, we just ignore any errors from `strtod()`. They can
happen either because no valid float can be parsed at all, or
because the state machine recognizes only a prefix of a valid
floating-point number.

Fix this by making sure `strtod()` parses everything we recognized,
provided it's non-empty. This requires to pop the last character
off the FP buffer, which is supposed to be parsed by the next
`*scanf()` directive.

* Make `%c` parsing in `*scanf()` respect the C standard

Currently, `%c`-style directives always succeed even if there
are actually fewer characters in the input than requested.

Before the fix, the added test fails like this:

        ...
        EXPECT_EQ(2, sscanf("ab", "%c %c %c", &c2, &c3, &c4))
                need 2 (or 0x02 or '\2' or ENOENT) =
                 got 3 (or 0x03 or '\3' or ESRCH)
        ...
        EXPECT_EQ(0, sscanf("abcd", "%5c", s2))
                need 0 (or 0x0 or '\0') =
                 got 1 (or 0x01 or '\1' or EPERM)

musl and glibc pass this test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants