Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make E.String always ascii or utf-8, never utf-16 #15163

Draft
wants to merge 73 commits into
base: main
Choose a base branch
from

Conversation

pfgithub
Copy link
Contributor

@pfgithub pfgithub commented Nov 15, 2024

What does this PR do?

  • Fixes a bug where ("0" + "1") == 0 was true
  • Supports folding "\x00" + "1" to "\x001"
  • Supports folding +"§" to NaN
  • Supports folding unicode strings:
    • "æ" + "™" to "æ™"
    • "\uD800" + "\uDF34" to "𐌴".
  • Fixes invalid utf-8 in a string did not correctly convert to the replacement character
  • Fixes Syntax Error in String.raw #14213
  • Fixes let abc\u0070 = 5 would define a variable called "p" rather than "abcp"
  • Disallows `\8` because numeric escapes are not allowed in untagged template literals
  • Fixes a bug where the utf-8 sequence (4)(cont)(cont)(cont)(ascii) would parse incorrectly, eg
    • 0b11110010 0b10101100 0b10110111 0b00101011 would parse as "\u{ACDEB}" rather than "���+" (or "�+")

Reworks E.String

Previously:

  • For javascript, E.String was ascii or wtf-16, and constant folding "a"[0] and "a".charCodeAt(0) depended on not having utf-8.
  • For JSON, E.String was always utf-8 with no rope segments. Lots of code accessed string.data directly.
  • At least one call site in javascript forgot to collapse the rope string before using it (why ("0" + "1") == 0 gave the wrong answer)

Now:

  • In javascript, E.String is wtf-8 or a wtf-8 rope string.
  • In json, E.String is always wtf-8 and never a rope string
  • It is now more difficult to access an uncollapsed rope segment by accident (.e_string.data is now .e_string._value_or_segment_value)

TODO:

benchmark is in path ~/Dev/Node/temp/generated/fdae05571e21265e7f76681435cbed48/tmp

performance:

  • must remove strings.isValidUTF8 check. this can be done by making unicode special and then after seeing a unicode character once, next
    itertation it doesn't have to be special anymore and we can go back to strings.isValidUTF8
  • text_.len < 64 doesn't have a significant effect
  • try arraylist.toOwnedSlice() over allocator.dupe() in parseStringCharacters

code:

  • remove fn decodeEscapeSequences

correctness:

  • fix is_all_ascii is only set if a non-ascii byte is found but it also needs to be set if a non-acii codepoint is added in escape parsing
  • implement the looser string parsing for tagged template literals
  • add more tests
  • compare benchnark bundle output before and after string parsing changes, make sure it is identical
  • make sure json parsing works right and disallows things it shouldn't allow
  • run parser tests (https://github.com/tc39/test262-parser-tests) and add them to automated testing
    • pass, todo, or skip all tests
    • use test.skip rather than not adding the test for skip tests
    • add so that CI runs it
    • make sure CI runs with --todo flag
  • run json tests (https://github.com/nst/JSONTestSuite). re-emit and run again and make sure the result is identical.
    • pass, todo, or skip all tests
    • add so that CI runs it
    • make sure CI runs with --todo flag
  • eliminate lexer.peek(); it's used exactly once and implemented wrong
  • nextCodepointSlice() doesn't validate the codepoint - an invalid codepoint should return a 1-len slice rather than eg 3-len (ie c3 c3 should nextCodepointSlice to [c3] not [c3, c3]`)
  • Should octal escapes in strings be allowed? In 1.1.34 they half-worked (\32 syntax error but \324 ok)
  • make sure bundler_string_2.test.ts runs with --todo flag
  • js_printer.zig writePreQuotedString \u{{{d}}} -> \u{{{X}}}
  • remove "decodeWTF8RuneT", "wtf8ByteSequenceLength", "codepointSize"
  • audit "NewCodePointIterator", "CodepointIterator", "UnsignedCodepointIterator" - these have been changed to not stop on a null byte, make sure the call sites expect that. consider removing these iterators and replacing them with a much simpler one.
    • If keeping, nextCodepointSlice has an incorrect implementation, nextCodepoint asserts utf-8 (not wtf-8), peek is wrong, and the api seems overcomplicated?
  • compare "PackedCodepointIterator" vs regular codepoint iterator
  • remove "std.unicode."
  • compare strings.wtf8Sequence(), strings.encodeWTF8RuneT(), strings.unicode.encodeWtf8WithInvalid() and deduplicate
  • identify and fix performance regression in 7dae15b and 65a8dcc
  • See if there were any bugs around U+FFFD handling in previous bun versions and if they have been fixed (there is probably a case of ? -> ???)
    • Maybe in toml
  • See if there were any bugs around null byte handling in previous bun versions (there is probably a case of null -> ?)
  • Add a test for 0b11110010 0b10101100 0b10110111 0b00101011
  • Compare performance on windows because it is probably worse due to the added assertions
  • Pass tests
    • [ ] esbuild/default.test.ts, specifically default/RequireShimSubstitutionBrowser
  • E.String size optimizations:
    • Make sure the union tag and is_from_template_string can share the same 8 bytes, rather than each taking 8 bytes
  • Check if this fixes a bug:
    import {$} from "bun";
    const enc = t => new TextEncoder().encode(t);
    
    await Bun.write("./emitted.js", new Uint8Array([
    	...enc("console.log('"),
    	0, 0b1111_0001, 0b10_000000,
    	...enc("');\n\n\n\nthrow new Error('here')\n"),
    ]));
    
    await $`${Bun.argv[0]} ./emitted.js`; // before: (lines stop getting split after null byte)
    
  • esbuild/default.test.ts RequireShimSubstitutionBrowser seems to be broken? In node 23, it outputs default and module.exports. But shouldn't it be converting the esm import() to require()?
  • pass tests:
    • test262-parser-tests.with_todo.test.ts
    • JSONTestSuite.with_todo.test.ts
    • bundler_string_2.with_todo.test.ts on windows

@robobun
Copy link

robobun commented Nov 15, 2024

Updated 5:07 PM PT - Jan 9th, 2025

@pfgithub, your commit 344c9ea has 1 failures in #9333:


🧪   try this PR locally:

bunx bun-pr 15163

@pfgithub pfgithub marked this pull request as ready for review November 18, 2024 20:57
@pfgithub pfgithub marked this pull request as draft November 22, 2024 05:32
…es decodeWtf8Rune, removes CodepointIterator.needsUTF8Decoding)
…g because it no longer matters & rename 'asWtf8JSON' to 'asWtf8AssertNotRope'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Syntax Error in String.raw
2 participants