Skip to content

Commit

Permalink
Don't restrict Unicode properties based on target (closes #10)
Browse files Browse the repository at this point in the history
  • Loading branch information
slevithan committed Dec 21, 2024
1 parent 0067f62 commit 2a57623
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 41 deletions.
29 changes: 14 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Oniguruma-To-ES (鬼車➜ES)
# Oniguruma-To-ES (鬼車➡️ES)

[![npm version][npm-version-src]][npm-version-href]
[![npm downloads][npm-downloads-src]][npm-downloads-href]
Expand Down Expand Up @@ -230,7 +230,7 @@ JavaScript version used for generated regexes. Using `auto` detects the best val
<summary>More details</summary>

- `ES2018`: Uses JS flag `u`.
- Emulation restrictions: Character class intersection, nested negated character classes, and Unicode properties added after ES2018 are not allowed.
- Emulation restrictions: Character class intersection and nested negated character classes are not allowed.
- Generated regexes might use ES2018 features that require Node.js 10 or a browser version released during 2018 to 2023 (in Safari's case). Minimum requirement for any regex is Node.js 6 or a 2016-era browser.
- `ES2024`: Uses JS flag `v`.
- No emulation restrictions.
Expand Down Expand Up @@ -515,7 +515,7 @@ Notice that nearly every feature below has at least subtle differences from Java
<code>\p{L}</code>,<br>
<code>\P{L}</code>
</td>
<td align="middle">✅<sup>[1]</sup></td>
<td align="middle">✅</td>
<td align="middle">✅</td>
<td>
✔ Binary properties<br>
Expand All @@ -528,7 +528,7 @@ Notice that nearly every feature below has at least subtle differences from Java
✔ <code>\p</code>, <code>\P</code> without <code>{</code> is an identity escape<br>
✔ Error for key prefixes<br>
✔ Error for props of strings<br>
❌ Blocks (wontfix<sup>[2]</sup>)<br>
❌ Blocks (wontfix<sup>[1]</sup>)<br>
</td>
</tr>

Expand Down Expand Up @@ -590,7 +590,7 @@ Notice that nearly every feature below has at least subtle differences from Java
<code>[[:word:]]</code>,<br>
<code>[[:^word:]]</code>
</td>
<td align="middle">☑️<sup>[3]</sup></td>
<td align="middle">☑️<sup>[2]</sup></td>
<td align="middle">✅</td>
<td>
✔ All use Unicode definitions<br>
Expand All @@ -599,7 +599,7 @@ Notice that nearly every feature below has at least subtle differences from Java
<tr valign="top">
<td>Nested class</td>
<td><code>[…[…]]</code></td>
<td align="middle">☑️<sup>[4]</sup></td>
<td align="middle">☑️<sup>[3]</sup></td>
<td align="middle">✅</td>
<td>
✔ Same as JS with flag <code>v</code><br>
Expand Down Expand Up @@ -800,7 +800,7 @@ Notice that nearly every feature below has at least subtle differences from Java
<td align="middle">☑️</td>
<td align="middle">☑️</td>
<td>
✔ Error if group to the right<sup>[5]</sup><br>
✔ Error if group to the right<sup>[4]</sup><br>
✔ Duplicate names (and subroutines) to the right not included in multiplex<br>
✔ Fail to match (or don't include in multiplex) ancestor groups and groups in preceding alternation paths<br>
❌ Some rare cases are indeterminable at compile time and use the JS behavior of matching an empty string<br>
Expand Down Expand Up @@ -854,7 +854,7 @@ Notice that nearly every feature below has at least subtle differences from Java
<td align="middle">☑️</td>
<td align="middle">☑️</td>
<td>
● Has depth limit<sup>[6]</sup><br>
● Has depth limit<sup>[5]</sup><br>
</td>
</tr>
<tr valign="top">
Expand All @@ -867,7 +867,7 @@ Notice that nearly every feature below has at least subtle differences from Java
<td align="middle">☑️</td>
<td align="middle">☑️</td>
<td>
● Has depth limit<sup>[6]</sup><br>
● Has depth limit<sup>[5]</sup><br>
</td>
</tr>

Expand Down Expand Up @@ -935,12 +935,11 @@ The table above doesn't include all aspects that Oniguruma-To-ES emulates (inclu

### Footnotes

1. Target `ES2018` doesn't allow using Unicode property names added in JavaScript specifications after ES2018.
2. Unicode blocks (which in Oniguruma are used with an `In…` prefix) are easily emulatable but their character data would significantly increase library weight. They're also a flawed and arguably unuseful feature, given the ability to use Unicode scripts and other properties.
3. With target `ES2018`, the specific POSIX classes `[:graph:]` and `[:print:]` use ASCII-based versions rather than the Unicode versions available for target `ES2024` and later, and they result in an error if using strict `accuracy`.
4. Target `ES2018` doesn't support nested *negated* character classes.
5. It's not an error for *numbered* backreferences to come before their referenced group in Oniguruma, but an error is the best path for Oniguruma-To-ES because (1) most placements are mistakes and can never match (based on the Oniguruma behavior for backreferences to nonparticipating groups), (2) erroring matches the behavior of named backreferences, and (3) the edge cases where they're matchable rely on rules for backreference resetting within quantified groups that are different in JavaScript and aren't emulatable. Note that it's not a backreference in the first place if using `\10` or higher and not as many capturing groups are defined to the left (it's an octal or identity escape).
6. The recursion depth limit is specified by option `maxRecursionDepth`. Overlapping recursions and the use of backreferences when the recursed subpattern contains captures aren't yet supported. Patterns that would error in Oniguruma due to triggering infinite recursion might find a match in Oniguruma-To-ES since recursion is bounded (future versions will detect this and error at transpilation time).
1. Unicode blocks (which in Oniguruma are used with an `In…` prefix) are easily emulatable but their character data would significantly increase library weight. They're also a flawed and arguably unuseful feature, given the ability to use Unicode scripts and other properties.
2. With target `ES2018`, the specific POSIX classes `[:graph:]` and `[:print:]` use ASCII-based versions rather than the Unicode versions available for target `ES2024` and later, and they result in an error if using strict `accuracy`.
3. Target `ES2018` doesn't support nested *negated* character classes.
4. It's not an error for *numbered* backreferences to come before their referenced group in Oniguruma, but an error is the best path for Oniguruma-To-ES because (1) most placements are mistakes and can never match (based on the Oniguruma behavior for backreferences to nonparticipating groups), (2) erroring matches the behavior of named backreferences, and (3) the edge cases where they're matchable rely on rules for backreference resetting within quantified groups that are different in JavaScript and aren't emulatable. Note that it's not a backreference in the first place if using `\10` or higher and not as many capturing groups are defined to the left (it's an octal or identity escape).
5. The recursion depth limit is specified by option `maxRecursionDepth`. Overlapping recursions and the use of backreferences when the recursed subpattern contains captures aren't yet supported. Patterns that would error in Oniguruma due to triggering infinite recursion might find a match in Oniguruma-To-ES since recursion is bounded (future versions will detect this and error at transpilation time).

## ❌ Unsupported features

Expand Down
6 changes: 1 addition & 5 deletions src/generate.js
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import {getOptions} from './options.js';
import {AstAssertionKinds, AstCharacterSetKinds, AstTypes} from './parse.js';
import {traverse} from './traverse.js';
import {getIgnoreCaseMatchChars, JsUnicodePropertiesPostEs2018, UnicodePropertiesWithSpecificCase} from './unicode.js';
import {getIgnoreCaseMatchChars, UnicodePropertiesWithSpecificCase} from './unicode.js';
import {cp, getNewCurrentFlags, isMinTarget, r} from './utils.js';
import {isLookaround} from './utils-node.js';

Expand Down Expand Up @@ -72,7 +72,6 @@ function generate(ast, options) {
useDuplicateNames: minTargetEs2025,
useFlagMods: minTargetEs2025,
useFlagV: minTargetEs2024,
usePostEs2018Properties: minTargetEs2024,
verbose: opts.verbose,
};
function gen(node) {
Expand Down Expand Up @@ -349,9 +348,6 @@ function genCharacterSet({kind, negate, value, key}, state) {
return negate ? r`\D` : r`\d`;
}
if (kind === AstCharacterSetKinds.property) {
if (!state.usePostEs2018Properties && JsUnicodePropertiesPostEs2018.has(value)) {
throw new Error(`Unicode property "${value}" unavailable in target ES2018`);
}
if (
state.useAppliedIgnoreCase &&
state.currentFlags.ignoreCase &&
Expand Down
21 changes: 0 additions & 21 deletions src/unicode.js
Original file line number Diff line number Diff line change
Expand Up @@ -159,26 +159,6 @@ for (const p of JsUnicodePropertiesOfStrings) {
JsUnicodePropertiesOfStringsMap.set(slug(p), p);
}

// Unicode scripts and binary properties (and their aliases) added after ES2018
// See <github.com/eslint-community/regexpp/blob/main/src/unicode/properties.ts>
const JsUnicodePropertiesPostEs2018 = new Set((
// ES2019 scripts
'Dogr Dogra Gong Gunjala_Gondi Hanifi_Rohingya Maka Makasar Medefaidrin Medf Old_Sogdian Rohg Sogd Sogdian Sogo' +
// ES2019 binary properties
' Extended_Pictographic' +
// ES2020 scripts
' Elym Elymaic Hmnp Nand Nandinagari Nyiakeng_Puachue_Hmong Wancho Wcho' +
// ES2021 scripts
' Chorasmian Chrs Diak Dives_Akuru Khitan_Small_Script Kits Yezi Yezidi' +
// ES2021 binary properties
' EBase EComp EMod EPres ExtPict' +
// ES2022 scripts
' Cpmn Cypro_Minoan Old_Uyghur Ougr Tangsa Tnsa Toto Vith Vithkuqi' +
// ES2023 scripts
' Gara Garay Gukh Gurung_Khema Hrkt Katakana_Or_Hiragana Kawi Kirat_Rai Krai Nag_Mundari Nagm Ol_Onal Onao Sunu Sunuwar Todhri Todr Tulu_Tigalari Tutg Unknown Zzzz'
// ES2024: None, but added `JsUnicodePropertiesOfStrings`
).split(' '));

const LowerToAlternativeLowerCaseMap = new Map([
['s', cp(0x17F)], // s, ſ
[cp(0x17F), 's'], // ſ, s
Expand Down Expand Up @@ -291,7 +271,6 @@ export {
JsUnicodeProperties,
JsUnicodePropertiesMap,
JsUnicodePropertiesOfStringsMap,
JsUnicodePropertiesPostEs2018,
PosixClassesMap,
PosixProperties,
slug,
Expand Down

0 comments on commit 2a57623

Please sign in to comment.