Skip to content

Commit

Permalink
Add rules.captureName option
Browse files Browse the repository at this point in the history
  • Loading branch information
slevithan committed Dec 19, 2024
1 parent 37699d9 commit 4f116aa
Show file tree
Hide file tree
Showing 13 changed files with 334 additions and 145 deletions.
42 changes: 32 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ type OnigurumaToEsOptions = {
allowOrphanBackrefs?: boolean;
allowUnhandledGAnchors?: boolean;
asciiWordBoundaries?: boolean;
captureGroup?: boolean;
};
target?: 'auto' | 'ES2025' | 'ES2024' | 'ES2018';
verbose?: boolean;
Expand Down Expand Up @@ -117,6 +118,9 @@ function toOnigurumaAst(
pattern: string,
options?: {
flags?: string;
rules?: {
captureGroup?: boolean;
};
}
): OnigurumaAst;
```
Expand Down Expand Up @@ -210,7 +214,8 @@ Advanced pattern options that override standard error checking and flags when en
- `allowOrphanBackrefs`: Useful with TextMate grammars that merge backreferences across patterns.
- `allowUnhandledGAnchors`: Applies flag `y` for unsupported uses of `\G`, rather than erroring.
- Oniguruma-To-ES uses a variety of strategies to accurately emulate many common uses of `\G`. When using this option, if a `\G` is found that doesn't have a known emulation strategy, the `\G` is simply removed and JavaScript's `y` (`sticky`) flag is added. This might lead to some false positives and negatives, but is useful for non-critical matching (like syntax highlighting) when having some mismatches is better than not working.
- `asciiWordBoundaries`: Use ASCII-based `\b` and `\B`, which increases performance.
- `asciiWordBoundaries`: Use ASCII-based `\b` and `\B`, which increases search performance of generated regexes.
- `captureGroup`: Oniguruma option `ONIG_OPTION_CAPTURE_GROUP`. Unnamed captures and numbered calls allowed when using named capture.

### `target`

Expand Down Expand Up @@ -616,7 +621,7 @@ Notice that nearly every feature below has at least subtle differences from Java
<td>
✔ Always "multiline"<br>
✔ Only <code>\n</code> as newline<br>
No match after string-terminating <code>\n</code><br>
<code>^</code> doesn't match after string-terminating <code>\n</code><br>
</td>
</tr>
<tr valign="top">
Expand Down Expand Up @@ -911,6 +916,17 @@ Notice that nearly every feature below has at least subtle differences from Java
✔ Error<br>
</td>
</tr>

<tr valign="top">
<th align="left" rowspan="1">Compile-time options</th>
<td colspan="2"><code>ONIG_OPTION_CAPTURE_GROUP</code></td>
<td align="middle">✅</td>
<td align="middle">✅</td>
<td>
✔ Unnamed captures and numbered calls allowed when using named capture<br>
✔ Allows numbered subroutine refs to duplicate group names<br>
</td>
</tr>
</table>

The table above doesn't include all aspects that Oniguruma-To-ES emulates (including error handling, most aspects that work the same as in JavaScript, and many aspects of non-JavaScript features that work the same in the other regex flavors that support them).
Expand All @@ -928,14 +944,20 @@ The table above doesn't include all aspects that Oniguruma-To-ES emulates (inclu

The following don't yet have any support, and throw errors. They're all infrequently-used features, with most being *extremely* rare.

- Grapheme boundaries: `\y`, `\Y`.
- Flags `P` (POSIX is ASCII) and `y{g}`/`y{w}` (grapheme boundary modes).
- Whole-pattern modifiers: Don't capture `(?C)`, ignore-case is ASCII `(?I)`, find longest `(?L)`.
- Absence functions: `(?~…)`, etc.
- Conditionals: `(?(…)…)`, etc.
- Rarely-used character specifiers: Non-A-Za-z with `\cx`, `\C-x`; meta `\M-x`, `\M-\C-x`; bracketed octals `\o{…}`; octal UTF-8 encoded bytes (≥ `\200`).
- Code point sequences: `\x{H H …}`, `\o{O O …}`.
- Callout functions: `(?{…})`, etc.
- Supportable:
- Grapheme boundaries: `\y`, `\Y`.
- Flags `P` (POSIX is ASCII) and `y{g}`/`y{w}` (grapheme boundary modes).
- Rarely-used character specifiers: Non-A-Za-z with `\cx`, `\C-x`; meta `\M-x`, `\M-\C-x`; bracketed octals `\o{…}`; octal UTF-8 encoded bytes (≥ `\200`).
- Code point sequences: `\x{H H …}`, `\o{O O …}`.
- Whole-pattern modifiers: Don't capture `(?C)`, ignore-case is ASCII `(?I)`.
- Supportable for some uses:
- Absence functions: `(?~…)`, etc.
- Conditionals: `(?(…)…)`, etc.
- Whole-pattern modifiers: Find longest `(?L)`.
- Not supportable:
- Callout functions: `(?{…})`, etc.

Despite the current omissions, Oniguruma-To-ES handles more than 99.9% of real-world Oniguruma regexes, based on patterns used in a large [collection](https://github.com/shikijs/textmate-grammars-themes/tree/main/packages/tm-grammars/grammars) of TextMate grammars.

## ㊗️ Unicode / mixed case-sensitivity

Expand Down
6 changes: 3 additions & 3 deletions demo/demo.css
Original file line number Diff line number Diff line change
Expand Up @@ -162,12 +162,12 @@ pre, code, kbd, textarea {
border-radius: 0.375em;
}

#more-options {
#more-options-cols {
display: flex;
}

#more-options div {
margin-right: 3%;
#more-options-cols div {
margin-right: 5%;
}

#output, textarea {
Expand Down
1 change: 1 addition & 0 deletions demo/demo.js
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ const state = {
allowOrphanBackrefs: getValue('option-allowOrphanBackrefs'),
allowUnhandledGAnchors: getValue('option-allowUnhandledGAnchors'),
asciiWordBoundaries: getValue('option-asciiWordBoundaries'),
captureGroup: getValue('option-captureGroup'),
},
target: getValue('option-target'),
verbose: getValue('option-verbose'),
Expand Down
121 changes: 66 additions & 55 deletions demo/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -74,31 +74,74 @@ <h2>Try it</h2>
</p>
<details>
<summary>More options</summary>
<section id="more-options">
<div>
<p>
<label>
<input type="checkbox" id="option-global" onchange="setOption('global', this.checked)">
<code>global</code>
<span class="tip tip-md">Add JS flag <kbd>g</kbd> to result</span>
</label>
</p>
<p>
<label>
<input type="checkbox" id="option-hasIndices" onchange="setOption('hasIndices', this.checked)">
<code>hasIndices</code>
<span class="tip tip-md">Add JS flag <kbd>d</kbd> to result</span>
</label>
</p>
<section>
<div id="more-options-cols">
<div>
<p>
<label>
<input type="checkbox" id="option-global" onchange="setOption('global', this.checked)">
<code>global</code>
<span class="tip tip-md">Add JS flag <kbd>g</kbd> to result</span>
</label>
</p>
<p>
<label>
<input type="checkbox" id="option-hasIndices" onchange="setOption('hasIndices', this.checked)">
<code>hasIndices</code>
<span class="tip tip-md">Add JS flag <kbd>d</kbd> to result</span>
</label>
</p>
</div>
<div>
<p>
<label>
<input type="checkbox" id="option-avoidSubclass" onchange="setOption('avoidSubclass', this.checked)">
<code>avoidSubclass</code>
<span class="tip tip-lg">Disables advanced emulation that relies on a <code>RegExp</code> subclass</span>
</label>
</p>
<p>
<label>
<input type="checkbox" id="option-verbose" onchange="setOption('verbose', this.checked)">
<code>verbose</code>
<span class="tip tip-lg">Disables optimizations that simplify the pattern without changing the meaning</span>
</label>
</p>
</div>
<div>
<p>
<label>
<input type="checkbox" id="option-allowOrphanBackrefs" onchange="setRule('allowOrphanBackrefs', this.checked)">
<code>allowOrphanBackrefs</code>
<span class="tip tip-xl">Useful with TextMate grammars that merge backrefs across <code>begin</code> and <code>end</code> patterns</span>
</label>
</p>
<p>
<label>
<input type="checkbox" id="option-allowUnhandledGAnchors" onchange="setRule('allowUnhandledGAnchors', this.checked)">
<code>allowUnhandledGAnchors</code>
<span class="tip tip-xl">Applies flag <code>y</code> for unsupported uses of <code>\G</code>, rather than erroring</span>
</label>
</p>
</div>
<div>
<p>
<label>
<input type="checkbox" id="option-asciiWordBoundaries" onchange="setRule('asciiWordBoundaries', this.checked)">
<code>asciiWordBoundaries</code>
<span class="tip tip-lg">Use ASCII-based <code>\b</code> and <code>\B</code></span>
</label>
</p>
<p>
<label>
<input type="checkbox" id="option-captureGroup" onchange="setRule('captureGroup', this.checked)">
<code>captureGroup</code>
<span class="tip tip-xl">Unnamed captures and numbered calls allowed when using named capture</span>
</label>
</p>
</div>
</div>
<div>
<p>
<label>
<input type="checkbox" id="option-avoidSubclass" onchange="setOption('avoidSubclass', this.checked)">
<code>avoidSubclass</code>
<span class="tip tip-lg">Disables advanced emulation that relies on a <code>RegExp</code> subclass</span>
</label>
</p>
<p>
<label>
<input type="number" id="option-maxRecursionDepth" value="5" min="2" max="100" onchange="setOption('maxRecursionDepth', this.value)" onkeyup="setOption('maxRecursionDepth', this.value)">
Expand All @@ -107,38 +150,6 @@ <h2>Try it</h2>
</label>
</p>
</div>
<div>
<p>
<label>
<input type="checkbox" id="option-allowOrphanBackrefs" onchange="setRule('allowOrphanBackrefs', this.checked)">
<code>allowOrphanBackrefs</code>
<span class="tip tip-xl">Useful with TextMate grammars that merge backrefs across <code>begin</code> and <code>end</code> patterns</span>
</label>
</p>
<p>
<label>
<input type="checkbox" id="option-allowUnhandledGAnchors" onchange="setRule('allowUnhandledGAnchors', this.checked)">
<code>allowUnhandledGAnchors</code>
<span class="tip tip-xl">Applies flag <code>y</code> for unsupported uses of <code>\G</code>, rather than erroring</span>
</label>
</p>
</div>
<div>
<p>
<label>
<input type="checkbox" id="option-asciiWordBoundaries" onchange="setRule('asciiWordBoundaries', this.checked)">
<code>asciiWordBoundaries</code>
<span class="tip tip-lg">Use ASCII-based <code>\b</code> and <code>\B</code></span>
</label>
</p>
<p>
<label>
<input type="checkbox" id="option-verbose" onchange="setOption('verbose', this.checked)">
<code>verbose</code>
<span class="tip tip-lg">Disables optimizations that simplify the pattern without changing the meaning</span>
</label>
</p>
</div>
</section>
</details>
<pre id="output"></pre>
Expand Down
4 changes: 3 additions & 1 deletion scripts/utils.js
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,9 @@ function getMatchDetails(match) {
const transpiledRegExpResult = (pattern, str, pos) => {
let result;
try {
const options = {};
// `vscode-oniguruma` uses option `ONIG_OPTION_CAPTURE_GROUP` by default; see
// <github.com/microsoft/vscode-oniguruma/blob/1970c417eb0ebcaf8c6607774934ab2f89549c92/src/index.ts#L380>
const options = {rules: {captureGroup: true}};
if (pos) {
options.global = true;
}
Expand Down
5 changes: 3 additions & 2 deletions spec/helpers/matchers.js
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ function getArgs(actual, expected) {
pattern: typeof expected === 'string' ? expected : expected.pattern,
flags: expected.flags ?? '',
accuracy: expected.accuracy ?? 'default',
rules: expected.rules ?? {},
strings: Array.isArray(actual) ? actual : [actual],
targets: targeted,
};
Expand All @@ -24,9 +25,9 @@ function wasFullStrMatch(match, str) {
// Expects `negate` to be set by `negativeCompare` and doesn't rely on Jasmine's automatic matcher
// negation because when negated we don't want to early return `true` when looping over the array
// of strings and one is found to not match; they all need to not match
function matchWithAllTargets({pattern, flags, strings, targets, accuracy}, {exact, negate}) {
function matchWithAllTargets({pattern, flags, accuracy, rules, strings, targets}, {exact, negate}) {
for (const target of targets) {
const re = toRegExp(pattern, {accuracy, flags, target});
const re = toRegExp(pattern, {accuracy, flags, rules, target});
for (const str of strings) {
// In case the regex includes flag g or y
re.lastIndex = 0;
Expand Down
Loading

0 comments on commit 4f116aa

Please sign in to comment.