Issues with whitespace definition #361

thobe · 2019-04-04T10:53:00Z

Neither Java's Character.isWhitespace(int), or Character.isSpaceChar(int), or the unicode [:White_Space:] specification treats \u180E (MONGOLIAN VOWEL SEPARATOR) as a whitespace.

Yet the openCypher grammar considers this a whitespace character, why?

openCypher/grammar/basic-grammar.xml

Line 781 in 346aa0d

Furthermore the definition of whitespace in the openCypher grammar does not consider \u0085 (NEXT LINE) to be whitespace, while it is part of the unicode [:White_Space:] specification. Perhaps that should be added? (it is not considered a whitespace by either Character.isWhitespace(int) or Character.isSpaceChar(int), which explains why it is not in the grammar).

The text was updated successfully, but these errors were encountered:

thobe · 2019-04-04T11:13:45Z

I came across this difference when looking at why the whitespace production rules spelled out all whitespace characters individually instead of just referencing the unicode [:White_Space:] specification. So I investigated the difference.

The conclusion of this exercise is that apart from \u0085 (NEXT LINE), the grammar includes all characters of the unicode [:White_Space:] specification, and additionally includes \u001C (FILE SEPARATOR), \u001D (GROUP SEPARATOR), \u001E (RECORD SEPARATOR), and \u001F (UNIT SEPARATOR).

Tabulating the characters involved:

Code Point	`Character.isWhitespace(...)`	`Character.isSpaceChar(...)`	`[:White_Space:]`
`\u0009`	True	False	True
`\u000a`	True	False	True
`\u000b`	True	False	True
`\u000c`	True	False	True
`\u000d`	True	False	True
`\u001c`	True	False	False
`\u001d`	True	False	False
`\u001e`	True	False	False
`\u001f`	True	False	False
`\u0020`	True	True	True
`\u0085`	False	False	True
`\u00a0`	False	True	True
`\u1680`	True	True	True
~~`\u180E`~~	~~True (Java 8)~~	~~True (Java 8)~~	~~True (Unicode 4.0 - 6.2)~~
~~`\u180E`~~	~~False (Java 11)~~	~~False (Java 11)~~	~~False (Unicode 3.0 - 3.2; 6.3 -)~~
`\u2000`	True	True	True
`\u2001`	True	True	True
`\u2002`	True	True	True
`\u2003`	True	True	True
`\u2004`	True	True	True
`\u2005`	True	True	True
`\u2006`	True	True	True
`\u2007`	False	True	True
`\u2008`	True	True	True
`\u2009`	True	True	True
`\u200a`	True	True	True
`\u2028`	True	True	True
`\u2029`	True	True	True
`\u202f`	False	True	True
`\u205f`	True	True	True
`\u3000`	True	True	True

thobe · 2019-04-04T12:08:56Z

If we agree to use the unicode [:White_Space:] specification, we could define whitespace as:

<production name="whitespace">
  <alt>
    <character set="White_Space"/>
    <character set="FS"/>
    <character set="GS"/>
    <character set="RS"/>
    <character set="US"/>
  </alt>
</production>

thobe · 2019-04-04T12:12:22Z

Looking at commit history, it appears as if at some point Java's Character.isWhitespace(int) treated \u180E (MONGOLIAN VOWEL SEPARATOR) as a whitespace. At least that is what the code comments say. And indeed, in Java 8 it is included, but in Java 11 it is not.

Mats-SX · 2019-04-04T12:38:50Z

I think it makes good sense to stick with Unicode here. Do we even need the special additions of FS, GS, RS and US?

thobe · 2019-04-04T13:01:06Z

The FILE SEPARATOR, GROUP SEPARATOR, RECORD SEPARATOR, and UNIT SEPARATOR have been explicitly treated as whitespace by Java since forever, and thus by the Neo4j Cypher parser.

They are likely to not occur in Cypher queries. I'd say it's harmless to either include or exclude them.

Mats-SX · 2019-04-05T08:01:30Z

I agree. I would lean towards going with Unicode rather than Java (and abandon Cypher's implementation history), but I don't feel strongly about it. I wonder if any of the two alternatives makes a difference for implementability? I doubt it.

hvub · 2022-03-18T14:42:52Z

See #530

thobe added bug grammar labels Apr 4, 2019

hvub mentioned this issue Mar 18, 2022

Update basic-grammar.xml #530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with whitespace definition #361

Issues with whitespace definition #361

thobe commented Apr 4, 2019 •

edited

Loading

thobe commented Apr 4, 2019 •

edited

Loading

thobe commented Apr 4, 2019

thobe commented Apr 4, 2019 •

edited

Loading

Mats-SX commented Apr 4, 2019

thobe commented Apr 4, 2019

Mats-SX commented Apr 5, 2019

hvub commented Mar 18, 2022 •

edited

Loading

Issues with whitespace definition #361

Issues with whitespace definition #361

Comments

thobe commented Apr 4, 2019 • edited Loading

thobe commented Apr 4, 2019 • edited Loading

thobe commented Apr 4, 2019

thobe commented Apr 4, 2019 • edited Loading

Mats-SX commented Apr 4, 2019

thobe commented Apr 4, 2019

Mats-SX commented Apr 5, 2019

hvub commented Mar 18, 2022 • edited Loading

thobe commented Apr 4, 2019 •

edited

Loading

thobe commented Apr 4, 2019 •

edited

Loading

thobe commented Apr 4, 2019 •

edited

Loading

hvub commented Mar 18, 2022 •

edited

Loading