Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On JavaScript, string.replace \r splits \r\n graphemes #584

Open
richard-viney opened this issue May 10, 2024 · 7 comments
Open

On JavaScript, string.replace \r splits \r\n graphemes #584

richard-viney opened this issue May 10, 2024 · 7 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@richard-viney
Copy link
Contributor

I was expecting string.replace to treat control characters like any other code points, but then ran into this behaviour which looks wrong.

The issue is only on the Erlang target, works fine under JS.

I can dig a bit deeper if it would be helpful, possible that other control characters are affected too.

import gleam/bit_array
import gleam/string
import gleeunit
import gleeunit/should

pub fn string_replace_test() {
  // Check \r\n binary representation is as expected
  "\r\n"
  |> bit_array.from_string
  |> bit_array.inspect
  |> should.equal("<<13, 10>>")

  // Test replacing \n with a space character (U+0020)
  "\r\n"
  |> string.replace("\n", " ")
  |> bit_array.from_string
  |> bit_array.inspect
  |> should.equal("<<13, 32>>")

  // Test replacing \r with a space character (U+0020) fails with:
  //
  //   Values were not equal
  //   expected: "<<32, 10>>"
  //   got: "<<13, 10>>"
  //
  "\r\n"
  |> string.replace("\r", " ")
  |> bit_array.from_string
  |> bit_array.inspect
  |> should.equal("<<32, 10>>")
}
@lpil
Copy link
Member

lpil commented May 13, 2024

Thank you

@lpil lpil added bug Something isn't working help wanted Extra attention is needed labels May 13, 2024
@mooreryan
Copy link
Contributor

This is a bit subtle, but as far as I can tell from skimming through the Erlang/OTP string module is that the functions are treating the \r\n as a single grapheme cluster.

First, check out string:length which returns the number of grapheme clusters the given string.

12> string:length("\r").
1
13> string:length("\r\n").
1
14> string:length("\n").
1
15> string:length("\n\n").
2
16> string:length("\r\r").
2

For trying to use the "\r" as the needle to replace in the haystack "\r\n", here is my understanding from following along with the code to see why \r will not work like the \n or \r\n.

  • replace calls split
  • split calls split_1
  • split_1 has many clauses, but many end up calling prefix_1
  • prefix_1 also has many clauses, but the one that matches if you pass non-binary is the first which will call unicode_util:gc on the \r\n, which returns ["\r\n"] (ie one grapheme cluster), which will then return the nomatch from prefix_1 when searching for \r
  • Because it got nomatch, so then split_1 is called again with the tail of "\r\n". But tail of that thing is the "\n" because the strings in erlang are list of codepoints:
1> [Head|Tail] = "\r\n".
"\r\n"
2> Head.
13
3> Tail.
"\n"

So, (as far as I can tell from going through the Erlang/OTP code right now), that's how the \r\n can match, and the \n can match, but not the \r by itself.

@richard-viney
Copy link
Contributor Author

Yep agreed it's due to the \r\n being treated as a grapheme cluster, this is mentioned in several places including here: https://www.erlang.org/doc/man/string#trim-3:

"Notice that [$\r,$\n] is one grapheme cluster according to the Unicode Standard."

So the question is which of JS or Erlang is the desired/intended behaviour? Should it replace by codepoint or by grapheme cluster 🤔.

@lpil
Copy link
Member

lpil commented May 20, 2024

Sounds like the JS implementation is incorrect.

@lpil lpil changed the title On Erlang, string.replace doesn't replace \r characters On JavaScript, string.replace \r splits \r\n graphemes May 20, 2024
@Pi-Cla
Copy link

Pi-Cla commented May 27, 2024

Erlang's splitting behaviour seems inconsistent though
Currently when you call the split function for either the Erlang or Javascript target we end up with this:

import gleam/should
import gleam/string

pub fn split_newline_test() {
  string.split("\r\n","\n")
  |> should.equal(["\r",""])
}

Meaning Erlang just like Javascript is perfectly ok with removing the \n from a \r\n grapheme cluster.

However Erlang is not ok with

import gleam/should
import gleam/string

pub fn split_newline_test() {
  string.split("\r\n","\r")
  |> should.equal(["","\n"])
}

whereas Javascript is.

@Pi-Cla
Copy link

Pi-Cla commented May 27, 2024

I would have expected Erlang to reject attempting to split a \n on a \r\n cluster if it was always treating \r\n as a separate thing from either \r or \n

@Pi-Cla
Copy link

Pi-Cla commented May 28, 2024

So if we want to fully copy what Erlang does we need to ensure Javascript still DOES split on \n which is a bit odd...

Edit:
The most "correct" behaviour to me would be either:

  1. Make \r\n split on neither \r nor \n
    or
  2. Treat \r\n as \n so that when you split on \n it removes the entire cluster \r\n
import gleam/should
import gleam/string

pub fn split_newline_test() {
  string.split("thing\r\nthing","\n")
  |> should.equal(["thing","thing"])
}

Both sequences are semantically the same in most cases and leaving a dangling \r would be very odd to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants