Make the ISO-8859 code page encodings use the Unicode Consortium mappings. #93

lrhn · 2023-11-07T13:40:27Z

Built on top of Brian's conversion of the first three encodings.

Added a script to generate the string constants from the official table files, just to be safe.

(We can probably generate the reverse-mapping Map<int, int>, which is created when someone uses the encoding, as generated constants too. But that can be done later.)

lrhn · 2023-11-07T13:43:03Z

pubspec.yaml

@@ -1,5 +1,5 @@
 name: convert
-version: 3.1.2-dev
+version: 4.0.0-dev


(This is going to be hard as long as test depends on someting which depends on ^3.0.0.
Might be better to just declare this change "non-breaking".)

We could discuss at the ecosystem meeting tomorrow? Clearly changing a data mapping is breaking but it is hard to imagine people being negatively impacted. For now why don't you change it back so that you can write tests?

I think that we decided to use a 3.X version.

brianquinlan · 2023-11-07T18:06:49Z

tools/update_unicode_iso8859_tables.dart

+void writeChar(StringSink buffer, int charCode) {
+  if (specialChars[charCode] case var s?) {
+    buffer.write(s);
+  } else if (charCode >= 0x20 && charCode < 0x7f) {


Instead of explicitly coding these character ranges, consider using the category of the unicode character to decide whether to escape it or not.

The script that I wrote only escaped characters in the "C" and "Z" categories, which results in this string for ISO-8859-6:

\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13 \x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20!"#$%&'()*+,-./0123456789:; <=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ \x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91 \x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\ufffd\ufffd \ufffd¤\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd،\xad\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd؛\ufffd\ufffd\ufffd؟\ufffd ءآأؤإئابةتثجحخدذرزسشصضطظعغ\ufffd\ufffd\ufffd\ufffd\ufffdـفقكلمنهوىيًٌٍَُِّْ\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd

It would also reduce your "special characters" map to just:

const specialChars = { 0x27: r"\'", 0x5c: r'\\', }

Speaking of which, it is really hard to identify escaped characters in these strings. I'd suggest changing the quote character if the string contains a special character.

I chose, perhaps not entirely necessarily, to keep the source ASCII, and encode any non-ASCII character.

I know Dart code is always UTF-8, but if I am reading this file at all, I prefer being able to tell the difference between "Α" and "A" .
(The former is upper-case Greek Alpha, U+0391, ISO-8859-7 char 0xC1, and the latter is the "plain" latin A, U+0041.)

The "ءآأؤإئابةتثجحخدذرزسشصضطظعغ" above also contains some right-to-left characters, which makes visually parsing the file even harder.

So, the options are:

Optimize for size by using the actual code point for every character, not an escape.

Optimize for size, to some degree, by using all visible characters, and escaping non-visible characters.

Optimize for "readability", to some degree, by using all visible ASCII characters, and escaping the rest.

Optimize for parsability by escaping everything with the same size escapes, \u0061 for A.

Any option "works", because only the final String constant content matters. The rest only matters if someone reads the source, of a generated file - instead of looking at the linked file that has the same information in a much more readable way.
Not a lot to do a data-drive choice from.

I don't think size is relevant to optimize for. It's not a big download no matter what we do, and plain ASCII compresses well.

I ended up choosing the second to last option. Mainly because I was looking at the file myself while generating it, and that's the step I reached before I was satisfied.
But escaping everything would have been sufficient (and simpler, I wouldn't need to start counting entries per line for that.)

Having \n->r'\n' as a special character is really more of a gimmick. It's there, but ... nobody cares, because it's always in the same position in every table anyway. (And who even knows what \f and \v does. I looked it up.)

I don't feel strongly about this. One advantage of using the actual characters instead of escapes is that you can visually see what types of characters your are encoding. But do this how you like.

If you are interested in how other languages do this, Python uses your approach but has one character per line with a comment like:

... '\x1e' # 0x1E -> RECORD SEPARATOR '\x1f' # 0x1F -> UNIT SEPARATOR ' ' # 0x20 -> SPACE '!' # 0x21 -> EXCLAMATION MARK '"' # 0x22 -> QUOTATION MARK ...

Go does something similar.

One difference is that they do one code page per file so it is less obnoxious to use a lot of vertical space for each encoding.

I think I have a slight preference for the the current approach of escaping non-ascii.

I'd be on board with other patterns like escaping everything in favor of a simpler generator.

brianquinlan · 2023-11-07T18:15:24Z

tools/update_unicode_iso8859_tables.dart

+/// Find the number of characters that fits in a single-line string literal.
+///
+/// Finds the largest number of characters starting from [from],
+/// which is a power of 2, and where the characters representation,


Why do you want to limit this to powers-of-2?

Mostly for a miniscule amount of readability. Which likely isn't really needed.

I started just doing eight \uHHHH chars per line, which works, and ensures that you can visually find the individual characters in the mapping if you really want to.
But it's so big, especially for the ASCII part, so I tried to reduce the size by using \xHH encodings and visible ASCII chars, but it's entirely a source-size-and-readability thing, it shouldn't make any difference to the string.

But, to make it still somewhat possible to eyeball what character is at which position, I wanted the lines to all be
power-of-2 aligned.
Or really, I wanted them to have lengths that are powers of 2 and be aligned by their size.
I think I might have gotten the latter, just because of how things end up being encoded.
(... no, 8859-4 has a 16-character line starting at 0xD8.)

If we don't care about the readability of the generated file, you can do this however you want.

Yeah. I'm coming around to that too.
In which case anything other than repeated \uHHHH entries is a waste of developer time and complexity budget. (I'll still put 8 of them on each line, to get a little alignment. I like alignment. Maybe even has room for a // 0xc0..0xc7 comment. And there I go trying for readability again. Because debugging ...? Don't debug generated code, debug the generator!)

brianquinlan · 2023-11-07T18:28:29Z

lib/src/codepage.dart

+///
+/// This is the authoritative mapping between ISO-8859-2 and
+/// Unicode text, as [specified by the Unicode Consortium](
+/// https://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT).


Bonus points for putting the table version here. You can get that from the mapping file:
# Table version: 2.0

Oooh, nice.

brianquinlan · 2023-11-07T18:30:31Z

pubspec.yaml

@@ -1,5 +1,5 @@
 name: convert
-version: 3.1.2-dev
+version: 4.0.0-dev


We could discuss at the ecosystem meeting tomorrow? Clearly changing a data mapping is breaking but it is hard to imagine people being negatively impacted. For now why don't you change it back so that you can write tests?

brianquinlan · 2023-11-07T18:43:40Z

tools/update_unicode_iso8859_tables.dart

+/// Shared HTTP client used by [fetchTable], initialized if needed.
+///
+/// Closed by [generateTables] when done fetching.
+HttpClient? httpClient;


Could you use package:http for this?

I could, but adding more dependencies isn't something I do lightly.

In this case, it's a dev-dependency, so it only affects ourselves.
(Also, I don't actually have experience with package:http, so I just used what I know.)

I don't have strong feelings about this but we are probably going to ask people to not use HttpClient directly in the next couple of months so, if we can avoid using it ourselves, that would be ideal.

That is a very good reason. And again, it's a dev-dependency, so nobody else should care.

What is our reasoning for avoiding HttpClient when the code is know to only be used on platforms that support it?

What is our reasoning for avoiding HttpClient when the code is know to only be used on platforms that support it?

Developers will assume that our code reflects best practices so, if we use HttpClient, then other people will assume that they should too. In my mind, there are very few reasons to use HttpClient directly.

IMO keeping core package dependencies as flat as possible should weigh higher than making our code look like recommended app code. I do think that having more sample code will make it less likely authors will need to go looking in packages like this to find a pattern to copy/paste.

IMO keeping core package dependencies as flat as possible should weigh higher than making our code look like recommended app code. I do think that having more sample code will make it less likely authors will need to go looking in packages like this to find a pattern to copy/paste.

I don't feel very strongly about this. The dependency would only be for dev though so it wouldn't impact users.

Another possibility is that we remove the download code, and just require the user to have pre-downloaded the files to somewhere, so they can point to that directory.

This was just me making my own life easier, when I had to do the same thing multiple times, but it's unnecessary complexity when we can just ask the user to use curl first. Like:

curl -OOOOOOOOOOOOOOO https://unicode.org/Public/MAPPINGS/ISO8859/8859-{1,2,3,4,5,6,7,8,9,10,11,13,14,15,16}.TXT

I think I'll just remove the download code. It's safer and simpler.

SGTM I like the idea of operating only on local files.

brianquinlan · 2023-11-07T18:50:54Z

tools/update_unicode_iso8859_tables.dart

+    ..write('const iso8859_$isoNumber = // From $uri');
+  // Write 8 chars per line, as Unicode escapes.
+  for (var i = 0; i < text.length;) {
+    var charsOnLine = nextRangeLength(text, i, 74);


Would this be more readable if nextRangeLength were a generator? So you could write:

for(var next in splitTextIntoLines(text, 74)) { ... }

Could work too. And could more easily keep state, that would let me ensure that every line is size-aligned, so a 16-character line will always start at position that is a multiplum of 16. (Which is what I really want, and I thought I actually got it by accident. Only almost.)

kevmoo · 2023-11-07T18:54:14Z

We could bump shelf_static to support a broader range, if we feel confident it won't break anything there!

lrhn · 2023-11-08T18:31:30Z

About publishing the script for others to use, to automatically download a specific mapping and generate a string for it:
If we do thatm I'd remove the automatica downloading, and ask people to download the files themselves, and just have thm point the script to the already downloaded sources. Or require them to provide the URIs to download, then the script can do it.

I'm not comfortable providing a script to users which is hard-coded to use resources from a third-party, without explicit permission from that third-party.

Anyone can download the files manually, but if a Google script downloads a UC web-page automatically, it's too close to doing it automatically and systematically, which is not OK. We'd be providing users a service based on other people's resources.

So, either:

Don't download files at all, require the user to do it ahead of time.
Only download files that the user has provided the URIs for, so the user chooses which files to download, not us.

Either also allows users to have their own mirrors of the data files, if they are not comfortable depending on a third-party directly in their build steps.
(But then I think it would be a nice feature, to parse and generate a constant for any mapping file, no matter how the users provides that mapping file.)

brianquinlan · 2023-11-15T17:04:51Z

You are still working on this, right? AFAIK, you still need to:

Change the version package to 3.X
Update the unit tests
Assume that the mapping files are already downloaded, i.e., remove the HttpClient code.

lrhn · 2023-11-15T17:06:37Z

I am. Just got caught up on another CL (making a change for String.fromCharCodes work on Wasm)

brianquinlan · 2023-11-15T17:14:12Z

No worries but I'll be out for the next two weeks so I'll remove myself as a reviewer.

lrhn · 2023-11-17T14:29:07Z

Updated, PTAL

natebosch · 2023-11-20T18:03:57Z

CHANGELOG.md

@@ -1,8 +1,10 @@
-## 3.1.2-dev
+## 4.0.0-dev


Did we decide on a nonbreaking release during the sync?

natebosch · 2023-11-20T18:04:56Z

tools/update_unicode_iso8859_tables.dart

+// for details. All rights reserved. Use of this source code is governed by a
+// BSD-style license that can be found in the LICENSE file.
+
+// Generator for tables for ISO-8859 code page encoding of Unicode text,


Should this be a doc comment on a library;?

natebosch · 2023-11-20T18:08:42Z

tools/update_unicode_iso8859_tables.dart

@@ -0,0 +1,266 @@
+// Copyright (c) 2023, the Dart project authors.  Please see the AUTHORS file


If we want this to be usable by other packages I think it needs to be in bin/ instead of tool/

mosuem · 2024-10-16T08:26:27Z

Closing as the dart-lang/convert repository is merged into the dart-lang/core monorepo. Please re-open this PR there!

brianquinlan · 2024-10-17T17:57:13Z

@lrhn Are you intending on reopening this in dart-lang/core?

lrhn · 2024-10-17T18:45:05Z

Yeah, let's do that.

brianquinlan and others added 3 commits November 6, 2023 09:47

Update the ISO-8859 encodings to match the current standards

56f6283

Increase major version

4f6b9b2

Use Unicode Consortium mappings for all ISO-8859 code pages.

1480e60

lrhn requested review from kevmoo and brianquinlan November 7, 2023 13:40

lrhn mentioned this pull request Nov 7, 2023

Update the ISO-8859 encodings to match the current standards #92

Closed

1 task

lrhn commented Nov 7, 2023

View reviewed changes

brianquinlan reviewed Nov 7, 2023

View reviewed changes

brianquinlan mentioned this pull request Nov 9, 2023

Provide a user-accessible script to generate encodings from Unicode.org mappings. dart-lang/core#287

Open

Address comments.

b125965

lrhn force-pushed the ascii-super-set branch from b15086b to b125965 Compare November 9, 2023 17:41

brianquinlan assigned natebosch Nov 15, 2023

Removed HTTP download feature.

880dc66

Make lints happy.

58f4623

natebosch reviewed Nov 20, 2023

View reviewed changes

kevmoo removed their request for review June 27, 2024 23:03

mosuem closed this Oct 16, 2024

		@@ -0,0 +1,266 @@
		// Copyright (c) 2023, the Dart project authors. Please see the AUTHORS file

Make the ISO-8859 code page encodings use the Unicode Consortium mappings. #93

Make the ISO-8859 code page encodings use the Unicode Consortium mappings. #93

Conversation

lrhn commented Nov 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lrhn Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natebosch Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lrhn Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianquinlan Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lrhn Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lrhn Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

kevmoo commented Nov 7, 2023

lrhn commented Nov 8, 2023

brianquinlan commented Nov 15, 2023 • edited by lrhn Loading

lrhn commented Nov 15, 2023

brianquinlan commented Nov 15, 2023

lrhn commented Nov 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mosuem commented Oct 16, 2024

brianquinlan commented Oct 17, 2024

lrhn commented Oct 17, 2024

lrhn Nov 8, 2023 •

edited

Loading

natebosch Nov 9, 2023 •

edited

Loading

lrhn Nov 8, 2023 •

edited

Loading

brianquinlan Nov 8, 2023 •

edited

Loading

lrhn Nov 13, 2023 •

edited

Loading

lrhn Nov 8, 2023 •

edited

Loading

brianquinlan commented Nov 15, 2023 •

edited by lrhn

Loading