Skip to content
This repository has been archived by the owner on Oct 17, 2024. It is now read-only.

Make the ISO-8859 code page encodings use the Unicode Consortium mappings. #93

Closed
wants to merge 6 commits into from

Conversation

lrhn
Copy link
Contributor

@lrhn lrhn commented Nov 7, 2023

Built on top of Brian's conversion of the first three encodings.

Added a script to generate the string constants from the official table files, just to be safe.

(We can probably generate the reverse-mapping Map<int, int>, which is created when someone uses the encoding, as generated constants too. But that can be done later.)

pubspec.yaml Outdated
@@ -1,5 +1,5 @@
name: convert
version: 3.1.2-dev
version: 4.0.0-dev
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is going to be hard as long as test depends on someting which depends on ^3.0.0.
Might be better to just declare this change "non-breaking".)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could discuss at the ecosystem meeting tomorrow? Clearly changing a data mapping is breaking but it is hard to imagine people being negatively impacted. For now why don't you change it back so that you can write tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we decided to use a 3.X version.

void writeChar(StringSink buffer, int charCode) {
if (specialChars[charCode] case var s?) {
buffer.write(s);
} else if (charCode >= 0x20 && charCode < 0x7f) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of explicitly coding these character ranges, consider using the category of the unicode character to decide whether to escape it or not.

The script that I wrote only escaped characters in the "C" and "Z" categories, which results in this string for ISO-8859-6:

\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13
\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20!"#$%&'()*+,-./0123456789:;
<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91
\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\ufffd\ufffd
\ufffd¤\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd،\xad\ufffd\ufffd\ufffd\ufffd
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd؛\ufffd\ufffd\ufffd؟\ufffd
ءآأؤإئابةتثجحخدذرزسشصضطظعغ\ufffd\ufffd\ufffd\ufffd\ufffdـفقكلمنهوىيًٌٍَُِّْ\ufffd\ufffd\ufffd
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd

It would also reduce your "special characters" map to just:

const specialChars = {
  0x27: r"\'",
  0x5c: r'\\',
}

Speaking of which, it is really hard to identify escaped characters in these strings. I'd suggest changing the quote character if the string contains a special character.

Copy link
Contributor Author

@lrhn lrhn Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose, perhaps not entirely necessarily, to keep the source ASCII, and encode any non-ASCII character.

I know Dart code is always UTF-8, but if I am reading this file at all, I prefer being able to tell the difference between "Α" and "A" .
(The former is upper-case Greek Alpha, U+0391, ISO-8859-7 char 0xC1, and the latter is the "plain" latin A, U+0041.)

The "ءآأؤإئابةتثجحخدذرزسشصضطظعغ" above also contains some right-to-left characters, which makes visually parsing the file even harder.

So, the options are:

  • Optimize for size by using the actual code point for every character, not an escape.
  • Optimize for size, to some degree, by using all visible characters, and escaping non-visible characters.
  • Optimize for "readability", to some degree, by using all visible ASCII characters, and escaping the rest.
  • Optimize for parsability by escaping everything with the same size escapes, \u0061 for A.

Any option "works", because only the final String constant content matters. The rest only matters if someone reads the source, of a generated file - instead of looking at the linked file that has the same information in a much more readable way.
Not a lot to do a data-drive choice from.

I don't think size is relevant to optimize for. It's not a big download no matter what we do, and plain ASCII compresses well.

I ended up choosing the second to last option. Mainly because I was looking at the file myself while generating it, and that's the step I reached before I was satisfied.
But escaping everything would have been sufficient (and simpler, I wouldn't need to start counting entries per line for that.)

Having \n->r'\n' as a special character is really more of a gimmick. It's there, but ... nobody cares, because it's always in the same position in every table anyway. (And who even knows what \f and \v does. I looked it up.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about this. One advantage of using the actual characters instead of escapes is that you can visually see what types of characters your are encoding. But do this how you like.

If you are interested in how other languages do this, Python uses your approach but has one character per line with a comment like:

    ...
    '\x1e'     #  0x1E -> RECORD SEPARATOR
    '\x1f'     #  0x1F -> UNIT SEPARATOR
    ' '        #  0x20 -> SPACE
    '!'        #  0x21 -> EXCLAMATION MARK
    '"'        #  0x22 -> QUOTATION MARK
    ...

Go does something similar.

One difference is that they do one code page per file so it is less obnoxious to use a lot of vertical space for each encoding.

Copy link
Contributor

@natebosch natebosch Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have a slight preference for the the current approach of escaping non-ascii.

I'd be on board with other patterns like escaping everything in favor of a simpler generator.

/// Find the number of characters that fits in a single-line string literal.
///
/// Finds the largest number of characters starting from [from],
/// which is a power of 2, and where the characters representation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to limit this to powers-of-2?

Copy link
Contributor Author

@lrhn lrhn Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly for a miniscule amount of readability. Which likely isn't really needed.

I started just doing eight \uHHHH chars per line, which works, and ensures that you can visually find the individual characters in the mapping if you really want to.
But it's so big, especially for the ASCII part, so I tried to reduce the size by using \xHH encodings and visible ASCII chars, but it's entirely a source-size-and-readability thing, it shouldn't make any difference to the string.

But, to make it still somewhat possible to eyeball what character is at which position, I wanted the lines to all be
power-of-2 aligned.
Or really, I wanted them to have lengths that are powers of 2 and be aligned by their size.
I think I might have gotten the latter, just because of how things end up being encoded.
(... no, 8859-4 has a 16-character line starting at 0xD8.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't care about the readability of the generated file, you can do this however you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'm coming around to that too.
In which case anything other than repeated \uHHHH entries is a waste of developer time and complexity budget. (I'll still put 8 of them on each line, to get a little alignment. I like alignment. Maybe even has room for a // 0xc0..0xc7 comment. And there I go trying for readability again. Because debugging ...? Don't debug generated code, debug the generator!)

///
/// This is the authoritative mapping between ISO-8859-2 and
/// Unicode text, as [specified by the Unicode Consortium](
/// https://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus points for putting the table version here. You can get that from the mapping file:
# Table version: 2.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, nice.

pubspec.yaml Outdated
@@ -1,5 +1,5 @@
name: convert
version: 3.1.2-dev
version: 4.0.0-dev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could discuss at the ecosystem meeting tomorrow? Clearly changing a data mapping is breaking but it is hard to imagine people being negatively impacted. For now why don't you change it back so that you can write tests?

/// Shared HTTP client used by [fetchTable], initialized if needed.
///
/// Closed by [generateTables] when done fetching.
HttpClient? httpClient;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use package:http for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but adding more dependencies isn't something I do lightly.

In this case, it's a dev-dependency, so it only affects ourselves.
(Also, I don't actually have experience with package:http, so I just used what I know.)

Copy link
Contributor

@brianquinlan brianquinlan Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings about this but we are probably going to ask people to not use HttpClient directly in the next couple of months so, if we can avoid using it ourselves, that would be ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good reason. And again, it's a dev-dependency, so nobody else should care.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our reasoning for avoiding HttpClient when the code is know to only be used on platforms that support it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our reasoning for avoiding HttpClient when the code is know to only be used on platforms that support it?

Developers will assume that our code reflects best practices so, if we use HttpClient, then other people will assume that they should too. In my mind, there are very few reasons to use HttpClient directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO keeping core package dependencies as flat as possible should weigh higher than making our code look like recommended app code. I do think that having more sample code will make it less likely authors will need to go looking in packages like this to find a pattern to copy/paste.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO keeping core package dependencies as flat as possible should weigh higher than making our code look like recommended app code. I do think that having more sample code will make it less likely authors will need to go looking in packages like this to find a pattern to copy/paste.

I don't feel very strongly about this. The dependency would only be for dev though so it wouldn't impact users.

Copy link
Contributor Author

@lrhn lrhn Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility is that we remove the download code, and just require the user to have pre-downloaded the files to somewhere, so they can point to that directory.

This was just me making my own life easier, when I had to do the same thing multiple times, but it's unnecessary complexity when we can just ask the user to use curl first. Like:

curl -OOOOOOOOOOOOOOO https://unicode.org/Public/MAPPINGS/ISO8859/8859-{1,2,3,4,5,6,7,8,9,10,11,13,14,15,16}.TXT

I think I'll just remove the download code. It's safer and simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM I like the idea of operating only on local files.

..write('const iso8859_$isoNumber = // From $uri');
// Write 8 chars per line, as Unicode escapes.
for (var i = 0; i < text.length;) {
var charsOnLine = nextRangeLength(text, i, 74);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be more readable if nextRangeLength were a generator? So you could write:

  for(var next in splitTextIntoLines(text, 74)) {
    ...
  }

Copy link
Contributor Author

@lrhn lrhn Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could work too. And could more easily keep state, that would let me ensure that every line is size-aligned, so a 16-character line will always start at position that is a multiplum of 16. (Which is what I really want, and I thought I actually got it by accident. Only almost.)

@kevmoo
Copy link
Contributor

kevmoo commented Nov 7, 2023

We could bump shelf_static to support a broader range, if we feel confident it won't break anything there!

@lrhn
Copy link
Contributor Author

lrhn commented Nov 8, 2023

About publishing the script for others to use, to automatically download a specific mapping and generate a string for it:
If we do thatm I'd remove the automatica downloading, and ask people to download the files themselves, and just have thm point the script to the already downloaded sources. Or require them to provide the URIs to download, then the script can do it.

I'm not comfortable providing a script to users which is hard-coded to use resources from a third-party, without explicit permission from that third-party.

Anyone can download the files manually, but if a Google script downloads a UC web-page automatically, it's too close to doing it automatically and systematically, which is not OK. We'd be providing users a service based on other people's resources.

So, either:

  • Don't download files at all, require the user to do it ahead of time.
  • Only download files that the user has provided the URIs for, so the user chooses which files to download, not us.

Either also allows users to have their own mirrors of the data files, if they are not comfortable depending on a third-party directly in their build steps.
(But then I think it would be a nice feature, to parse and generate a constant for any mapping file, no matter how the users provides that mapping file.)

@brianquinlan
Copy link
Contributor

brianquinlan commented Nov 15, 2023

You are still working on this, right? AFAIK, you still need to:

  • Change the version package to 3.X
  • Update the unit tests
  • Assume that the mapping files are already downloaded, i.e., remove the HttpClient code.

@lrhn
Copy link
Contributor Author

lrhn commented Nov 15, 2023

I am. Just got caught up on another CL (making a change for String.fromCharCodes work on Wasm)

@brianquinlan
Copy link
Contributor

No worries but I'll be out for the next two weeks so I'll remove myself as a reviewer.

@lrhn
Copy link
Contributor Author

lrhn commented Nov 17, 2023

Updated, PTAL

@@ -1,8 +1,10 @@
## 3.1.2-dev
## 4.0.0-dev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we decide on a nonbreaking release during the sync?

// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.

// Generator for tables for ISO-8859 code page encoding of Unicode text,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a doc comment on a library;?

@@ -0,0 +1,266 @@
// Copyright (c) 2023, the Dart project authors. Please see the AUTHORS file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want this to be usable by other packages I think it needs to be in bin/ instead of tool/

@kevmoo kevmoo removed their request for review June 27, 2024 23:03
@mosuem
Copy link
Contributor

mosuem commented Oct 16, 2024

Closing as the dart-lang/convert repository is merged into the dart-lang/core monorepo. Please re-open this PR there!

@mosuem mosuem closed this Oct 16, 2024
@brianquinlan
Copy link
Contributor

@lrhn Are you intending on reopening this in dart-lang/core?

@lrhn
Copy link
Contributor Author

lrhn commented Oct 17, 2024

Yeah, let's do that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants