-
Notifications
You must be signed in to change notification settings - Fork 25
Make the ISO-8859 code page encodings use the Unicode Consortium mappings. #93
Conversation
pubspec.yaml
Outdated
@@ -1,5 +1,5 @@ | |||
name: convert | |||
version: 3.1.2-dev | |||
version: 4.0.0-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is going to be hard as long as test
depends on someting which depends on ^3.0.0
.
Might be better to just declare this change "non-breaking".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could discuss at the ecosystem meeting tomorrow? Clearly changing a data mapping is breaking but it is hard to imagine people being negatively impacted. For now why don't you change it back so that you can write tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we decided to use a 3.X version.
void writeChar(StringSink buffer, int charCode) { | ||
if (specialChars[charCode] case var s?) { | ||
buffer.write(s); | ||
} else if (charCode >= 0x20 && charCode < 0x7f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of explicitly coding these character ranges, consider using the category of the unicode character to decide whether to escape it or not.
The script that I wrote only escaped characters in the "C" and "Z" categories, which results in this string for ISO-8859-6:
\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13
\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20!"#$%&'()*+,-./0123456789:;
<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91
\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\ufffd\ufffd
\ufffd¤\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd،\xad\ufffd\ufffd\ufffd\ufffd
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd؛\ufffd\ufffd\ufffd؟\ufffd
ءآأؤإئابةتثجحخدذرزسشصضطظعغ\ufffd\ufffd\ufffd\ufffd\ufffdـفقكلمنهوىيًٌٍَُِّْ\ufffd\ufffd\ufffd
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd
It would also reduce your "special characters" map to just:
const specialChars = {
0x27: r"\'",
0x5c: r'\\',
}
Speaking of which, it is really hard to identify escaped characters in these strings. I'd suggest changing the quote character if the string contains a special character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose, perhaps not entirely necessarily, to keep the source ASCII, and encode any non-ASCII character.
I know Dart code is always UTF-8, but if I am reading this file at all, I prefer being able to tell the difference between "Α" and "A" .
(The former is upper-case Greek Alpha, U+0391, ISO-8859-7 char 0xC1, and the latter is the "plain" latin A, U+0041.)
The "ءآأؤإئابةتثجحخدذرزسشصضطظعغ" above also contains some right-to-left characters, which makes visually parsing the file even harder.
So, the options are:
- Optimize for size by using the actual code point for every character, not an escape.
- Optimize for size, to some degree, by using all visible characters, and escaping non-visible characters.
- Optimize for "readability", to some degree, by using all visible ASCII characters, and escaping the rest.
- Optimize for parsability by escaping everything with the same size escapes,
\u0061
forA
.
Any option "works", because only the final String
constant content matters. The rest only matters if someone reads the source, of a generated file - instead of looking at the linked file that has the same information in a much more readable way.
Not a lot to do a data-drive choice from.
I don't think size is relevant to optimize for. It's not a big download no matter what we do, and plain ASCII compresses well.
I ended up choosing the second to last option. Mainly because I was looking at the file myself while generating it, and that's the step I reached before I was satisfied.
But escaping everything would have been sufficient (and simpler, I wouldn't need to start counting entries per line for that.)
Having \n
->r'\n'
as a special character is really more of a gimmick. It's there, but ... nobody cares, because it's always in the same position in every table anyway. (And who even knows what \f
and \v
does. I looked it up.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about this. One advantage of using the actual characters instead of escapes is that you can visually see what types of characters your are encoding. But do this how you like.
If you are interested in how other languages do this, Python uses your approach but has one character per line with a comment like:
...
'\x1e' # 0x1E -> RECORD SEPARATOR
'\x1f' # 0x1F -> UNIT SEPARATOR
' ' # 0x20 -> SPACE
'!' # 0x21 -> EXCLAMATION MARK
'"' # 0x22 -> QUOTATION MARK
...
Go does something similar.
One difference is that they do one code page per file so it is less obnoxious to use a lot of vertical space for each encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have a slight preference for the the current approach of escaping non-ascii.
I'd be on board with other patterns like escaping everything in favor of a simpler generator.
/// Find the number of characters that fits in a single-line string literal. | ||
/// | ||
/// Finds the largest number of characters starting from [from], | ||
/// which is a power of 2, and where the characters representation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to limit this to powers-of-2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly for a miniscule amount of readability. Which likely isn't really needed.
I started just doing eight \uHHHH
chars per line, which works, and ensures that you can visually find the individual characters in the mapping if you really want to.
But it's so big, especially for the ASCII part, so I tried to reduce the size by using \xHH
encodings and visible ASCII chars, but it's entirely a source-size-and-readability thing, it shouldn't make any difference to the string.
But, to make it still somewhat possible to eyeball what character is at which position, I wanted the lines to all be
power-of-2 aligned.
Or really, I wanted them to have lengths that are powers of 2 and be aligned by their size.
I think I might have gotten the latter, just because of how things end up being encoded.
(... no, 8859-4 has a 16-character line starting at 0xD8.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't care about the readability of the generated file, you can do this however you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I'm coming around to that too.
In which case anything other than repeated \uHHHH
entries is a waste of developer time and complexity budget. (I'll still put 8 of them on each line, to get a little alignment. I like alignment. Maybe even has room for a // 0xc0..0xc7
comment. And there I go trying for readability again. Because debugging ...? Don't debug generated code, debug the generator!)
/// | ||
/// This is the authoritative mapping between ISO-8859-2 and | ||
/// Unicode text, as [specified by the Unicode Consortium]( | ||
/// https://unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bonus points for putting the table version here. You can get that from the mapping file:
# Table version: 2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, nice.
pubspec.yaml
Outdated
@@ -1,5 +1,5 @@ | |||
name: convert | |||
version: 3.1.2-dev | |||
version: 4.0.0-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could discuss at the ecosystem meeting tomorrow? Clearly changing a data mapping is breaking but it is hard to imagine people being negatively impacted. For now why don't you change it back so that you can write tests?
/// Shared HTTP client used by [fetchTable], initialized if needed. | ||
/// | ||
/// Closed by [generateTables] when done fetching. | ||
HttpClient? httpClient; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use package:http
for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but adding more dependencies isn't something I do lightly.
In this case, it's a dev-dependency, so it only affects ourselves.
(Also, I don't actually have experience with package:http
, so I just used what I know.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong feelings about this but we are probably going to ask people to not use HttpClient
directly in the next couple of months so, if we can avoid using it ourselves, that would be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very good reason. And again, it's a dev-dependency, so nobody else should care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is our reasoning for avoiding HttpClient
when the code is know to only be used on platforms that support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is our reasoning for avoiding
HttpClient
when the code is know to only be used on platforms that support it?
Developers will assume that our code reflects best practices so, if we use HttpClient
, then other people will assume that they should too. In my mind, there are very few reasons to use HttpClient
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO keeping core package dependencies as flat as possible should weigh higher than making our code look like recommended app code. I do think that having more sample code will make it less likely authors will need to go looking in packages like this to find a pattern to copy/paste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO keeping core package dependencies as flat as possible should weigh higher than making our code look like recommended app code. I do think that having more sample code will make it less likely authors will need to go looking in packages like this to find a pattern to copy/paste.
I don't feel very strongly about this. The dependency would only be for dev though so it wouldn't impact users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility is that we remove the download code, and just require the user to have pre-downloaded the files to somewhere, so they can point to that directory.
This was just me making my own life easier, when I had to do the same thing multiple times, but it's unnecessary complexity when we can just ask the user to use curl
first. Like:
curl -OOOOOOOOOOOOOOO https://unicode.org/Public/MAPPINGS/ISO8859/8859-{1,2,3,4,5,6,7,8,9,10,11,13,14,15,16}.TXT
I think I'll just remove the download code. It's safer and simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM I like the idea of operating only on local files.
..write('const iso8859_$isoNumber = // From $uri'); | ||
// Write 8 chars per line, as Unicode escapes. | ||
for (var i = 0; i < text.length;) { | ||
var charsOnLine = nextRangeLength(text, i, 74); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be more readable if nextRangeLength
were a generator? So you could write:
for(var next in splitTextIntoLines(text, 74)) {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could work too. And could more easily keep state, that would let me ensure that every line is size-aligned, so a 16-character line will always start at position that is a multiplum of 16. (Which is what I really want, and I thought I actually got it by accident. Only almost.)
We could bump |
About publishing the script for others to use, to automatically download a specific mapping and generate a string for it: I'm not comfortable providing a script to users which is hard-coded to use resources from a third-party, without explicit permission from that third-party. Anyone can download the files manually, but if a Google script downloads a UC web-page automatically, it's too close to doing it automatically and systematically, which is not OK. We'd be providing users a service based on other people's resources. So, either:
Either also allows users to have their own mirrors of the data files, if they are not comfortable depending on a third-party directly in their build steps. |
You are still working on this, right? AFAIK, you still need to:
|
I am. Just got caught up on another CL (making a change for |
No worries but I'll be out for the next two weeks so I'll remove myself as a reviewer. |
Updated, PTAL |
@@ -1,8 +1,10 @@ | |||
## 3.1.2-dev | |||
## 4.0.0-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we decide on a nonbreaking release during the sync?
// for details. All rights reserved. Use of this source code is governed by a | ||
// BSD-style license that can be found in the LICENSE file. | ||
|
||
// Generator for tables for ISO-8859 code page encoding of Unicode text, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a doc comment on a library;
?
@@ -0,0 +1,266 @@ | |||
// Copyright (c) 2023, the Dart project authors. Please see the AUTHORS file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want this to be usable by other packages I think it needs to be in bin/
instead of tool/
Closing as the dart-lang/convert repository is merged into the dart-lang/core monorepo. Please re-open this PR there! |
@lrhn Are you intending on reopening this in dart-lang/core? |
Yeah, let's do that. |
Built on top of Brian's conversion of the first three encodings.
Added a script to generate the string constants from the official table files, just to be safe.
(We can probably generate the reverse-mapping
Map<int, int>
, which is created when someone uses the encoding, as generated constants too. But that can be done later.)