-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dhall-to-nix encoding of symbols with special keys #2426
Fix dhall-to-nix encoding of symbols with special keys #2426
Conversation
I have to say that I pretty much implemented this blindly, I don’t know if there’s any golden tests that have to be corrected now, or if I missed any symbols in the AST. |
I kept as closely to the GHC version of z-encoding as possible, most of the code is copied verbatim (except I switched the function to |
93b0d1c
to
82d5b93
Compare
Oh, another problem I found is that sum keys can be the same nearly arbitrary symbols in Dhall, but the nix representation of anonymous function keys can not: The dhall <Foo | Bar/Baz>.Bar/Baz generates the nix code { Bar/Baz, Foo }:
Bar/Baz but nix function attribute arguments cannot be quoted, and cannot contain |
82d5b93
to
791d9e8
Compare
I added a second commit which should fix this problem. Now only function arguments and let-bindings should be z-encoded anymore, because those can’t really be reduced any further. Maybe there could be an encoding via a record field, but that would impact performance. |
I haven’t found a way to make a |
And another one, we also need to quote field selectors. |
8212381
to
8fbdaf7
Compare
I don't really use Dhall anymore, so don't have much input here. It looks generally fine, but I haven't given it a thorough review. I'm probably not the right person to be reviewing, so I'm untagging myself. |
8fbdaf7
to
1313716
Compare
1313716
to
2e26784
Compare
I totally forgot about this; @Gabriella439 I think I addressed your comments, PTAL |
9aab03b
to
5e3a407
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just some tiny suggestions
then StaticKey (VarName key) | ||
else DynamicKey (Plain (DoubleQuoted [Plain key])) | ||
where | ||
simpleChar c = isAsciiLower c || isAsciiUpper c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion: I think this makes the intention more clear
simpleChar c = isAsciiLower c || isAsciiUpper c | |
simpleChar c = isAscii c && isLetter c |
This would also require a matching change to the imports
Edit: Fixed to use &&
instead of ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I … don’t think all ascii values are a valid nix symbol. It includes stuff like 0x04 (Ctrl+c); and I’d have to verify if they are allowed to start with a digit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example
nix-repl> { 1 = "foo"; }
error: syntax error, unexpected INT
at «string»:1:3:
1| { 1 = "foo"; }
| ^
2|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I meant to say isAscii c && isLetter c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But isAscii
still includes all control characters for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e.
-- | Selects the first 128 characters of the Unicode character set,
-- corresponding to the ASCII character set.
isAscii :: Char -> Bool
isAscii c = c < '\x80'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read my responses more closely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, using &&
… at least to me filtering out by boolean and is not more intuitive than set logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I suggested using isLetter
is that it seems more direct than specifying that something is a letter by saying that it is uppercase or lowercase. What you've written reads to me like:
simpleChar c = (isAscii c && isUpper c) || (isAscii c && isLower c)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, coming back to this, apart from the aesthetics, isLetter
needs support for unicode ICU (it’s not just a simple ord
check):
isLetter :: Char -> Bool
isLetter c = case generalCategory c of
GeneralCategory
UppercaseLetter -> True
LowercaseLetter -> True
TitlecaseLetter -> True
ModifierLetter -> True
OtherLetter -> True
_ -> False
generalCategory :: Char -> GeneralCategory
generalCategory c = toEnum $ fromIntegral $ wgencat $ fromIntegral $ ord c
foreign import ccall unsafe "u_gencat"
wgencat :: Int -> Int
vs
isAsciiLower :: Char -> Bool
isAsciiLower c = c >= 'a' && c <= 'z'
( isAsciiLower c | ||
|| isAsciiUpper c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps reuse the simpleChar
function here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I’d have to think about whether they are semantically the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thinking about it they are not the same, since we either double quote where necessary (using the simpleChar
filter), or z-encode where necessary (internal symbols that cannot be quoted). In particular, z-encoded values should never land inside a struct for example.
I think this made me notice a place where I’d forgotten the encoding, namely Project
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that right now simpleChar c
is defined as isAsciiLower c || isAsciiUpper c
, so it seems like you should be able to replace isAsciiLower c || isAsciiUpper c
right here with simpleChar c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still feel like you can reuse the simpleChar
function here:
needsEncoding c = not (simpleChar c)
-- @ | ||
-- Before After | ||
-- -------------------------- | ||
-- Trak Trak | ||
-- foo-wib foozmwib | ||
-- \> zg | ||
-- \>1 zg1 | ||
-- foo\# foozh | ||
-- foo\#\# foozhzh | ||
-- foo\#\#1 foozhzh1 | ||
-- fooZ fooZZ | ||
-- :+ ZCzp | ||
-- @ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you begin each line with >
then you don't need to escape special characters like \#
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the change I'm referring to:
-- @ | |
-- Before After | |
-- -------------------------- | |
-- Trak Trak | |
-- foo-wib foozmwib | |
-- \> zg | |
-- \>1 zg1 | |
-- foo\# foozh | |
-- foo\#\# foozhzh | |
-- foo\#\#1 foozhzh1 | |
-- fooZ fooZZ | |
-- :+ ZCzp | |
-- @ | |
-- > Before After | |
-- > -------------------------- | |
-- > Trak Trak | |
-- > foo-wib foozmwib | |
-- > > zg | |
-- > >1 zg1 | |
-- > foo# foozh | |
-- > foo## foozhzh | |
-- > foo##1 foozhzh1 | |
-- > fooZ fooZZ | |
-- > :+ ZCzp |
The fix unfortunately does not apply anymore. I don’t know if it was fixed on HEAD in the meantime |
This is not fixed in |
Symbols in nix can only consist of a very restricted amount of characters, whereas in dhall they can be basically anything. So let’s use an encoding scheme similar to what GHC uses to generate C symbols. Code slightly changed (some GHC-specific cases removed). I might have missed some cases of dhall symbols that are translated verbatim.
Symbols in nix can only consist of a very restricted amount of characters, whereas in dhall they can be basically anything. When you want to get a value of a union, before it was generated into ``` { Foo, Bar }: Foo ``` where `Foo` cannot be a complex symbol like `{ Foo/Baz, Bar }: Foo/Baz`, because nix does not allow it in anonymous record arguments. So now we generate it as ``` u: u."Foo/Baz" ``` which should always work and is equal (though it loses the information of what other fields are there in the nix code). Before I faultily encoded some of these symbols with a Z-encoding, but that was wrong, so it was undone.
Another one I missed, when you have a field selector, you want to quote it, in case it has some symbols nix does not know how to handle. `x.Foo/bar` will now be `x."Foo/bar"`, which is valid nix.
We copied the Z-encoding functions from GHC more or less verbatim, but we can rewrite it in terms of `Text.concatMap`, which should perform better. In this first step, we change all the helper functions from `Char -> String` to `Char -> Text`, and apply hlint warnings.
We can drop the extra `any needsEncoding` check, since it should be performant enough on its own when using `Text.concatMap` and simplifies the code a bit.
A small improvement in the generation logic. The hnix should really just do this for us.
5e3a407
to
7a7e767
Compare
Rebased on master |
Is this intended to stay broken? |
Fixes #2425