Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dhall-to-nix encoding of symbols with special keys #2426

Merged

Conversation

Profpatsch
Copy link
Member

Fixes #2425

@Profpatsch
Copy link
Member Author

I have to say that I pretty much implemented this blindly, I don’t know if there’s any golden tests that have to be corrected now, or if I missed any symbols in the AST.

@Profpatsch
Copy link
Member Author

I kept as closely to the GHC version of z-encoding as possible, most of the code is copied verbatim (except I switched the function to Text and added a check that skips the conversion to String if there is no characters that have to be escaped).

@Profpatsch Profpatsch force-pushed the fix-dhall-to-nix-key-encoding branch from 93b0d1c to 82d5b93 Compare June 19, 2022 02:09
@Profpatsch
Copy link
Member Author

Oh, another problem I found is that sum keys can be the same nearly arbitrary symbols in Dhall, but the nix representation of anonymous function keys can not:

The dhall

<Foo | Bar/Baz>.Bar/Baz

generates the nix code

{ Bar/Baz, Foo }:
  Bar/Baz

but nix function attribute arguments cannot be quoted, and cannot contain / for example. So we are gonna have to Z-Encode every union key in the generated nix.

@Profpatsch Profpatsch force-pushed the fix-dhall-to-nix-key-encoding branch from 82d5b93 to 791d9e8 Compare June 19, 2022 03:42
@Profpatsch
Copy link
Member Author

I added a second commit which should fix this problem.

Now only function arguments and let-bindings should be z-encoded anymore, because those can’t really be reduced any further. Maybe there could be an encoding via a record field, but that would impact performance.

@Profpatsch
Copy link
Member Author

I haven’t found a way to make a let binding appear in the nix code though, maybe that is not even possible? They always normalize out in my examples and never appear in the beta-reduced output.

@Profpatsch
Copy link
Member Author

And another one, we also need to quote field selectors.

@Profpatsch Profpatsch force-pushed the fix-dhall-to-nix-key-encoding branch from 8212381 to 8fbdaf7 Compare June 19, 2022 20:19
@ocharles
Copy link
Member

I don't really use Dhall anymore, so don't have much input here. It looks generally fine, but I haven't given it a thorough review. I'm probably not the right person to be reviewing, so I'm untagging myself.

@ocharles ocharles requested review from ocharles and removed request for ocharles June 20, 2022 07:14
@Profpatsch Profpatsch force-pushed the fix-dhall-to-nix-key-encoding branch from 8fbdaf7 to 1313716 Compare June 20, 2022 11:53
dhall-nix/src/Dhall/Nix.hs Outdated Show resolved Hide resolved
dhall-nix/src/Dhall/Nix.hs Outdated Show resolved Hide resolved
@Profpatsch Profpatsch force-pushed the fix-dhall-to-nix-key-encoding branch from 1313716 to 2e26784 Compare December 25, 2022 18:28
@Profpatsch
Copy link
Member Author

I totally forgot about this; @Gabriella439 I think I addressed your comments, PTAL

@Profpatsch Profpatsch force-pushed the fix-dhall-to-nix-key-encoding branch from 9aab03b to 5e3a407 Compare December 25, 2022 21:08
Copy link
Collaborator

@Gabriella439 Gabriella439 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just some tiny suggestions

then StaticKey (VarName key)
else DynamicKey (Plain (DoubleQuoted [Plain key]))
where
simpleChar c = isAsciiLower c || isAsciiUpper c
Copy link
Collaborator

@Gabriella439 Gabriella439 Dec 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: I think this makes the intention more clear

Suggested change
simpleChar c = isAsciiLower c || isAsciiUpper c
simpleChar c = isAscii c && isLetter c

This would also require a matching change to the imports

Edit: Fixed to use && instead of ||

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I … don’t think all ascii values are a valid nix symbol. It includes stuff like 0x04 (Ctrl+c); and I’d have to verify if they are allowed to start with a digit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example

nix-repl> { 1 = "foo"; }
error: syntax error, unexpected INT

       at «string»:1:3:

            1| { 1 = "foo"; }
             |   ^
            2|

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I meant to say isAscii c && isLetter c

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isAscii still includes all control characters for example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e.

-- | Selects the first 128 characters of the Unicode character set,
-- corresponding to the ASCII character set.
isAscii :: Char -> Bool
isAscii c = c < '\x80'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read my responses more closely

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, using && … at least to me filtering out by boolean and is not more intuitive than set logic.

Copy link
Collaborator

@Gabriella439 Gabriella439 Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I suggested using isLetter is that it seems more direct than specifying that something is a letter by saying that it is uppercase or lowercase. What you've written reads to me like:

simpleChar c = (isAscii c && isUpper c) || (isAscii c && isLower c)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, coming back to this, apart from the aesthetics, isLetter needs support for unicode ICU (it’s not just a simple ord check):

isLetter :: Char -> Bool
isLetter c = case generalCategory c of
 GeneralCategory
 UppercaseLetter -> True
 LowercaseLetter -> True
 TitlecaseLetter -> True
 ModifierLetter -> True
 OtherLetter -> True
 _ -> False

generalCategory :: Char -> GeneralCategory
generalCategory c = toEnum $ fromIntegral $ wgencat $ fromIntegral $ ord c

foreign import ccall unsafe "u_gencat"
 wgencat :: Int -> Int

vs

isAsciiLower :: Char -> Bool
isAsciiLower c = c >= 'a' && c <= 'z'

Comment on lines +812 to +813
( isAsciiLower c
|| isAsciiUpper c
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps reuse the simpleChar function here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I’d have to think about whether they are semantically the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thinking about it they are not the same, since we either double quote where necessary (using the simpleChar filter), or z-encode where necessary (internal symbols that cannot be quoted). In particular, z-encoded values should never land inside a struct for example.

I think this made me notice a place where I’d forgotten the encoding, namely Project.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that right now simpleChar c is defined as isAsciiLower c || isAsciiUpper c, so it seems like you should be able to replace isAsciiLower c || isAsciiUpper c right here with simpleChar c

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel like you can reuse the simpleChar function here:

needsEncoding c = not (simpleChar c)

Comment on lines +787 to +799
-- @
-- Before After
-- --------------------------
-- Trak Trak
-- foo-wib foozmwib
-- \> zg
-- \>1 zg1
-- foo\# foozh
-- foo\#\# foozhzh
-- foo\#\#1 foozhzh1
-- fooZ fooZZ
-- :+ ZCzp
-- @
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you begin each line with > then you don't need to escape special characters like \#

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change I'm referring to:

Suggested change
-- @
-- Before After
-- --------------------------
-- Trak Trak
-- foo-wib foozmwib
-- \> zg
-- \>1 zg1
-- foo\# foozh
-- foo\#\# foozhzh
-- foo\#\#1 foozhzh1
-- fooZ fooZZ
-- :+ ZCzp
-- @
-- > Before After
-- > --------------------------
-- > Trak Trak
-- > foo-wib foozmwib
-- > > zg
-- > >1 zg1
-- > foo# foozh
-- > foo## foozhzh
-- > foo##1 foozhzh1
-- > fooZ fooZZ
-- > :+ ZCzp

@Profpatsch
Copy link
Member Author

The fix unfortunately does not apply anymore. I don’t know if it was fixed on HEAD in the meantime

@Gabriella439
Copy link
Collaborator

This is not fixed in HEAD

Symbols in nix can only consist of a very restricted amount of
characters, whereas in dhall they can be basically anything.

So let’s use an encoding scheme similar to what GHC uses to generate C
symbols. Code slightly changed (some GHC-specific cases removed).

I might have missed some cases of dhall symbols that are translated
verbatim.
Symbols in nix can only consist of a very restricted amount of
characters, whereas in dhall they can be basically anything.

When you want to get a value of a union, before it was generated into

```
{ Foo, Bar }: Foo
```

where `Foo` cannot be a complex symbol like `{ Foo/Baz, Bar }:
Foo/Baz`, because nix does not allow it in anonymous record arguments.

So now we generate it as

```
u: u."Foo/Baz"
```

which should always work and is equal (though it loses the information
of what other fields are there in the nix code).

Before I faultily encoded some of these symbols with a Z-encoding, but
that was wrong, so it was undone.
Another one I missed, when you have a field selector, you want to
quote it, in case it has some symbols nix does not know how to handle.

`x.Foo/bar`  will now be `x."Foo/bar"`, which is valid nix.
We copied the Z-encoding functions from GHC more or less verbatim, but
we can rewrite it in terms of `Text.concatMap`, which should perform
better.

In this first step, we change all the helper functions from `Char ->
String` to `Char -> Text`, and apply hlint warnings.
We can drop the extra `any needsEncoding` check, since it should be
performant enough on its own when using `Text.concatMap` and
simplifies the code a bit.
A small improvement in the generation logic.
The hnix should really just do this for us.
@Profpatsch Profpatsch force-pushed the fix-dhall-to-nix-key-encoding branch from 5e3a407 to 7a7e767 Compare February 15, 2024 09:46
@Profpatsch
Copy link
Member Author

Rebased on master

@Profpatsch
Copy link
Member Author

Is this intended to stay broken?

@Gabriella439 Gabriella439 merged commit 4305dab into dhall-lang:main Aug 21, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dhall-nix Use z-encoding for nix symbol names
3 participants