Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reevaluate JSON schema converter behaviour when node lacks type keyword #135

Open
414owen opened this issue Dec 20, 2024 · 1 comment
Open

Comments

@414owen
Copy link
Contributor

414owen commented Dec 20, 2024

JSON schema nodes without the type keyword are valid, but I think we currently misinterpret the semantics.

We compile

{
  "properties": {
    "a": {
      "const": "a"
    }
  }
}

to

\{("a":"a")?\}

however according to this thread, numbers, strings, and other JSON types are also valid here. properties doesn't make a restriction on the type of node, but if the node is an object, then its fields should be validated according to the properties field of the schema.

Frankly, the thread is a little confusing, and the JSON schema spec seems to itself be ambiguous, but there does seem to be consensus, among those concerned, and across validators.

@torymur @dpsimpson

@414owen
Copy link
Contributor Author

414owen commented Dec 21, 2024

I want to highlight a couple of quirks/features of the JSON schema spec, starting from the fact that JSON schema keywords add constraints, so it's possible to write a schema like this:

{
  "type": "string",
  "minLength": 5,
  "maxLength": 10,
  "pattern": "hello world"
}

Currently we support:

  • the "type" constraint
  • either the pattern constraint, or the min/max length constraints, but not both, and only in the presence of the "type" field, which isn't required for these constraints to be valid.

In reality, this JSON schema represents the union of all four keyword constraints -- "type", "minLength", "maxLength", and "pattern".

Now, there is no JSON which matches all of these constraints, because "hello world" is longer than 10 chars. Ideally, the way we'd model this is with a regex which can never succeed, such as (?!x)x, although since the aim is really to power LLM inference, I guess throwing an error at schema compile time would be more apt....

In any case, it would be nice to support more combinations of constraints (eg. minLength + pattern), as well as having less dependencies between constraints (eg. maxLength without type).

Obviously we'll never be able to support all of the JSON schema spec (eg. the multipleOf constraint), but maybe we can support the union of multiple constraints on the syntactic level, with lookahead groups (of which we can write multiple)?

As an example, the constraints for the above schema could be represented as:

(?|hello world)"[^"]{5,10}"

It seems like a pretty big design space to explore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants