Skip to content

Commit

Permalink
more preprocessing described in docs
Browse files Browse the repository at this point in the history
  • Loading branch information
simonmandlik committed Oct 26, 2024
1 parent 564dc06 commit cfe0a0f
Show file tree
Hide file tree
Showing 6 changed files with 167 additions and 25 deletions.
8 changes: 4 additions & 4 deletions docs/src/examples/mutagenesis/mutagenesis.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,9 @@
"Status `~/.julia/dev/JsonGrinder/docs/src/examples/mutagenesis/Project.toml`\n",
" [587475ba] Flux v0.14.22\n",
" [682c06a0] JSON v0.21.4\n",
" [d201646e] JsonGrinder v2.5.1 `../../../..`\n",
" [d201646e] JsonGrinder v2.5.2 `../../../..`\n",
" [f1d291b0] MLUtils v0.4.4\n",
" [1d0525e4] Mill v2.10.5\n"
" [1d0525e4] Mill v2.10.6\n"
]
}
],
Expand Down Expand Up @@ -472,11 +472,11 @@
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.11.0"
"version": "1.11.1"
},
"kernelspec": {
"name": "julia-1.11",
"display_name": "Julia 1.11.0",
"display_name": "Julia 1.11.1",
"language": "julia"
}
},
Expand Down
10 changes: 5 additions & 5 deletions docs/src/examples/recipes/recipes.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,10 @@
" Activating project at `~/.julia/dev/JsonGrinder/docs/src/examples/recipes`\n",
"Status `~/.julia/dev/JsonGrinder/docs/src/examples/recipes/Project.toml`\n",
" [587475ba] Flux v0.14.22\n",
" [0f8b85d8] JSON3 v1.14.0\n",
" [d201646e] JsonGrinder v2.5.1 `../../../..`\n",
" [0f8b85d8] JSON3 v1.14.1\n",
" [d201646e] JsonGrinder v2.5.2 `../../../..`\n",
" [f1d291b0] MLUtils v0.4.4\n",
" [1d0525e4] Mill v2.10.5\n",
" [1d0525e4] Mill v2.10.6\n",
" [0b1bfda6] OneHotArrays v0.2.5\n"
]
}
Expand Down Expand Up @@ -551,11 +551,11 @@
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.11.0"
"version": "1.11.1"
},
"kernelspec": {
"name": "julia-1.11",
"display_name": "Julia 1.11.0",
"display_name": "Julia 1.11.1",
"language": "julia"
}
},
Expand Down
5 changes: 5 additions & 0 deletions docs/src/manual/extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,11 @@ Applying `e` on the first JSON document yields the following hierarchy of
x = e(jss[1])
```

!!! ukn "Consistent preprocessing"
If any preprocessing was performed for input documents as for example discussed in
[Preprocessing](@ref) make sure to apply the same preprocessing before passing
documents to any [`Extractor`](@ref) as well!

!!! ukn "Missing key"
Note that we didn't include any extractor for the `"siblings"` key. In such case, the
key in the JSON document is simply ignored and never extracted.
Expand Down
64 changes: 62 additions & 2 deletions docs/src/manual/schema_inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ Similarly, [`JsonGrinder.jl`](https://github.com/CTUAvastLab/JsonGrinder.jl) als
that are too long before saving them to schema. This can be governed with the
[`JsonGrinder.max_string_length`](@ref) parameter.

## Unstable schema
## Preprocessing

Sometimes, input JSON documents do not adhere to a stable schema, which for example happens if one
key has children of multiple different types in different documents. An example would be:
Expand All @@ -148,7 +148,67 @@ schema(jss)

Should this happen, we recommend to deal with such cases by suitable preprocessing.

## Null values
### Mapping paths

Assume that input documents contain information about port numbers, some of which are encoded as
integers and some of which as strings:

```@repl schema
jss = [
""" {"ports": [70, 80, 443], "protocol": "TCP" } """,
""" {"ports": ["22", "80", "500"], "protocol": "UDP" } """,
]
```
```@repl schema
schema(JSON.parse, jss)
```

We recommend to deal with these cases using optic approach from
[`Accessors.jl`](https://juliaobjects.github.io/Accessors.jl/stable/), available also as
`JsonGrinder: Accessors`. We can use `Accessors.modify` to modify the problematic paths,
turning all into `String`s:

```@example schema
using Accessors
```
```@repl schema
f = js -> Accessors.modify(string, js, @optic _["ports"][∗])
f.(JSON.parse.(jss))
schema(f ∘ JSON.parse, jss)
```

or parsing them as `Integer`s:

```@repl schema
schema(jss) do doc
js = JSON.parse(doc)
Accessors.modify(x -> x isa Integer ? x : parse(Int, x), js, @optic _["ports"][∗])
end
```

!!! ukn "Writing ``"
Asterisk for selecting all elements of the array (``) is not the standard star (`*`), but is
written as `\ast<TAB>` in Julia REPL, see also [`Accessors.jl`
docstrings](https://juliaobjects.github.io/Accessors.jl/stable/docstrings/).

We can also get rid of this path completely with `Accessors.delete`:

```@repl schema
schema(jss) do doc
Accessors.delete(JSON.parse(doc), @optic _["ports"])
end
```

If [`JSON3`](https://github.com/quinnj/JSON3.jl) is used for parsing, it uses `Symbol`s for keys
in objects instead of `String`s so make sure to use `Symbol`s:

```@repl schema
using JSON3
Accessors.delete(JSON3.read(""" {"port": 1} """), @optic _["port"])
Accessors.delete(JSON3.read(""" {"port": 1} """), @optic _[:port])
```

### Null values

In the current version, [`JsonGrinder.jl`](https://github.com/CTUAvastLab/JsonGrinder.jl) does not
support `null` values in JSON documents (represented as `nothing` in Julia):
Expand Down
2 changes: 1 addition & 1 deletion src/preprocessing.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
remove_nulls(js)
Returns a new document in which all `null` values (represented as `nothing` in julia) are removed.
Return a new document in which all `null` values (represented as `nothing` in julia) are removed.
# Examples
```jldoctest
Expand Down
103 changes: 90 additions & 13 deletions test/schema.jl
Original file line number Diff line number Diff line change
Expand Up @@ -148,10 +148,10 @@ end
@testset "Consistency 2" begin
jss = map(parsef(), [
""" {} """,
""" { "a": [] } """,
""" { "a": [1, 2] } """,
""" { "b": {} } """,
""" { "b": { "c": "foo" } } """,
""" { "a": []} """,
""" { "a": [1, 2]} """,
""" { "b": {}} """,
""" { "b": { "c": "foo"}} """,
])

test_permutations_merging(jss)
Expand Down Expand Up @@ -309,8 +309,8 @@ end
end

@testset "`nothing` values" begin
jss1 = [ """ {"a": "test" } """, """ {} """ ]
jss2 = [ """ {"a": "test" } """, """ {"a": null } """ ]
jss1 = [ """ {"a": "test"} """, """ {} """ ]
jss2 = [ """ {"a": "test"} """, """ {"a": null} """ ]
@test schema(parsef(), jss1) == schema(remove_nulls parsef(), jss2)
@test_throws NullValues schema(parsef(), jss2)

Expand All @@ -322,9 +322,9 @@ end
@test_throws NullValues schema(parsef(), jss2)
@test_throws NullValues schema(parsef(), jss3)

jss1 = [ """ {} """, """ {"a": {"b": 1 }} """, """{"a": {}}""" ]
jss2 = [ """ {"a": {"b": 1 }} """, """ {"a": null } """, """ {"a": {"b": null }} """ ]
jss3 = [ """ {"a": null } """, """ {"a": {"b": 1 }} """, """ {"a": {"b": null }} """ ]
jss1 = [ """ {} """, """ {"a": {"b": 1}} """, """{"a": {}}""" ]
jss2 = [ """ {"a": {"b": 1}} """, """ {"a": null} """, """ {"a": {"b": null}} """ ]
jss3 = [ """ {"a": null} """, """ {"a": {"b": 1}} """, """ {"a": {"b": null}} """ ]
@test schema(parsef(), jss1) == schema(remove_nulls parsef(), jss2) ==
schema(remove_nulls parsef(), jss3)
@test_throws NullValues schema(parsef(), jss2)
Expand All @@ -338,17 +338,94 @@ end
@test_throws NullValues schema(parsef(), jss2)
@test_throws NullValues schema(parsef(), jss3)

jss1 = [ """ {"a": [ {"b": 1 }, {}]} """ ]
jss2 = [ """ {"a": [ {"b": 1 }, {"b": null }]} """ ]
jss1 = [ """ {"a": [ {"b": 1}, {}]} """ ]
jss2 = [ """ {"a": [ {"b": 1}, {"b": null}]} """ ]
@test schema(parsef(), jss1) == schema(remove_nulls parsef(), jss2)
@test_throws NullValues schema(parsef(), jss2)

jss1 = [ """ {"a": [ {}, {"b": {"c": 1 }}, {"b": {}}]} """ ]
jss2 = [ """ {"a": [ {"b": null }, {"b": {"c": null }}, {"b": {"c": 1 }}]} """ ]
jss1 = [ """ {"a": [ {}, {"b": {"c": 1}}, {"b": {}}]} """ ]
jss2 = [ """ {"a": [ {"b": null}, {"b": {"c": null}}, {"b": {"c": 1}}]} """ ]
@test schema(parsef(), jss1) == schema(remove_nulls parsef(), jss2)
@test_throws NullValues schema(parsef(), jss2)
end

@testset "modifying paths" begin
for jss in (
[
""" {"a": "80"} """,
""" {"a": 80} """,
],
[
""" {"a": "80", "b": "foo"} """,
""" {"a": 80, "b": "bar"} """,
]
)
@test_throws InconsistentSchema schema(parsef(), jss)

pf = JSON.parse
f = js -> Accessors.modify(string, js, @optic _["a"])
sch = schema(f pf, jss)
@test sch[:a] isa LeafEntry{String}
f = js -> Accessors.modify(x -> x isa Integer ? x : parse(Int, x), js, @optic _["a"])
sch = schema(f pf, jss)
@test sch[:a] isa LeafEntry{Real}
f = js -> Accessors.delete(js, @optic _["a"])
sch = schema(f pf, jss)
@test !haskey(sch, :a)

pf = JSON3.read
f = js -> Accessors.modify(string, js, @optic _[:a])
sch = schema(f pf, jss)
@test sch[:a] isa LeafEntry{String}
f = js -> Accessors.modify(x -> x isa Integer ? x : parse(Int, x), js, @optic _[:a])
sch = schema(f pf, jss)
@test sch[:a] isa LeafEntry{Real}
f = js -> Accessors.delete(js, @optic _[:a])
sch = schema(f pf, jss)
@test !haskey(sch, :a)
end

for jss in (
[
""" {"a": [1, 2, "3"]} """,
""" {"a": ["1", "2", "3"]} """,
],
[
""" {"a": [], "b": "foo"} """,
""" {"a": ["1", 1], "b": "bar"} """,
]
)
@test_throws InconsistentSchema schema(parsef(), jss)

pf = JSON.parse
f = js -> Accessors.modify(string, js, @optic _["a"][])
sch = schema(f pf, jss)
@test sch[:a] isa ArrayEntry
@test sch[:a].items isa LeafEntry{String}
f = js -> Accessors.modify(x -> x isa Integer ? x : parse(Int, x), js, @optic _["a"][])
sch = schema(f pf, jss)
@test sch[:a] isa ArrayEntry
@test sch[:a].items isa LeafEntry{Real}
f = js -> Accessors.delete(js, @optic _["a"])
sch = schema(f pf, jss)
@test !haskey(sch, :a)

pf = JSON3.read
f = js -> Accessors.modify(string, js, @optic _[:a][])
sch = schema(f pf, jss)
@test sch[:a] isa ArrayEntry
@test sch[:a].items isa LeafEntry{String}
f = js -> Accessors.modify(x -> x isa Integer ? x : parse(Int, x), js, @optic _[:a][])
sch = schema(f pf, jss)
@test sch[:a] isa ArrayEntry
@test sch[:a].items isa LeafEntry{Real}
f = js -> Accessors.delete(js, @optic _[:a])
sch = schema(f pf, jss)
@test !haskey(sch, :a)
end
end


@testset "representative_example" begin
sch = DictEntry(Dict(
:a => ArrayEntry(
Expand Down

0 comments on commit cfe0a0f

Please sign in to comment.