Schema inference
The schema helps to understand the structure of JSON documents and stores some basic statistics of values present in the dataset. All this information is later taken into the account in the suggestextractor
function, which takes a schema and using few reasonable heuristic, suggests a suitable extractor for converting JSONs to Mill.jl
structures.
In this simple example, we start with a dataset of three JSON documents:
using JSON
julia> jss = JSON.parse.([ """{ "a": "Hello", "b": { "c": 1, "d": [] } }""", """{ "b": { "c": 1, "d": [1, 2, 3] } }""", """{ "a": "World", "b": { "c": 2, "d": [1, 3] } }""", ]);
julia> jss[1]
Dict{String, Any} with 2 entries: "b" => Dict{String, Any}("c"=>1, "d"=>Any[]) "a" => "Hello"
The main function for creating schema is schema
, which accepts an array documents and produces a Schema
:
julia> sch = schema(jss[1:2])
DictEntry 2x updated ├── a: LeafEntry (1 unique `String` values) 1x updated ╰── b: DictEntry 2x updated ├── c: LeafEntry (1 unique `Real` values) 2x updated ╰── d: ArrayEntry 2x updated ╰── LeafEntry (3 unique `Real` values) 3x updated
This schema might already come handy for quick statistical insight into the dataset at hand, which we discuss further below.
Schema can be always updated with another document with the update!
function:
julia> update!(sch, jss[3]);
julia> sch
DictEntry 3x updated ├── a: LeafEntry (2 unique `String` values) 2x updated ╰── b: DictEntry 3x updated ├── c: LeafEntry (2 unique `Real` values) 3x updated ╰── d: ArrayEntry 3x updated ╰── LeafEntry (3 unique `Real` values) 5x updated
Schema merging
Lastly, it is also possible to merge two or more schemas together with Base.merge
and Base.merge!
. It is thus possible to easily parallelize schema inference, merging together individual (sub)schemas as follows:
julia> sch1 = schema(jss[1:2])
DictEntry 2x updated ├── a: LeafEntry (1 unique `String` values) 1x updated ╰── b: DictEntry 2x updated ├── c: LeafEntry (1 unique `Real` values) 2x updated ╰── d: ArrayEntry 2x updated ╰── LeafEntry (3 unique `Real` values) 3x updated
julia> sch2 = schema(jss[3:3])
DictEntry 1x updated ├── a: LeafEntry (1 unique `String` values) 1x updated ╰── b: DictEntry 1x updated ├── c: LeafEntry (1 unique `Real` values) 1x updated ╰── d: ArrayEntry 1x updated ╰── LeafEntry (2 unique `Real` values) 2x updated
julia> merge(sch1, sch2)
DictEntry 3x updated ├── a: LeafEntry (2 unique `String` values) 2x updated ╰── b: DictEntry 3x updated ├── c: LeafEntry (2 unique `Real` values) 3x updated ╰── d: ArrayEntry 3x updated ╰── LeafEntry (3 unique `Real` values) 5x updated
or inplace merge into sch1
:
julia> merge!(sch1, sch2);
julia> sch1
DictEntry 3x updated ├── a: LeafEntry (2 unique `String` values) 2x updated ╰── b: DictEntry 3x updated ├── c: LeafEntry (2 unique `Real` values) 3x updated ╰── d: ArrayEntry 3x updated ╰── LeafEntry (3 unique `Real` values) 5x updated
Advanced
Statistics are collected in a hierarchical structure reflecting the structured composed of DictEntry
, ArrayEntry
, and LeafEntry
. These structures are direct counterparts to those in JSON: Dict
, Array
, and Value
.
In this example we will load larger dataset of JSON documents (available here):
julia> jss = JSON.parsefile("json_examples.json");
julia> jss[1]
Dict{String, Any} with 7 entries: "bib_entries" => Dict{String, Any}("BIBREF9"=>Dict{String, Any}("ref_id"=>"b9… "body_text" => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[… "back_matter" => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[… "metadata" => Dict{String, Any}("title"=>"", "authors"=>Any[Dict{String, A… "abstract" => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[… "ref_entries" => Dict{String, Any}("FIGREF0"=>Dict{String, Any}("latex"=>noth… "paper_id" => "0000fcce604204b1b9d876dc073eb529eb5ce305"
and compute the schema:
julia> sch = schema(jss)
ERROR: NullValues at path [:bib_entries][:BIBREF16][:year]: JsonGrinder.jl doesn't support `null` values (`nothing` in julia). Preprocess documents appropriately, e.g. with `remove_nulls`.
To save space, hierarchical structures like schema and extractor are not shown in full in REPL. To inspect the full schema, we can use printtree
from HierarchicalUtils.jl
. We use
vtrunc=3
, which only shows at most 3 children of each node in the tree, andtrav=true
, which also shows traversal codes of individual nodes.
julia> printtree(sch; vtrunc=3, trav=true)
DictEntry [""] 3x updated ├── a: LeafEntry (2 unique `String` values) ["E"] 2x updated ╰── b: DictEntry ["U"] 3x updated ├── c: LeafEntry (2 unique `Real` values) ["Y"] 3x updated ╰── d: ArrayEntry ["c"] 3x updated ╰── LeafEntry (3 unique `Real` values) ["e"] 5x updated
Traversal codes (strings printed at the end of rows) can be used to access individual elements of the schema.
julia> sch["JQ"]
ERROR: Invalid index!
julia> sch["sc"]
ERROR: Invalid index!
As indicated in the displayed tree, LeafEntry
accessible as sch["JQ"]
was updated 3 times in input documents. On the other hand, LeafEntry
accessible as sch["sc"]
was updated 12 times, each time with a different value.
Note that for example the ArrayEntry
accessible as sch["AU"]
has been updated 22 times, but doesn't have any children. This is because on path with keys "abstract"
and "cite_spans"
, we have seen 22 arrays, but all were empty.
To learn more about the HierarchicalUtils.jl
package, check also this section of docs, or this section in the Mill.jl
docs.
Schema parameters
It may happen that values in leaves of the documents are too unique. Saving all values might quickly become too memory demanding. JsonGrinder.jl
thus works with JsonGrinder.max_values
parameter. Once the number of unique values in one leaf exceeds this parameter, schema
will no longer remember new appearing values in this leaf. This behavior might be relevant when calling suggestextractor
, especially with the categorical_limit
argument.
Similarly, JsonGrinder.jl
also shortens strings that are too long before saving them to schema. This can be governed with the JsonGrinder.max_string_codeunits
parameter.
Preprocessing
Sometimes, input JSON documents do not adhere to a stable schema, which for example happens if one key has children of multiple different types in different documents. An example would be:
julia> jss = JSON.parse.([ """ {"a": [1, 2, 3] } """, """ {"a": { "b": 1 } } """, """ {"a": "hello" } """ ])
3-element Vector{Dict{String, Any}}: Dict("a" => Any[1, 2, 3]) Dict("a" => Dict{String, Any}("b" => 1)) Dict("a" => "hello")
In these cases the schema creation fails indicating what went wrong:
julia> schema(jss)
ERROR: InconsistentSchema at path [:a]: Can't store `Dict{String, Any}` into `ArrayEntry`!
Should this happen, we recommend to deal with such cases by suitable preprocessing.
Mapping paths
Assume that input documents contain information about port numbers, some of which are encoded as integers and some of which as strings:
julia> jss = [ """ {"ports": [70, 80, 443], "protocol": "TCP" } """, """ {"ports": ["22", "80", "500"], "protocol": "UDP" } """, ]
2-element Vector{String}: " {\"ports\": [70, 80, 443], \"protocol\": \"TCP\" } " " {\"ports\": [\"22\", \"80\", \"500\"], \"protocol\": \"UDP\" } "
julia> schema(JSON.parse, jss)
ERROR: InconsistentSchema at path [:ports][]: Can't store `String` into `LeafEntry{Real}`!
We recommend to deal with these cases using optic approach from Accessors.jl
(and possibly also from AccessorsExtra.jl
). We can use Accessors.modify
to modify the problematic paths, turning all into String
s:
using Accessors
julia> f = js -> Accessors.modify(string, js, @optic _["ports"][∗])
#1 (generic function with 1 method)
julia> f.(JSON.parse.(jss))
2-element Vector{Dict{String, Any}}: Dict("protocol" => "TCP", "ports" => ["70", "80", "443"]) Dict("protocol" => "UDP", "ports" => ["22", "80", "500"])
julia> schema(f ∘ JSON.parse, jss)
DictEntry 2x updated ├───── ports: ArrayEntry 2x updated │ ╰── LeafEntry (5 unique `String` values) 6x updated ╰── protocol: LeafEntry (2 unique `String` values) 2x updated
or parsing them as Integer
s:
julia> schema(jss) do doc js = JSON.parse(doc) Accessors.modify(x -> x isa Integer ? x : parse(Int, x), js, @optic _["ports"][∗]) end
DictEntry 2x updated ├───── ports: ArrayEntry 2x updated │ ╰── LeafEntry (5 unique `Real` values) 6x updated ╰── protocol: LeafEntry (2 unique `String` values) 2x updated
Asterisk for selecting all elements of the array (∗
) is not the standard star (*
), but is written as \ast<TAB>
in Julia REPL, see also Accessors.jl
docstrings.
We can also get rid of this path completely with Accessors.delete
:
julia> schema(jss) do doc Accessors.delete(JSON.parse(doc), @optic _["ports"]) end
DictEntry 2x updated ╰── protocol: LeafEntry (2 unique `String` values) 2x updated
If JSON3
is used for parsing, it uses Symbol
s for keys in objects instead of String
s so make sure to use Symbol
s:
using JSON3
julia> Accessors.delete(JSON3.read(""" {"port": 1} """), @optic _["port"])
Dict{Symbol, Any} with 1 entry: :port => 1
julia> Accessors.delete(JSON3.read(""" {"port": 1} """), @optic _[:port])
Dict{Symbol, Any}()
Null values
In the current version, JsonGrinder.jl
does not support null
values in JSON documents (represented as nothing
in Julia):
julia> schema(JSON.parse, [ """ {"a": null } """ ])
ERROR: NullValues at path [:a]: JsonGrinder.jl doesn't support `null` values (`nothing` in julia). Preprocess documents appropriately, e.g. with `remove_nulls`.
julia> schema(JSON.parse, [ """ {"a": [1, null, 3] } """ ])
ERROR: NullValues at path [:a][]: JsonGrinder.jl doesn't support `null` values (`nothing` in julia). Preprocess documents appropriately, e.g. with `remove_nulls`.
julia> schema(JSON.parse, [ """ {"a": {"b": null} } """ ])
ERROR: NullValues at path [:a][:b]: JsonGrinder.jl doesn't support `null` values (`nothing` in julia). Preprocess documents appropriately, e.g. with `remove_nulls`.
These values usually do not carry any relevant information, therefore, as the error suggests, the most straighforward and easiest solution is to filter them out using remove_nulls
function:
julia> schema(remove_nulls ∘ JSON.parse, [ """ {"a": null } """ ])
DictEntry 1x updated
julia> schema(remove_nulls ∘ JSON.parse, [ """ {"a": [1, null, 3] } """ ])
DictEntry 1x updated ╰── a: ArrayEntry 1x updated ╰── LeafEntry (2 unique `Real` values) 2x updated
julia> schema(remove_nulls ∘ JSON.parse, [ """ {"a": {"b": null} } """ ])
DictEntry 1x updated ╰── a: DictEntry 1x updated