Schema inference

The schema helps to understand the structure of JSON documents and stores some basic statistics of values present in the dataset. All this information is later taken into the account in the suggestextractor function, which takes a schema and using few reasonable heuristic, suggests a suitable extractor for converting JSONs to Mill.jl structures.

In this simple example, we start with a dataset of three JSON documents:

using JSON
julia> jss = JSON.parse.([
          """{ "a": "Hello", "b": { "c": 1, "d": [] } }""",
          """{ "b": { "c": 1, "d": [1, 2, 3] } }""",
          """{ "a": "World", "b": { "c": 2, "d": [1, 3] } }""",
       ]);
julia> jss[1]Dict{String, Any} with 2 entries: "b" => Dict{String, Any}("c"=>1, "d"=>Any[]) "a" => "Hello"

The main function for creating schema is schema, which accepts an array documents and produces a Schema:

julia> sch = schema(jss[1:2])DictEntry 2x updated
  ├── a: LeafEntry (1 unique `String` values) 1x updated
  ╰── b: DictEntry 2x updated
           ├── c: LeafEntry (1 unique `Real` values) 2x updated
           ╰── d: ArrayEntry 2x updated
                    ╰── LeafEntry (3 unique `Real` values) 3x updated

This schema might already come handy for quick statistical insight into the dataset at hand, which we discuss further below.

Schema can be always updated with another document with the update! function:

julia> update!(sch, jss[3]);
julia> schDictEntry 3x updated ├── a: LeafEntry (2 unique `String` values) 2x updated ╰── b: DictEntry 3x updated ├── c: LeafEntry (2 unique `Real` values) 3x updated ╰── d: ArrayEntry 3x updated ╰── LeafEntry (3 unique `Real` values) 5x updated

Schema merging

Lastly, it is also possible to merge two or more schemas together with Base.merge and Base.merge!. It is thus possible to easily parallelize schema inference, merging together individual (sub)schemas as follows:

julia> sch1 = schema(jss[1:2])DictEntry 2x updated
  ├── a: LeafEntry (1 unique `String` values) 1x updated
  ╰── b: DictEntry 2x updated
           ├── c: LeafEntry (1 unique `Real` values) 2x updated
           ╰── d: ArrayEntry 2x updated
                    ╰── LeafEntry (3 unique `Real` values) 3x updated
julia> sch2 = schema(jss[3:3])DictEntry 1x updated ├── a: LeafEntry (1 unique `String` values) 1x updated ╰── b: DictEntry 1x updated ├── c: LeafEntry (1 unique `Real` values) 1x updated ╰── d: ArrayEntry 1x updated ╰── LeafEntry (2 unique `Real` values) 2x updated
julia> merge(sch1, sch2)DictEntry 3x updated ├── a: LeafEntry (2 unique `String` values) 2x updated ╰── b: DictEntry 3x updated ├── c: LeafEntry (2 unique `Real` values) 3x updated ╰── d: ArrayEntry 3x updated ╰── LeafEntry (3 unique `Real` values) 5x updated

or inplace merge into sch1:

julia> merge!(sch1, sch2);
julia> sch1DictEntry 3x updated ├── a: LeafEntry (2 unique `String` values) 2x updated ╰── b: DictEntry 3x updated ├── c: LeafEntry (2 unique `Real` values) 3x updated ╰── d: ArrayEntry 3x updated ╰── LeafEntry (3 unique `Real` values) 5x updated

Advanced

Statistics are collected in a hierarchical structure reflecting the structured composed of DictEntry, ArrayEntry, and LeafEntry. These structures are direct counterparts to those in JSON: Dict, Array, and Value.

In this example we will load larger dataset of JSON documents (available here):

julia> jss = JSON.parsefile("json_examples.json");
julia> jss[1]Dict{String, Any} with 7 entries: "bib_entries" => Dict{String, Any}("BIBREF9"=>Dict{String, Any}("ref_id"=>"b9… "body_text" => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[… "back_matter" => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[… "metadata" => Dict{String, Any}("title"=>"", "authors"=>Any[Dict{String, A… "abstract" => Any[Dict{String, Any}("ref_spans"=>Any[], "cite_spans"=>Any[… "ref_entries" => Dict{String, Any}("FIGREF0"=>Dict{String, Any}("latex"=>noth… "paper_id" => "0000fcce604204b1b9d876dc073eb529eb5ce305"

and compute the schema:

julia> sch = schema(jss)ERROR: NullValues at path [:bib_entries][:BIBREF16][:year]: JsonGrinder.jl doesn't support `null` values (`nothing` in julia). Preprocess documents appropriately, e.g. with `remove_nulls`.

To save space, hierarchical structures like schema and extractor are not shown in full in REPL. To inspect the full schema, we can use printtree from HierarchicalUtils.jl. We use

  • vtrunc=3, which only shows at most 3 children of each node in the tree, and
  • trav=true, which also shows traversal codes of individual nodes.
julia> printtree(sch; vtrunc=3, trav=true)DictEntry [""] 3x updated
  ├── a: LeafEntry (2 unique `String` values) ["E"] 2x updated
  ╰── b: DictEntry ["U"] 3x updated
           ├── c: LeafEntry (2 unique `Real` values) ["Y"] 3x updated
           ╰── d: ArrayEntry ["c"] 3x updated
                    ╰── LeafEntry (3 unique `Real` values) ["e"] 5x updated

Traversal codes (strings printed at the end of rows) can be used to access individual elements of the schema.

julia> sch["JQ"]ERROR: Invalid index!
julia> sch["sc"]ERROR: Invalid index!

As indicated in the displayed tree, LeafEntry accessible as sch["JQ"] was updated 3 times in input documents. On the other hand, LeafEntry accessible as sch["sc"] was updated 12 times, each time with a different value.

Empty arrays

Note that for example the ArrayEntry accessible as sch["AU"] has been updated 22 times, but doesn't have any children. This is because on path with keys "abstract" and "cite_spans", we have seen 22 arrays, but all were empty.

To learn more about the HierarchicalUtils.jl package, check also this section of docs, or this section in the Mill.jl docs.

Schema parameters

It may happen that values in leaves of the documents are too unique. Saving all values might quickly become too memory demanding. JsonGrinder.jl thus works with JsonGrinder.max_values parameter. Once the number of unique values in one leaf exceeds this parameter, schema will no longer remember new appearing values in this leaf. This behavior might be relevant when calling suggestextractor, especially with the categorical_limit argument.

Similarly, JsonGrinder.jl also shortens strings that are too long before saving them to schema. This can be governed with the JsonGrinder.max_string_codeunits parameter.

Preprocessing

Sometimes, input JSON documents do not adhere to a stable schema, which for example happens if one key has children of multiple different types in different documents. An example would be:

julia> jss = JSON.parse.([
           """ {"a": [1, 2, 3] } """,
           """ {"a": { "b": 1 } } """,
           """ {"a": "hello" } """
       ])3-element Vector{Dict{String, Any}}:
 Dict("a" => Any[1, 2, 3])
 Dict("a" => Dict{String, Any}("b" => 1))
 Dict("a" => "hello")

In these cases the schema creation fails indicating what went wrong:

julia> schema(jss)ERROR: InconsistentSchema at path [:a]: Can't store `Dict{String, Any}` into `ArrayEntry`!

Should this happen, we recommend to deal with such cases by suitable preprocessing.

Mapping paths

Assume that input documents contain information about port numbers, some of which are encoded as integers and some of which as strings:

julia> jss = [
           """ {"ports": [70, 80, 443], "protocol": "TCP" } """,
           """ {"ports": ["22", "80", "500"], "protocol": "UDP" } """,
       ]2-element Vector{String}:
 " {\"ports\": [70, 80, 443], \"protocol\": \"TCP\" } "
 " {\"ports\": [\"22\", \"80\", \"500\"], \"protocol\": \"UDP\" } "
julia> schema(JSON.parse, jss)ERROR: InconsistentSchema at path [:ports][]: Can't store `String` into `LeafEntry{Real}`!

We recommend to deal with these cases using optic approach from Accessors.jl (and possibly also from AccessorsExtra.jl). We can use Accessors.modify to modify the problematic paths, turning all into Strings:

using Accessors
julia> f = js -> Accessors.modify(string, js, @optic _["ports"][∗])#1 (generic function with 1 method)
julia> f.(JSON.parse.(jss))2-element Vector{Dict{String, Any}}: Dict("protocol" => "TCP", "ports" => ["70", "80", "443"]) Dict("protocol" => "UDP", "ports" => ["22", "80", "500"])
julia> schema(f ∘ JSON.parse, jss)DictEntry 2x updated ├───── ports: ArrayEntry 2x updated ╰── LeafEntry (5 unique `String` values) 6x updated ╰── protocol: LeafEntry (2 unique `String` values) 2x updated

or parsing them as Integers:

julia> schema(jss) do doc
           js = JSON.parse(doc)
           Accessors.modify(x -> x isa Integer ? x : parse(Int, x), js, @optic _["ports"][∗])
       endDictEntry 2x updated
  ├───── ports: ArrayEntry 2x updated
  ╰── LeafEntry (5 unique `Real` values) 6x updated
  ╰── protocol: LeafEntry (2 unique `String` values) 2x updated
Writing `∗`

Asterisk for selecting all elements of the array () is not the standard star (*), but is written as \ast<TAB> in Julia REPL, see also Accessors.jl docstrings.

We can also get rid of this path completely with Accessors.delete:

julia> schema(jss) do doc
           Accessors.delete(JSON.parse(doc), @optic _["ports"])
       endDictEntry 2x updated
  ╰── protocol: LeafEntry (2 unique `String` values) 2x updated

If JSON3 is used for parsing, it uses Symbols for keys in objects instead of Strings so make sure to use Symbols:

using JSON3
julia> Accessors.delete(JSON3.read(""" {"port": 1} """), @optic _["port"])Dict{Symbol, Any} with 1 entry:
  :port => 1
julia> Accessors.delete(JSON3.read(""" {"port": 1} """), @optic _[:port])Dict{Symbol, Any}()

Null values

In the current version, JsonGrinder.jl does not support null values in JSON documents (represented as nothing in Julia):

julia> schema(JSON.parse, [
           """ {"a": null } """
       ])ERROR: NullValues at path [:a]: JsonGrinder.jl doesn't support `null` values (`nothing` in julia). Preprocess documents appropriately, e.g. with `remove_nulls`.
julia> schema(JSON.parse, [
           """ {"a": [1, null, 3] } """
       ])ERROR: NullValues at path [:a][]: JsonGrinder.jl doesn't support `null` values (`nothing` in julia). Preprocess documents appropriately, e.g. with `remove_nulls`.
julia> schema(JSON.parse, [
           """ {"a": {"b": null} } """
       ])ERROR: NullValues at path [:a][:b]: JsonGrinder.jl doesn't support `null` values (`nothing` in julia). Preprocess documents appropriately, e.g. with `remove_nulls`.

These values usually do not carry any relevant information, therefore, as the error suggests, the most straighforward and easiest solution is to filter them out using remove_nulls function:

julia> schema(remove_nulls ∘ JSON.parse, [
           """ {"a": null } """
       ])DictEntry 1x updated
julia> schema(remove_nulls ∘ JSON.parse, [
           """ {"a": [1, null, 3] } """
       ])DictEntry 1x updated
  ╰── a: ArrayEntry 1x updated
           ╰── LeafEntry (2 unique `Real` values) 2x updated
julia> schema(remove_nulls ∘ JSON.parse, [
           """ {"a": {"b": null} } """
       ])DictEntry 1x updated
  ╰── a: DictEntry 1x updated