Extraction

Extractor is responsible for converting JSON documents into Mill.jl structures. The main idea is that the extractor follows the same hierarchical structure as previously inferred Schema. Extractor for a whole JSON is created by composing (sub-)extractors while reflecting the JSON structure.

Assume the following dataset of two JSON documents for which we infer a Schema:

using JSON

julia> jss = JSON.parse("""[
           {
               "name": "Karl",
               "siblings": ["Gertruda", "Heike", "Fritz"],
               "hobby": ["running", "pingpong"],
               "age": 21
           },
           {
               "name": "Heike",
               "siblings": ["Gertruda", "Heike", "Fritz"],
               "hobby": ["yoga"],
               "age": 24
           }
       ]""");
julia> sch = schema(jss)DictEntry 2x updated
  ├─────── age: LeafEntry (2 unique `Real` values) 2x updated
  ├───── hobby: ArrayEntry 2x updated
  │               ╰── LeafEntry (3 unique `String` values) 3x updated
  ├────── name: LeafEntry (2 unique `String` values) 2x updated
  ╰── siblings: ArrayEntry 2x updated
                  ╰── LeafEntry (3 unique `String` values) 6x updated

Manual creation of `Extractor`s

One possible way to create an Extractor is to manually define it from all the required pieces. One extractor corresponding to sch might look like this:

julia> e = DictExtractor((
           name = NGramExtractor(),
           age = ScalarExtractor(),
           hobby = ArrayExtractor(CategoricalExtractor(["running", "swimming","yoga"]))
       ))DictExtractor
  ├─── name: NGramExtractor(n=3, b=256, m=2053)
  ├──── age: ScalarExtractor(c=0.0, s=1.0)
  ╰── hobby: ArrayExtractor
               ╰── CategoricalExtractor(n=4)

We have just created a DictExtractor with

NGramExtractor to extract Strings under the "name" key,
ScalarExtractor to extract age under the "age" key, and finally
ArrayExtractor for extracting arrays under the "hobby" key. This extractor has one child, a CategoricalExtractor, which operates on three hobby categories.

Applying e on the first JSON document yields the following hierarchy of Mill.jl structures:

julia> x = e(jss[1])ProductNode  1 obs
  ├─── name: ArrayNode(2053×1 NGramMatrix with Int64 elements)  1 obs
  ├──── age: ArrayNode(1×1 Array with Float32 elements)  1 obs
  ╰── hobby: BagNode  1 obs
               ╰── ArrayNode(4×2 OneHotArray with Bool elements)  2 obs

Consistent preprocessing

If any preprocessing was performed for input documents as for example discussed in Preprocessing make sure to apply the same preprocessing before passing documents to any Extractor as well!

Missing key

Note that we didn't include any extractor for the "siblings" key. In such case, the key in the JSON document is simply ignored and never extracted.

Every (sub)extractor, a node in the extractor "tree" is also callable, for example:

julia> e[:hobby](jss[1]["hobby"])BagNode  1 obs
  ╰── ArrayNode(4×2 OneHotArray with Bool elements)  2 obs

Let's inspect how the subtree under the "hobby" key in the JSON was extracted:

julia> printtree(x; trav=true)ProductNode [""]  1 obs
  ├─── name: ArrayNode(2053×1 NGramMatrix with Int64 elements) ["E"]  1 obs
  ├──── age: ArrayNode(1×1 Array with Float32 elements) ["U"]  1 obs
  ╰── hobby: BagNode ["k"]  1 obs
               ╰── ArrayNode(4×2 OneHotArray with Bool elements) ["s"]  2 obs
julia> x["s"]4×2 ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
 1  ⋅
 ⋅  ⋅
 ⋅  ⋅
 ⋅  1

The first column in the OneHotMatrix corresponds to "running", which is the first category in the corresponding CategoricalExtractor. The second column corresponds to "pingpong", which is an unknown category in the extractor. Any other unknown String would be extracted in the same way:

julia> e["s"]("unknown")4×1 ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
 ⋅
 ⋅
 ⋅
 1

Semi-automatic `Extractor` creation

Manually creating Extractors is a laborous and error-prone process once the hierarchical structure of input JSON documents gets large. For this reason, JsonGrinder.jl provides the suggestextractor function greatly simplifying this process:

julia> e = suggestextractor(sch)DictExtractor
  ├─────── age: CategoricalExtractor(n=3)
  ├────── name: CategoricalExtractor(n=3)
  ├── siblings: ArrayExtractor
  │               ╰── CategoricalExtractor(n=4)
  ╰───── hobby: ArrayExtractor
                  ╰── CategoricalExtractor(n=4)

The function uses a simple heuristic for choosing reasonable extractors for values in leaves: if there are not many unique values (less than the categorical_limit keyword argument), use CategoricalExtractor, else use either NGramExtractor or ScalarExtractor depending on the type.

Hooking into the behavior

It is possible to hook into the internals of how suggestextractor treats values in leaves by redefining _suggestextractor(e::LeafEntry).

Please refer to the suggestextractor docs for all possible keyword arguments.

Inspect the result

It is recommended to check the proposed extractor manually, and modifying it if it makes sense.

Stable `Extractor`s

Sometimes not all JSON documents in a dataset are complete. For example:

julia> jss = JSON.parse.([
           """ { "a" : 1, "b" : "foo" } """,
           """ { "b" : "bar" } """
       ]);
julia> sch = schema(jss)DictEntry 2x updated
  ├── a: LeafEntry (1 unique `Real` values) 1x updated
  ╰── b: LeafEntry (2 unique `String` values) 2x updated

In such case, suggestextractor wraps the extractor corresponding to the key with missing data ("a") into StableExtractor:

julia> e = suggestextractor(sch)DictExtractor
  ├── a: StableExtractor(CategoricalExtractor(n=2))
  ╰── b: CategoricalExtractor(n=3)

and the extraction works fine:

julia> extract(e, jss)ProductNode  2 obs
  ├── a: ArrayNode(2×2 MaybeHotMatrix with Union{Missing, Bool} elements)  2 o ⋯
  ╰── b: ArrayNode(3×2 OneHotArray with Bool elements)  2 obs

If the dataset for schema inference is undersampled and the missing key doesn't show up, suggestextractor will infer unsuitable Extractor:

julia> sch = schema(jss[1:1])DictEntry 1x updated
  ├── a: LeafEntry (1 unique `Real` values) 1x updated
  ╰── b: LeafEntry (1 unique `String` values) 1x updated
julia> e = suggestextractor(sch)DictExtractor
  ├── a: CategoricalExtractor(n=2)
  ╰── b: CategoricalExtractor(n=2)
julia> e(jss[2])ERROR: IncompatibleExtractor at path [:a]: this path contains missing data not supported by this extractor! See the `Stable Extractors` section in the docs.

There are multiple ways to deal with this problem:

Manually wrap the problematic node (here with the help of Accessors.jl):

using Accessors

julia> e_stable = @set e.children[:a] = StableExtractor(e[:a])DictExtractor
  ├── a: StableExtractor(CategoricalExtractor(n=2))
  ╰── b: CategoricalExtractor(n=2)
julia> e_stable(jss[2])ProductNode  1 obs
  ├── a: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements)  1 o ⋯
  ╰── b: ArrayNode(2×1 OneHotArray with Bool elements)  1 obs

Use stabilizeextractor on the whole tree (or a subtree):

julia> e_stable = stabilizeextractor(e)DictExtractor
  ├── a: StableExtractor(CategoricalExtractor(n=2))
  ╰── b: StableExtractor(CategoricalExtractor(n=2))
julia> e_stable(jss[2])ProductNode  1 obs
  ├── a: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements)  1 o ⋯
  ╰── b: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements)  1 o ⋯

Call suggestextractor with all_stable=true. Now all document values are treated as possibly missing. Results of stabilizeextractor(schema(...)) and suggestextractor(...; all_stable=true) are roughly equivalent:

julia> e_stable = suggestextractor(sch; all_stable=true)DictExtractor
  ├── a: StableExtractor(CategoricalExtractor(n=2))
  ╰── b: StableExtractor(CategoricalExtractor(n=2))
julia> e_stable(jss[2])ProductNode  1 obs
  ├── a: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements)  1 o ⋯
  ╰── b: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements)  1 o ⋯

Preprocess the data (delete the problematic key from all documents or the schema, or make sure that documents with the missing key are present in the data when calling schema).

Extraction

Manual creation of Extractors

Semi-automatic Extractor creation

Stable Extractors

Manual creation of `Extractor`s

Semi-automatic `Extractor` creation

Stable `Extractor`s