Extraction
Extractor is responsible for converting JSON documents into Mill.jl structures. The main idea is that the extractor follows the same hierarchical structure as previously inferred Schema. Extractor for a whole JSON is created by composing (sub-)extractors while reflecting the JSON structure.
Assume the following dataset of two JSON documents for which we infer a Schema:
using JSONjulia> jss = JSON.parse("""[ { "name": "Karl", "siblings": ["Gertruda", "Heike", "Fritz"], "hobby": ["running", "pingpong"], "age": 21 }, { "name": "Heike", "siblings": ["Gertruda", "Heike", "Fritz"], "hobby": ["yoga"], "age": 24 } ]""");julia> sch = schema(jss)DictEntry 2x updated ├─────── age: LeafEntry (2 unique `Real` values) 2x updated ├───── hobby: ArrayEntry 2x updated │ ╰── LeafEntry (3 unique `String` values) 3x updated ├────── name: LeafEntry (2 unique `String` values) 2x updated ╰── siblings: ArrayEntry 2x updated ╰── LeafEntry (3 unique `String` values) 6x updated
Manual creation of Extractors
One possible way to create an Extractor is to manually define it from all the required pieces. One extractor corresponding to sch might look like this:
julia> e = DictExtractor(( name = NGramExtractor(), age = ScalarExtractor(), hobby = ArrayExtractor(CategoricalExtractor(["running", "swimming","yoga"])) ))DictExtractor ├─── name: NGramExtractor(n=3, b=256, m=2053) ├──── age: ScalarExtractor(c=0.0, s=1.0) ╰── hobby: ArrayExtractor ╰── CategoricalExtractor(n=4)
We have just created a DictExtractor with
NGramExtractorto extractStrings under the"name"key,ScalarExtractorto extract age under the"age"key, and finallyArrayExtractorfor extracting arrays under the"hobby"key. This extractor has one child, aCategoricalExtractor, which operates on three hobby categories.
Applying e on the first JSON document yields the following hierarchy of Mill.jl structures:
julia> x = e(jss[1])ProductNode 1 obs ├─── name: ArrayNode(2053×1 NGramMatrix with Int64 elements) 1 obs ├──── age: ArrayNode(1×1 Array with Float32 elements) 1 obs ╰── hobby: BagNode 1 obs ╰── ArrayNode(4×2 OneHotArray with Bool elements) 2 obs
If any preprocessing was performed for input documents as for example discussed in Preprocessing make sure to apply the same preprocessing before passing documents to any Extractor as well!
Note that we didn't include any extractor for the "siblings" key. In such case, the key in the JSON document is simply ignored and never extracted.
Every (sub)extractor, a node in the extractor "tree" is also callable, for example:
julia> e[:hobby](jss[1]["hobby"])BagNode 1 obs ╰── ArrayNode(4×2 OneHotArray with Bool elements) 2 obs
Let's inspect how the subtree under the "hobby" key in the JSON was extracted:
julia> printtree(x; trav=true)ProductNode [""] 1 obs ├─── name: ArrayNode(2053×1 NGramMatrix with Int64 elements) ["E"] 1 obs ├──── age: ArrayNode(1×1 Array with Float32 elements) ["U"] 1 obs ╰── hobby: BagNode ["k"] 1 obs ╰── ArrayNode(4×2 OneHotArray with Bool elements) ["s"] 2 obsjulia> x["s"]4×2 ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}: 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1
The first column in the OneHotMatrix corresponds to "running", which is the first category in the corresponding CategoricalExtractor. The second column corresponds to "pingpong", which is an unknown category in the extractor. Any other unknown String would be extracted in the same way:
julia> e["s"]("unknown")4×1 ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}: ⋅ ⋅ ⋅ 1
For more information about individual subtypes of Extractor, see their docs, or Extractor API.
Semi-automatic Extractor creation
Manually creating Extractors is a laborous and error-prone process once the hierarchical structure of input JSON documents gets large. For this reason, JsonGrinder.jl provides the suggestextractor function greatly simplifying this process:
julia> e = suggestextractor(sch)DictExtractor ├─────── age: CategoricalExtractor(n=3) ├────── name: CategoricalExtractor(n=3) ├── siblings: ArrayExtractor │ ╰── CategoricalExtractor(n=4) ╰───── hobby: ArrayExtractor ╰── CategoricalExtractor(n=4)
The function uses a simple heuristic for choosing reasonable extractors for values in leaves: if there are not many unique values (less than the categorical_limit keyword argument), use CategoricalExtractor, else use either NGramExtractor or ScalarExtractor depending on the type.
It is possible to hook into the internals of how suggestextractor treats values in leaves by redefining _suggestextractor(e::LeafEntry).
Please refer to the suggestextractor docs for all possible keyword arguments.
It is recommended to check the proposed extractor manually, and modifying it if it makes sense.
Stable Extractors
Sometimes not all JSON documents in a dataset are complete. For example:
julia> jss = JSON.parse.([ """ { "a" : 1, "b" : "foo" } """, """ { "b" : "bar" } """ ]);julia> sch = schema(jss)DictEntry 2x updated ├── a: LeafEntry (1 unique `Real` values) 1x updated ╰── b: LeafEntry (2 unique `String` values) 2x updated
In such case, suggestextractor wraps the extractor corresponding to the key with missing data ("a") into StableExtractor:
julia> e = suggestextractor(sch)DictExtractor ├── a: StableExtractor(CategoricalExtractor(n=2)) ╰── b: CategoricalExtractor(n=3)
and the extraction works fine:
julia> extract(e, jss)ProductNode 2 obs ├── a: ArrayNode(2×2 MaybeHotMatrix with Union{Missing, Bool} elements) 2 o ⋯ ╰── b: ArrayNode(3×2 OneHotArray with Bool elements) 2 obs
If the dataset for schema inference is undersampled and the missing key doesn't show up, suggestextractor will infer unsuitable Extractor:
julia> sch = schema(jss[1:1])DictEntry 1x updated ├── a: LeafEntry (1 unique `Real` values) 1x updated ╰── b: LeafEntry (1 unique `String` values) 1x updatedjulia> e = suggestextractor(sch)DictExtractor ├── a: CategoricalExtractor(n=2) ╰── b: CategoricalExtractor(n=2)julia> e(jss[2])ERROR: IncompatibleExtractor at path [:a]: this path contains missing data not supported by this extractor! See the `Stable Extractors` section in the docs.
There are multiple ways to deal with this problem:
- Manually wrap the problematic node (here with the help of
Accessors.jl):
using Accessorsjulia> e_stable = @set e.children[:a] = StableExtractor(e[:a])DictExtractor ├── a: StableExtractor(CategoricalExtractor(n=2)) ╰── b: CategoricalExtractor(n=2)julia> e_stable(jss[2])ProductNode 1 obs ├── a: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o ⋯ ╰── b: ArrayNode(2×1 OneHotArray with Bool elements) 1 obs
- Use
stabilizeextractoron the whole tree (or a subtree):
julia> e_stable = stabilizeextractor(e)DictExtractor ├── a: StableExtractor(CategoricalExtractor(n=2)) ╰── b: StableExtractor(CategoricalExtractor(n=2))julia> e_stable(jss[2])ProductNode 1 obs ├── a: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o ⋯ ╰── b: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o ⋯
- Call
suggestextractorwithall_stable=true. Now all document values are treated as possibly missing. Results ofstabilizeextractor(schema(...))andsuggestextractor(...; all_stable=true)are roughly equivalent:
julia> e_stable = suggestextractor(sch; all_stable=true)DictExtractor ├── a: StableExtractor(CategoricalExtractor(n=2)) ╰── b: StableExtractor(CategoricalExtractor(n=2))julia> e_stable(jss[2])ProductNode 1 obs ├── a: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o ⋯ ╰── b: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o ⋯
- Preprocess the data (delete the problematic key from all documents or the schema, or make sure that documents with the missing key are present in the data when calling
schema).