Extraction
Extractor
is responsible for converting JSON documents into Mill.jl
structures. The main idea is that the extractor follows the same hierarchical structure as previously inferred Schema
. Extractor for a whole JSON is created by composing (sub-)extractors while reflecting the JSON structure.
Assume the following dataset of two JSON documents for which we infer a Schema
:
using JSON
julia> jss = JSON.parse("""[ { "name": "Karl", "siblings": ["Gertruda", "Heike", "Fritz"], "hobby": ["running", "pingpong"], "age": 21 }, { "name": "Heike", "siblings": ["Gertruda", "Heike", "Fritz"], "hobby": ["yoga"], "age": 24 } ]""");
julia> sch = schema(jss)
DictEntry 2x updated ├─────── age: LeafEntry (2 unique `Real` values) 2x updated ├───── hobby: ArrayEntry 2x updated │ ╰── LeafEntry (3 unique `String` values) 3x updated ├────── name: LeafEntry (2 unique `String` values) 2x updated ╰── siblings: ArrayEntry 2x updated ╰── LeafEntry (3 unique `String` values) 6x updated
Manual creation of Extractor
s
One possible way to create an Extractor
is to manually define it from all the required pieces. One extractor corresponding to sch
might look like this:
julia> e = DictExtractor(( name = NGramExtractor(), age = ScalarExtractor(), hobby = ArrayExtractor(CategoricalExtractor(["running", "swimming","yoga"])) ))
DictExtractor ├─── name: NGramExtractor(n=3, b=256, m=2053) ├──── age: ScalarExtractor(c=0.0, s=1.0) ╰── hobby: ArrayExtractor ╰── CategoricalExtractor(n=4)
We have just created a DictExtractor
with
NGramExtractor
to extractString
s under the"name"
key,ScalarExtractor
to extract age under the"age"
key, and finallyArrayExtractor
for extracting arrays under the"hobby"
key. This extractor has one child, aCategoricalExtractor
, which operates on three hobby categories.
Applying e
on the first JSON document yields the following hierarchy of Mill.jl
structures:
julia> x = e(jss[1])
ProductNode 1 obs ├─── name: ArrayNode(2053×1 NGramMatrix with Int64 elements) 1 obs ├──── age: ArrayNode(1×1 Array with Float32 elements) 1 obs ╰── hobby: BagNode 1 obs ╰── ArrayNode(4×2 OneHotArray with Bool elements) 2 obs
If any preprocessing was performed for input documents as for example discussed in Preprocessing make sure to apply the same preprocessing before passing documents to any Extractor
as well!
Note that we didn't include any extractor for the "siblings"
key. In such case, the key in the JSON document is simply ignored and never extracted.
Every (sub)extractor, a node in the extractor "tree" is also callable, for example:
julia> e[:hobby](jss[1]["hobby"])
BagNode 1 obs ╰── ArrayNode(4×2 OneHotArray with Bool elements) 2 obs
Let's inspect how the subtree under the "hobby"
key in the JSON was extracted:
julia> printtree(x; trav=true)
ProductNode [""] 1 obs ├─── name: ArrayNode(2053×1 NGramMatrix with Int64 elements) ["E"] 1 obs ├──── age: ArrayNode(1×1 Array with Float32 elements) ["U"] 1 obs ╰── hobby: BagNode ["k"] 1 obs ╰── ArrayNode(4×2 OneHotArray with Bool elements) ["s"] 2 obs
julia> x["s"]
4×2 ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}: 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1
The first column in the OneHotMatrix
corresponds to "running"
, which is the first category in the corresponding CategoricalExtractor
. The second column corresponds to "pingpong"
, which is an unknown category in the extractor. Any other unknown String
would be extracted in the same way:
julia> e["s"]("unknown")
4×1 ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}: ⋅ ⋅ ⋅ 1
For more information about individual subtypes of Extractor
, see their docs, or Extractor API.
Semi-automatic Extractor
creation
Manually creating Extractor
s is a laborous and error-prone process once the hierarchical structure of input JSON documents gets large. For this reason, JsonGrinder.jl
provides the suggestextractor
function greatly simplifying this process:
julia> e = suggestextractor(sch)
DictExtractor ├─────── age: CategoricalExtractor(n=3) ├────── name: CategoricalExtractor(n=3) ├── siblings: ArrayExtractor │ ╰── CategoricalExtractor(n=4) ╰───── hobby: ArrayExtractor ╰── CategoricalExtractor(n=4)
The function uses a simple heuristic for choosing reasonable extractors for values in leaves: if there are not many unique values (less than the categorical_limit
keyword argument), use CategoricalExtractor
, else use either NGramExtractor
or ScalarExtractor
depending on the type.
It is possible to hook into the internals of how suggestextractor
treats values in leaves by redefining _suggestextractor(e::LeafEntry)
.
Please refer to the suggestextractor
docs for all possible keyword arguments.
It is recommended to check the proposed extractor manually, and modifying it if it makes sense.
Stable Extractor
s
Sometimes not all JSON documents in a dataset are complete. For example:
julia> jss = JSON.parse.([ """ { "a" : 1, "b" : "foo" } """, """ { "b" : "bar" } """ ]);
julia> sch = schema(jss)
DictEntry 2x updated ├── a: LeafEntry (1 unique `Real` values) 1x updated ╰── b: LeafEntry (2 unique `String` values) 2x updated
In such case, suggestextractor
wraps the extractor corresponding to the key with missing data ("a"
) into StableExtractor
:
julia> e = suggestextractor(sch)
DictExtractor ├── a: StableExtractor(CategoricalExtractor(n=2)) ╰── b: CategoricalExtractor(n=3)
and the extraction works fine:
julia> extract(e, jss)
ProductNode 2 obs ├── a: ArrayNode(2×2 MaybeHotMatrix with Union{Missing, Bool} elements) 2 o ⋯ ╰── b: ArrayNode(3×2 OneHotArray with Bool elements) 2 obs
If the dataset for schema inference is undersampled and the missing key doesn't show up, suggestextractor
will infer unsuitable Extractor
:
julia> sch = schema(jss[1:1])
DictEntry 1x updated ├── a: LeafEntry (1 unique `Real` values) 1x updated ╰── b: LeafEntry (1 unique `String` values) 1x updated
julia> e = suggestextractor(sch)
DictExtractor ├── a: CategoricalExtractor(n=2) ╰── b: CategoricalExtractor(n=2)
julia> e(jss[2])
ERROR: IncompatibleExtractor at path [:a]: this path contains missing data not supported by this extractor! See the `Stable Extractors` section in the docs.
There are multiple ways to deal with this problem:
- Manually wrap the problematic node (here with the help of
Accessors.jl
):
using Accessors
julia> e_stable = @set e.children[:a] = StableExtractor(e[:a])
DictExtractor ├── a: StableExtractor(CategoricalExtractor(n=2)) ╰── b: CategoricalExtractor(n=2)
julia> e_stable(jss[2])
ProductNode 1 obs ├── a: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o ⋯ ╰── b: ArrayNode(2×1 OneHotArray with Bool elements) 1 obs
- Use
stabilizeextractor
on the whole tree (or a subtree):
julia> e_stable = stabilizeextractor(e)
DictExtractor ├── a: StableExtractor(CategoricalExtractor(n=2)) ╰── b: StableExtractor(CategoricalExtractor(n=2))
julia> e_stable(jss[2])
ProductNode 1 obs ├── a: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o ⋯ ╰── b: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o ⋯
- Call
suggestextractor
withall_stable=true
. Now all document values are treated as possibly missing. Results ofstabilizeextractor(schema(...))
andsuggestextractor(...; all_stable=true)
are roughly equivalent:
julia> e_stable = suggestextractor(sch; all_stable=true)
DictExtractor ├── a: StableExtractor(CategoricalExtractor(n=2)) ╰── b: StableExtractor(CategoricalExtractor(n=2))
julia> e_stable(jss[2])
ProductNode 1 obs ├── a: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o ⋯ ╰── b: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o ⋯
- Preprocess the data (delete the problematic key from all documents or the schema, or make sure that documents with the missing key are present in the data when calling
schema
).