Extractors overview
Below, we first describe extractors of values (i.e. leaves of JSON tree), then proceed to description of extractors of Array
and Dict
, and finish with some specials.
Extractors of scalar values are arguably the most important, but also fortunately the most understood ones. They control, how values are converted to a Vector
(or generally tensor) for the neural networks. For example, they control, if number should be represented as a number, or as one-hot encoded categorical variable. Similarly, they control how String
should be treated, although we admit to natively support only n-grams.
Because mapping from JSON (or different hierarchical structure) to Mill
structures can be non-trivial, extractors have keyword argument store_input
, which, if true
, causes input data to be stored as metadata of respective Mill
structure. By default, it's false, because it can cause type-instability in case of irregular input data and thus suffer from performance loss. The store_input
argument is propagated to leaves and is used to store primarily leaf values.
Because JsonGrinder
supports working with missing values, each leaf extractor has uniontypes
field which determines if it can return missing values or not, and based on this field, extractor returns appropriate data type. By default, uniontypes
is true, so it supports missing values of the shelf, but we advise to set it during extractor construction according to your data because it may create unnecessarily many parameters otherwise. suggestextractor
takes into account where missing values can be observed and where not based on statistics in schema and provides sensible default extractor.
Recall
Numbers
struct ExtractScalar{T} <: AbstractExtractor
c::T
s::T
uniontypes::Bool
end
Extracts a numerical value, centered by subtracting c
and scaled by multiplying by s
. Strings are converted to numbers. The extractor returns ArrayNode{Matrix{T}}
with a single row if uniontypes
if false
, and ArrayNode{Matrix{Union{Missing, T}}}
with a single row if uniontypes
if true
.
e = ExtractScalar(Float32, 0.5, 4.0, true)
e("1").data
1×1 Matrix{Union{Missing, Float32}}:
2.0f0
missing
value is extracted as a missing value, as it is automatically handled downstream by Mill
.
e(missing)
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
missing
the e("1")
is equivalent to e("1", store_input=false)
. To see input data in metadata of ArrayNode
, we can run
e("1", store_input=true).metadata
1×1 Matrix{String}:
"1"
data remain unchanged
e("1", store_input=true).data
1×1 Matrix{Union{Missing, Float32}}:
2.0f0
by default, metadata contains nothing
.
And if uniontypes
is false, it looks as follows
julia> e = ExtractScalar(Float32, 0.5, 4.0, true)
Float32
julia> e("1").data
1×1 Matrix{Union{Missing, Float32}}: 2.0f0
julia> e("1", store_input_true=true).data
ERROR: MethodError: no method matching (::ExtractScalar{Float32})(::String; store_input_true::Bool) Closest candidates are: (::ExtractScalar{T})(::AbstractString; store_input) where T got unsupported keyword argument "store_input_true" @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:75 (::ExtractScalar)(::Union{Missing, Nothing, JsonGrinder.ExtractEmpty, AbstractString, Number, AbstractDict, AbstractVector}; store_input) got unsupported keyword argument "store_input_true" @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:70 (::ExtractScalar{T})(!Matched::Number; store_input) where T got unsupported keyword argument "store_input_true" @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:73 ...
julia> e("1", store_input_true=true).metadata
ERROR: MethodError: no method matching (::ExtractScalar{Float32})(::String; store_input_true::Bool) Closest candidates are: (::ExtractScalar{T})(::AbstractString; store_input) where T got unsupported keyword argument "store_input_true" @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:75 (::ExtractScalar)(::Union{Missing, Nothing, JsonGrinder.ExtractEmpty, AbstractString, Number, AbstractDict, AbstractVector}; store_input) got unsupported keyword argument "store_input_true" @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:70 (::ExtractScalar{T})(!Matched::Number; store_input) where T got unsupported keyword argument "store_input_true" @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:73 ...
julia> e(missing)
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}: missing
Strings
struct ExtractString <: AbstractExtractor
n::Int
b::Int
m::Int
uniontypes::Bool
end
Represents String
as n-
grams (NGramMatrix
from Mill.jl
) with base b
and modulo m
.
e = ExtractString()
e("Hello")
2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
"Hello"
missing
value is extracted as a missing value, as it is automatically handled downstream by Mill
.
e(missing)
2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
missing
Storing input works in the same manner as for ExtractScalar
, see
e("Hello", store_input=true).metadata
1-element Vector{String}:
"Hello"
it works the same also with missing values
e(missing, store_input=true).metadata
1-element Vector{Missing}:
missing
and if we know we won't have missing strings, we can disable uniontypes
:
julia> e = ExtractString(false)
String
julia> e("Hello")
2053×1 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}: "Hello"
julia> e(missing)
ERROR: This extractor does not support missing values
julia> e("Hello", store_input=true).metadata
1-element Vector{String}: "Hello"
Categorical
struct ExtractCategorical{V,I} <: AbstractExtractor
keyvalemap::Dict{V,I}
n::Int
uniontypes::Bool
end
Converts a single item to a one-hot encoded vector. For a safety, there is always an extra item reserved for an unknown value.
e = ExtractCategorical(["A","B","C"])
e(["A","B","C","D"]).data
4×1 MaybeHotMatrix with eltype Union{Missing, Bool}:
missing
missing
missing
missing
missing
value is extracted as a missing value, as it is automatically handled downstream by Mill
.
e(missing)
4×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
missing
missing
missing
missing
Storing input in this case looks as follows
e(["A","B","C","D"], store_input=true).metadata
1-element Vector{Vector{String}}:
["A", "B", "C", "D"]
uniontypes
settings works the same as with scalars or strings.
Use-cases for unknown value
The last dimension for unknown value can be used e.g. to represent sparse data, where values which are frequent have their own dimension, and all values which are scarce will share this single dimension. This is useful for heavy-tail distributions where many dimensions would be used only rarely, but would raise the number of trained parameters significantly.
Of course if all values in the training set are represented explicitly and unknown value is not present during training, the model will not learn the unknown representation, and it will produce noise in case of unknown value in inference time.
Examples of schema with heavy tail can be following histogram with exponential number of observations
using Random
ht_hist = Dict((i==1 ? "aaaaa" : randstring(5))=>ceil(100*ℯ^(-i/5)) for i in 1:1000)
entry = JsonGrinder.Entry{String}(ht_hist, sum(values(ht_hist)))
[Scalar - String], 1000 unique values # updated = 1437
Now we can see it has 1000 unique values, and has been created from 1437 observations, where 977 values were observed only once. Creating the extractor directly and extracting value from it will produce one-hot encoded vector of dimension 1001 (1000 unique values + 1 dimension for the unknown).
ExtractCategorical(entry)("aaaaa")
1001×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
true
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
But when making threshold for values we have seen at least 5 times, it produces one-hot vector of dimension 17, which is significantly smaller.
ExtractCategorical(keys(filter(kv->kv[2]>=5, entry.counts)))("aaaaa")
17×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
⋅
true
⋅
⋅
⋅
⋅
⋅
⋅
⋅
For training, the latter approach may be beneficial, because the number of weights will be significantly lower.
Difference between unknown and missing value
Note that there is semantic difference between unknown and missing
value. For unknown value, special dimension is trained. For the missing one, it's similar, every time missing value is encountered in specific layer, the neural network contains vector it trains instead and which is used as an output in case of missing observation. This allows the model to distinguish between missing values (e.g. when the key in dict is not present, the value under that key is tretated as missing) and unknown, previously unseen values.
Array (Lists / Sets)
struct ExtractArray{T}
item::T
end
Converts an array of values to a Mill.BagNode
with items converted by item
. The entire array is assumed to be a single bag. The BagNode
contains data in field data
and information about bags in bags
field.
sc = ExtractArray(ExtractCategorical(["A","B","C"]))
sc(["A","B","C","D"])
BagNode # 1 obs, 88 bytes
╰── ArrayNode(4×4 MaybeHotMatrix with Union{Missing, Bool} elements) # 4 obs, 92 bytes
Empty arrays are represented as an empty bag.
sc([])
BagNode # 1 obs, 88 bytes
╰── ArrayNode(4×0 MaybeHotMatrix with Union{Missing, Bool} elements) # 0 obs, 72 bytes
the information about bags themselves is seen here
sc([]).bags
AlignedBags{Int64}(UnitRange{Int64}[0:-1])
The data of empty bag can be either missing
or a empty sample, which is more convenient as it makes all samples of the same type, which is nicer to AD. This behavior is controlled by Mill.emptyismissing
. The extractor of a BagNode
can signal to child extractors to extract a sample with zero observations using a special singleton JsonGrinder.extractempty
. For example
Mill.emptyismissing!(true)
sc([]).data
missing
Mill.emptyismissing!(false)
sc([]).data
4×0 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}
Storing input is delegated to leaf extractors, so metadata of bag itself are empty
sc(["A","B","C","D"], store_input=true).metadata
but metadata of underlying ArrayNode
contain inputs.
sc(["A","B","C","D"], store_input=true).data.metadata
4-element Vector{String}:
"A"
"B"
"C"
"D"
In case of empty arrays, input is stored in metadata of BagNode
itself, because there might not be any underlying ArrayNode
.
sc([], store_input=true).metadata
1-element Vector{Vector{Any}}:
[]
Dict
struct ExtractDict{S} <: AbstractExtractor
dict::S
end
Extracts all items in dict
and return them as a ProductNode. Key in dict corresponds to keys in JSON.
ex = ExtractDict(Dict(:a => ExtractScalar(),
:b => ExtractString(),
:c => ExtractCategorical(["A","B"]),
:d => ExtractArray(ExtractString())))
ex(Dict(:a => "1",
:b => "Hello",
:c => "A",
:d => ["Hello", "world"]))
ProductNode # 1 obs, 96 bytes
├── a: ArrayNode(1×1 Array with Union{Missing, Float32} elements) # 1 obs, 53 bytes
├── b: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements) # 1 obs, 125 bytes
├── d: BagNode # 1 obs, 104 bytes
│ ╰── ArrayNode(2053×2 NGramMatrix with Union{Missing, Int64} elements) # 2 obs, 146 bytes
╰── c: ArrayNode(3×1 MaybeHotMatrix with Union{Missing, Bool} elements) # 1 obs, 77 bytes
Missing keys are replaced by missing
and handled by child extractors.
ex(Dict(:a => "1",
:c => "A"))
ProductNode # 1 obs, 96 bytes
├── a: ArrayNode(1×1 Array with Union{Missing, Float32} elements) # 1 obs, 53 bytes
├── b: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements) # 1 obs, 112 bytes
├── d: BagNode # 1 obs, 104 bytes
│ ╰── ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements) # 0 obs, 104 bytes
╰── c: ArrayNode(3×1 MaybeHotMatrix with Union{Missing, Bool} elements) # 1 obs, 77 bytes
Storing input data works in similar manner as for ExtractArray
, input data are delegated to leaf extractors.
julia> ex(Dict(:a => "1", :c => "A"), store_input=true).metadata
julia> ex(Dict(:a => "1", :c => "A"), store_input=true)[:a].metadata
1×1 Matrix{String}: "1"
julia> ex(Dict(:a => "1", :c => "A"), store_input=true)[:b].metadata
1-element Vector{Nothing}: nothing
julia> ex(Dict(:a => "1", :c => "A"), store_input=true)[:c].metadata
1-element Vector{String}: "A"
julia> ex(Dict(:a => "1", :c => "A"), store_input=true)[:d].metadata
1-element Vector{Nothing}: nothing
or
julia> ex(Dict(:a => "1", :b => "Hello", :c => "A", :d => ["Hello", "world"]), store_input=true).metadata
julia> ex(Dict(:a => "1", :b => "Hello", :c => "A", :d => ["Hello", "world"]), store_input=true)[:a].metadata
1×1 Matrix{String}: "1"
julia> ex(Dict(:a => "1", :b => "Hello", :c => "A", :d => ["Hello", "world"]), store_input=true)[:b].metadata
1-element Vector{String}: "Hello"
julia> ex(Dict(:a => "1", :b => "Hello", :c => "A", :d => ["Hello", "world"]), store_input=true)[:c].metadata
1-element Vector{String}: "A"
julia> ex(Dict(:a => "1", :b => "Hello", :c => "A", :d => ["Hello", "world"]), store_input=true)[:d].metadata
julia> ex(Dict(:a => "1", :b => "Hello", :c => "A", :d => ["Hello", "world"]), store_input=true)[:d].data.metadata
2-element Vector{String}: "Hello" "world"
Specials
ExtractKeyAsField
Some JSONs we have encountered use Dict
s to hold an array of named lists (or other types). Having computer security background a prototypical example is storing a list of DLLs with a corresponding list of imported function in a single structure. For example a JSON
{ "foo.dll" : ["print","write", "open","close"],
"bar.dll" : ["send", "recv"]
}
should be better written as
[{"key": "foo.dll",
"item": ["print","write", "open","close"]},
{"key": "bar.dll",
"item": ["send", "recv"]}
]
JsonGrinder tries to detect these cases, as they are typically manifested by Dicts
with excessively large number of keys in a schema. The detection logic of this case in suggestextractor(e::DictEntry)
is simple, if the number of unique keys in a specific Dict
is greater than settings.key_as_field = 500
, such Dict
is considered to hold values in keys and ExtractKeyAsField
is used instead of ExtractDict
. key_as_field
can be set to any value based on specific data or domain, but we have found 500
to be reasonable default.
The extractor itself is simple as well. For the case above, it would look like
s = JSON.parse("{ \"foo.dll\" : [\"print\",\"write\", \"open\",\"close\"],
\"bar.dll\" : [\"send\", \"recv\"]
}")
ex = ExtractKeyAsField(ExtractString(),ExtractArray(ExtractString()))
ex(s)
BagNode # 1 obs, 144 bytes
╰── ProductNode # 2 obs, 72 bytes
├─── key: ArrayNode(2053×2 NGramMatrix with Union{Missing, Int64} elements) # 2 obs, 150 bytes
╰── item: BagNode # 2 obs, 120 bytes
╰── ArrayNode(2053×6 NGramMatrix with Union{Missing, Int64} elements) # 6 obs, 227 bytes
Because it returns BagNode
, missing values are treated in similar manner as in ExtractArray
and settings of Mill.emptyismissing
applies here too.
Mill.emptyismissing!(true)
ex(Dict()).data
missing
Mill.emptyismissing!(false)
ex(Dict()).data
ProductNode # 0 obs, 72 bytes
├─── key: ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements) # 0 obs, 104 bytes
╰── item: BagNode # 0 obs, 88 bytes
╰── ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements) # 0 obs, 104 bytes
MultipleRepresentation
Provides a way to have multiple representations for a single value or subtree in JSON. For example imagine that are extracting strings with some very frequently occurring values and a lots of clutter, which might be important and you do not know about it. MultipleRepresentation(extractors::Tuple)
contains a Tuple
or NamedTuple
of extractors and apply them to a single sub-tree in a json. The corresponding Mill
structure will contain ProductNode
of both representation.
For example String
with Categorical and NGram representation will look like
ex = MultipleRepresentation((c = ExtractCategorical(["Hello","world"]), s = ExtractString()))
reduce(catobs,ex.(["Hello","world","from","Prague"]))
ProductNode # 4 obs, 48 bytes
├── c: ArrayNode(3×4 MaybeHotMatrix with Union{Missing, Bool} elements) # 4 obs, 92 bytes
╰── s: ArrayNode(2053×4 NGramMatrix with Union{Missing, Int64} elements) # 4 obs, 188 bytes
Because it produces ProductNode
, missing values are delegated to leaf extractors.
ex(missing)
ProductNode # 1 obs, 48 bytes
├── c: ArrayNode(3×1 MaybeHotMatrix with Union{Missing, Bool} elements) # 1 obs, 77 bytes
╰── s: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements) # 1 obs, 112 bytes
MultipleRepresentation
together with handling of missing
values enables JsonGrinder
to deal with JSONs with non-stable schema.
Minimalistic example of such non-stable schema can be json which sometimes has string and sometimes has array of numbers under same key. Let's create appropriate MultipleRepresentation
(although in real-world usage most suitable MultipleRepresentation
is proposed based on observed data in suggestextractor
):
julia> ex = MultipleRepresentation((ExtractString(), ExtractArray(ExtractScalar(Float32))));
julia> e_hello = ex("Hello")
ProductNode # 1 obs, 48 bytes ├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements) # 1 obs, 125 bytes ╰── e2: BagNode # 1 obs, 80 bytes ╰── ArrayNode(1×0 Array with Union{Missing, Float32} elements) # 0 obs, 48 bytes
julia> e_hello[:e1].data
2053×1 NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}: "Hello"
julia> e_hello[:e2].data
1×0 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}
julia> e_123 = ex([1,2,3])
ProductNode # 1 obs, 48 bytes ├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements) # 1 obs, 112 bytes ╰── e2: BagNode # 1 obs, 80 bytes ╰── ArrayNode(1×3 Array with Union{Missing, Float32} elements) # 3 obs, 63 bytes
julia> e_123[:e1].data
2053×1 NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}: missing
julia> e_123[:e2].data
1×3 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}: 1.0f0 2.0f0 3.0f0
julia> e_2 = ex([2])
ProductNode # 1 obs, 48 bytes ├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements) # 1 obs, 112 bytes ╰── e2: BagNode # 1 obs, 80 bytes ╰── ArrayNode(1×1 Array with Union{Missing, Float32} elements) # 1 obs, 53 bytes
julia> e_2[:e1].data
2053×1 NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}: missing
julia> e_2[:e2].data
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}: 2.0f0
julia> e_world = ex("world")
ProductNode # 1 obs, 48 bytes ├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements) # 1 obs, 125 bytes ╰── e2: BagNode # 1 obs, 80 bytes ╰── ArrayNode(1×0 Array with Union{Missing, Float32} elements) # 0 obs, 48 bytes
julia> e_world[:e1].data
2053×1 NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}: "world"
julia> e_world[:e2].data
1×0 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}
in this example we can see that every time one representation is always missing, and the other one contains data.
ExtractEmpty
As mentioned in earlier, ExtractEmpty
is a type used to extract observation with 0 samples. There is singleton extractempty
which can be used to obtain instance of instance of ExtractEmpty
type. MLUtils.numobs(ex(JsonGrinder.extractempty)) == 0
is required to hold for every extractor in order to work correctly.
All above-mentioned extractors are able to extract this, as we can see here
julia> ExtractString()(JsonGrinder.extractempty)
2053×0 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}
julia> ExtractString()(JsonGrinder.extractempty) |> numobs
0
julia> ExtractCategorical(["A","B"])(JsonGrinder.extractempty)
3×0 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}
julia> ExtractCategorical(["A","B"])(JsonGrinder.extractempty) |> numobs
0
julia> ExtractScalar()(JsonGrinder.extractempty)
1×0 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}
julia> ExtractScalar()(JsonGrinder.extractempty) |> numobs
0
julia> ExtractArray(ExtractString())(JsonGrinder.extractempty)
BagNode # 0 obs, 88 bytes ╰── ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements) # 0 obs, 104 bytes
julia> ExtractArray(ExtractString())(JsonGrinder.extractempty) |> numobs
0
julia> ExtractDict(Dict(:a => ExtractScalar(), :b => ExtractString(), :c => ExtractCategorical(["A","B"]), :d => ExtractArray(ExtractString())))(JsonGrinder.extractempty)
ProductNode # 0 obs, 96 bytes ├── a: ArrayNode(1×0 Array with Union{Missing, Float32} elements) # 0 obs, 48 bytes ├── b: ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements) # 0 obs, 104 bytes ├── d: BagNode # 0 obs, 88 bytes │ ╰── ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements) # 0 obs, 104 bytes ╰── c: ArrayNode(3×0 MaybeHotMatrix with Union{Missing, Bool} elements) # 0 obs, 72 bytes
julia> ExtractDict(Dict(:a => ExtractScalar(), :b => ExtractString(), :c => ExtractCategorical(["A","B"]), :d => ExtractArray(ExtractString())))(JsonGrinder.extractempty) |> numobs
0