Extractors overview

Below, we first describe extractors of values (i.e. leaves of JSON tree), then proceed to description of extractors of Array and Dict, and finish with some specials.

Extractors of scalar values are arguably the most important, but also fortunately the most understood ones. They control, how values are converted to a Vector (or generally tensor) for the neural networks. For example, they control, if number should be represented as a number, or as one-hot encoded categorical variable. Similarly, they control how String should be treated, although we admit to natively support only n-grams.

Because mapping from JSON (or different hierarchical structure) to Mill structures can be non-trivial, extractors have keyword argument store_input, which, if true, causes input data to be stored as metadata of respective Mill structure. By default, it's false, because it can cause type-instability in case of irregular input data and thus suffer from performance loss. The store_input argument is propagated to leaves and is used to store primarily leaf values.

Because JsonGrinder supports working with missing values, each leaf extractor has uniontypes field which determines if it can return missing values or not, and based on this field, extractor returns appropriate data type. By default, uniontypes is true, so it supports missing values of the shelf, but we advise to set it during extractor construction according to your data because it may create unnecessarily many parameters otherwise. suggestextractor takes into account where missing values can be observed and where not based on statistics in schema and provides sensible default extractor.

Recall

Numbers

struct ExtractScalar{T} <: AbstractExtractor
	c::T
	s::T
	uniontypes::Bool
end

Extracts a numerical value, centered by subtracting c and scaled by multiplying by s. Strings are converted to numbers. The extractor returns ArrayNode{Matrix{T}} with a single row if uniontypes if false, and ArrayNode{Matrix{Union{Missing, T}}} with a single row if uniontypes if true.

e = ExtractScalar(Float32, 0.5, 4.0, true)
e("1").data

1×1 Matrix{Union{Missing, Float32}}:
 2.0f0

missing value is extracted as a missing value, as it is automatically handled downstream by Mill.

e(missing)

1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 missing

the e("1") is equivalent to e("1", store_input=false). To see input data in metadata of ArrayNode, we can run

e("1", store_input=true).metadata

1×1 Matrix{String}:
 "1"

data remain unchanged

e("1", store_input=true).data

1×1 Matrix{Union{Missing, Float32}}:
 2.0f0

by default, metadata contains nothing.

And if uniontypes is false, it looks as follows

julia> e = ExtractScalar(Float32, 0.5, 4.0, true)Float32
julia> e("1").data1×1 Matrix{Union{Missing, Float32}}:
 2.0f0
julia> e("1", store_input_true=true).dataERROR: MethodError: no method matching (::ExtractScalar{Float32})(::String; store_input_true::Bool)

Closest candidates are:
  (::ExtractScalar{T})(::AbstractString; store_input) where T got unsupported keyword argument "store_input_true"
   @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:75
  (::ExtractScalar)(::Union{Missing, Nothing, JsonGrinder.ExtractEmpty, AbstractString, Number, AbstractDict, AbstractVector}; store_input) got unsupported keyword argument "store_input_true"
   @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:70
  (::ExtractScalar{T})(!Matched::Number; store_input) where T got unsupported keyword argument "store_input_true"
   @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:73
  ...
julia> e("1", store_input_true=true).metadataERROR: MethodError: no method matching (::ExtractScalar{Float32})(::String; store_input_true::Bool)

Closest candidates are:
  (::ExtractScalar{T})(::AbstractString; store_input) where T got unsupported keyword argument "store_input_true"
   @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:75
  (::ExtractScalar)(::Union{Missing, Nothing, JsonGrinder.ExtractEmpty, AbstractString, Number, AbstractDict, AbstractVector}; store_input) got unsupported keyword argument "store_input_true"
   @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:70
  (::ExtractScalar{T})(!Matched::Number; store_input) where T got unsupported keyword argument "store_input_true"
   @ JsonGrinder ~/work/JsonGrinder.jl/JsonGrinder.jl/src/extractors/extractscalar.jl:73
  ...
julia> e(missing)1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 missing

Strings

struct ExtractString <: AbstractExtractor
	n::Int
	b::Int
	m::Int
	uniontypes::Bool
end

Represents String as n-grams (NGramMatrix from Mill.jl) with base b and modulo m.

e = ExtractString()
e("Hello")

2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
 "Hello"

missing value is extracted as a missing value, as it is automatically handled downstream by Mill.

e(missing)

2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
 missing

Storing input works in the same manner as for ExtractScalar, see

e("Hello", store_input=true).metadata

1-element Vector{String}:
 "Hello"

it works the same also with missing values

e(missing, store_input=true).metadata

1-element Vector{Missing}:
 missing

and if we know we won't have missing strings, we can disable uniontypes:

julia> e = ExtractString(false)String
julia> e("Hello")2053×1 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
 "Hello"
julia> e(missing)ERROR: This extractor does not support missing values
julia> e("Hello", store_input=true).metadata1-element Vector{String}:
 "Hello"

Categorical

struct ExtractCategorical{V,I} <: AbstractExtractor
	keyvalemap::Dict{V,I}
	n::Int
	uniontypes::Bool
end

Converts a single item to a one-hot encoded vector. For a safety, there is always an extra item reserved for an unknown value.

e = ExtractCategorical(["A","B","C"])
e(["A","B","C","D"]).data

4×1 MaybeHotMatrix with eltype Union{Missing, Bool}:
 missing
 missing
 missing
 missing

missing value is extracted as a missing value, as it is automatically handled downstream by Mill.

e(missing)

4×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
 missing
 missing
 missing
 missing

Storing input in this case looks as follows

e(["A","B","C","D"], store_input=true).metadata

1-element Vector{Vector{String}}:
 ["A", "B", "C", "D"]

uniontypes settings works the same as with scalars or strings.

Use-cases for unknown value

The last dimension for unknown value can be used e.g. to represent sparse data, where values which are frequent have their own dimension, and all values which are scarce will share this single dimension. This is useful for heavy-tail distributions where many dimensions would be used only rarely, but would raise the number of trained parameters significantly.

Of course if all values in the training set are represented explicitly and unknown value is not present during training, the model will not learn the unknown representation, and it will produce noise in case of unknown value in inference time.

Examples of schema with heavy tail can be following histogram with exponential number of observations

using Random
ht_hist = Dict((i==1 ? "aaaaa" : randstring(5))=>ceil(100*ℯ^(-i/5)) for i in 1:1000)
entry = JsonGrinder.Entry{String}(ht_hist, sum(values(ht_hist)))

[Scalar - String], 1000 unique values  # updated = 1437

Now we can see it has 1000 unique values, and has been created from 1437 observations, where 977 values were observed only once. Creating the extractor directly and extracting value from it will produce one-hot encoded vector of dimension 1001 (1000 unique values + 1 dimension for the unknown).

ExtractCategorical(entry)("aaaaa")

1001×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
  true
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅

But when making threshold for values we have seen at least 5 times, it produces one-hot vector of dimension 17, which is significantly smaller.

ExtractCategorical(keys(filter(kv->kv[2]>=5, entry.counts)))("aaaaa")

17×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
  true
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅  
   ⋅

For training, the latter approach may be beneficial, because the number of weights will be significantly lower.

Difference between unknown and missing value

Note that there is semantic difference between unknown and missing value. For unknown value, special dimension is trained. For the missing one, it's similar, every time missing value is encountered in specific layer, the neural network contains vector it trains instead and which is used as an output in case of missing observation. This allows the model to distinguish between missing values (e.g. when the key in dict is not present, the value under that key is tretated as missing) and unknown, previously unseen values.

Array (Lists / Sets)

struct ExtractArray{T}
	item::T
end

Converts an array of values to a Mill.BagNode with items converted by item. The entire array is assumed to be a single bag. The BagNode contains data in field data and information about bags in bags field.

sc = ExtractArray(ExtractCategorical(["A","B","C"]))
sc(["A","B","C","D"])

BagNode  # 1 obs, 88 bytes
  ╰── ArrayNode(4×4 MaybeHotMatrix with Union{Missing, Bool} elements)  # 4 obs, 92 bytes

Empty arrays are represented as an empty bag.

sc([])

BagNode  # 1 obs, 88 bytes
  ╰── ArrayNode(4×0 MaybeHotMatrix with Union{Missing, Bool} elements)  # 0 obs, 72 bytes

the information about bags themselves is seen here

sc([]).bags

AlignedBags{Int64}(UnitRange{Int64}[0:-1])

The data of empty bag can be either missing or a empty sample, which is more convenient as it makes all samples of the same type, which is nicer to AD. This behavior is controlled by Mill.emptyismissing. The extractor of a BagNode can signal to child extractors to extract a sample with zero observations using a special singleton JsonGrinder.extractempty. For example

Mill.emptyismissing!(true)
sc([]).data

missing

Mill.emptyismissing!(false)
sc([]).data

4×0 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}

Storing input is delegated to leaf extractors, so metadata of bag itself are empty

sc(["A","B","C","D"], store_input=true).metadata

but metadata of underlying ArrayNode contain inputs.

sc(["A","B","C","D"], store_input=true).data.metadata

4-element Vector{String}:
 "A"
 "B"
 "C"
 "D"

In case of empty arrays, input is stored in metadata of BagNode itself, because there might not be any underlying ArrayNode.

sc([], store_input=true).metadata

1-element Vector{Vector{Any}}:
 []

Dict

struct ExtractDict{S} <: AbstractExtractor
	dict::S
end

Extracts all items in dict and return them as a ProductNode. Key in dict corresponds to keys in JSON.

ex = ExtractDict(Dict(:a => ExtractScalar(),
	:b => ExtractString(),
	:c => ExtractCategorical(["A","B"]),
	:d => ExtractArray(ExtractString())))
ex(Dict(:a => "1",
	:b => "Hello",
	:c => "A",
	:d => ["Hello", "world"]))

ProductNode  # 1 obs, 96 bytes
  ├── a: ArrayNode(1×1 Array with Union{Missing, Float32} elements)  # 1 obs, 53 bytes
  ├── b: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)  # 1 obs, 125 bytes
  ├── d: BagNode  # 1 obs, 104 bytes
  │        ╰── ArrayNode(2053×2 NGramMatrix with Union{Missing, Int64} elements)  # 2 obs, 146 bytes
  ╰── c: ArrayNode(3×1 MaybeHotMatrix with Union{Missing, Bool} elements)  # 1 obs, 77 bytes

Missing keys are replaced by missing and handled by child extractors.

ex(Dict(:a => "1",
	:c => "A"))

ProductNode  # 1 obs, 96 bytes
  ├── a: ArrayNode(1×1 Array with Union{Missing, Float32} elements)  # 1 obs, 53 bytes
  ├── b: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)  # 1 obs, 112 bytes
  ├── d: BagNode  # 1 obs, 104 bytes
  │        ╰── ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements)  # 0 obs, 104 bytes
  ╰── c: ArrayNode(3×1 MaybeHotMatrix with Union{Missing, Bool} elements)  # 1 obs, 77 bytes

Storing input data works in similar manner as for ExtractArray, input data are delegated to leaf extractors.

julia> ex(Dict(:a => "1",
       	:c => "A"), store_input=true).metadata
julia> ex(Dict(:a => "1",
       	:c => "A"), store_input=true)[:a].metadata1×1 Matrix{String}:
 "1"
julia> ex(Dict(:a => "1",
       	:c => "A"), store_input=true)[:b].metadata1-element Vector{Nothing}:
 nothing
julia> ex(Dict(:a => "1",
       	:c => "A"), store_input=true)[:c].metadata1-element Vector{String}:
 "A"
julia> ex(Dict(:a => "1",
       	:c => "A"), store_input=true)[:d].metadata1-element Vector{Nothing}:
 nothing

julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true).metadata
julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true)[:a].metadata1×1 Matrix{String}:
 "1"
julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true)[:b].metadata1-element Vector{String}:
 "Hello"
julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true)[:c].metadata1-element Vector{String}:
 "A"
julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true)[:d].metadata
julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true)[:d].data.metadata2-element Vector{String}:
 "Hello"
 "world"

Specials

ExtractKeyAsField

Some JSONs we have encountered use Dicts to hold an array of named lists (or other types). Having computer security background a prototypical example is storing a list of DLLs with a corresponding list of imported function in a single structure. For example a JSON

{ "foo.dll" : ["print","write", "open","close"],
  "bar.dll" : ["send", "recv"]
}

should be better written as

[{"key": "foo.dll",
  "item": ["print","write", "open","close"]},
  {"key": "bar.dll",
  "item": ["send", "recv"]}
]

JsonGrinder tries to detect these cases, as they are typically manifested by Dicts with excessively large number of keys in a schema. The detection logic of this case in suggestextractor(e::DictEntry) is simple, if the number of unique keys in a specific Dict is greater than settings.key_as_field = 500, such Dict is considered to hold values in keys and ExtractKeyAsField is used instead of ExtractDict. key_as_field can be set to any value based on specific data or domain, but we have found 500 to be reasonable default.

The extractor itself is simple as well. For the case above, it would look like

s = JSON.parse("{ \"foo.dll\" : [\"print\",\"write\", \"open\",\"close\"],
  \"bar.dll\" : [\"send\", \"recv\"]
}")
ex = ExtractKeyAsField(ExtractString(),ExtractArray(ExtractString()))
ex(s)

BagNode  # 1 obs, 144 bytes
  ╰── ProductNode  # 2 obs, 72 bytes
        ├─── key: ArrayNode(2053×2 NGramMatrix with Union{Missing, Int64} elements)  # 2 obs, 150 bytes
        ╰── item: BagNode  # 2 obs, 120 bytes
                    ╰── ArrayNode(2053×6 NGramMatrix with Union{Missing, Int64} elements)  # 6 obs, 227 bytes

Because it returns BagNode, missing values are treated in similar manner as in ExtractArray and settings of Mill.emptyismissing applies here too.

Mill.emptyismissing!(true)
ex(Dict()).data

missing

Mill.emptyismissing!(false)
ex(Dict()).data

ProductNode  # 0 obs, 72 bytes
  ├─── key: ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements)  # 0 obs, 104 bytes
  ╰── item: BagNode  # 0 obs, 88 bytes
              ╰── ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements)  # 0 obs, 104 bytes

MultipleRepresentation

Provides a way to have multiple representations for a single value or subtree in JSON. For example imagine that are extracting strings with some very frequently occurring values and a lots of clutter, which might be important and you do not know about it. MultipleRepresentation(extractors::Tuple) contains a Tuple or NamedTuple of extractors and apply them to a single sub-tree in a json. The corresponding Mill structure will contain ProductNode of both representation.

For example String with Categorical and NGram representation will look like

ex = MultipleRepresentation((c = ExtractCategorical(["Hello","world"]), s = ExtractString()))
reduce(catobs,ex.(["Hello","world","from","Prague"]))

ProductNode  # 4 obs, 48 bytes
  ├── c: ArrayNode(3×4 MaybeHotMatrix with Union{Missing, Bool} elements)  # 4 obs, 92 bytes
  ╰── s: ArrayNode(2053×4 NGramMatrix with Union{Missing, Int64} elements)  # 4 obs, 188 bytes

Because it produces ProductNode, missing values are delegated to leaf extractors.

ex(missing)

ProductNode  # 1 obs, 48 bytes
  ├── c: ArrayNode(3×1 MaybeHotMatrix with Union{Missing, Bool} elements)  # 1 obs, 77 bytes
  ╰── s: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)  # 1 obs, 112 bytes

MultipleRepresentation together with handling of missing values enables JsonGrinder to deal with JSONs with non-stable schema.

Minimalistic example of such non-stable schema can be json which sometimes has string and sometimes has array of numbers under same key. Let's create appropriate MultipleRepresentation (although in real-world usage most suitable MultipleRepresentation is proposed based on observed data in suggestextractor):

julia> ex = MultipleRepresentation((ExtractString(), ExtractArray(ExtractScalar(Float32))));
julia> e_hello = ex("Hello")ProductNode  # 1 obs, 48 bytes
  ├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)  # 1 obs, 125 bytes
  ╰── e2: BagNode  # 1 obs, 80 bytes
            ╰── ArrayNode(1×0 Array with Union{Missing, Float32} elements)  # 0 obs, 48 bytes
julia> e_hello[:e1].data2053×1 NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}:
 "Hello"
julia> e_hello[:e2].data1×0 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}
julia> e_123 = ex([1,2,3])ProductNode  # 1 obs, 48 bytes
  ├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)  # 1 obs, 112 bytes
  ╰── e2: BagNode  # 1 obs, 80 bytes
            ╰── ArrayNode(1×3 Array with Union{Missing, Float32} elements)  # 3 obs, 63 bytes
julia> e_123[:e1].data2053×1 NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}:
 missing
julia> e_123[:e2].data1×3 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 1.0f0  2.0f0  3.0f0
julia> e_2 = ex([2])ProductNode  # 1 obs, 48 bytes
  ├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)  # 1 obs, 112 bytes
  ╰── e2: BagNode  # 1 obs, 80 bytes
            ╰── ArrayNode(1×1 Array with Union{Missing, Float32} elements)  # 1 obs, 53 bytes
julia> e_2[:e1].data2053×1 NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}:
 missing
julia> e_2[:e2].data1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 2.0f0
julia> e_world = ex("world")ProductNode  # 1 obs, 48 bytes
  ├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)  # 1 obs, 125 bytes
  ╰── e2: BagNode  # 1 obs, 80 bytes
            ╰── ArrayNode(1×0 Array with Union{Missing, Float32} elements)  # 0 obs, 48 bytes
julia> e_world[:e1].data2053×1 NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}:
 "world"
julia> e_world[:e2].data1×0 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}

in this example we can see that every time one representation is always missing, and the other one contains data.

ExtractEmpty

As mentioned in earlier, ExtractEmpty is a type used to extract observation with 0 samples. There is singleton extractempty which can be used to obtain instance of instance of ExtractEmpty type. MLUtils.numobs(ex(JsonGrinder.extractempty)) == 0 is required to hold for every extractor in order to work correctly.

All above-mentioned extractors are able to extract this, as we can see here

julia> ExtractString()(JsonGrinder.extractempty)2053×0 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}
julia> ExtractString()(JsonGrinder.extractempty) |> numobs0
julia> ExtractCategorical(["A","B"])(JsonGrinder.extractempty)3×0 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}
julia> ExtractCategorical(["A","B"])(JsonGrinder.extractempty) |> numobs0
julia> ExtractScalar()(JsonGrinder.extractempty)1×0 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}
julia> ExtractScalar()(JsonGrinder.extractempty) |> numobs0
julia> ExtractArray(ExtractString())(JsonGrinder.extractempty)BagNode  # 0 obs, 88 bytes
  ╰── ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements)  # 0 obs, 104 bytes
julia> ExtractArray(ExtractString())(JsonGrinder.extractempty) |> numobs0
julia> ExtractDict(Dict(:a => ExtractScalar(),
       	:b => ExtractString(),
       	:c => ExtractCategorical(["A","B"]),
       	:d => ExtractArray(ExtractString())))(JsonGrinder.extractempty)ProductNode  # 0 obs, 96 bytes
  ├── a: ArrayNode(1×0 Array with Union{Missing, Float32} elements)  # 0 obs, 48 bytes
  ├── b: ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements)  # 0 obs, 104 bytes
  ├── d: BagNode  # 0 obs, 88 bytes
  │        ╰── ArrayNode(2053×0 NGramMatrix with Union{Missing, Int64} elements)  # 0 obs, 104 bytes
  ╰── c: ArrayNode(3×0 MaybeHotMatrix with Union{Missing, Bool} elements)  # 0 obs, 72 bytes
julia> ExtractDict(Dict(:a => ExtractScalar(),
       	:b => ExtractString(),
       	:c => ExtractCategorical(["A","B"]),
       	:d => ExtractArray(ExtractString())))(JsonGrinder.extractempty) |> numobs0