Extractor API
Index
JsonGrinder.ArrayExtractor
JsonGrinder.CategoricalExtractor
JsonGrinder.DictExtractor
JsonGrinder.Extractor
JsonGrinder.LeafExtractor
JsonGrinder.NGramExtractor
JsonGrinder.PolymorphExtractor
JsonGrinder.ScalarExtractor
JsonGrinder.StableExtractor
JsonGrinder.extract
JsonGrinder.stabilizeextractor
JsonGrinder.suggestextractor
API
JsonGrinder.suggestextractor
— Functionsuggestextractor(e::Schema; min_occurences=1, all_stable=false, categorical_limit=100)
given schema e
, create a corresponding Extractor
min_occurences
specifies the minimum occurence of a key to be included in the extractor.all_stable
makes all leaf extractors strictly stable.categorical_limit
specifies the maximum number of different values in a leaf for it to be considered a categorical variable.ngram_params
makes it possible to override default params forNGramExtractor
.
Examples
julia> s = schema([ Dict("a" => 1, "b" => ["foo"], "c" => Dict("d" => 1)),
Dict("a" => 2, "c" => Dict())])
DictEntry 2x updated
├── a: LeafEntry (2 unique `Real` values) 2x updated
├── b: ArrayEntry 1x updated
│ ╰── LeafEntry (1 unique `String` values) 1x updated
╰── c: DictEntry 2x updated
╰── d: LeafEntry (1 unique `Real` values) 1x updated
julia> suggestextractor(s)
DictExtractor
├── a: CategoricalExtractor(n=3)
├── b: ArrayExtractor
│ ╰── StableExtractor(CategoricalExtractor(n=2))
╰── c: DictExtractor
╰── d: StableExtractor(CategoricalExtractor(n=2))
julia> suggestextractor(s; all_stable=true)
DictExtractor
├── a: StableExtractor(CategoricalExtractor(n=3))
├── b: ArrayExtractor
│ ╰── StableExtractor(CategoricalExtractor(n=2))
╰── c: DictExtractor
╰── d: StableExtractor(CategoricalExtractor(n=2))
julia> suggestextractor(s; min_occurences=2)
DictExtractor
╰── a: CategoricalExtractor(n=3)
julia> suggestextractor(s; categorical_limit=0)
DictExtractor
├── a: ScalarExtractor(c=1.0, s=1.0)
├── b: ArrayExtractor
│ ╰── StableExtractor(NGramExtractor(n=3, b=256, m=2053))
╰── c: DictExtractor
╰── d: StableExtractor(ScalarExtractor(c=1.0, s=1.0))
See also: extract
, stabilizeextractor
.
JsonGrinder.stabilizeextractor
— Functionstabilizeextractor(e::Extractor)
Returns a new extractor with similar structure as e
, containing StableExtractor
in its leaves.
Examples
julia> e = (a=ScalarExtractor(), b=CategoricalExtractor(1:5)) |> DictExtractor
DictExtractor
├── a: ScalarExtractor(c=0.0, s=1.0)
╰── b: CategoricalExtractor(n=6)
julia> e_stable = stabilizeextractor(e)
DictExtractor
├── a: StableExtractor(ScalarExtractor(c=0.0, s=1.0))
╰── b: StableExtractor(CategoricalExtractor(n=6))
julia> e(Dict("a" => 0))
ERROR: IncompatibleExtractor at path [:b]: This extractor does not support missing values! See the `Stable Extractors` section in the docs.
[...]
julia> e_stable(Dict("a" => 0))
ProductNode 1 obs, 0 bytes
├── a: ArrayNode(1×1 Array with Union{Missing, Float32} elements) 1 obs, 62 bytes
╰── b: ArrayNode(6×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 obs, 62 bytes
See also: suggestextractor
, extract
.
JsonGrinder.extract
— Functionextract(e::Extractor, samples; store_input=Val(false))
Efficient extraction of multiple samples at once.
Note that whereas extract
expects samples
to be an iterable of samples (of known length), calling the extractor directly with e(sample)
works for a single sample. In other words, e(sample)
is equivalent to extract(e, [sample])
.
See also: suggestextractor
, stabilizeextractor
, schema
.
Examples
julia> sample = Dict("a" => 0, "b" => "foo");
julia> e = suggestextractor(schema([sample]))
DictExtractor
├── a: CategoricalExtractor(n=2)
╰── b: CategoricalExtractor(n=2)
julia> e(sample)
ProductNode 1 obs, 0 bytes
├── a: ArrayNode(2×1 OneHotArray with Bool elements) 1 obs, 60 bytes
╰── b: ArrayNode(2×1 OneHotArray with Bool elements) 1 obs, 60 bytes
julia> e(sample) == extract(e, [sample])
true
JsonGrinder.Extractor
— TypeExtractor
Supertype for all extractor node types.
JsonGrinder.LeafExtractor
— TypeLeafExtractor
Supertype for all leaf extractor node types that reside in the leafs of the hierarchy.
JsonGrinder.StableExtractor
— Typestruct StableExtractor{T <: LeafExtractor} <: LeafExtractor
Wraps any other LeafExtractor
and makes it output stable results w.r.t. missing input values.
See also: stabilizeextractor
.
JsonGrinder.ScalarExtractor
— TypeScalarExtractor{T} <: Extractor
Extracts a numerical value, centered by subtracting c
and scaled by s
.
Examples
julia> e = ScalarExtractor(2, 3)
ScalarExtractor(c=2.0, s=3.0)
julia> e(0)
1×1 ArrayNode{Matrix{Float32}, Nothing}:
-6.0
julia> e(1)
1×1 ArrayNode{Matrix{Float32}, Nothing}:
-3.0
JsonGrinder.CategoricalExtractor
— TypeCategoricalExtractor{V, I} <: Extractor
Extracts a single item interpreted as a categorical variable into a one-hot encoded vector.
There is always an extra category for an unknown value (and hence the displayed n
is one more than the number of categories).
Examples
julia> e = CategoricalExtractor(1:3)
CategoricalExtractor(n=4)
julia> e(2)
4×1 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
⋅
1
⋅
⋅
julia> e(-1)
4×1 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
⋅
⋅
⋅
1
JsonGrinder.NGramExtractor
— TypeNGramExtractor{T} <: Extractor
Extracts String
as n-
grams (Mill.NGramMatrix
).
Examples
julia> e = NGramExtractor()
NGramExtractor(n=3, b=256, m=2053)
julia> e("foo")
2053×1 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
"foo"
JsonGrinder.DictExtractor
— TypeDictExtractor{S} <: Extractor
Extracts all items in a Dict
and returns them as a Mill.ProductNode
.
Examples
julia> e = (a=ScalarExtractor(), b=CategoricalExtractor(1:5)) |> DictExtractor
DictExtractor
├── a: ScalarExtractor(c=0.0, s=1.0)
╰── b: CategoricalExtractor(n=6)
julia> e(Dict("a" => 1, "b" => 1))
ProductNode 1 obs, 0 bytes
├── a: ArrayNode(1×1 Array with Float32 elements) 1 obs, 60 bytes
╰── b: ArrayNode(6×1 OneHotArray with Bool elements) 1 obs, 60 bytes
JsonGrinder.ArrayExtractor
— TypeArrayExtractor{T}
Extracts all items in an Array
and returns them as a Mill.BagNode
.
Examples
julia> e = ArrayExtractor(CategoricalExtractor(2:4))
ArrayExtractor
╰── CategoricalExtractor(n=4)
julia> e([2, 3, 1, 4])
BagNode 1 obs, 64 bytes
╰── ArrayNode(4×4 OneHotArray with Bool elements) 4 obs, 72 bytes
JsonGrinder.PolymorphExtractor
— TypePolymorphExtractor
Extracts to a Mill.ProductNode
where each item is a result of different extractor.
Examples
julia> e = (NGramExtractor(), CategoricalExtractor(["tcp", "udp", "dhcp"])) |> PolymorphExtractor
PolymorphExtractor
├── NGramExtractor(n=3, b=256, m=2053)
╰── CategoricalExtractor(n=4)
julia> e("tcp")
ProductNode 1 obs, 0 bytes
├── ArrayNode(2053×1 NGramMatrix with Int64 elements) 1 obs, 91 bytes
╰── ArrayNode(4×1 OneHotArray with Bool elements) 1 obs, 60 bytes
julia> e("http")
ProductNode 1 obs, 0 bytes
├── ArrayNode(2053×1 NGramMatrix with Int64 elements) 1 obs, 92 bytes
╰── ArrayNode(4×1 OneHotArray with Bool elements) 1 obs, 60 bytes