Extractor API

Index

API

JsonGrinder.suggestextractorFunction
suggestextractor(e::Schema; min_occurences=1, all_stable=false, categorical_limit=100)

given schema e, create a corresponding Extractor

  • min_occurences specifies the minimum occurence of a key to be included in the extractor.
  • all_stable makes all leaf extractors strictly stable.
  • categorical_limit specifies the maximum number of different values in a leaf for it to be considered a categorical variable.
  • ngram_params makes it possible to override default params for NGramExtractor.

Examples

julia> s = schema([ Dict("a" => 1, "b" => ["foo"], "c" => Dict("d" => 1)),
                    Dict("a" => 2,                 "c" => Dict())])
DictEntry 2x updated
  ├── a: LeafEntry (2 unique `Real` values) 2x updated
  ├── b: ArrayEntry 1x updated
  │        ╰── LeafEntry (1 unique `String` values) 1x updated
  ╰── c: DictEntry 2x updated
           ╰── d: LeafEntry (1 unique `Real` values) 1x updated

julia> suggestextractor(s)
DictExtractor
  ├── a: CategoricalExtractor(n=3)
  ├── b: ArrayExtractor
  │        ╰── StableExtractor(CategoricalExtractor(n=2))
  ╰── c: DictExtractor
           ╰── d: StableExtractor(CategoricalExtractor(n=2))

julia> suggestextractor(s; all_stable=true)
DictExtractor
  ├── a: StableExtractor(CategoricalExtractor(n=3))
  ├── b: ArrayExtractor
  │        ╰── StableExtractor(CategoricalExtractor(n=2))
  ╰── c: DictExtractor
           ╰── d: StableExtractor(CategoricalExtractor(n=2))

julia> suggestextractor(s; min_occurences=2)
DictExtractor
  ╰── a: CategoricalExtractor(n=3)

julia> suggestextractor(s; categorical_limit=0)
DictExtractor
  ├── a: ScalarExtractor(c=1.0, s=1.0)
  ├── b: ArrayExtractor
  │        ╰── StableExtractor(NGramExtractor(n=3, b=256, m=2053))
  ╰── c: DictExtractor
           ╰── d: StableExtractor(ScalarExtractor(c=1.0, s=1.0))

See also: extract, stabilizeextractor.

source
JsonGrinder.stabilizeextractorFunction
stabilizeextractor(e::Extractor)

Returns a new extractor with similar structure as e, containing StableExtractor in its leaves.

Examples

julia> e = (a=ScalarExtractor(), b=CategoricalExtractor(1:5)) |> DictExtractor
DictExtractor
  ├── a: ScalarExtractor(c=0.0, s=1.0)
  ╰── b: CategoricalExtractor(n=6)

julia> e_stable = stabilizeextractor(e)
DictExtractor
  ├── a: StableExtractor(ScalarExtractor(c=0.0, s=1.0))
  ╰── b: StableExtractor(CategoricalExtractor(n=6))

julia> e(Dict("a" => 0))
ERROR: IncompatibleExtractor at path [:b]: This extractor does not support missing values! See the `Stable Extractors` section in the docs.
[...]

julia> e_stable(Dict("a" => 0))
ProductNode  1 obs, 0 bytes
  ├── a: ArrayNode(1×1 Array with Union{Missing, Float32} elements)  1 obs, 62 bytes
  ╰── b: ArrayNode(6×1 MaybeHotMatrix with Union{Missing, Bool} elements)  1 obs, 62 bytes

See also: suggestextractor, extract.

source
JsonGrinder.extractFunction
extract(e::Extractor, samples; store_input=Val(false))

Efficient extraction of multiple samples at once.

Note that whereas extract expects samples to be an iterable of samples (of known length), calling the extractor directly with e(sample) works for a single sample. In other words, e(sample) is equivalent to extract(e, [sample]).

See also: suggestextractor, stabilizeextractor, schema.

Examples

julia> sample = Dict("a" => 0, "b" => "foo");

julia> e = suggestextractor(schema([sample]))
DictExtractor
  ├── a: CategoricalExtractor(n=2)
  ╰── b: CategoricalExtractor(n=2)

julia> e(sample)
ProductNode  1 obs, 0 bytes
  ├── a: ArrayNode(2×1 OneHotArray with Bool elements)  1 obs, 60 bytes
  ╰── b: ArrayNode(2×1 OneHotArray with Bool elements)  1 obs, 60 bytes

julia> e(sample) == extract(e, [sample])
true
source
JsonGrinder.ScalarExtractorType
ScalarExtractor{T} <: Extractor

Extracts a numerical value, centered by subtracting c and scaled by s.

Examples

julia> e = ScalarExtractor(2, 3)
ScalarExtractor(c=2.0, s=3.0)

julia> e(0)
1×1 ArrayNode{Matrix{Float32}, Nothing}:
 -6.0

julia> e(1)
1×1 ArrayNode{Matrix{Float32}, Nothing}:
 -3.0
source
JsonGrinder.CategoricalExtractorType
CategoricalExtractor{V, I} <: Extractor

Extracts a single item interpreted as a categorical variable into a one-hot encoded vector.

There is always an extra category for an unknown value (and hence the displayed n is one more than the number of categories).

Examples

julia> e = CategoricalExtractor(1:3)
CategoricalExtractor(n=4)

julia> e(2)
4×1 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
 ⋅
 1
 ⋅
 ⋅

julia> e(-1)
4×1 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
 ⋅
 ⋅
 ⋅
 1
source
JsonGrinder.NGramExtractorType
NGramExtractor{T} <: Extractor

Extracts String as n-grams (Mill.NGramMatrix).

Examples

julia> e = NGramExtractor()
NGramExtractor(n=3, b=256, m=2053)

julia> e("foo")
2053×1 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
 "foo"
source
JsonGrinder.DictExtractorType
DictExtractor{S} <: Extractor

Extracts all items in a Dict and returns them as a Mill.ProductNode.

Examples

julia> e = (a=ScalarExtractor(), b=CategoricalExtractor(1:5)) |> DictExtractor
DictExtractor
  ├── a: ScalarExtractor(c=0.0, s=1.0)
  ╰── b: CategoricalExtractor(n=6)

julia> e(Dict("a" => 1, "b" => 1))
ProductNode  1 obs, 0 bytes
  ├── a: ArrayNode(1×1 Array with Float32 elements)  1 obs, 60 bytes
  ╰── b: ArrayNode(6×1 OneHotArray with Bool elements)  1 obs, 60 bytes
source
JsonGrinder.ArrayExtractorType
ArrayExtractor{T}

Extracts all items in an Array and returns them as a Mill.BagNode.

Examples

julia> e = ArrayExtractor(CategoricalExtractor(2:4))
ArrayExtractor
  ╰── CategoricalExtractor(n=4)

julia> e([2, 3, 1, 4])
BagNode  1 obs, 64 bytes
  ╰── ArrayNode(4×4 OneHotArray with Bool elements)  4 obs, 72 bytes
source
JsonGrinder.PolymorphExtractorType
PolymorphExtractor

Extracts to a Mill.ProductNode where each item is a result of different extractor.

Examples

julia> e = (NGramExtractor(), CategoricalExtractor(["tcp", "udp", "dhcp"])) |> PolymorphExtractor
PolymorphExtractor
  ├── NGramExtractor(n=3, b=256, m=2053)
  ╰── CategoricalExtractor(n=4)

julia> e("tcp")
ProductNode  1 obs, 0 bytes
  ├── ArrayNode(2053×1 NGramMatrix with Int64 elements)  1 obs, 91 bytes
  ╰── ArrayNode(4×1 OneHotArray with Bool elements)  1 obs, 60 bytes

julia> e("http")
ProductNode  1 obs, 0 bytes
  ├── ArrayNode(2053×1 NGramMatrix with Int64 elements)  1 obs, 92 bytes
  ╰── ArrayNode(4×1 OneHotArray with Bool elements)  1 obs, 60 bytes
source