Public API Reference
Documentation for JsonGrinder.jl
's public interface.
See the Internals section of the manual for internal package docs covering all functions.
Index
JsonGrinder.AuxiliaryExtractor
JsonGrinder.ExtractArray
JsonGrinder.ExtractCategorical
JsonGrinder.ExtractDict
JsonGrinder.ExtractKeyAsField
JsonGrinder.ExtractScalar
JsonGrinder.ExtractString
JsonGrinder.ExtractVector
JsonGrinder.MultipleRepresentation
JsonGrinder.extractbatch
JsonGrinder.generate_html
JsonGrinder.schema
JsonGrinder.suggestextractor
Public Interface
JsonGrinder.schema
— Functionschema(samples::AbstractArray{<:Dict})
schema(samples::AbstractArray{<:AbstractString})
schema(samples::AbstractArray, map_fun::Function)
schema(map_fun::Function, samples::AbstractArray)
creates schema from an array of parsed or unparsed JSONs.
JsonGrinder.suggestextractor
— Functionsuggestextractor(e::DictEntry, settings = NamedTuple())
create convertor of json to tree-structure of DataNode
e
top-level of json hierarchy, typically returned by invoking schemasettings
can be any container supportingget
functionsettings.mincountkey
contains minimum repetition of the key to be included into the extractor (if missing it is equal to zero)settings.key_as_field
of the number of keys exceeds this value, it is assumed that keys contains a value, which means that they will be treated as strings.settings.scalar_extractors
contains rules for determining which extractor to use for leaves. Default value is return value ofdefault_scalar_extractor()
, it's array of pairs where first element is predicate and if it matches, second element, function which maps schema to specific extractor, is called.
JsonGrinder.generate_html
— Functiongenerate_html(sch::DictEntry; max_vals=100, max_len=1_000)
generate_html(file_name, sch::DictEntry; max_vals=100, max_len=1_000)
exports schema to HTML including CSS style and JS allowing to expand / hide sub-parts of schema, countmaps, and lengthmaps.
Arguments
max_vals
controls maximum number of exported values in countmapmax_len
controls maximum number of exported lengts of arraysfile_name
a name of file to save HTML with schema
Return
If provided filename, it does not return anything. If not, it returns the generated HTML+CSS+JS as a String.
Example
You can either open the html file in any browser, or open it directly using ElectronDisplay
using ElectronDisplay
using ElectronDisplay: newdisplay
generated_html = generate_html(sch, max_vals = 100)
display(newdisplay(), MIME{Symbol("text/html")}(), generated_html)
JsonGrinder.extractbatch
— Functionextractbatch(extractor, samples)
utility function, shortcut for mapreduce(extractor, catobs, samples)
JsonGrinder.ExtractScalar
— Typestruct ExtractScalar{T} <: AbstractExtractor
c::T
s::T
uniontypes::Bool
end
Extracts a numerical value, centred by subtracting c
and scaled by multiplying by s
. Strings are converted to numbers.
The extractor returns ArrayNode{Matrix{Union{Missing, Int64}},Nothing}
or it subtypes. If passed missing
, it extracts missing values which Mill understands and can work with.
The uniontypes
field determines whether extractor may or may not accept missing
. If uniontypes
is false, it does not accept missing values. If uniontypes
is true, it accepts missing values, and always returns Mill structure of type Union{Missing, T} due to type stability reasons.
It can be created also using extractscalar(Float32, 5, 2)
Example
julia> ExtractScalar(Float32, 2, 3, true)(1)
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
-3.0f0
julia> ExtractScalar(Float32, 2, 3, true)(missing)
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
missing
julia> ExtractScalar(Float32, 2, 3, false)(1)
1×1 ArrayNode{Matrix{Float32}, Nothing}:
-3.0
JsonGrinder.ExtractCategorical
— Typestruct ExtractCategorical{V,I} <: AbstractExtractor
keyvalemap::Dict{V,I}
n::Int
uniontypes::Bool
end
ExtractCategorical(s::Entry, uniontypes = true)
ExtractCategorical(s::UnitRange, uniontypes = true)
ExtractCategorical(s::Vector, uniontypes = true)
Converts a single item to a one-hot encoded vector. Converts array of items into matrix of one-hot encoded columns. There is always alocated an extra element for a unknown value. If passed missing
, if uniontypes
is true, returns column of missing values, otherwise raises error. If uniontypes
is true, it allows extracting missing
values and all extracted values will be of type Union{Missing, <other type>}
due to type stability reasons. Otherwise missings extraction is not allowed.
Examples
julia> using Mill: catobs
julia> e = ExtractCategorical(2:4, true);
julia> mapreduce(e, catobs, [2,3,1,4])
4×4 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
true ⋅ ⋅ ⋅
⋅ true ⋅ ⋅
⋅ ⋅ ⋅ true
⋅ ⋅ true ⋅
julia> mapreduce(e, catobs, [1,missing,5])
4×3 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
⋅ missing ⋅
⋅ missing ⋅
⋅ missing ⋅
true missing true
julia> e(4)
4×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
⋅
⋅
true
⋅
julia> e(missing)
4×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
missing
missing
missing
missing
julia> e = ExtractCategorical(2:4, false);
julia> mapreduce(e, catobs, [2, 3, 1, 4])
4×4 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
1 ⋅ ⋅ ⋅
⋅ 1 ⋅ ⋅
⋅ ⋅ ⋅ 1
⋅ ⋅ 1 ⋅
julia> e(4)
4×1 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
⋅
⋅
1
⋅
JsonGrinder.ExtractArray
— Typestruct ExtractArray{T}
item::T
end
Convert array of values to a Mill.BagNode
with items converted by item
. The entire array is assumed to be a single bag.
Examples
julia> ec = ExtractArray(ExtractCategorical(2:4));
julia> ec([2, 3, 1, 4])
BagNode # 1 obs, 88 bytes
╰── ArrayNode(4×4 MaybeHotMatrix with Union{Missing, Bool} elements) # 4 obs, 92 bytes
julia> ans.data
4×4 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
true ⋅ ⋅ ⋅
⋅ true ⋅ ⋅
⋅ ⋅ ⋅ true
⋅ ⋅ true ⋅
julia> es = ExtractArray(ExtractScalar());
julia> es([2,3,4])
BagNode # 1 obs, 80 bytes
╰── ArrayNode(1×3 Array with Union{Missing, Float32} elements) # 3 obs, 63 bytes
julia> es([2,3,4]).data
1×3 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
2.0f0 3.0f0 4.0f0
JsonGrinder.ExtractDict
— Typestruct ExtractDict{S} <: AbstractExtractor
dict::S
end
extracts all items in dict
and return them as a Mill.ProductNode
. If a key is missing in extracted dict, nothing
is passed to the child extractors.
Examples
julia> e = ExtractDict(Dict(:a=>ExtractScalar(Float32, 2, 3),
:b=>ExtractCategorical(1:5)))
Dict
├── a: Float32
╰── b: Categorical d = 6
julia> res1 = e(Dict("a"=>1, "b"=>1))
ProductNode # 1 obs, 24 bytes
├── a: ArrayNode(1×1 Array with Union{Missing, Float32} elements) # 1 obs, 53 bytes
╰── b: ArrayNode(6×1 MaybeHotMatrix with Union{Missing, Bool} elements) # 1 obs, 77 bytes
julia> res1[:a]
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
-3.0f0
julia> res1[:b]
6×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
true
⋅
⋅
⋅
⋅
⋅
julia> res2 = e(Dict("a"=>0))
ProductNode # 1 obs, 24 bytes
├── a: ArrayNode(1×1 Array with Union{Missing, Float32} elements) # 1 obs, 53 bytes
╰── b: ArrayNode(6×1 MaybeHotMatrix with Union{Missing, Bool} elements) # 1 obs, 77 bytes
julia> res2[:a]
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
-6.0f0
julia> res2[:b]
6×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
missing
missing
missing
missing
missing
missing
JsonGrinder.ExtractVector
— Typestruct ExtractVector{T} <: AbstractExtractor
n::Int
uniontypes::Bool
end
represents an array of a fixed length, typically a feature vector of numbers of type T
julia> sc = ExtractVector(4)
julia> sc([2, 3, 1, 4])
4×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
2.0f0
3.0f0
1.0f0
4.0f0
JsonGrinder.MultipleRepresentation
— TypeMultipleRepresentation(extractors::Tuple)
Extractor extracts item to a ProductNode
where each item is different extractor and item is extracted by all extractors in multirepresentation.
Examples
Example of both categorical and string representation
One of usecases is to use string representation for strings and categorical variable representation for most frequent values. This allows model to more easily learn frequent or somehow else significant values, which creating meaningful representation for previously unseen inputs.
julia> e = MultipleRepresentation((ExtractString(false),
ExtractCategorical(["tcp", "udp", "dhcp"], false)));
julia> s1 = e("tcp")
ProductNode # 1 obs, 48 bytes
├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) # 1 obs, 123 bytes
╰── e2: ArrayNode(4×1 OneHotArray with Bool elements) # 1 obs, 76 bytes
julia> s1[:e1]
2053×1 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
"tcp"
julia> s1[:e2]
4×1 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
⋅
1
⋅
⋅
julia> s2 = e("http")
ProductNode # 1 obs, 48 bytes
├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) # 1 obs, 124 bytes
╰── e2: ArrayNode(4×1 OneHotArray with Bool elements) # 1 obs, 76 bytes
julia> s2[:e1]
2053×1 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
"http"
julia> s2[:e2]
4×1 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
⋅
⋅
⋅
1
Example of irregular schema representation
The other usecase is to handle irregular schema, where extractor returns missing
representation if it's unable to extract it properly. Of course there do not have to be only leaf value extractors, some extractors may be ExtractDict, while other are extracting leaves etc.
julia> e = MultipleRepresentation((ExtractString(), ExtractScalar(Float32, 2, 3)));
julia> s1 = e(5)
ProductNode # 1 obs, 40 bytes
├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements) # 1 obs, 112 bytes
╰── e2: ArrayNode(1×1 Array with Union{Missing, Float32} elements) # 1 obs, 53 bytes
julia> s1[:e1]
2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
missing
julia> s1[:e2]
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
9.0f0
julia> s2 = e("hi")
ProductNode # 1 obs, 40 bytes
├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements) # 1 obs, 122 bytes
╰── e2: ArrayNode(1×1 Array with Union{Missing, Float32} elements) # 1 obs, 53 bytes
julia> s2[:e1]
2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
"hi"
julia> s2[:e2]
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
missing
JsonGrinder.ExtractString
— Typestruct ExtractString{T} <: AbstractExtractor
n::Int
b::Int
m::Int
uniontypes::Bool
end
Represents String
as n-
grams (NGramMatrix
from Mill.jl
) with base b
and modulo m
.
The uniontypes
field determines whether extractor may or may not accept missing
. If uniontypes
is false, it does not accept missing values. If uniontypes
is true, it accepts missing values, and always returns Mill structure of type Union{Missing, T} due to type stability reasons.
Example
julia> using Mill: catobs
julia> ExtractString(true)("hello")
2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
"hello"
julia> mapreduce(ExtractString(true), catobs, (["hello", "world"]))
2053×2 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
"hello"
"world"
julia> mapreduce(ExtractString(true), catobs, ["hello", missing])
2053×2 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
"hello"
missing
julia> ExtractString(true)(missing)
2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
missing
julia> ExtractString(false)("hello")
2053×1 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
"hello"
julia> mapreduce(ExtractString(false), catobs, (["hello", "world"]))
2053×2 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
"hello"
"world"
julia> ExtractString(false)(["hello", "world"])
ERROR: This extractor does not support missing values
JsonGrinder.ExtractKeyAsField
— Typestruct ExtractKeyAsField{S,V} <: AbstractExtractor
key::S
item::V
end
extracts all items in vec
and in other
and return them as a ProductNode.
JsonGrinder.AuxiliaryExtractor
— Typestruct AuxiliaryExtractor <: AbstractExtractor
extractor::AbstractExtractor
extract_fun::Function
end
Universal extractor for applying any function, which lets you ambed any transformation into the AbstractExtractor machinery. Useful e.g. for extractors accompanying trained models, where you need to apply yet another transformation.
julia> e1 = ExtractDict(Dict(:a=>ExtractString(), :b=>ExtractString()));
julia> e2 = AuxiliaryExtractor(e1, (e,x)->e[:a](x["a"]))
Auxiliary extractor with
╰── Dict
├── a: String
╰── b: String
julia> e2(Dict("a"=>"Hello", "b"=>"World"))
ArrayNode{NGramMatrix{String,Array{String,1},Int64},Nothing}:
"Hello"