Public API Reference

Documentation for JsonGrinder.jl's public interface.

See the Internals section of the manual for internal package docs covering all functions.

Index

Public Interface

JsonGrinder.schemaFunction
schema(samples::AbstractArray{<:Dict})
schema(samples::AbstractArray{<:AbstractString})
schema(samples::AbstractArray, map_fun::Function)
schema(map_fun::Function, samples::AbstractArray)

creates schema from an array of parsed or unparsed JSONs.

source
JsonGrinder.suggestextractorFunction
suggestextractor(e::DictEntry, settings = NamedTuple())

create convertor of json to tree-structure of DataNode

  • e top-level of json hierarchy, typically returned by invoking schema
  • settings can be any container supporting get function
  • settings.mincountkey contains minimum repetition of the key to be included into the extractor (if missing it is equal to zero)
  • settings.key_as_field of the number of keys exceeds this value, it is assumed that keys contains a value, which means that they will be treated as strings.
  • settings.scalar_extractors contains rules for determining which extractor to use for leaves. Default value is return value of default_scalar_extractor(), it's array of pairs where first element is predicate and if it matches, second element, function which maps schema to specific extractor, is called.
source
JsonGrinder.generate_htmlFunction
generate_html(sch::DictEntry; max_vals=100, max_len=1_000)
generate_html(file_name, sch::DictEntry; max_vals=100, max_len=1_000)

exports schema to HTML including CSS style and JS allowing to expand / hide sub-parts of schema, countmaps, and lengthmaps.

Arguments

  • max_vals controls maximum number of exported values in countmap
  • max_len controls maximum number of exported lengts of arrays
  • file_name a name of file to save HTML with schema

Return

If provided filename, it does not return anything. If not, it returns the generated HTML+CSS+JS as a String.

Example

You can either open the html file in any browser, or open it directly using ElectronDisplay

using ElectronDisplay
using ElectronDisplay: newdisplay
generated_html = generate_html(sch, max_vals = 100)
display(newdisplay(), MIME{Symbol("text/html")}(), generated_html)
source
JsonGrinder.ExtractScalarType
struct ExtractScalar{T} <: AbstractExtractor
	c::T
	s::T
	uniontypes::Bool
end

Extracts a numerical value, centred by subtracting c and scaled by multiplying by s. Strings are converted to numbers.

The extractor returns ArrayNode{Matrix{Union{Missing, Int64}},Nothing} or it subtypes. If passed missing, it extracts missing values which Mill understands and can work with.

The uniontypes field determines whether extractor may or may not accept missing. If uniontypes is false, it does not accept missing values. If uniontypes is true, it accepts missing values, and always returns Mill structure of type Union{Missing, T} due to type stability reasons.

It can be created also using extractscalar(Float32, 5, 2)

Example

julia> ExtractScalar(Float32, 2, 3, true)(1)
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 -3.0f0

julia> ExtractScalar(Float32, 2, 3, true)(missing)
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 missing

julia> ExtractScalar(Float32, 2, 3, false)(1)
1×1 ArrayNode{Matrix{Float32}, Nothing}:
 -3.0
source
JsonGrinder.ExtractCategoricalType
struct ExtractCategorical{V,I} <: AbstractExtractor
	keyvalemap::Dict{V,I}
	n::Int
	uniontypes::Bool
end
ExtractCategorical(s::Entry, uniontypes = true)
ExtractCategorical(s::UnitRange, uniontypes = true)
ExtractCategorical(s::Vector, uniontypes = true)

Converts a single item to a one-hot encoded vector. Converts array of items into matrix of one-hot encoded columns. There is always alocated an extra element for a unknown value. If passed missing, if uniontypes is true, returns column of missing values, otherwise raises error. If uniontypes is true, it allows extracting missing values and all extracted values will be of type Union{Missing, <other type>} due to type stability reasons. Otherwise missings extraction is not allowed.

Examples

julia> using Mill: catobs

julia> e = ExtractCategorical(2:4, true);

julia> mapreduce(e, catobs, [2,3,1,4])
4×4 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
  true    ⋅      ⋅      ⋅
   ⋅     true    ⋅      ⋅
   ⋅      ⋅      ⋅     true
   ⋅      ⋅     true    ⋅

julia> mapreduce(e, catobs, [1,missing,5])
4×3 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
   ⋅    missing    ⋅
   ⋅    missing    ⋅
   ⋅    missing    ⋅
  true  missing   true

julia> e(4)
4×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
   ⋅
   ⋅
  true
   ⋅

julia> e(missing)
4×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
 missing
 missing
 missing
 missing

julia> e = ExtractCategorical(2:4, false);

julia> mapreduce(e, catobs, [2, 3, 1, 4])
4×4 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
 1  ⋅  ⋅  ⋅
 ⋅  1  ⋅  ⋅
 ⋅  ⋅  ⋅  1
 ⋅  ⋅  1  ⋅

julia> e(4)
4×1 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
 ⋅
 ⋅
 1
 ⋅
source
JsonGrinder.ExtractArrayType
struct ExtractArray{T}
	item::T
end

Convert array of values to a Mill.BagNode with items converted by item. The entire array is assumed to be a single bag.

Examples

julia> ec = ExtractArray(ExtractCategorical(2:4));

julia> ec([2, 3, 1, 4])
BagNode  # 1 obs, 88 bytes
  ╰── ArrayNode(4×4 MaybeHotMatrix with Union{Missing, Bool} elements)  # 4 obs, 92 bytes

julia> ans.data
4×4 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
  true    ⋅      ⋅      ⋅  
   ⋅     true    ⋅      ⋅
   ⋅      ⋅      ⋅     true
   ⋅      ⋅     true    ⋅

julia> es = ExtractArray(ExtractScalar());

julia> es([2,3,4])
BagNode  # 1 obs, 80 bytes
  ╰── ArrayNode(1×3 Array with Union{Missing, Float32} elements)  # 3 obs, 63 bytes

julia> es([2,3,4]).data
1×3 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 2.0f0  3.0f0  4.0f0
source
JsonGrinder.ExtractDictType
struct ExtractDict{S} <: AbstractExtractor
	dict::S
end

extracts all items in dict and return them as a Mill.ProductNode. If a key is missing in extracted dict, nothing is passed to the child extractors.

Examples

julia> e = ExtractDict(Dict(:a=>ExtractScalar(Float32, 2, 3),
                            :b=>ExtractCategorical(1:5)))
Dict
  ├── a: Float32
  ╰── b: Categorical d = 6

julia> res1 = e(Dict("a"=>1, "b"=>1))
ProductNode  # 1 obs, 24 bytes
  ├── a: ArrayNode(1×1 Array with Union{Missing, Float32} elements)  # 1 obs, 53 bytes
  ╰── b: ArrayNode(6×1 MaybeHotMatrix with Union{Missing, Bool} elements)  # 1 obs, 77 bytes

julia> res1[:a]
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 -3.0f0

julia> res1[:b]
6×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
  true
   ⋅
   ⋅
   ⋅
   ⋅
   ⋅

julia> res2 = e(Dict("a"=>0))
ProductNode  # 1 obs, 24 bytes
  ├── a: ArrayNode(1×1 Array with Union{Missing, Float32} elements)  # 1 obs, 53 bytes
  ╰── b: ArrayNode(6×1 MaybeHotMatrix with Union{Missing, Bool} elements)  # 1 obs, 77 bytes

julia> res2[:a]
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 -6.0f0

julia> res2[:b]
6×1 ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}:
 missing
 missing
 missing
 missing
 missing
 missing
source
JsonGrinder.ExtractVectorType
struct ExtractVector{T} <: AbstractExtractor
    n::Int
    uniontypes::Bool
end

represents an array of a fixed length, typically a feature vector of numbers of type T

julia> sc = ExtractVector(4)
julia> sc([2, 3, 1, 4])
4×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 2.0f0
 3.0f0
 1.0f0
 4.0f0
source
JsonGrinder.MultipleRepresentationType
MultipleRepresentation(extractors::Tuple)

Extractor extracts item to a ProductNode where each item is different extractor and item is extracted by all extractors in multirepresentation.

Examples

Example of both categorical and string representation

One of usecases is to use string representation for strings and categorical variable representation for most frequent values. This allows model to more easily learn frequent or somehow else significant values, which creating meaningful representation for previously unseen inputs.

julia> e = MultipleRepresentation((ExtractString(false),
                        ExtractCategorical(["tcp", "udp", "dhcp"], false)));

julia> s1 = e("tcp")
ProductNode  # 1 obs, 48 bytes
  ├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements)  # 1 obs, 123 bytes
  ╰── e2: ArrayNode(4×1 OneHotArray with Bool elements)  # 1 obs, 76 bytes

julia> s1[:e1]
2053×1 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
 "tcp"

julia> s1[:e2]
4×1 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
 ⋅
 1
 ⋅
 ⋅

julia> s2 = e("http")
ProductNode  # 1 obs, 48 bytes
  ├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements)  # 1 obs, 124 bytes
  ╰── e2: ArrayNode(4×1 OneHotArray with Bool elements)  # 1 obs, 76 bytes

julia> s2[:e1]
2053×1 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
 "http"

julia> s2[:e2]
4×1 ArrayNode{OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}:
 ⋅
 ⋅
 ⋅
 1

Example of irregular schema representation

The other usecase is to handle irregular schema, where extractor returns missing representation if it's unable to extract it properly. Of course there do not have to be only leaf value extractors, some extractors may be ExtractDict, while other are extracting leaves etc.

julia> e = MultipleRepresentation((ExtractString(), ExtractScalar(Float32, 2, 3)));

julia> s1 = e(5)
ProductNode  # 1 obs, 40 bytes
  ├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)  # 1 obs, 112 bytes
  ╰── e2: ArrayNode(1×1 Array with Union{Missing, Float32} elements)  # 1 obs, 53 bytes

julia> s1[:e1]
2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
 missing

julia> s1[:e2]
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 9.0f0

julia> s2 = e("hi")
ProductNode  # 1 obs, 40 bytes
  ├── e1: ArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)  # 1 obs, 122 bytes
  ╰── e2: ArrayNode(1×1 Array with Union{Missing, Float32} elements)  # 1 obs, 53 bytes

julia> s2[:e1]
2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
 "hi"

julia> s2[:e2]
1×1 ArrayNode{Matrix{Union{Missing, Float32}}, Nothing}:
 missing
source
JsonGrinder.ExtractStringType
struct ExtractString{T} <: AbstractExtractor
	n::Int
	b::Int
	m::Int
	uniontypes::Bool
end

Represents String as n-grams (NGramMatrix from Mill.jl) with base b and modulo m.

The uniontypes field determines whether extractor may or may not accept missing. If uniontypes is false, it does not accept missing values. If uniontypes is true, it accepts missing values, and always returns Mill structure of type Union{Missing, T} due to type stability reasons.

Example

julia> using Mill: catobs

julia> ExtractString(true)("hello")
2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
 "hello"

julia> mapreduce(ExtractString(true), catobs, (["hello", "world"]))
2053×2 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
 "hello"
 "world"
 
julia> mapreduce(ExtractString(true), catobs, ["hello", missing])
2053×2 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
 "hello"
 missing

julia> ExtractString(true)(missing)
2053×1 ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}:
 missing

julia> ExtractString(false)("hello")
2053×1 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
 "hello"

julia> mapreduce(ExtractString(false), catobs, (["hello", "world"]))
2053×2 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
 "hello"
 "world"

julia> ExtractString(false)(["hello", "world"])
ERROR: This extractor does not support missing values
source
JsonGrinder.ExtractKeyAsFieldType
struct ExtractKeyAsField{S,V} <: AbstractExtractor
    key::S
    item::V
end

extracts all items in vec and in other and return them as a ProductNode.

source
JsonGrinder.AuxiliaryExtractorType
struct AuxiliaryExtractor <: AbstractExtractor
	extractor::AbstractExtractor
	extract_fun::Function
end

Universal extractor for applying any function, which lets you ambed any transformation into the AbstractExtractor machinery. Useful e.g. for extractors accompanying trained models, where you need to apply yet another transformation.

julia> e1 = ExtractDict(Dict(:a=>ExtractString(), :b=>ExtractString()));

julia> e2 = AuxiliaryExtractor(e1, (e,x)->e[:a](x["a"]))
Auxiliary extractor with
  ╰── Dict
        ├── a: String
        ╰── b: String

julia> e2(Dict("a"=>"Hello", "b"=>"World"))
ArrayNode{NGramMatrix{String,Array{String,1},Int64},Nothing}:
 "Hello"
source