Schema API Reference

Section of internal API reference related to creation, manipulation and visualization of the schema.

Index

Internal functions

Base.delete!Function

Deletes field at the specified path from the schema sch. For instance, the following: delete!(schema, ".field.subfield.[]", "x") deletes the field x from schema at: schema.childs[:field].childs[:subfield].items.childs

source
Base.mergeFunction

Dispatch of Base.merge on JsonGrinder.JSONEntry structures. Allows to merge multiple schemas to single one.

merge(es::Entry...)
merge(es::DictEntry...)
merge(es::ArrayEntry...)
merge(es::MultiEntry...)
merge(es::JsonGrinder.JSONEntry...)

it can be used to distribute calculation of schema across multiple workers to merge their partial results into bigger one.

Example

If we want to calculate schema from e.g. array of jsons in a distributed manner, if we have jsons array and , we can do it using

using ThreadsX
ThreadsX.mapreduce(schema, merge, Iterators.partition(jsons, length(jsons) ÷ Threads.nthreads()))

or

using ThreadTools
merge(tmap(schema, Threads.nthreads(), Iterators.partition(jsons, length(jsons) ÷ Threads.nthreads()))

or, if you like to split it into multiple jobs and having them processed by multiple threads, it can look like

using ThreadTools
merge(tmap(schema, Threads.nthreads(), Iterators.partition(jsons, 1_000))

where we split array to smaller array of size 1k and let all available threads create partial schemas.

If your data is too large to fit into ram, following approach works well also with filenames and similar other ways to process large data.

source
JsonGrinder.prune_jsonFunction
prune_json(json, schema)

Removes keys from json which are not part of the schema.

Example

julia> using JSON

julia> j1 = JSON.parse("{\"a\": 4, \"b\": {\"a\":1, \"b\": 1}}");

julia> j2 = JSON.parse("{\"a\": 4, \"b\": {\"a\":1}}");

julia> sch = JsonGrinder.schema([j1,j2])
[Dict]  # updated = 2
  ├── a: [Scalar - Int64], 1 unique values  # updated = 2
  ╰── b: [Dict]  # updated = 2
           ├── a: [Scalar - Int64], 1 unique values  # updated = 2
           ╰── b: [Scalar - Int64], 1 unique values  # updated = 1

julia> j3 = Dict("a" => 4, "b" => Dict("a"=>1), "c" => 1, "d" => 2)
Dict{String, Any} with 4 entries:
  "c" => 1
  "b" => Dict("a"=>1)
  "a" => 4
  "d" => 2

julia> JsonGrinder.prune_json(j3, sch)
Dict{String, Any} with 2 entries:
  "b" => Dict("a"=>1)
  "a" => 4

so the JsonGrinder.prune_json removes keys c and d.

source
JsonGrinder.updatemaxlen!Function
updatemaxlen!(n::Int)

limits the maximum size of string values in statistics of nodes in JSON. Default value is 10_000.
Longer strings will be trimmed and their length and hash will be appended to retain the uniqueness.
This is due to some strings being very long and causing the schema to be even order of magnitute larger than needed.
source
JsonGrinder.EntryType
mutable struct Entry <: JSONEntry
	counts::Dict{Any,Int}
	updated::Int
end

Keeps statistics about scalar values of a one key and also about items inside a key

  • counts counts how many times given value appeared (at most max_keys() is held)
  • updated counts how many times the entry was updated
source
JsonGrinder.ArrayEntryType
mutable struct ArrayEntry <: JSONEntry
	items
	l::Dict{Int,Int}
	updated::Int
end

keeps statistics about an array entry in JSON.

  • items is typeof Entry or nothing and keeps statistics about the elements of the array
  • l keeps histogram of message length
  • updated counts how many times the struct was updated.
source
JsonGrinder.MultiEntryType
mutable struct MultiEntry <: JSONEntry
	childs::Vector{Any}
end

support for JSON which does not adhere to a fixed type. Container for multiple types of entry which are observed on the same place in JSON.

source
JsonGrinder.DictEntryType
mutable struct DictEntry <: JSONEntry
	childs::Dict{String, Any}
	updated::Int
end

keeps statistics about an object in json

  • childs maintains key-value statistics of childrens. All values should be JSONEntries
  • updated counts how many times the struct was updated.
source