Custom nodes
Mill.jl
data nodes are lightweight wrappers around data, such as Array
, DataFrame
, and others. It is of course possible to define a custom data (and model) nodes. A useful abstraction for implementing custom data nodes suitable for most cases is LazyNode
, which you can easily use to extend the functionality of Mill.jl
.
Unix path example
Let's define a custom node type for representing path names in Unix and one custom model type for processing it. LazyNode
serves as a bolierplate for simple extension of Mill.jl
ecosystem. We start by by defining an example of such node:
julia> ds = LazyNode{:Path}(["/var/lib/blob_files/myfile.blob"])
LazyNode{:Path, Vector{String}, Nothing}: "/var/lib/blob_files/myfile.blob"
Entirely new type is not needed, because we can dispatch on the first type parameter. Specifically, :Path
"tag" in this case defines a special kind of LazyNode
. Consequently, we can define multiple variations of custom LazyNode
without any conflicts in dispatch.
As a next step, we extend the Mill.unpack2mill
function, which always takes one LazyNode
and produces an arbitrary Mill.jl
structure. We will represent individual file and directory names (as obtained by splitpath
) using an NGramMatrix
representation and, for simplicity, the whole path as a bag of individual names:
function Mill.unpack2mill(ds::LazyNode{:Path})
ss = splitpath.(ds.data)
x = NGramMatrix(reduce(vcat, ss), 3)
BagNode(ArrayNode(x), Mill.length2bags(length.(ss)))
end
julia> Mill.unpack2mill(ds)
BagNode 1 obs ╰── ArrayNode(2053×5 NGramMatrix with Int64 elements) 5 obs
Also, note that the node keeps an array of strings instead of just one string. This is because we want our node to be able to hold multiple observations than one. Methods such as catobs
work as expected:
ds1 = LazyNode{:Path}(["/var/lib/blob_files/myfile.blob"])
ds2 = LazyNode{:Path}(["/var/lib/python"])
julia> ds = catobs(ds1, ds2)
LazyNode{:Path, Vector{String}, Nothing}: "/var/lib/blob_files/myfile.blob" "/var/lib/python"
The Mill.unpack2mill
function is called lazily during the inference by a LazyModel
counterpart.
Model reflection works too:
julia> pm = reflectinmodel(ds, d -> Dense(d, 3))
LazyModel{Path} ╰── BagModel ↦ BagCount([SegmentedMean(3); SegmentedMax(3)]) ↦ Dense(7 => 3) ⋯ ╰── ArrayModel(Dense(2053 => 3)) 2 arrays, 6_162 params, 24.156 KiB
We can use the obtained model to perform inference as we would do with any other model.
julia> pm(ds)
3×2 Matrix{Float32}: -1.09527 -0.984371 1.29677 1.15081 0.464608 0.439383
Adding custom nodes without LazyNode
The solution using LazyNode
is sufficient in most scenarios. For other cases, it is recommended to equip custom nodes with the following functionality:
- allow nesting (if needed)
- implement
Base.getindex
to obtain subsets of observations. We make use ofMill.metadata_getindex
to index the metadata. - allow concatenation of nodes with
catobs
. Optionally, implementreduce(catobs, ...)
as well to avoid excessive compilations if a number of arguments will vary a lot - define a specialized method for
MLUtils.numobs
, which we can however import directly fromMill.jl
. - register the custom node with
HierarchicalUtils.jl
to obtain pretty printing, iterators and other functionality
Here is an example of a custom node with the same functionality as in the Unix path example section:
using Mill
import Base: getindex, show
import Mill: catobs, numobs, data, metadata, VecOrRange, AbstractMillNode, reflectinmodel
import Flux
import HierarchicalUtils: NodeType, LeafNode
struct PathNode{S <: AbstractString, C} <: AbstractMillNode
data::Vector{S}
metadata::C
end
PathNode(data::Vector{S}) where {S <: AbstractString} = PathNode(data, nothing)
Base.show(io::IO, n::PathNode) = print(io, "PathNode ($(numobs(n)) obs)")
Base.ndims(::PathNode) = Colon()
numobs(n::PathNode) = length(n.data)
catobs(ns::PathNode) = PathNode(vcat(data.(ns)...), catobs(metadata.(as)...))
Base.getindex(n::PathNode, i::VecOrRange{<:Int}) = PathNode(n.data[i], Mill.metadata_getindex(n.metadata, i))
NodeType(::Type{<:PathNode}) = LeafNode()
We also have to define a corresponding model node type which will be a counterpart processing the data:
The solution using LazyNode
is sufficient in most scenarios. For other cases, it is recommended to equip custom nodes with the following functionality:
struct PathModel{T, F} <: AbstractMillModel
m::T
path2mill::F
end
Flux.@layer :ignore PathModel
show(io::IO, ::PathModel) = print(io, "PathModel")
NodeType(::Type{<:PathModel}) = LeafNode()
path2mill(ds::PathNode) = path2mill(ds.data)
path2mill(ss::Vector{<:AbstractString}) = reduce(catobs, map(path2mill, ss))
function path2mill(s::String)
ss = splitpath(s)
BagNode(ArrayNode(NGramMatrix(ss, 3)), AlignedBags([1:length(ss)]))
end
(m::PathModel)(x::PathNode) = m.m(m.path2mill(x))
function reflectinmodel(ds::PathNode, args...)
pm = reflectinmodel(path2mill(ds), args...)
PathModel(pm, path2mill)
end
Example of usage:
julia> ds = PathNode(["/etc/passwd", "/home/tonda/.bashrc"])
PathNode (2 obs) 2 obs
julia> pm = reflectinmodel(ds, d -> Dense(d, 3))
PathModel 6 arrays, 6_192 params, 24.438 KiB
julia> pm(ds)
3×2 Matrix{Float32}: 0.258106 0.307619 0.298125 0.404265 0.669369 0.869995