Custom nodes

Mill.jl data nodes are lightweight wrappers around data, such as Array, DataFrame, and others. It is of course possible to define a custom data (and model) nodes. A useful abstraction for implementing custom data nodes suitable for most cases is LazyNode, which you can easily use to extend the functionality of Mill.

Unix path example

Let's define a custom node type for representing path names in Unix and one custom model type for processing it. LazyNode serves as a bolierplate for simple extension of Mill ecosystem. We start by by defining an example of such node:

julia> ds = LazyNode{:Path}(["/var/lib/blob_files/myfile.blob"])LazyNode{:Path, Vector{String}, Nothing}:
 "/var/lib/blob_files/myfile.blob"

Entirely new type is not needed, because we can dispatch on the first type parameter. Specifically, :Path "tag" in this case defines a special kind of LazyNode. Consequently, we can define multiple variations of custom LazyNode without any conflicts in dispatch.

As a next step, we extend the Mill.unpack2mill function, which always takes one LazyNode and produces an arbitrary Mill structure. We will represent individual file and directory names (as obtained by splitpath) using an NGramMatrix representation and, for simplicity, the whole path as a bag of individual names:

function Mill.unpack2mill(ds::LazyNode{:Path})
    ss = splitpath.(ds.data)
    x = NGramMatrix(reduce(vcat, ss), 3)
    BagNode(ArrayNode(x), Mill.length2bags(length.(ss)))
end
julia> Mill.unpack2mill(ds)BagNode  # 1 obs, 104 bytes
  ╰── ArrayNode(2053×5 NGramMatrix with Int64 elements)  # 5 obs, 212 bytes

Also, note that the node keeps an array of strings instead of just one string. This is because we want our node to be able to hold multiple observations than one. Methods such as catobs work as expected:

ds1 = LazyNode{:Path}(["/var/lib/blob_files/myfile.blob"])
ds2 = LazyNode{:Path}(["/var/lib/python"])
julia> ds = catobs(ds1, ds2)LazyNode{:Path, Vector{String}, Nothing}:
 "/var/lib/blob_files/myfile.blob"
 "/var/lib/python"

The Mill.unpack2mill function is called lazily during the inference by a LazyModel counterpart.

Model reflection works too:

julia> pm = reflectinmodel(ds, d -> Dense(d, 3))LazyModel{Path}
  ╰── BagModel ↦ BagCount([SegmentedMean(3); SegmentedMax(3)]) ↦ Dense(7 => 3)  # 4 arrays, 30 params, 280 bytes
        ╰── ArrayModel(Dense(2053 => 3))  # 2 arrays, 6_162 params, 24.148 KiB

We can use the obtained model to perform inference as we would do with any other model.

julia> pm(ds)3×2 Matrix{Float32}:
 -1.09527   -0.984371
  1.29677    1.15081
  0.464608   0.439383

Adding custom nodes without LazyNode

The solution using LazyNode is sufficient in most scenarios. For other cases, it is recommended to equip custom nodes with the following functionality:

  • allow nesting (if needed)
  • implement Mill.subset and optionally Base.getindex to obtain subsets of observations. Mill already defines Mill.subset for common datatypes, which can be used.
  • allow concatenation of nodes with catobs. Optionally, implement reduce(catobs, ...) as well to avoid excessive compilations if a number of arguments will vary a lot
  • define a specialized method for MLUtils.numobs, which we can however import directly from Mill.
  • register the custom node with HierarchicalUtils.jl to obtain pretty printing, iterators and other functionality

Here is an example of a custom node with the same functionality as in the Unix path example section:

using Mill

import Base: getindex, show
import Mill: catobs, numobs, data, metadata, VecOrRange, AbstractMillNode, reflectinmodel
import Flux
import HierarchicalUtils: NodeType, LeafNode

struct PathNode{S <: AbstractString, C} <: AbstractMillNode
    data::Vector{S}
    metadata::C
end

PathNode(data::Vector{S}) where {S <: AbstractString} = PathNode(data, nothing)
Base.show(io::IO, n::PathNode) = print(io, "PathNode ($(numobs(n)) obs)")

Base.ndims(n::PathNode) = Colon()
numobs(n::PathNode) = length(n.data)
catobs(ns::PathNode) = PathNode(vcat(data.(ns)...), catobs(metadata.(as)...))
Base.getindex(n::PathNode, i::VecOrRange{<:Int}) = PathNode(subset(data(x), i),
                                                            subset(metadata(x), i))
NodeType(::Type{<:PathNode}) = LeafNode()

We also have to define a corresponding model node type which will be a counterpart processing the data:

The solution using LazyNode is sufficient in most scenarios. For other cases, it is recommended to equip custom nodes with the following functionality:

struct PathModel{T, F} <: AbstractMillModel
    m::T
    path2mill::F
end

Flux.@functor PathModel
show(io::IO, n::PathModel) = print(io, "PathModel")
NodeType(::Type{<:PathModel}) = LeafNode()

path2mill(ds::PathNode) = path2mill(ds.data)
path2mill(ss::Vector{<:AbstractString}) = reduce(catobs, map(path2mill, ss))
function path2mill(s::String)
    ss = splitpath(s)
    BagNode(ArrayNode(NGramMatrix(ss, 3)), AlignedBags([1:length(ss)]))
end

(m::PathModel)(x::PathNode) = m.m(m.path2mill(x))

function reflectinmodel(ds::PathNode, args...)
    pm = reflectinmodel(path2mill(ds), args...)
    PathModel(pm, path2mill)
end

Example of usage:

julia> ds = PathNode(["/etc/passwd", "/home/tonda/.bashrc"])PathNode (2 obs)  # 2 obs, 110 bytes
julia> pm = reflectinmodel(ds, d -> Dense(d, 3))PathModel # 6 arrays, 6_192 params, 24.422 KiB
julia> pm(ds)3×2 Matrix{Float32}: 0.258106 0.307619 0.298125 0.404265 0.669369 0.869995