Missing data

One detail that was left out so far is how Mill.jl handles incomplete or missing data. This phenomenon is nowadays ubiquitous in many data sources and occurs due to:

a high price of obtaining a (part of) observation
information being unreachable due to privacy reasons
a gradual change in the definition of data being gathered
a faulty collection process

and many other possible reasons. At the same time, it is wasteful to throw away the incomplete observations altogether. Thanks to the hierarchical structure of both samples and models, we can still process samples with missing information fragments at various levels of abstraction. Problems of this type can be categorized into 3 not necessarily separate types:

Missing parts of raw-data in a leaf ArrayNode
Empty bags with no instances in a BagNode
And entire key missing in a ProductNode

At the moment, Mill is capable of handling the first two cases. The solution always involves an additional vector of parameters (denoted always by ψ) that are used during the model evaluation to substitute the missing values. Parameters ψ can be either fixed or learned during training. Everything is done automatically.

Empty bags

It may happen that some bags in the datasets are empty by definition or no associated instances were obtained during data collection. Recall, that an empty bag is specified as empty range 0:-1 in case of AlignedBags and as an empty vector Int[] when ScatteredBags are used:

julia> empty_bags_1 = AlignedBags([1:2, 0:-1, 3:5, 0:-1])AlignedBags{Int64}(UnitRange{Int64}[1:2, 0:-1, 3:5, 0:-1])
julia> empty_bags_2 = ScatteredBags([[1, 2], Int[], [3, 4, 5], Int[]])ScatteredBags{Int64}([[1, 2], Int64[], [3, 4, 5], Int64[]])

To obtain the vector representation for a bag, be it for dircetly predicting some value or using it to represent some higher-level structures, we need to deal with these empty bags. This is done in Bag aggregation. Each AbstractAggregation operator carries a vector of parameters ψ, initialized to zeros upon creation:

julia> a = SegmentedSumMax(2)AggregationStack:
 SegmentedSum(ψ = Float32[0.0, 0.0])
 SegmentedMax(ψ = Float32[0.0, 0.0])

When we evaluate any BagModel, these values are used to compute output for empty bags instead of the aggregation itself. See the demo below:

julia> an = ArrayNode(randn(Float32, 2, 5))2×5 ArrayNode{Matrix{Float32}, Nothing}:
  0.6467423  0.31947362  -0.92877233  1.0960424    1.049499
 -1.0114008  1.3934115    0.38794893  0.62901175  -2.0970821
julia> ds = BagNode(an, empty_bags_2)BagNode  # 4 obs, 296 bytes
  ╰── ArrayNode(2×5 Array with Float32 elements)  # 5 obs, 88 bytes
julia> m = BagModel(identity, a, identity)BagModel ↦ [SegmentedSum(2); SegmentedMax(2)] ↦ identity  # 2 arrays, 4 params (all zero), 96 bytes
  ╰── ArrayModel(identity)
julia> m(ds)4×4 Matrix{Float32}:
 0.966216  0.0   1.21677   0.0
 0.382011  0.0  -1.08012   0.0
 0.646742  0.0   1.09604   0.0
 1.39341   0.0   0.629012  0.0

Vector ψ is learnable and therefore after training will contain a suitable representation of an empty bag for the given problem.

When a BagNode is entirely empty, it can be constructed with missing instead of a matrix wrapped in an ArrayNode:

julia> bn1 = BagNode(ArrayNode(rand(3, 4)), [1:4])BagNode  # 1 obs, 80 bytes
  ╰── ArrayNode(3×4 Array with Float64 elements)  # 4 obs, 144 bytes
julia> bn2 = BagNode(missing, [0:-1])BagNode  # 1 obs, 72 bytes
  ╰── ∅

and everything will work as expected. For example, we can concatenate these two:

julia> x = catobs(bn1, bn2)BagNode  # 2 obs, 96 bytes
  ╰── ArrayNode(3×4 Array with Float64 elements)  # 4 obs, 144 bytes

Notice, that the resulting ArrayNode has still the same dimension as ArrayNode inside bn1. The emptiness of bn2 is stored in bags:

julia> x.bagsAlignedBags{Int64}(UnitRange{Int64}[1:4, 0:-1])

The second element BagNode can be obtained again by indexing:

julia> bn1 == x[2]false

Even though this approach of using missing for data field in BagNodes is the most accurate from the semantic point of view, it may cause excessive compilation, as the types will be different. Therefore, if this happens in multiple places in the sample tree, it may be better to instead use an empty matrix for type consistency:

julia> BagNode(ArrayNode(zeros(3, 0)), [0:-1])BagNode  # 1 obs, 80 bytes
  ╰── ArrayNode(3×0 Array with Float64 elements)  # 0 obs, 48 bytes

How indexing behaves with respect to this issue depends on a global switch (off by default) and can be changed with the Mill.emptyismissing! function:

julia> a = BagNode(ArrayNode(rand(3, 2)), [1:2, 0:-1, 0:-1])BagNode  # 3 obs, 112 bytes
  ╰── ArrayNode(3×2 Array with Float64 elements)  # 2 obs, 96 bytes
julia> a[2:3] |> Mill.data3×0 ArrayNode{Matrix{Float64}, Nothing}
julia> Mill.emptyismissing!(true)
julia> a[2:3] |> Mill.datamissing
julia> missingmissing

Post imputing

Storing missing strings in NGramMatrix is straightforward:

julia> missing_ngrams = NGramMatrix(["foo", missing, "bar"], 3, 256, 5)5×3 NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}:
 "foo"
 missing
 "bar"

When some values of categorical variables are missing, Mill defines a new type for representation:

julia> missing_categorical = maybehotbatch([missing, 2, missing], 1:5)5×3 MaybeHotMatrix with eltype Union{Missing, Bool}:
 missing    ⋅    missing
 missing   true  missing
 missing    ⋅    missing
 missing    ⋅    missing
 missing    ⋅    missing

MaybeHotMatrix behaves similarly as OneHotMatrix from Flux.jl, but it supports possibly missing values. In case when no values are missing it looks exactly like OneHotMatrix:

julia> maybehotbatch([5, 2, 1], 1:5)5×3 MaybeHotMatrix with eltype Bool:
 ⋅  ⋅  1
 ⋅  1  ⋅
 ⋅  ⋅  ⋅
 ⋅  ⋅  ⋅
 1  ⋅  ⋅

MaybeHotMatrix behaves like AbstractMatrix and supports left multiplication again:

julia> missing_categorical::AbstractMatrix{Union{Bool, Missing}}5×3 MaybeHotMatrix with eltype Union{Missing, Bool}:
 missing    ⋅    missing
 missing   true  missing
 missing    ⋅    missing
 missing    ⋅    missing
 missing    ⋅    missing

However, multiplying these matrices with missing data leads into missing data in the output.

julia> W = rand(2, 5)2×5 Matrix{Float64}:
 0.880801  0.52829   0.567866  0.657427  0.839293
 0.45467   0.255719  0.657579  0.617     0.533037
julia> W * missing_ngrams2×3 Matrix{Union{Missing, Float64}}:
 3.69705  missing  3.69512
 2.35568  missing  2.8359
julia> W * missing_categorical2×3 Matrix{Union{Missing, Float64}}:
 missing  0.52829   missing
 missing  0.255719  missing

Consequently, gradient can't be computed and any model can't be trained.

Model debugging

Flux gradient call returns an error like Output should be scalar; gradients are not defined for output missing when attempted on missing result. In a similar fashion as having NaNs in a model, this signifies that some missing input is not treated anywhere in the model and it propagates up. Generally speaking, it is recommended to deal with missing values as soon as possible (on the leaf level) so that they do not propagate and cause type instabilities.

PostImputingMatrix is a solution for this. It can be constructed as follows:

julia> A = PostImputingMatrix(W)2×5 PostImputingMatrix{Float64, Matrix{Float64}, Vector{Float64}}:
W:
 0.880801  0.52829   0.567866  0.657427  0.839293
 0.45467   0.255719  0.657579  0.617     0.533037

ψ:
 0.0
 0.0

Matrix W is stored inside and A creates one vector of parameters ψ of length size(W, 1) on top of that. Suddenly, multiplication automagically works:

julia> A * missing_ngrams2×3 Matrix{Float64}:
 3.69705  0.0  3.69512
 2.35568  0.0  2.8359
julia> A * missing_categorical2×3 Matrix{Float64}:
 0.0  0.52829   0.0
 0.0  0.255719  0.0

What happens under the hood is that whenever A encounters a missing column in the matrix, it fills in values from ψ after the multiplication is performed (effectively replacing all missing values in the result of multiplying with W, but implemented more efficiently). Vector ψ can be learned during training as well and everything works out of the box.

Pre imputing

If we have to deal with inputs where some elements of input matrix are missing:

julia> X = [missing 1 2; 3 missing missing]2×3 Matrix{Union{Missing, Int64}}:
  missing  1         2
 3          missing   missing

we can make use of PreImputingMatrix:

julia> W = rand(1:2, 3, 2)3×2 Matrix{Int64}:
 1  1
 1  2
 2  2
julia> A = PreImputingMatrix(W)3×2 PreImputingMatrix{Int64, Matrix{Int64}, Vector{Int64}}:
W:
 1  1
 1  2
 2  2

ψ:
 0  0

As opposed to PostImputingMatrix, A now stores a vector of values ψ with length size(W, 2). When we use it for multiplication:

julia> A * X3×3 Matrix{Int64}:
 3  1  2
 6  1  2
 6  2  4

what happens is that when we perform a dot product of a row of A and a column of X, we first fill in values from ψ into the column before the multiplication is performed. Again, it is possible to compute gradients with respect to all three of W, ψ and X and therefore learn the appropriate default values in ψ from the data:

julia> gradient((A, X) -> sum(A * X), A, X)((W = [3 3; 3 3; 3 3], ψ = [4, 10]), Real[0 4.0 4.0; 5.0 0 0])

Model reflection with missing values

Model reflection takes missing values and types into account and creates appropriate (sub)models to handle them:

julia> ds = ProductNode((missing_ngrams, missing_categorical, X))ProductNode  # 3 obs, 56 bytes
  ├── ArrayNode(5×3 NGramMatrix with Union{Missing, Int64} elements)  # 3 obs, 150 bytes
  ├── ArrayNode(5×3 MaybeHotMatrix with Union{Missing, Bool} elements)  # 3 obs, 87 bytes
  ╰── ArrayNode(2×3 Array with Union{Missing, Int64} elements)  # 3 obs, 102 bytes
julia> m = reflectinmodel(ds)ProductModel ↦ Dense(30 => 10)  # 2 arrays, 310 params, 1.289 KiB
  ├── ArrayModel([postimputing]Dense(5 => 10))  # 3 arrays, 70 params, 400 bytes
  ├── ArrayModel([postimputing]Dense(5 => 10))  # 3 arrays, 70 params, 400 bytes
  ╰── ArrayModel([preimputing]Dense(2 => 10))  # 3 arrays, 32 params, 248 bytes

Here, [pre_imputing]Dense and [post_imputing]Dense are standard dense layers with a special matrix inside:

julia> dense = m.ms[1].m; typeof(dense.weight)PostImputingMatrix{Float32, Matrix{Float32}, Vector{Float32}}

Inside Mill we add a special definition Base.show for these types for compact printing.

The reflectinmodel method use types to determine whether imputing is needed or not. Compare the following:

julia> reflectinmodel(ArrayNode(randn(Float32, 2, 3)))ArrayModel(Dense(2 => 10))  # 2 arrays, 30 params, 200 bytes
julia> reflectinmodel(ArrayNode([1.0f0 2.0f0 missing; 4.0f0 missing missing]))ArrayModel([preimputing]Dense(2 => 10))  # 3 arrays, 32 params, 248 bytes
julia> reflectinmodel(ArrayNode(Matrix{Union{Missing, Float32}}(randn(2, 3))))ArrayModel([preimputing]Dense(2 => 10))  # 3 arrays, 32 params, 248 bytes

In the last case, the imputing type is returned even though there is no missing element in the matrix. Of course, the same applies to MaybeHotVector, MaybeHotMatrix and NGramMatrix. This way, we can signify that even though there are no missing values in the available sample, we expect them to appear in the future and want our model compatible. If it is hard to determine this in advance a safe bet is to make all leaves in the model. The performance will not suffer because imputing types are as fast as their non-imputing counterparts on data not containing missing values and the only tradeoff is a slight increase in the number of parameters, some of which may never be used.