Missing data
One detail that was left out so far is how Mill.jl
handles incomplete or missing data. This phenomenon is nowadays ubiquitous in many data sources and occurs due to:
- a high price of obtaining a (part of) observation
- information being unreachable due to privacy reasons
- a gradual change in the definition of data being gathered
- a faulty collection process
and many other possible reasons. At the same time, it is wasteful to throw away the incomplete observations altogether. Thanks to the hierarchical structure of both samples and models, we can still process samples with missing information fragments at various levels of abstraction. Problems of this type can be categorized into 3 not necessarily separate types:
- Missing parts of raw-data in a leaf
ArrayNode
- Empty bags with no instances in a
BagNode
- And entire key missing in a
ProductNode
At the moment, Mill.jl
is capable of handling the first two cases. The solution always involves an additional vector of parameters (denoted always by ψ
) that are used during the model evaluation to substitute the missing values. Parameters ψ
can be either fixed or learned during training. Everything is done automatically.
Empty bags
It may happen that some bags in the datasets are empty by definition or no associated instances were obtained during data collection. Recall, that an empty bag is specified as empty range 0:-1
in case of AlignedBags
and as an empty vector Int[]
when ScatteredBags
are used:
julia> empty_bags_1 = AlignedBags([1:2, 0:-1, 3:5, 0:-1])
AlignedBags{Int64}(UnitRange{Int64}[1:2, 0:-1, 3:5, 0:-1])
julia> empty_bags_2 = ScatteredBags([[1, 2], Int[], [3, 4, 5], Int[]])
ScatteredBags{Int64}([[1, 2], Int64[], [3, 4, 5], Int64[]])
To obtain the vector representation for a bag, be it for dircetly predicting some value or using it to represent some higher-level structures, we need to deal with these empty bags. This is done in Bag aggregation. Each AbstractAggregation
operator carries a vector of parameters ψ
, initialized to zeros upon creation:
julia> a = SegmentedSumMax(2)
AggregationStack: SegmentedSum(ψ = Float32[0.0, 0.0]) SegmentedMax(ψ = Float32[0.0, 0.0])
When we evaluate any BagModel
, these values are used to compute output for empty bags instead of the aggregation itself. See the demo below:
julia> an = ArrayNode(randn(Float32, 2, 5))
2×5 ArrayNode{Matrix{Float32}, Nothing}: 0.6467423 0.31947362 -0.92877233 1.0960424 1.049499 -1.0114008 1.3934115 0.38794893 0.62901175 -2.0970821
julia> ds = BagNode(an, empty_bags_2)
BagNode 4 obs ╰── ArrayNode(2×5 Array with Float32 elements) 5 obs
julia> m = BagModel(identity, a, identity)
BagModel ↦ [SegmentedSum(2); SegmentedMax(2)] ↦ identity 2 arrays, 4 params ( ⋯ ╰── ArrayModel(identity)
julia> m(ds)
4×4 Matrix{Float32}: 0.966216 0.0 1.21677 0.0 0.382011 0.0 -1.08012 0.0 0.646742 0.0 1.09604 0.0 1.39341 0.0 0.629012 0.0
Vector ψ
is learnable and therefore after training will contain a suitable representation of an empty bag for the given problem.
When a BagNode
is entirely empty, it can be constructed with missing
instead of a matrix wrapped in an ArrayNode
:
julia> bn1 = BagNode(ArrayNode(rand(3, 4)), [1:4])
BagNode 1 obs ╰── ArrayNode(3×4 Array with Float64 elements) 4 obs
julia> bn2 = BagNode(missing, [0:-1])
BagNode 1 obs ╰── ∅
and everything will work as expected. For example, we can concatenate these two:
julia> x = catobs(bn1, bn2)
BagNode 2 obs ╰── ArrayNode(3×4 Array with Float64 elements) 4 obs
Notice, that the resulting ArrayNode
has still the same dimension as ArrayNode
inside bn1
. The emptiness of bn2
is stored in bags
:
julia> x.bags
AlignedBags{Int64}(UnitRange{Int64}[1:4, 0:-1])
The second element BagNode
can be obtained again by indexing:
julia> bn1 == x[2]
false
Even though this approach of using missing
for data
field in BagNode
s is the most accurate from the semantic point of view, it may cause excessive compilation, as the types will be different. Therefore, if this happens in multiple places in the sample tree, it may be better to instead use an empty matrix for type consistency:
julia> BagNode(ArrayNode(zeros(3, 0)), [0:-1])
BagNode 1 obs ╰── ArrayNode(3×0 Array with Float64 elements) 0 obs
How indexing behaves with respect to this issue depends on a global switch (off by default) and can be changed with the Mill.emptyismissing!
function:
julia> a = BagNode(ArrayNode(rand(3, 2)), [1:2, 0:-1, 0:-1])
BagNode 3 obs ╰── ArrayNode(3×2 Array with Float64 elements) 2 obs
julia> a[2:3] |> Mill.data
3×0 ArrayNode{Matrix{Float64}, Nothing}
julia> Mill.emptyismissing!(true)
julia> a[2:3] |> Mill.data
missing
julia> missing
missing
Post imputing
Storing missing strings in NGramMatrix
is straightforward:
julia> missing_ngrams = NGramMatrix(["foo", missing, "bar"], 3, 256, 5)
5×3 NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}: "foo" missing "bar"
When some values of categorical variables are missing, Mill.jl
defines a new type for representation:
julia> missing_categorical = maybehotbatch([missing, 2, missing], 1:5)
5×3 MaybeHotMatrix with eltype Union{Missing, Bool}: missing ⋅ missing missing 1 missing missing ⋅ missing missing ⋅ missing missing ⋅ missing
MaybeHotMatrix
behaves similarly as OneHotMatrix
from Flux.jl
, but it supports possibly missing
values. In case when no values are missing
it looks exactly like OneHotMatrix
:
julia> maybehotbatch([5, 2, 1], 1:5)
5×3 MaybeHotMatrix with eltype Bool: ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅
MaybeHotMatrix
behaves like AbstractMatrix
and supports left multiplication again:
julia> missing_categorical::AbstractMatrix{Union{Bool, Missing}}
5×3 MaybeHotMatrix with eltype Union{Missing, Bool}: missing ⋅ missing missing 1 missing missing ⋅ missing missing ⋅ missing missing ⋅ missing
However, multiplying these matrices with missing
data leads into missing
data in the output.
julia> W = rand(2, 5)
2×5 Matrix{Float64}: 0.880801 0.52829 0.567866 0.657427 0.839293 0.45467 0.255719 0.657579 0.617 0.533037
julia> W * missing_ngrams
2×3 Matrix{Union{Missing, Float64}}: 3.69705 missing 3.69512 2.35568 missing 2.8359
julia> W * missing_categorical
2×3 Matrix{Union{Missing, Float64}}: missing 0.52829 missing missing 0.255719 missing
Consequently, gradient can't be computed and any model can't be trained.
Flux gradient
call returns an error like Output should be scalar; gradients are not defined for output missing
when attempted on missing
result. In a similar fashion as having NaN
s in a model, this signifies that some missing
input is not treated anywhere in the model and it propagates up. Generally speaking, it is recommended to deal with missing values as soon as possible (on the leaf level) so that they do not propagate and cause type instabilities.
PostImputingMatrix
is a solution for this. It can be constructed as follows:
julia> A = PostImputingMatrix(W)
2×5 PostImputingMatrix{Float64, Matrix{Float64}, Vector{Float64}}: W: 0.880801 0.52829 0.567866 0.657427 0.839293 0.45467 0.255719 0.657579 0.617 0.533037 ψ: 0.0 0.0
Matrix W
is stored inside and A
creates one vector of parameters ψ
of length size(W, 1)
on top of that. Suddenly, multiplication automagically works:
julia> A * missing_ngrams
2×3 Matrix{Float64}: 3.69705 0.0 3.69512 2.35568 0.0 2.8359
julia> A * missing_categorical
2×3 Matrix{Float64}: 0.0 0.52829 0.0 0.0 0.255719 0.0
What happens under the hood is that whenever A
encounters a missing
column in the matrix, it fills in values from ψ
after the multiplication is performed (effectively replacing all missing
values in the result of multiplying with W
, but implemented more efficiently). Vector ψ
can be learned during training as well and everything works out of the box.
Pre imputing
If we have to deal with inputs where some elements of input matrix are missing
:
julia> X = [missing 1 2; 3 missing missing]
2×3 Matrix{Union{Missing, Int64}}: missing 1 2 3 missing missing
we can make use of PreImputingMatrix
:
julia> W = rand(1:2, 3, 2)
3×2 Matrix{Int64}: 1 1 1 2 2 2
julia> A = PreImputingMatrix(W)
3×2 PreImputingMatrix{Int64, Matrix{Int64}, Vector{Int64}}: W: 1 1 1 2 2 2 ψ: 0 0
As opposed to PostImputingMatrix
, A
now stores a vector of values ψ
with length size(W, 2)
. When we use it for multiplication:
julia> A * X
3×3 Matrix{Int64}: 3 1 2 6 1 2 6 2 4
what happens is that when we perform a dot product of a row of A
and a column of X
, we first fill in values from ψ
into the column before the multiplication is performed. Again, it is possible to compute gradients with respect to all three of W
, ψ
and X
and therefore learn the appropriate default values in ψ
from the data:
julia> gradient((A, X) -> sum(A * X), A, X)
((W = [3 3; 3 3; 3 3], ψ = [4, 10]), Real[0 4.0 4.0; 5.0 0 0])
Model reflection with missing values
Model reflection takes missing
values and types into account and creates appropriate (sub)models to handle them:
julia> ds = ProductNode((missing_ngrams, missing_categorical, X))
ProductNode 3 obs ├── ArrayNode(5×3 NGramMatrix with Union{Missing, Int64} elements) 3 obs ├── ArrayNode(5×3 MaybeHotMatrix with Union{Missing, Bool} elements) 3 obs ╰── ArrayNode(2×3 Array with Union{Missing, Int64} elements) 3 obs
julia> m = reflectinmodel(ds)
ProductModel ↦ Dense(30 => 10) 2 arrays, 310 params, 1.297 KiB ├── ArrayModel([postimputing]Dense(5 => 10)) 3 arrays, 70 params, 408 bytes ├── ArrayModel([postimputing]Dense(5 => 10)) 3 arrays, 70 params, 408 bytes ╰── ArrayModel([preimputing]Dense(2 => 10)) 3 arrays, 32 params, 256 bytes
Here, [pre_imputing]Dense
and [post_imputing]Dense
are standard dense layers with a special matrix inside:
julia> dense = m.ms[1].m; typeof(dense.weight)
PostImputingMatrix{Float32, Matrix{Float32}, Vector{Float32}}
Inside Mill.jl
we add a special definition Base.show
for these types for compact printing.
The reflectinmodel
method use types to determine whether imputing is needed or not. Compare the following:
julia> reflectinmodel(ArrayNode(randn(Float32, 2, 3)))
ArrayModel(Dense(2 => 10)) 2 arrays, 30 params, 208 bytes
julia> reflectinmodel(ArrayNode([1.0f0 2.0f0 missing; 4.0f0 missing missing]))
ArrayModel([preimputing]Dense(2 => 10)) 3 arrays, 32 params, 256 bytes
julia> reflectinmodel(ArrayNode(Matrix{Union{Missing, Float32}}(randn(2, 3))))
ArrayModel([preimputing]Dense(2 => 10)) 3 arrays, 32 params, 256 bytes
In the last case, the imputing type is returned even though there is no missing
element in the matrix. Of course, the same applies to MaybeHotVector
, MaybeHotMatrix
and NGramMatrix
. This way, we can signify that even though there are no missing values in the available sample, we expect them to appear in the future and want our model compatible. If it is hard to determine this in advance a safe bet is to make all leaves in the model. The performance will not suffer because imputing types are as fast as their non-imputing counterparts on data not containing missing
values and the only tradeoff is a slight increase in the number of parameters, some of which may never be used.