Bag aggregation

Aggregation operators in Mill.jl are all subtypes of AbstractAggregation. These structures are responsible for mapping of vector representations of multiple instances into a single vector. They all operate element-wise and independently of dimension and thus the output has the same size as representations on the input, unless the Concatenation of multiple operators is used.

Some setup:

julia> d = 22
julia> X = Float32.([1 2 3 4; 8 7 6 5])2×4 Matrix{Float32}: 1.0 2.0 3.0 4.0 8.0 7.0 6.0 5.0
julia> bags = AlignedBags([1:1, 2:3, 4:4])AlignedBags{Int64}(UnitRange{Int64}[1:1, 2:3, 4:4])

Different choice of operator, or their combinations, are suitable for different problems. Nevertheless, because the input is interpreted as an unordered bag of instances, every operator is invariant to permutation and also does not scale when increasing size of the bag.

Non-parametric aggregation

Max aggregation

SegmentedMax implements a simple max and is the most straightforward operator defined in one dimension as follows:

\[a_{\max}(\{x_1, \ldots, x_k\}) = \max_{i = 1, \ldots, k} x_i\]

where $\{x_1, \ldots, x_k\}$ are all instances of the given bag. In Mill, the operator is constructed this way:

julia> a_max = SegmentedMax(d)SegmentedMax(ψ = Float32[0.0, 0.0])

The application is straightforward:

julia> a_max(X, bags)2×3 Matrix{Float32}:
 1.0  3.0  4.0
 8.0  7.0  5.0

Since we have three bags, we have three columns in the output, each storing the maximal element over all instances of the given bag.

Mean aggregation

SegmentedMean implements mean function, defined as:

\[a_{\operatorname{mean}}(\{x_1, \ldots, x_k\}) = \frac{1}{k} \sum_{i = 1}^{k} x_i\]

and used the same way:

julia> a_mean = SegmentedMean(d)SegmentedMean(ψ = Float32[0.0, 0.0])
julia> a_mean(X, bags)2×3 Matrix{Float32}: 1.0 2.5 4.0 8.0 6.5 5.0
Sufficiency of the mean operator

In theory, mean aggregation is sufficient for approximation as proven in [6], but in practice, a combination of multiple operators performes better.

The max aggregation is suitable for cases when one instance in the bag may give evidence strong enough to predict the label. On the other side of the spectrum lies the mean aggregation function, which detects well trends identifiable globally over the whole bag.

Sum aggregation

The last non-parametric operator is SegmentedSum, defined as:

\[a_{\operatorname{mean}}(\{x_1, \ldots, x_k\}) = \sum_{i = 1}^{k} x_i\]

and used the same way:

julia> a_sum = SegmentedSum(d)SegmentedSum(ψ = Float32[0.0, 0.0])
julia> a_sum(X, bags)2×3 Matrix{Float32}: 1.0 5.0 4.0 8.0 13.0 5.0

Parametric aggregation

Whereas non-parametric aggregations do not use any parameter, parametric aggregations represent an entire class of functions parametrized by one or more real vectors of parameters, which can be even learned during training.

Log-sum-exp (LSE) aggregation

SegmentedLSE (log-sum-exp) aggregation ([8]) is parametrized by a vector of positive numbers $\bm{r} \in (\mathbb{R}^+)^d$ m that specifies one real parameter for computation in each output dimension:

\[a_{\operatorname{lse}}(\{x_1, \ldots, x_k\}; r) = \frac{1}{r}\log \left(\frac{1}{k} \sum_{i = 1}^{k} \exp({r\cdot x_i})\right)\]

With different values of $r$, LSE behaves differently and in fact both max and mean operators are limiting cases of LSE. If $r$ is very small, the output approaches simple mean, and on the other hand, if $r$ is a large number, LSE becomes a smooth approximation of the max function. Naively implementing the definition above may lead to numerical instabilities, however, the Mill implementation is numerically stable.

julia> a_lse = SegmentedLSE(d)SegmentedLSE(ψ = Float32[0.0, 0.0], ρ = Float32[0.249335, -0.0607043])
julia> a_lse(X, bags)2×3 Matrix{Float32}: 1.0 2.60039 4.0 8.0 6.58143 5.0

$p$-norm aggregation

(Normalized) $p$-norm operator ([9]) is parametrized by a vector of real numbers $\bm{p} \in (\mathbb{R}^+)^d$, where $\forall i \in \{1, \ldots ,m \} \colon p_i \geq 1$, and another vector $\bm{c} \in (\mathbb{R}^+)^d$. It is computed with formula:

\[a_{\operatorname{pnorm}}(\{x_1, \ldots, x_k\}; p, c) = \left(\frac{1}{k} \sum_{i = 1}^{k} \vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}\]

Again, the Mill implementation is stable.

julia> a_pnorm = SegmentedPNorm(d)SegmentedPNorm(ψ = Float32[0.0, 0.0], ρ = Float32[0.937752, -1.37727], c = Float32[0.0, 0.0])
julia> a_pnorm(X, bags)2×3 Matrix{Float32}: 1.0 2.56238 4.0 8.0 6.50433 5.0

Because all parameter constraints are included implicitly (field ρ in both types is a real number that undergoes appropriate transformation before being used), both parametric operators are easy to use and do not require any special treatment. Replacing the definition of aggregation operators while constructing a model (either manually or with reflectinmodel) is enough.

Concatenation

To use a concatenation of two or more operators, one can use the AggregationStack constructor:

julia> a = AggregationStack(a_mean, a_max)AggregationStack:
 SegmentedMean(ψ = Float32[0.0, 0.0])
 SegmentedMax(ψ = Float32[0.0, 0.0])
julia> a(X, bags)4×3 Matrix{Float32}: 1.0 2.5 4.0 8.0 6.5 5.0 1.0 3.0 4.0 8.0 7.0 5.0

For the most common combinations, Mill provides some convenience definitions:

julia> SegmentedMeanMax(d)AggregationStack:
 SegmentedMean(ψ = Float32[0.0, 0.0])
 SegmentedMax(ψ = Float32[0.0, 0.0])
julia> SegmentedPNormLSE(d)AggregationStack: SegmentedPNorm(ψ = Float32[0.0, 0.0], ρ = Float32[-2.59288, -0.492026], c = Float32[0.0, 0.0]) SegmentedLSE(ψ = Float32[0.0, 0.0], ρ = Float32[1.66734, 1.67681])

Weighted aggregation

Sometimes, different instances in the bag are not equally important and contribute to output to a different extent. For instance, this may come in handy when performing importance sampling over very large bags. SegmentedMean and SegmentedPNorm have definitions taking weights into account:

\[a_{\operatorname{mean}}(\{(x_i, w_i)\}_{i=1}^k) = \frac{1}{\sum_{i=1}^k w_i} \sum_{i = 1}^{k} w_i \cdot x_i\]

\[a_{\operatorname{pnorm}}(\{x_i, w_i\}_{i=1}^k; p, c) = \left(\frac{1}{\sum_{i=1}^k w_i} \sum_{i = 1}^{k} w_i\cdot\vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}\]

This is done in Mill by passing an additional parameter:

julia> w = Float32.([1.0, 0.2, 0.8, 0.5])4-element Vector{Float32}:
 1.0
 0.2
 0.8
 0.5
julia> a_mean(X, bags, w)2×3 Matrix{Float32}: 1.0 2.8 4.0 8.0 6.2 5.0
julia> a_pnorm(X, bags, w)2×3 Matrix{Float32}: 1.0 2.83521 4.0 8.0 6.20283 5.0

For SegmentedMax and SegmentedLSE it is possible to pass in weights, but they are ignored during computation:

julia> a_max(X, bags, w) == a_max(X, bags)true

Weighted nodes

WeightedBagNode is used to store instance weights into a dataset. It accepts weights in the constructor:

julia> wbn = WeightedBagNode(X, bags, w)WeightedBagNode  # 3 obs, 176 bytes
  ╰── ArrayNode(2×4 Array with Float32 elements)  # 4 obs, 80 bytes

and passes them to aggregation operators:

julia> m = reflectinmodel(wbn, d -> Dense(d, 3))BagModel ↦ BagCount([SegmentedMean(3); SegmentedMax(3)]) ↦ Dense(7 => 3)  # 4 arrays, 30 params, 280 bytes
  ╰── ArrayModel(Dense(2 => 3))  # 2 arrays, 9 params, 116 bytes
julia> m(wbn)3×3 Matrix{Float32}: 3.76052 3.6409 3.24472 0.712644 0.87318 0.36645 4.20341 3.50676 3.27765

Otherwise, WeightedBagNode behaves exactly like the standard BagNode.

Bag count

For some problems, it may be beneficial to use the size of the bag directly and feed it to subsequent layers. To do this, wrap an instance of AbstractAggregation or AggregationStack in the BagCount type.

In the aggregation phase, bag count appends one more element which stores the bag size to the output after all operators are applied. Furthermore, Mill, performs a mapping $x \mapsto \log(x) + 1$ on top of that:

julia> a_mean_bc = BagCount(a_mean)BagCount(SegmentedMean(ψ = Float32[0.0, 0.0]))
julia> a_mean_bc(X, bags)3×3 Matrix{Float32}: 1.0 2.5 4.0 8.0 6.5 5.0 0.693147 1.09861 0.693147

The matrix now has three rows, the last one storing the size of the bag.

Model reflection adds BagCount after each aggregation operator by default.

julia> bn = BagNode(X, bags)BagNode  # 3 obs, 112 bytes
  ╰── ArrayNode(2×4 Array with Float32 elements)  # 4 obs, 80 bytes
julia> bm = reflectinmodel(bn, d -> Dense(d, 3))BagModel ↦ BagCount([SegmentedMean(3); SegmentedMax(3)]) ↦ Dense(7 => 3) # 4 arrays, 30 params, 280 bytes ╰── ArrayModel(Dense(2 => 3)) # 2 arrays, 9 params, 116 bytes

Note that the bm (sub)model field of the BagNode has size of (7, 3), 3 for each of two aggregation outputs and 1 for sizes of bags.

julia> bm(bn)3×3 Matrix{Float32}:
 -3.99267   -3.3032    -4.15971
 -0.604963  -3.69212   -4.94037
  0.849309  -0.414772  -1.12406

Default aggregation values

When all aggregation operators are printed, one may notice that all of them store one additional vector ψ. This is a vector of default parameters, initialized to all zeros, that are used for empty bags:

julia> bags = AlignedBags([1:1, 0:-1, 2:3, 0:-1, 4:4])AlignedBags{Int64}(UnitRange{Int64}[1:1, 0:-1, 2:3, 0:-1, 4:4])
julia> a_mean(X, bags)2×5 Matrix{Float32}: 1.0 0.0 2.5 0.0 4.0 8.0 0.0 6.5 0.0 5.0

That's why the dimension of input is required in the constructor. See Missing data page for more information.