Data in leaves

In Mill.jl tree-like data representations, there are always some raw data on the leaf level, whereas on higher levels instances are grouped into bags (BagNodes), and different sets are joined together with Cartesion products (ProductNodes) and thus more abstract concepts are created. In this section we look into several examples how the lowest-level data can be represented.

For this purpose, let's assume that we would like to identify infected clients in a network from their HTTP traffic. Since one client can make an arbitrary number of connections during the observation period, modelling the client as a bag of connections seems like the most natural approach:

julia> connections = AlignedBags([1:2, 3:3, 4:6])AlignedBags{Int64}(UnitRange{Int64}[1:2, 3:3, 4:6])

Thus, each of the six connections becomes an instance in one of the three bags representing clients. How to represent such connections instances? Each HTTP flow has properties that can be expressed as standard numerical features, categorical features or strings of characters.

Numerical features

We have already shown how to represent standard numerical features in previous parts of this manual. It is as simple as wrapping a type that behaves like a matrix into an ArrayNode:

julia> content_lengths = [4446, 1957, 4310,
                          11604, 17019, 13947]6-element Vector{Int64}:
  4446
  1957
  4310
 11604
 17019
 13947
julia> dates = [1435420950, 1376190532, 1316869962, 1302775198, 1555598383, 1562237892]6-element Vector{Int64}: 1435420950 1376190532 1316869962 1302775198 1555598383 1562237892
julia> numerical_node = ArrayNode([content_lengths'; dates'])2×6 ArrayNode{Matrix{Int64}, Nothing}: 4446 1957 4310 11604 17019 13947 1435420950 1376190532 1316869962 1302775198 1555598383 1562237892

We use Content-Length and Date request headers, the latter converted to Unix timestamp.

Categorical features

For categorical variables, we proceed in the same way, but we use one-hot encoding implemented in Flux.jl. This way, we can encode for example a verb of the request:

julia> ALL_VERBS = ["GET", "HEAD", "POST", "PUT", "DELETE"] # etc...5-element Vector{String}:
 "GET"
 "HEAD"
 "POST"
 "PUT"
 "DELETE"
julia> verbs = ["GET", "GET", "POST", "HEAD", "HEAD", "PUT"]6-element Vector{String}: "GET" "GET" "POST" "HEAD" "HEAD" "PUT"
julia> verb_node = ArrayNode(Flux.onehotbatch(verbs, ALL_VERBS))5×6 ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}: 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

or Content-Encoding header:

julia> ALL_ENCODINGS = ["bzip2", "gzip", "xz", "identity"] # etc...4-element Vector{String}:
 "bzip2"
 "gzip"
 "xz"
 "identity"
julia> encodings = ["xz", "gzip", "bzip2", "xz", "identity", "bzip2"]6-element Vector{String}: "xz" "gzip" "bzip2" "xz" "identity" "bzip2"
julia> encoding_node = ArrayNode(Flux.onehotbatch(encodings, ALL_ENCODINGS))4×6 ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}: ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅

Because Flux.OneHotMatrix supports multiplication it is possible to wrap it into an ArrayNode.

Strings

The last example we will consider are string features. This could for example be the Host header:

julia> hosts = [
           "www.foo.com", "www.foo.com", "www.baz.com",
           "www.foo.com", "www.baz.com", "www.foo.com"
       ]6-element Vector{String}:
 "www.foo.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"

Mill offers ngram histogram-based representation for strings. To get started, we pass the vector of strings into the constructor of NGramMatrix:

julia> hosts_ngrams = NGramMatrix(hosts, 3, 256, 7)7×6 NGramMatrix{String, Vector{String}, Int64}:
 "www.foo.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"

Each string gets processed into ngrams (trigram in this case as specified in the first parameter). Then, each character is transformed into an integer via the codeunits function and the whole trigram is interpreted as a three digit number using a base b specified in the second parameter. Here, we use a base of 256, which is the most reasonable choice for ascii URLs. For example, for foo trigram, we obtain:

julia> c = codeunits("foo")3-element Base.CodeUnits{UInt8, String}:
 0x66
 0x6f
 0x6f
julia> c[1] * 256^2 + c[2] * 256 + c[3]6713199

The last step is taking the modulo of this result with respect to some prime modulo m, in this case 7 (last parameter in the constructor), leaving us with 3 as a result. Therefore, for this trigram foo, we would add 1 to the third row[1]. We can convert this NGramMatrix into a sparse array and then to the standard array:

using SparseArrays
julia> hosts_dense = hosts_ngrams |> SparseMatrixCSC |> Matrix7×6 Matrix{Int64}:
 1  1  3  1  3  1
 1  1  0  1  0  1
 3  3  4  3  4  3
 2  2  1  2  1  2
 3  3  3  3  3  3
 2  2  1  2  1  2
 1  1  1  1  1  1

Again, we get one column for each string, and the matrix has the same number of rows as modulo m. For each string s, we get length(s) + n - 1 ngrams:

julia> sum(hosts_dense; dims=1)1×6 Matrix{Int64}:
 13  13  13  13  13  13

This is because we use special abstract characters (or tokens) for the start and the end of the string. If we denote these ^ and $, respectively, from string "foo", we get trigrams ^^f, ^fo, foo, oo$, o$$. Note that these special characters are purely abstract whereas ^ and $ used only for illustration purposes here are characters like any other. Both string start and string end special characters have a unique mapping to integers, which can be obtained as well as set:

julia> Mill.string_start_code()0x02
julia> Mill.string_end_code()0x03
julia> Mill.string_start_code!(42)
julia> Mill.string_start_code()0x2a

NGramMatrix behaves like a matrix, implements an efficient left-multiplication and thus can be used in ArrayNode:

julia> hosts_ngrams::AbstractMatrix{Int64}7×6 NGramMatrix{String, Vector{String}, Int64}:
 "www.foo.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
julia> host_node = ArrayNode(hosts_ngrams)7×6 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}: "www.foo.com" "www.foo.com" "www.baz.com" "www.foo.com" "www.baz.com" "www.foo.com"

Custom nodes section shows one more slightly more complex way of processing strings, specifically Unix paths.

Putting it all together

Now, we can finally put wrap all features of all six connections into one ProductNode and construct a BagNode representing a bag of all connections (corresponding to three different clients):

julia> ds = BagNode(ProductNode(
           numerical=numerical_node,
           verb=verb_node,
           encoding=encoding_node,
           hosts=host_node
       ), connections)BagNode  # 3 obs, 176 bytes
  ╰── ProductNode  # 6 obs, 72 bytes
        ├── numerical: ArrayNode(2×6 Array with Int64 elements)  # 6 obs, 144 bytes
        ├─────── verb: ArrayNode(5×6 OneHotArray with Bool elements)  # 6 obs, 96 bytes
        ├─── encoding: ArrayNode(4×6 OneHotArray with Bool elements)  # 6 obs, 96 bytes
        ╰────── hosts: ArrayNode(7×6 NGramMatrix with Int64 elements)  # 6 obs, 266 bytes

create a model for training and run one forward pass:

julia> m = reflectinmodel(ds)BagModel ↦ BagCount([SegmentedMean(10); SegmentedMax(10)]) ↦ Dense(21 => 10)  # 4 arrays, 240 params, 1.094 KiB
  ╰── ProductModel ↦ Dense(40 => 10)  # 2 arrays, 410 params, 1.680 KiB
        ├── numerical: ArrayModel(Dense(2 => 10))  # 2 arrays, 30 params, 200 bytes
        ├─────── verb: ArrayModel(Dense(5 => 10))  # 2 arrays, 60 params, 320 bytes
        ├─── encoding: ArrayModel(Dense(4 => 10))  # 2 arrays, 50 params, 280 bytes
        ╰────── hosts: ArrayModel(Dense(7 => 10))  # 2 arrays, 80 params, 400 bytes
julia> m(ds)10×3 Matrix{Float32}: 1.28439f8 1.23432f8 1.17321f8 -5.59327f7 -4.71452f7 -8.44016f7 -7.23352f8 -6.77337f8 -7.52425f8 3.82563f8 3.63774f8 3.74053f8 -3.25157f8 -3.05034f8 -3.36274f8 -1.77164f8 -1.68485f8 -1.74555f8 -4.77705f8 -4.48281f8 -4.97372f8 4.16182f7 3.67129f7 4.93028f7 4.13563f7 4.18648f7 3.00047f7 3.72453f7 3.75997f7 2.38858f7

We now obtain a matrix with three columns, each corresponding to one of the clients. Now we can for example calculate gradients with respect to the model parameters:

julia> gradient(m -> sum(m(ds)), m)((im = (ms = (numerical = (m = (weight = Float32[3543.7795 3.467772f8; 12015.21 2.9824387f9; … ; -10205.47 -2.5080468f9; -22158.354 -4.05619f9], bias = Float32[0.12283072, 2.3017697, -6.108818, -0.07409614, -2.9991698, 2.245645, -2.9834032, 0.46019074, -1.9266042, -2.8137956], σ = nothing),), verb = (m = (weight = Float32[-0.6740063 -0.70653045 … 0.03252409 0.0; 0.799402 0.48524243 … 0.31415942 0.0; … ; 0.37385964 0.10423415 … 0.26962546 0.0; 1.1983932 0.8594307 … 0.3389626 0.0], bias = Float32[-2.022019, 2.3982058, -1.1765642, -0.78065914, 0.99973947, -2.0380163, 1.3930469, 4.333061, 1.1215789, 3.5951798], σ = nothing),), encoding = (m = (weight = Float32[-0.06326684 0.74241585 0.37819934 -0.15449953; -1.5059438 -0.29936808 -0.8562048 -0.18580464; … ; -1.4919614 -0.3994601 -0.92160493 -0.09642328; 1.679817 -0.015670985 0.7583598 0.2948524], bias = Float32[0.90284884, -2.847321, 4.356851, 0.28094852, -2.0077217, -5.3011537, 3.5605748, -1.0072757, -2.9094498, 2.717358], σ = nothing),), hosts = (m = (weight = Float32[-18.11323 -2.8004198 … -2.8004198 -4.7428136; -5.9066224 -0.88825744 … -0.88825744 -1.5366275; … ; 3.351283 0.6230595 … 0.6230595 0.9194803; -8.199025 -1.2190037 … -1.2190037 -2.1274068], bias = Float32[-4.7428136, -1.5366275, -2.2574577, -0.43867037, 2.0060303, 0.528619, -1.6092172, -0.7405628, 0.9194803, -2.1274068], σ = nothing),)), m = (weight = Float32[-4.019252f8 6.513404f8 … -1.2286562 0.8835508; 5.433767f8 -8.805694f8 … 1.4225271 -1.0247681; … ; 6.43344f8 -1.0425715f9 … 1.6505618 -1.1893108; -3.5949668f7 5.8265244f7 … -0.5622243 0.39958024], bias = Float32[1.9439511, -2.6168752, -7.0138664, 8.038992, -3.5198772, 3.923693, 0.69940704, -0.5939465, -3.09123, -0.07146776], σ = nothing)), a = (a = (fs = ((ψ = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],), (ψ = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],)),),), bm = (weight = Float32[-1.05757606f9 4.0311667f8 … -1.6887172f9 3.1780539; -1.05757606f9 4.0311667f8 … -1.6887172f9 3.1780539; … ; -1.05757606f9 4.0311667f8 … -1.6887172f9 3.1780539; -1.05757606f9 4.0311667f8 … -1.6887172f9 3.1780539], bias = Fill(3.0f0, 10), σ = nothing)),)
Numerical features

To put all numerical features into one ArrayNode is a design choice. We could as well introduce more keys in the final ProductNode. The model treats these two cases slightly differently (see Nodes section).

This dummy example illustrates the versatility of Mill. With little to no preprocessing we are able to process complex hierarchical structures and avoid manually designing feature extraction procedures. For a more involved study on processing Internet traffic with Mill, see for example [10].

  • 1One appropriate value for modulo m for real problems is 2053