Data in leaves
In Mill.jl
tree-like data representations, there are always some raw data on the leaf level, whereas on higher levels instances are grouped into bags (BagNode
s), and different sets are joined together with Cartesion products (ProductNode
s) and thus more abstract concepts are created. In this section we look into several examples how the lowest-level data can be represented.
For this purpose, let's assume that we would like to identify infected clients in a network from their HTTP traffic. Since one client can make an arbitrary number of connections during the observation period, modelling the client as a bag of connections seems like the most natural approach:
julia> connections = AlignedBags([1:2, 3:3, 4:6])
AlignedBags{Int64}(UnitRange{Int64}[1:2, 3:3, 4:6])
Thus, each of the six connections becomes an instance in one of the three bags representing clients. How to represent such connections instances? Each HTTP flow has properties that can be expressed as standard numerical features, categorical features or strings of characters.
Numerical features
We have already shown how to represent standard numerical features in previous parts of this manual. It is as simple as wrapping a type that behaves like a matrix into an ArrayNode
:
julia> content_lengths = [4446, 1957, 4310, 11604, 17019, 13947]
6-element Vector{Int64}: 4446 1957 4310 11604 17019 13947
julia> dates = [1435420950, 1376190532, 1316869962, 1302775198, 1555598383, 1562237892]
6-element Vector{Int64}: 1435420950 1376190532 1316869962 1302775198 1555598383 1562237892
julia> numerical_node = ArrayNode([content_lengths'; dates'])
2×6 ArrayNode{Matrix{Int64}, Nothing}: 4446 1957 4310 11604 17019 13947 1435420950 1376190532 1316869962 1302775198 1555598383 1562237892
We use Content-Length
and Date
request headers, the latter converted to Unix timestamp.
Categorical features
For categorical variables, we proceed in the same way, but we use one-hot encoding implemented in Flux.jl
. This way, we can encode for example a verb of the request:
julia> ALL_VERBS = ["GET", "HEAD", "POST", "PUT", "DELETE"] # etc...
5-element Vector{String}: "GET" "HEAD" "POST" "PUT" "DELETE"
julia> verbs = ["GET", "GET", "POST", "HEAD", "HEAD", "PUT"]
6-element Vector{String}: "GET" "GET" "POST" "HEAD" "HEAD" "PUT"
julia> verb_node = ArrayNode(Flux.onehotbatch(verbs, ALL_VERBS))
5×6 ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}: 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
or Content-Encoding
header:
julia> ALL_ENCODINGS = ["bzip2", "gzip", "xz", "identity"] # etc...
4-element Vector{String}: "bzip2" "gzip" "xz" "identity"
julia> encodings = ["xz", "gzip", "bzip2", "xz", "identity", "bzip2"]
6-element Vector{String}: "xz" "gzip" "bzip2" "xz" "identity" "bzip2"
julia> encoding_node = ArrayNode(Flux.onehotbatch(encodings, ALL_ENCODINGS))
4×6 ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}: ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅
Because Flux.OneHotMatrix
supports multiplication it is possible to wrap it into an ArrayNode
.
Strings
The last example we will consider are string features. This could for example be the Host
header:
julia> hosts = [ "www.foo.com", "www.foo.com", "www.baz.com", "www.foo.com", "www.baz.com", "www.foo.com" ]
6-element Vector{String}: "www.foo.com" "www.foo.com" "www.baz.com" "www.foo.com" "www.baz.com" "www.foo.com"
Mill.jl
offers n
gram histogram-based representation for strings. To get started, we pass the vector of strings into the constructor of NGramMatrix
:
julia> hosts_ngrams = NGramMatrix(hosts, 3, 256, 7)
7×6 NGramMatrix{String, Vector{String}, Int64}: "www.foo.com" "www.foo.com" "www.baz.com" "www.foo.com" "www.baz.com" "www.foo.com"
Each string gets processed into n
grams (trigram in this case as specified in the first parameter). Then, each character is transformed into an integer via the codeunits
function and the whole trigram is interpreted as a three digit number using a base b
specified in the second parameter. Here, we use a base of 256
, which is the most reasonable choice for ascii URLs. For example, for foo
trigram, we obtain:
julia> c = codeunits("foo")
3-element Base.CodeUnits{UInt8, String}: 0x66 0x6f 0x6f
julia> c[1] * 256^2 + c[2] * 256 + c[3]
6713199
The last step is taking the modulo of this result with respect to some prime modulo m
, in this case 7
(last parameter in the constructor), leaving us with 3
as a result. Therefore, for this trigram foo
, we would add 1
to the third row[1]. We can convert this NGramMatrix
into a sparse array and then to the standard array:
using SparseArrays
julia> hosts_dense = hosts_ngrams |> SparseMatrixCSC |> Matrix
7×6 Matrix{Int64}: 1 1 3 1 3 1 1 1 0 1 0 1 3 3 4 3 4 3 2 2 1 2 1 2 3 3 3 3 3 3 2 2 1 2 1 2 1 1 1 1 1 1
Again, we get one column for each string, and the matrix has the same number of rows as modulo m
. For each string s
, we get length(s) + n - 1
n
grams:
julia> sum(hosts_dense; dims=1)
1×6 Matrix{Int64}: 13 13 13 13 13 13
This is because we use special abstract characters (or tokens) for the start and the end of the string. If we denote these ^
and $
, respectively, from string "foo"
, we get trigrams ^^f
, ^fo
, foo
, oo$
, o$$
. Note that these special characters are purely abstract whereas ^
and $
used only for illustration purposes here are characters like any other. Both string start and string end special characters have a unique mapping to integers, which can be obtained as well as set:
julia> Mill.string_start_code()
0x02
julia> Mill.string_end_code()
0x03
julia> Mill.string_start_code!(42)
julia> Mill.string_start_code()
0x2a
NGramMatrix
behaves like a matrix, implements an efficient left-multiplication and thus can be used in ArrayNode
:
julia> hosts_ngrams::AbstractMatrix{Int64}
7×6 NGramMatrix{String, Vector{String}, Int64}: "www.foo.com" "www.foo.com" "www.baz.com" "www.foo.com" "www.baz.com" "www.foo.com"
julia> host_node = ArrayNode(hosts_ngrams)
7×6 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}: "www.foo.com" "www.foo.com" "www.baz.com" "www.foo.com" "www.baz.com" "www.foo.com"
Custom nodes section shows one more slightly more complex way of processing strings, specifically Unix paths.
Putting it all together
Now, we can finally put wrap all features of all six connections into one ProductNode
and construct a BagNode
representing a bag of all connections (corresponding to three different clients):
julia> ds = BagNode(ProductNode( numerical=numerical_node, verb=verb_node, encoding=encoding_node, hosts=host_node ), connections)
BagNode 3 obs ╰── ProductNode 6 obs ├── numerical: ArrayNode(2×6 Array with Int64 elements) 6 obs ├─────── verb: ArrayNode(5×6 OneHotArray with Bool elements) 6 obs ├─── encoding: ArrayNode(4×6 OneHotArray with Bool elements) 6 obs ╰────── hosts: ArrayNode(7×6 NGramMatrix with Int64 elements) 6 obs
create a model for training and run one forward pass:
julia> m = reflectinmodel(ds)
BagModel ↦ BagCount([SegmentedMean(10); SegmentedMax(10)]) ↦ Dense(21 => 10) ⋯ ╰── ProductModel ↦ Dense(40 => 10) 2 arrays, 410 params, 1.688 KiB ├── numerical: ArrayModel(Dense(2 => 10)) 2 arrays, 30 params, 208 by ⋯ ├─────── verb: ArrayModel(Dense(5 => 10)) 2 arrays, 60 params, 328 by ⋯ ├─── encoding: ArrayModel(Dense(4 => 10)) 2 arrays, 50 params, 288 by ⋯ ╰────── hosts: ArrayModel(Dense(7 => 10)) 2 arrays, 80 params, 408 by ⋯
julia> m(ds)
10×3 Matrix{Float32}: 1.28439f8 1.23432f8 1.17321f8 -5.59327f7 -4.71452f7 -8.44016f7 -7.23352f8 -6.77337f8 -7.52425f8 3.82563f8 3.63774f8 3.74053f8 -3.25157f8 -3.05034f8 -3.36274f8 -1.77164f8 -1.68485f8 -1.74555f8 -4.77705f8 -4.48281f8 -4.97372f8 4.16182f7 3.67129f7 4.93028f7 4.13563f7 4.18648f7 3.00047f7 3.72453f7 3.75997f7 2.38858f7
We now obtain a matrix with three columns, each corresponding to one of the clients. Now we can for example calculate gradients with respect to the model parameters:
julia> gradient(m -> sum(m(ds)), m)
((im = (ms = (numerical = (m = (weight = Float32[3543.7795 3.467772f8; 12015.21 2.9824387f9; … ; -10205.47 -2.5080468f9; -22158.354 -4.05619f9], bias = Float32[0.12283072, 2.3017697, -6.108818, -0.07409614, -2.9991698, 2.245645, -2.9834032, 0.46019074, -1.9266042, -2.8137956], σ = nothing),), verb = (m = (weight = Float32[-0.6740063 -0.70653045 … 0.03252409 0.0; 0.799402 0.48524243 … 0.31415942 0.0; … ; 0.37385964 0.10423415 … 0.26962546 0.0; 1.1983932 0.8594307 … 0.3389626 0.0], bias = Float32[-2.022019, 2.3982058, -1.1765642, -0.78065914, 0.99973947, -2.0380163, 1.3930469, 4.333061, 1.1215789, 3.5951798], σ = nothing),), encoding = (m = (weight = Float32[-0.06326684 0.74241585 0.37819934 -0.15449953; -1.5059438 -0.29936808 -0.8562048 -0.18580464; … ; -1.4919614 -0.3994601 -0.92160493 -0.09642328; 1.679817 -0.015670985 0.7583598 0.2948524], bias = Float32[0.90284884, -2.847321, 4.356851, 0.28094852, -2.0077217, -5.3011537, 3.5605748, -1.0072757, -2.9094498, 2.717358], σ = nothing),), hosts = (m = (weight = Float32[-18.11323 -2.8004198 … -2.8004198 -4.7428136; -5.9066224 -0.88825744 … -0.88825744 -1.5366275; … ; 3.351283 0.6230595 … 0.6230595 0.9194803; -8.199025 -1.2190037 … -1.2190037 -2.1274068], bias = Float32[-4.7428136, -1.5366275, -2.2574577, -0.43867037, 2.0060303, 0.528619, -1.6092172, -0.7405628, 0.9194803, -2.1274068], σ = nothing),)), m = (weight = Float32[-4.019252f8 6.513404f8 … -1.2286562 0.8835508; 5.433767f8 -8.805694f8 … 1.4225271 -1.0247681; … ; 6.43344f8 -1.0425715f9 … 1.6505618 -1.1893108; -3.5949668f7 5.8265244f7 … -0.5622243 0.39958024], bias = Float32[1.9439511, -2.6168752, -7.0138664, 8.038992, -3.5198772, 3.923693, 0.69940704, -0.5939465, -3.09123, -0.07146776], σ = nothing)), a = (a = (fs = ((ψ = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],), (ψ = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],)),),), bm = (weight = Float32[-1.05757606f9 4.0311667f8 … -1.6887172f9 3.1780539; -1.05757606f9 4.0311667f8 … -1.6887172f9 3.1780539; … ; -1.05757606f9 4.0311667f8 … -1.6887172f9 3.1780539; -1.05757606f9 4.0311667f8 … -1.6887172f9 3.1780539], bias = Fill(3.0f0, 10), σ = nothing)),)
To put all numerical features into one ArrayNode
is a design choice. We could as well introduce more keys in the final ProductNode
. The model treats these two cases slightly differently (see Nodes section).
This dummy example illustrates the versatility of Mill.jl
. With little to no preprocessing we are able to process complex hierarchical structures and avoid manually designing feature extraction procedures. For a more involved study on processing Internet traffic with Mill.jl
, see for example [10].
- 1One appropriate value for modulo
m
for real problems is2053