Data Views¶
There exist a wide variety of machine learning algorithms, each of which unique in their own way. Yet, there are clear similarities concerning how these algorithms utilize the data set during model training. In fact, most algorithms belong to at least one of the following categories.
- Some algorithms use the whole training data in every iteration of the training procedure. If that is the case, then it is not necessary to partition or otherwise prepare the data any further.
- In contrast to this, many of the modern algorithms prefer to process the training data piece by piece using smaller chunks. These “chunks” are usually of some fixed size and are commonly referred to as mini batches.
- Yet another (but overlapping) group of algorithms processes the training data just one single observation at a time.
What we can learn from this is that regardless of the concrete algorithm and all its details, it is quite likely that at some point during a machine learning experiment, we need to iterate over the training data in a particular way. Most of the time we either iterate over the training data one observation at a time, or one mini-batch at at time.
This package provides multiple approaches to accomplish such a functionality. This “multiple” is necessary, because there are different forms of data sources, that have very different properties. Recall how we differentiated between data container and data iterators. In this document, however, we will solely focus on data sources that are considered Data Container.
As Vector of Observations¶
One could be inclined to believe, that in order to iterate over some data container one observation at a time, it should suffice to just iterate over the data container itself. While this may work as expected for some data container types, it will not work in general.
Consider the following two example data containers, the matrix
X
and the vector y
. While both are different types of
data container, we will make each one store exactly 5
observations.
julia> X = rand(2, 5)
2×5 Array{Float64,2}:
0.226582 0.933372 0.505208 0.0443222 0.812814
0.504629 0.522172 0.0997825 0.722906 0.245457
julia> y = rand(5)
5-element Array{Float64,1}:
0.11202
0.000341996
0.380001
0.505277
0.841177
When we iterate over y
, it turns out we also iterate over
each observation. So in other words, for a Vector
it would
work to just iterate over the data container itself, if our goal
is to process it one observation at a time.
julia> foreach(println, y)
0.11201971329803984
0.0003419958128361156
0.3800005018641015
0.5052774551949404
0.8411766784932724
On the other hand, when we iterate over X
, we iterate over
all individual array elements, and not over what we consider to
be the observations. In general that is the behaviour we want for
a Array
, but it is not in line with our domain interpretation
of a data container.
julia> foreach(println, X)
0.22658190197881312
0.5046291972412908
0.9333724636943255
0.5221721267193593
0.5052080505550971
0.09978246027514359
0.04432218813798294
0.7229058081423172
0.8128138585478044
0.24545709827626805
This means, that we need a more general approach for iterating
over a data container one observation at a time. Recall how data
containers have the nice property of knowing how many
observations they contain, and how to access each individual
observation. Because of this we need not even limit ourselves to
just iteration here, instead we could just create a new “view”.
For that purpose we provide the type ObsView
, which can
be used to treat any data container as a vector-like
representation of that data container, where each vector element
corresponds to a single observation.
-
ObsView <: DataView <: AbstractVector
Lazy representation of a data container as a vector of individual observations.
Any data access is delayed until
getindex
is called, and evengetindex
returns the result ofdatasubset()
which in general avoids data movement untilgetobs()
is invoked.
-
obsview
(data[, obsdim]) → ObsView¶ Create a
ObsView
for the given data container. It will serve as a vector-like view into data, where every element of the vector points to a single observation in data.Parameters: - data – The object representing a data container.
- obsdim –
Optional. If it makes sense for the type of data, then obsdim can be used to specify which dimension of data denotes the observations. It can be specified in a type-stable manner as a positional argument, or as a more convenient keyword parameter. See Observation Dimension for more information.
Let us consider our toy matrix X
again, which we will
interpret as containing 5 observations with 2 features each. This
time we pass it to obsview()
before iterating over it.
Notice how the resulting ObsView
will look like a vector
of vectors. As we will see from the type, each element of the
ObsView
ov
is just a SubArray
into X
. As
such, no data from X
is copied.
julia> X = rand(2, 5)
2×5 Array{Float64,2}:
0.226582 0.933372 0.505208 0.0443222 0.812814
0.504629 0.522172 0.0997825 0.722906 0.245457
julia> ov = obsview(X)
5-element obsview(::Array{Float64,2}, ObsDim.Last()) with element type SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true}:
[0.226582,0.504629]
[0.933372,0.522172]
[0.505208,0.0997825]
[0.0443222,0.722906]
[0.812814,0.245457]
julia> ov[2] # access second observation
2-element SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true}:
0.933372
0.522172
julia> foreach(println, ov) # now we iterate over observation
[0.226582,0.504629]
[0.933372,0.522172]
[0.505208,0.0997825]
[0.0443222,0.722906]
[0.812814,0.245457]
If there is more than one array dimension, all but the
observation dimension are implicitly assumed to be features (i.e.
part of that observation). As we have seen with X
in the
example above, the default assumption is that the last array
dimension enumerates the observations. This can be overwritten by
explicitly specifying the obsdim
. In the following code
snippet we treat X
as a data set that has 2 observations with
5 features each.
julia> X = rand(2, 5)
2×5 Array{Float64,2}:
0.226582 0.933372 0.505208 0.0443222 0.812814
0.504629 0.522172 0.0997825 0.722906 0.245457
julia> ov = obsview(X, obsdim = 1)
2-element obsview(::Array{Float64,2}, ObsDim.Constant{1}()) with element type SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Colon},true}:
[0.226582,0.933372,0.505208,0.0443222,0.812814]
[0.504629,0.522172,0.0997825,0.722906,0.245457]
julia> ov = obsview(X, ObsDim.First()); # same as above but type-stable
Similarly, we can also call obsview()
with our toy vector
y
. Recall that a Vector
is just an Array
with only
one dimension. This example will help demonstrate how an
ObsView
handles data container that are already in a
vector-like form.
julia> y = rand(5)
5-element Array{Float64,1}:
0.11202
0.000341996
0.380001
0.505277
0.841177
julia> ov = obsview(y)
5-element obsview(::Array{Float64,1}, ObsDim.Last()) with element type SubArray{Float64,0,Array{Float64,1},Tuple{Int64},false}:
0.11202
0.000341996
0.380001
0.505277
0.841177
julia> ov[2] # access second observation
0-dimensional SubArray{Float64,0,Array{Float64,1},Tuple{Int64},false}:
0.000341996
On first glance, the result of indexing into ov
may seem
unintuitive. Why does it return a \(0\)-dimensional
SubArray
instead of simply the value? The main reason for
this behaviour is that we try to avoid data movement unless
getobs()
is called. Until that point, we just create
subsets into the original data container.
julia> getobs(ov[2])
0.0003419958128361156
julia> getobs(ov)
5-element Array{Float64,1}:
0.11202
0.000341996
0.380001
0.505277
0.841177
julia> getobs(ov, 2)
0.0003419958128361156
You may have noted in all the examples so far, that creating an
ObsView
preserves the order of the observations. This is
of course on purpose and the desired behaviour. However, since
ObsView
is commonly used as an iterator, one may be
inclined to prefer iterating over the data in a random order. To
do so, simply combine the functions obsview()
and
shuffleobs()
.
julia> ov = obsview(shuffleobs(y))
5-element obsview(::SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}, ObsDim.Last()) with element type SubArray{Float64,0,Array{Float64,1},Tuple{Int64},false}:
0.505277
0.11202
0.841177
0.380001
0.000341996
julia> ov = shuffleobs(obsview(y)) # also possible
5-element obsview(::SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}, ObsDim.Last()) with element type SubArray{Float64,0,Array{Float64,1},Tuple{Int64},false}:
0.505277
0.380001
0.000341996
0.841177
0.11202
It is also possible to link multiple different data containers
together on an per-observation level. To do that, simply put all
the relevant data container into a single Tuple
, before
passing it to obsview()
. Each element of the resulting
ObsView
will then be a Tuple
, with the resulting
observation in the same tuple position.
julia> ov = obsview((X, y))
5-element obsview(::Tuple{Array{Float64,2},Array{Float64,1}}, (ObsDim.Last(),ObsDim.Last())) with element type Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},SubArray{Float64,0,Array{Float64,1},Tuple{Int64},false}}:
([0.226582,0.504629],0.11202)
([0.933372,0.522172],0.000341996)
([0.505208,0.0997825],0.380001)
([0.0443222,0.722906],0.505277)
([0.812814,0.245457],0.841177)
julia> ov[2] # access second observation
([0.933372,0.522172],0.000341996)
julia> typeof(ov[2])
Tuple{SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true},SubArray{Float64,0,Array{Float64,1},Tuple{Int64},false}}
It is worth pointing out, that the tuple elements (i.e. data
container), that are passed to obsview()
, need not be of
the same type, nor of the same shape. You can observe this in the
code above, where X
is a Matrix
while y
is a
Vector
. Note, however, that all tuple elements must be data
containers themselves. Furthermore, they all must contain the
same exact number of observations.
As Vector of Batches¶
Another common use case is to iterate over the given data set in small equal-sized chunks. These chunks are usually referred to as mini-batches.
Not unlike ObsView
, this package provides a vector-like
type called BatchView
, that can be used to treat any
data container as a vector of equal-sized batches.
-
BatchView <: DataView <: AbstractVector
Lazy representation of a data container as a vector of batches. Each batch will contain an equal amount of observations in them. In the case that the number ob observations is not dividable by the specified (or inferred) batch-size, the remaining observations will be ignored.
Any data access is delayed until
getindex
is called, and evengetindex
returns the result ofdatasubset()
which in general avoids data movement untilgetobs()
is invoked.
-
batchview
(data[, size|maxsize][, count][, obsdim]) → BatchView¶ Create a
BatchView
for the given data container. It will serve as a vector-like view into data, where every element of the vector points to a batch of size observations from data. The number of batches and the batch-size can be specified using (keyword) parameters count and size (or alternatively maxsize).In the case that the size of the dataset is not dividable by the specified (or inferred) size, the remaining observations will be ignored. If maxsize is provided instead of size, then the next dividable size will be used such that no observations are ignored.
Parameters: - data – The object representing a data container.
- size (Integer) – Optional. The exact number of observations in each batch.
- maxsize (Integer) –
Optional alternative to size. The maximal number of observations in each batch, such that all observations get used.
- count (Integer) –
Optional. The number of batches that should be used. This will also we the length of the return value.
- obsdim –
Optional. If it makes sense for the type of data, then obsdim can be used to specify which dimension of data denotes the observations. It can be specified in a type-stable manner as a positional argument, or as a more convenient keyword parameter. See Observation Dimension for more information.
Consider the following toy data-matrix X
, which we will
interpret as containing a total of 5 observations, where each
observation consists of 2 features.
julia> X = rand(2, 5)
2×5 Array{Float64,2}:
0.226582 0.933372 0.505208 0.0443222 0.812814
0.504629 0.522172 0.0997825 0.722906 0.245457
Using a prime number for the total number of observations makes
this data container a particularly interesting example for using
batchview()
. Unless we choose a batch-size of 1
or
5
, there is no way to iterate the whole data in terms of
equally-sized batches. BatchView
deals with such edge
cases by ignoring the excess observations with an informative
message.
julia> bv = batchview(X, size = 2)
INFO: The specified values for size and/or count will result in 1 unused data points
2-element batchview(::Array{Float64,2}, 2, 2, ObsDim.Last()) with element type SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
[0.226582 0.933372; 0.504629 0.522172]
[0.505208 0.0443222; 0.0997825 0.722906]
julia> bv = batchview(X, maxsize = 2)
5-element batchview(::Array{Float64,2}, 1, 5, ObsDim.Last()) with element type SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
[0.226582; 0.504629]
[0.933372; 0.522172]
[0.505208; 0.0997825]
[0.0443222; 0.722906]
[0.812814; 0.245457]
You can query the size of each batch by using the function
batchsize()
on any BatchView
.
julia> batchsize(bv)
2
Similar to ObsView
, a BatchView
acts like a
vector and can be used as such. The one big difference to the
former is that each element is now a batch of X
instead of a
single observation.
julia> bv[2] # access second batch
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
0.505208 0.0443222
0.0997825 0.722906
Naturally, batchview()
also supports the optional parameter
obsdim
, which can be used to specify which dimension denotes
the observation. If that concept of dimensionality does not make
sense for the given data container, then obsdim
can simply be
omitted.
julia> bv = batchview(X', size = 2, obsdim = 1) # note the transpose
INFO: The specified values for size and/or count will result in 1 unused data points
2-element batchview(::Array{Float64,2}, 2, 2, ObsDim.Constant{1}()) with element type SubArray{Float64,2,Array{Float64,2},Tuple{UnitRange{Int64},Colon},false}:
[0.226582 0.504629; 0.933372 0.522172]
[0.505208 0.0997825; 0.0443222 0.722906]
So far we used the parameter size
to explicitly specify how
many observation we want to be in each batch. Alternatively, we
can also use the parameter count
to specify the total number
of batches that we would like to use.
julia> bv = batchview(X, count = 4)
INFO: The specified values for size and/or count will result in 1 unused data points
4-element batchview(::Array{Float64,2}, 1, 4, ObsDim.Last()) with element type SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
[0.226582; 0.504629]
[0.933372; 0.522172]
[0.505208; 0.0997825]
[0.0443222; 0.722906]
julia> bv[2] # access second batch
2×1 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
0.933372
0.522172
Note how in the above example, the inferred batch-size is 1
.
Arguably, this makes the resulting BatchView
, bv
,
appear very similar to an ObsView
. The big difference,
though, is that BatchView
preserves the shape on
indexing. Consequently, each element of bv
is a subtype of
AbstractMatrix
and not AbstractVector
.
It is also possible to call batchview()
with multiple data
containers wrapped in a Tuple
. Note, however, that all data
containers must have the same total number of observations. Using
a tuple this way will link those data containers together on a
per-observation basis.
julia> y = rand(5)
5-element Array{Float64,1}:
0.11202
0.000341996
0.380001
0.505277
0.841177
julia> bv = batchview((X, y))
INFO: The specified values for size and/or count will result in 1 unused data points
2-element batchview(::Tuple{Array{Float64,2},Array{Float64,1}}, 2, 2, (ObsDim.Last(),ObsDim.Last())) with element type Tuple{SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true},SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}}:
([0.226582 0.933372; 0.504629 0.522172], [0.11202,0.000341996])
([0.505208 0.0443222; 0.0997825 0.722906], [0.380001,0.505277])
julia> bv[2]
([0.505208 0.0443222; 0.0997825 0.722906], [0.380001,0.505277])
As Vector of Sequences¶
Time series data, or more generally sequence data, often requires a special type of preparation in order to work with it in a machine learning experiment. The big difference to “normal” data is that in sequence data the observations are not independent from each other. For example if you think of a piece of text as a sequence of words (i.e. each word is an observation), you’ll notice that there is an inherent order in the data.
julia> data = split("The quick brown fox jumps over the lazy dog")
9-element Array{SubString{String},1}:
"The"
"quick"
"brown"
"fox"
"jumps"
"over"
"the"
"lazy"
"dog"
Unlabeled Windows¶
There are some common scenarios when working with sequence data such as the above. Before we think about more practical use cases that require labeled windows, let us quickly consider the case that we would like to process our sequence data in chunks of equally sized windows.
-
slidingwindow
(data, size[, stride][, obsdim])¶ Return a vector-like view of the data for which each element is a fixed size “window” of size adjacent observations. By default these windows are not overlapping.
Parameters: - data – The object representing a data container.
- size (Integer) – The number of observations in each window.
- stride (Integer) –
Optional. The step size between the starting observation of susequent windows. Defaults to size.
- obsdim –
Optional. If it makes sense for the type of data, then obsdim can be used to specify which dimension of data denotes the observations. It can be specified in a type-stable manner as a positional argument, or as a more convenient keyword parameter. See Observation Dimension for more information.
Its worth pointing out that only complete windows are included in the output. This implies that it is possible for excess observations to be omitted from the view. The following code snippet shows an example that partitions 22 observation into 4 windows, where the last two observations are omitted.
julia> A = slidingwindow(1:22, 4)
5-element slidingwindow(::UnitRange{Int64}, 4) with element type SubArray{Int64,1,UnitRange{Int64},Tuple{UnitRange{Int64}},true}:
[1, 2, 3, 4]
[5, 6, 7, 8]
[9, 10, 11, 12]
[13, 14, 15, 16]
[17, 18, 19, 20]
Note that the values of the given data are not actually copied.
Instead the function datasubset()
is called when
getindex
is invoked. To actually get a copy of the data at
some window use the function getobs()
.
julia> A[2]
4-element SubArray{Int64,1,UnitRange{Int64},Tuple{UnitRange{Int64}},true}:
5
6
7
8
julia> getobs(A, 2)
4-element Array{Int64,1}:
5
6
7
8
Up to this point the behaviour may be very reminiscent of a
batchview()
, but this is where the similarities end. The
optional parameter stride
can be used to specify the distance
between the start elements of each adjacent window. By default
the stride is equal to the window size.
julia> A = slidingwindow(1:22, 4, stride=2)
10-element slidingwindow(::UnitRange{Int64}, 4, stride = 2) with element type SubArray{Int64,1}:
[1, 2, 3, 4]
[3, 4, 5, 6]
[5, 6, 7, 8]
[7, 8, 9, 10]
[9, 10, 11, 12]
[11, 12, 13, 14]
[13, 14, 15, 16]
[15, 16, 17, 18]
[17, 18, 19, 20]
[19, 20, 21, 22]
julia> A = slidingwindow(data, 4, stride=2)
3-element slidingwindow(::Array{SubString{String},1}, 4, stride = 2) with element type SubArray{...}:
["The", "quick", "brown", "fox"]
["brown", "fox", "jumps", "over"]
["jumps", "over", "the", "lazy"]
Labeled Windows¶
Now that we have seen the general idea of slidingwindow
, let
us consider a more practical variation of it. A conceptually
simple use case may be that we want to predict the next word in a
sentence given all the words that came before it (e.g. for
autocompletion).
An interesting aspect of sequence prediction is that we can transform an unlabeled sequence into a number of labeled sub-sequences. Let’s again use our original (unlabeled) data for this.
julia> data = split("The quick brown fox jumps over the lazy dog")
9-element Array{SubString{String},1}:
"The"
"quick"
"brown"
"fox"
"jumps"
"over"
"the"
"lazy"
"dog"
If we were to train a model that given two words should predict
the next word, we would need to rearrange our data
quite a
bit. To make this process more convenient we provide a custom
method for slidingwindow
that expects an target-index
function f
as first parameter.
-
slidingwindow
(f, data, size[, stride][, excludetarget][, obsdim]) Return a vector-like view of the data for which each element is a tuple of two elements:
- A fixed size “window” of size adjacent observations. By default these windows are not overlapping. This can be changed by explicitly specifying a stride.
- A single target (or vector of targets) for the window. The content of the target(s) is defined by the label-index function f.
Parameters: - f (Function) –
A unary function that takes the index of the first observation in the current window and should return the index (or indices) of the associated target(s) for that window.
- data – The object representing a data container.
- size (Integer) – The number of observations in each window.
- stride (Integer) –
Optional. The step size between the starting observation of susequent windows. Defaults to size.
- excludetarget (Bool) –
Should a target index returned by f also occur in the window, then setting this to
true
will make sure that such elements are removed from the window. Defaults tofalse
. - obsdim –
Optional. If it makes sense for the type of data, then obsdim can be used to specify which dimension of data denotes the observations. It can be specified in a type-stable manner as a positional argument, or as a more convenient keyword parameter. See Observation Dimension for more information.
Note that only complete and in-bound windows are included in the output, which implies that it is possible for excess observations to be omitted from the resulting view.
julia> A = slidingwindow(i->i+2, data, 2, stride=1)
7-element slidingwindow(::##9#10, ::Array{SubString{String},1}, 2, stride = 1) with element type Tuple{...}:
(["The", "quick"], "brown")
(["quick", "brown"], "fox")
(["brown", "fox"], "jumps")
(["fox", "jumps"], "over")
(["jumps", "over"], "the")
(["over", "the"], "lazy")
(["the", "lazy"], "dog")
julia> A = slidingwindow(i->i-1, data, 2, stride=1)
7-element slidingwindow(::##11#12, ::Array{SubString{String},1}, 2, stride = 1) with element type Tuple{...}:
(["quick", "brown"], "The")
(["brown", "fox"], "quick")
(["fox", "jumps"], "brown")
(["jumps", "over"], "fox")
(["over", "the"], "jumps")
(["the", "lazy"], "over")
(["lazy", "dog"], "the")
As hinted above, it is also allowed for f
to return a vector of
indices. This can be useful for emulating techniques such as
skip-gram.
julia> A = slidingwindow(i->[i-2:i-1; i+1:i+2], data, 1)
5-element slidingwindow(::##11#12, ::Array{SubString{String},1}, 1) with element type Tuple{...}:
(["brown"], ["The", "quick", "fox", "jumps"])
(["fox"], ["quick", "brown", "jumps", "over"])
(["jumps"], ["brown", "fox", "over", "the"])
(["over"], ["fox", "jumps", "the", "lazy"])
(["the"], ["jumps", "over", "lazy", "dog"])
Should it so happen that the targets overlap with the features,
then the affected observation(s) will be present in both. To
change this behaviour one can set the optional parameter
excludetarget = true
. This will remove the target(s) from the
feature window.
julia> slidingwindow(i->i+2, data, 5, stride = 1, excludetarget = true)
5-element slidingwindow(::##17#18, ::Array{SubString{String},1}, 5, stride = 1) with element type Tuple{...}:
(["The", "quick", "fox", "jumps"], "brown")
(["quick", "brown", "jumps", "over"], "fox")
(["brown", "fox", "over", "the"], "jumps")
(["fox", "jumps", "the", "lazy"], "over")
(["jumps", "over", "lazy", "dog"], "the")