Background and Motivation¶

In this section we discuss what data partitioning entails and how one could go about approaching this task efficiently when performing it manually. Furthermore, we will outline some of the pitfalls one might encounter when doing so, which will lead to the motivation of the design decisions that this package follows.

When it comes to subsetting/partitioning index-based data into individual samples (which we call observations), or groups of samples (which we will refer to as batches or mini-batches) the problem at hand really breaks down to one core task: keeping track of indices.

To get a better understanding of what exactly we mean by “index-based data” and “tracking indices”, let us together explore how one would typically approach data partitioning scenarios without the use of external packages (i.e. by getting our hands dirty and coding it ourselves!). We will do so using a number of different but commonly used forms of data storage such as matrices and data frames.

Warning

This section and its sub-sections serve soley as example to explain the underyling problem of partitioning/subsetting data and further to motivate the solution provided by this package. As such this section is not intended as a guide on how to apply this package.

Two Kinds of Data Sources¶

In the context of this package, we differentiate between two “kinds” of data sources, which we will call iteration-based and index-based .

Iteration-based (aka Data iterator)

To belong to this group, a data source must implement the iterator interface. It may or may not know the total amount of observations it can provide, which means that knowing \(N\) is not necessary.

The key requirement for a iteration-based data source is that every iteration either returns a single observation or a batch of observations.

These kind of data sources are primarily used for either streaming data, continuous resampling, or for large/remote data sets where even storing the indices requires too much memory.

Index-based (aka Data Container)

For a data source to belong in this category it needs to be able to provide two things:

The total number of observations \(N\), that the data source contains.
A way to query a specific observation or set of observations. This must be done using indices, where every observation has a unique index \(i \in I\) assigned from the set of indices \(I = \{1, 2, ..., N\}\).

We will go into more detail about data sources and their differences in later sections. The key takeaway from this little discussion here is that these two kinds of data sources offer distinct challenges and need to be reasoned with differently.

For the rest of this document we will focus on working with index-based data.

A Manual Solution for Arrays¶

Matrices and other (multi-dimensional) arrays are one of the most commonly used data storage containers in Machine Learning. As such, it is quite likely that you will find (or have already found) yourself in the position of working with such data sooner or later.

Let’s say you are interested in working with the Iris data set in order to test some clustering or classification algorithm that you are working on. The package MLDataUtils provides a convenience function load_iris for loading the data set in array form. Calling this function will give us two variables, the feature-matrix X and the target-vector of labels Y.

julia> using MLDataUtils

julia> X, Y = load_iris();

The first variable, X, contains all our features, sometimes called independent variables or predictors. In this case each column of the matrix corresponds to a single observation, or sample. Each observation in X thus has 4 features each. These features represent some quantitative information known about the corresponding observation, which for the sake of keeping this document concise, is about the extend to which we will discuss their meaning in this tutorial.

julia> X
4×150 Array{Float64,2}:
 5.1  4.9  4.7  4.6  5.0  5.4  4.6  …  6.8  6.7  6.7  6.3  6.5  6.2  5.9
 3.5  3.0  3.2  3.1  3.6  3.9  3.4     3.2  3.3  3.0  2.5  3.0  3.4  3.0
 1.4  1.4  1.3  1.5  1.4  1.7  1.4     5.9  5.7  5.2  5.0  5.2  5.4  5.1
 0.2  0.2  0.2  0.2  0.2  0.4  0.3     2.3  2.5  2.3  1.9  2.0  2.3  1.8

The second variable, Y, denotes the labels (also often called classes or categories) of each observation. These terms are usually used in the context of predicting categorical variables, such as we do in this example. The more general term for Y, which also includes the case of numerical outcomes, is targets, responses, or dependent variables.

julia> Y
150-element Array{String,1}:
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"

Together, X and Y represent our data set. Both variables contain 150 observations and the individual elements of the two variables are linked together through the corresponding observation-index. For example, the following code snippet shows how to access the 30-th observation of the data set.

julia> X[:, 30], Y[30]
([4.7,3.2,1.6,0.2],"setosa")

This link is an important detail that we need to keep in mind when thinking about how to partition our data set into subsets. The main lesson here is that whatever kind of sub-setting strategy we apply to one of the variables we need to apply the exact same sub-setting operation to the other one as well.

Now that we have our full data set we could consider splitting it into two differently sized subsets: a training set and a test set.

One naive and dangerous approach to achieve this is to do a “static” split, i.e. use the first \(n\) observations as training set and the remaining observations as test set. I say dangerous because this strategy makes a strong assumption that may not be true for the data we are working with (and in fact it is not true for the Iris data set). But more on that later.

To perform a static split we first need to decide how many observations we want in our training set and how many observations we would like to hold out on and put in our test set. It is often more convenient to think in terms of proportions instead of absolute numbers. Let’s say we decide on using 80% of our data for training. To split our data set in such a way, we first need to derive which elements of X and Y we need assign to each subset in order to accomplish this exact effect.

julia> idx_train = 1:floor(Int, 0.8 * 150)
1:120

julia> idx_test = (floor(Int, 0.8 * 150) + 1):150
121:150

As we can see, we made sure that the two ranges do not overlap, implying that our two subsets will be disjoint. At this point we can use these ranges as indices to subset our variables into a training and a test portion.

julia> X_train, Y_train = X[:, idx_train], Y[idx_train];
julia> size(X_train)
(4,120)

julia> X_test, Y_test = X[:, idx_test], Y[idx_test];
julia> size(X_test)
(4,30)

Note

To put this into perspective: In order to perform this type of static split using the provided functions of this package, one would type the following code:

(X_train,Y_train), (X_test,Y_test) = splitobs((X,Y), at = 0.8)

For more information take a look at the documentation for the function splitobs().

So far so good. For many data sets, this approach would actually work pretty fine. However - as we teased before - performing static splits is not necessarily a good idea if you are not sure that both your resulting subsets (individually!) would end up being representative of the complete data set or population under study.

The concrete issue in our current example is that the iris data set has structure in the order of its observations. In fact, the data set is ordered according to their labels. The first 50 observations all belong to the class setosa, the next 50 to versicolor, and the last 50 observation to virginica. Knowing that piece of trivia it is now plain to see that our supposed test set only contains observation that belong to the class virginica.

julia> Y_test
30-element Array{String,1}:
 "virginica"
 "virginica"
 "virginica"
 ⋮
 "virginica"
 "virginica"
 "virginica"

As a consequence our prediction results would not give us good estimates and chances are that any output we get from our model is completely nonsensical.

Tip

While it surely depends on the situation, as a rough guide we would advise to only use static splits in one of the following two situations:

You are absolutely confident that the order of the observations in your data set is random.
You are working with a data set for which there is a convention to use the last \(n\) observations as a test set or validation set.

Well, so we saw that a static split would not be a good idea for this data set. What we really want in our situation is a random assignment of each observation to one (and only one) of the two subsets. Turns out we can quite conveniently randomize the order of our observations by using the function shuffle.

julia> idx = shuffle(1:150)
150-element Array{Int64,1}:
  56
  41
 146
   ⋮
  90
   5
  13

The naive thing to do now would be to first create a shuffled version of our full data set using X[:,idx] and Y[idx] and then do a static split on the new shuffled version. That, however, would in general be quite inefficient as we would copy the data set around unnecessarily a few times before even using it for training our model. The data set usually takes up a lot more memory than just the indices, and if we think about it, we will see that reasoning with the indices is all we really need to do in order to accomplish our partitioning strategy.

Instead of first shuffling the whole data set, let us just perform a static split on idx, similar to how we initially did on the data directly. In other words we perform our static sub-setting on the indices in idx instead of the observations in data. This is already hinting to what we meant at the beginning of this document with “keeping track of indices”, since this concept of index-accumulation is quite powerful.

julia> idx_train = idx[1:floor(Int, 0.8 * 150)]
120-element Array{Int64,1}:
  56
  41
   ⋮
 121
   7

julia> idx_test = idx[(floor(Int, 0.8 * 150) + 1):150]
30-element Array{Int64,1}:
 102
  92
   ⋮
   5
  13

Using these new training- and test indices we can now construct our two data subsets as we did before, but this time we end up with randomly assigned observations for both.

julia> Y_test
30-element Array{String,1}:
 "virginica"
 "versicolor"
 ⋮
 "setosa"
 "setosa"

Very well! Now we have a training set and a test set. In many situations we may want to consider further sub-setting of our training set before feeding the subsets into some learning algorithm.

In a typical scenario we would be inclined to split our newly created training set into a smaller training set and a validation set, the later of which we would like to use to test the impact of our hyper-parameters on the prediction quality of our model. And if additionally we employ a stochastic learning algorithm, chances are that we also want to chunk our training data into equally sized mini-batches before feeding those individually into the training procedure.

Even though this is starting to sound rather complex, it turns out that all we really need to do is keep track of our indices properly. In other words, all these sub-setting of sub-sets can be done by just accumulating indices. The following code snippet shows how this could be achieved if implemented manually.

X, Y = load_iris()

# trainingset: 100 obs
# validationset: 20 obs
# testset: 30 obs
n_cv    = 120
n_train = 100

# randomly assign observations to either CV set or test set
# the CV set will later be divided into training and validation set
idx = shuffle(1:150)
idx_cv   = idx[1:n_cv]
idx_test = idx[(n_cv + 1):150]

# we will perform 10 different partitions of the CV set into
# a training and validation portion to get a better estimate
# NOTE: This is just a very rudimentary resampling strategy for
#       the sake of keeping this example simple.
for i = 1:10
    # each iteration we shuffle around the CV indices so that
    # a static split into training and validation set will be
    # the same as a random assignment
    shuffle!(idx_cv)
    idx_train = idx_cv[1:n_train]
    idx_val   = idx_cv[(n_train+1):n_cv]

    # iterate over our training set in 20 batches of batch-size 5
    for j = 1:20
        idx_batch = idx_train[(1:5) + (j*5-5)]

        # Now we actually allocate the current batch of data
        # that we need for our computation in this step.
        X_batch = X[:, idx_batch]
        Y_batch = Y[idx_batch]

        # ... train some model on current batch here ...
    end
end

I would argue that this code is still quite readable and we managed to delay accessing and sub-setting of our data set to the latest possible moment. Also note how we only copy the portion of the data that we actually need at that iteration.

The main point of this exercise is to show that nesting data access pattern can be reduced to just keeping track of indices. This is the core design principle that the access pattern of MLDataPattern follow.

Note

To put this into perspective: In order to perform this type of partitioning scheme using the provided functions of this package, one would type the following code:

cv, test = splitobs(shuffleobs((X,Y), at = 0.8)

for i = 1:10
    train, val = splitobs(shuffleobs(cv), at = 0.84)

    # iterate over our training set in 20 batches of batch-size 5
    for (X_batch, Y_batch) in eachbatch(train, 5)
        # ... train some model on current batch here ...
    end
end

For more information take a look at the documentation for the functions splitobs(), shuffleobs(), and eachbatch() respectively.

While this is already a decent enough implementation, we could further reduce our memory footprint by using views. We should not forget that that even if we only copy indices, we still copy around memory.

X, Y = load_iris()

# same as before
n_cv    = 120
n_train = 100

# instead of static splits create views into idx
idx = shuffle(1:150)
idx_cv   = view(idx, 1:n_cv)
idx_test = view(idx, (n_cv + 1):150)

# preallocate batch buffers. We will re-use them in every
# iteration to avoid temporary arrays
X_batch = zeros(Float64, 4, 5)
Y_batch = Y[1:5]

# We can create our training and validation views outside the loop,
# as their elements will be mutated when we shuffle idx_cv
idx_train = view(idx_cv, 1:n_train)
idx_val   = view(idx_cv, (n_train+1):n_cv)

for i = 1:10
    # this little trick will randomly assign observations to
    # either training set or validation set in each iteration
    shuffle!(idx_cv)

    for j = 1:20
        idx_batch = view(idx_train, (1:5) + (j*5-5))

        # copy the current batch of interest into a proper
        # array that is a continuous block of memory
        copy!(X_batch, view(X, :, idx_batch))
        # to be fair it makes less difference for an array
        # of strings, but you get the idea.
        copy!(Y_batch, view(Y, idx_batch))

        # .. train some model on current batch here ...
    end
end

In this version of the code we did quite a lot of micro-optimization, which at least on paper yields a cleaner solution to our task. While probably improving our performance a little, it did not really help readability of our code however. And if we end up with a bug somewhere we may have a nasty time deducing which little “trick” does not do what we thought it would.

Warning

These kind of hand-crafted micro-optimizations, while fun to think about, can be quite error prone. In some situations they may not even turn out to have been worth the effort when measuring its influence on the training time of your model. Keep that in mind when tinkering on a project. Premature optimization without profiling can cost a lot of valueable time and energy.

Now to the good part. MLDataPattern tries to do these kind of performance tricks for you in certain situations (specifically when working directly with DataSubset). So if it makes sense, our provided pattern try to avoid allocating unnecessary index-vectors. Naturally, one will always be able to hand craft some better optimized solution for some special use-case such as this one, but most of the time just avoiding common pitfalls will get you 80% of the way. With an interesting enough problem the other 20% of performance-gain you could achieve by dwelling on this issue would likely be negligible in relation to the training time of your learning algorithm.

Array Dimension for Observations¶

Before we move on from our array example to a data frame, let us briefly think about the “observation dimension” of some array. Let us consider the Iris data set again.

julia> X, Y = load_iris();

julia> size(X)
(4,150)

The variable X is our feature Matrix{Float64}, which in Julia is a typealias for a two dimensional array Array{Float64,2}. As such the variable has two dimensions that we can assign meaning to.

So far we acted on the convention that the first dimension encodes our features, and the second dimension encodes our observations. However, there is no law that dictates that this is the right way around. In fact it is much more common in the literature as well as other languages to have the first dimension encode the observations and the second dimension denote the features. This would also be much more relatable to how we organize some data in a spreadsheet.

Note

There is a good reason that you will often find the convention to use the last array dimension to encode the observations when working with Julia. This has to do with how Julia arrays access their memory. For more information on this topic take a look at the corresponding section in the Julia documentation

There have been many discussions on which convention is more useful and/or efficient, but the only answer you will find here is a humble it depends on what you are doing.

Consider the following scenario. Let’s say we would again like to work with the Iris dataset, but this time we use the RDatasets package to load it. This will give us the same data, but in a quite different data-storage type.

julia> using RDatasets

julia> iris = dataset("datasets", "iris")
150×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species     │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ "setosa"    │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ "setosa"    │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ "setosa"    │
│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ "setosa"    │
⋮
│ 147 │ 6.3         │ 2.5        │ 5.0         │ 1.9        │ "virginica" │
│ 148 │ 6.5         │ 3.0        │ 5.2         │ 2.0        │ "virginica" │
│ 149 │ 6.2         │ 3.4        │ 5.4         │ 2.3        │ "virginica" │
│ 150 │ 5.9         │ 3.0        │ 5.1         │ 1.8        │ "virginica" │

There are two common ways of how to go about using such a data frame for some Machine Learning purposes:

Using a formula to compute a model-matrix and work with that. This is a typical approach for when one wants to use models that need numerical features, such as linear regression. By using a formula we can transform the categorical features to numerical ones using so-called dummy variables.
Using the data frame directly. Some models, such as decision trees, can deal with categorical features themself and don’t require the features in a numerical form.

Before we dive into the second scenario, let us consider building a model matrix. This will give us a motivating example to deal with different conventions for the observation dimension.

Without any explanation that does it justice, let us create a feature matrix X from the data frame iris using the following code snippet:

julia> X = ModelMatrix(ModelFrame(@formula(Species ~ SepalLength + SepalWidth + PetalLength + PetalWidth), iris)).m
150×5 Array{Float64,2}:
0  5.1  3.5  1.4  0.2
0  4.9  3.0  1.4  0.2
0  4.7  3.2  1.3  0.2
0  4.6  3.1  1.5  0.2
 ⋮
0  6.3  2.5  5.0  1.9
0  6.5  3.0  5.2  2.0
0  6.2  3.4  5.4  2.3
0  5.9  3.0  5.1  1.8

Notice two things. First, we now have a feature matrix X for which the first dimension (i.e. the rows) denotes the observations. Secondly, we ended up with 5 features for each observation, while in our previous example he had 4. This is because by default the model matrix is augmented with a constant variable that models can use to fit an intercept to. But that need not trouble us right now. The main point is that different tasks often have different conventions, and ideally we would like to have tools that can adapt to the current situation.

So how would this change of convention be reflected in our sub-setting strategy? Well, everywhere we previously wrote X[:, indices], we would now write X[indices, :]. This looks like a simple enough change, but it has the consequence that the reuse already written partitioning code can be rather limited without some more coding effort. And even then, what if next time we work with 3 or 4 dimensional arrays (e.g. image data)? Generalizing this concept requires careful considerations.

Note

To put this into perspective: In order to be able to diverge from the convention of using the last array dimension as observation, all relevant methods of this package have an optional parameter obsdim, which can be specified either as a positional and type-stable argument, or as a convenient keyword argument

train, test = splitobs(X, obsdim = 1)
train, test = splitobs(X, obsdim = :first)
train, test = splitobs(X, ObsDim.First())

For more information take a look at the section on Observation Dimension.

Generalizing to Other Data¶

So far we have discussed how to implement a solution to the task of partitioning some data that is in array form. We also showed that it is feasible to consider supporting different conventions for which dimension to use to denote the individual observations.

Now, what if we would like to work with data that is not in array form, such as data frames or any other kind of data container, really. Well, if we look back at the code snippets we have written so far, we will see that we haven’t actually specified any type- or structure requirement of the learning algorithm we are interested in. Indeed, we haven’t said much about any learning algorithm at all, only that it expects the data in mini-batches. Instead we focused on how to represent our array-like data-subset and even considered to buffer it efficiently by preallocating the subset storage.

Whatever kind of partitioning scheme we code, we would like it to be agnostic about our learning algorithm. What it should really care about is the type of data storage it is working with and how to communicate with it. Ideally we would like to abstract whatever information we need from our data and whatever action we need to perform with our data.

Turns out we only need our data container to expose two things:

How many observation the data contains.
A way to access the observations of a given index or indices.

Let’s consider data in the form of a data frame. We can query the total number of observations using nrow(iris), since each row contains a single observation. Further we can access the observations of some given indices idx using iris[idx, :]. That is all that is needed to make our first code snippet from the array example work with data frames (we leave the proof of this as an exercise). However, there are a few things to note.

When we access the observations of a given index we get a DataFrame in return. This makes sense for the data we are working with. Our learning algorithm may or may not support working with data frames, but that is not the responsibility of the partitioning logic.
Notice how no buffering of the mini-batches would occur in this case, as each access to a getindex of iris would create a new data frame. That said, we can’t do much better here because the lack of efficient buffering is a property of the type of data we are working with.

Great! At this point we know how to partition any data set that provides a way to query the number of observation it contains, and has a method available to access observations of specific indices. That does not free us from the burden of tracking the indices, however.

This is where MLDataPattern comes in.