Data Subsets¶

It is a common requirement in Machine Learning related experiments to partition some data set in one way or the other. At its essence, data partitioning can be thought of as a process that assigns observations to one or more subsets of the original data. This abstraction is also true for other important and widely used data access pattern in machine learning (e.g. over- and under-sampling of a labeled data set).

In other words, the core problem that needs to be addressed efficiently, is how to create and represent a data subset in a generic way. Once we can subset arbitrary index-based data, more complicated tasks such as data partitioning, shuffling, or resampling can be expressed through data subsetting in a coherent manner.

Before we move on, let us quickly clarify what exactly we mean when we talk about a data “subset”. We don’t think about the term “subset” in the mathematical sense of the word. Instead, when we attempt to subset some data source, what we are really interested in, is a representation (aka. subset) of a specific sequence of observations from the original data source. We specify which observations we want to be part of this subset, by using observation-indices from the set \(I = \{1,2,...,N\}\). Here N is the total number of observations in our data source. This interpretation of “subset” implies the following:

We can only subset data sources that are considered data container. Furthermore, a subset of a data container is again considered a data container.
When specifying a subset, the order of the requested observation-indices matter. That means that different index permutations will cause conceptually different “subsets”.
A subset can contain the same exact observation for an arbitrary number of times (including zero). Furthermore, an observation can be part of multiple distinct subsets.

We will spend the rest of this document discussing how to use this package to create data subsets and how to interact with them. After introducing the basics, we will go over the multiple high-level functions that create data subsets for you. These include splitting your data into train and test portions, shuffling your data, and resampling your data using a k-folds partitioning scheme.

Subsetting a Data Container¶

We have seen before that when confronted with a data container, nesting various subsetting operations really just breaks down to keeping track of the observation-indices. This in turn is much cheaper than copying observation-values around needlessly (see Background and Motivation for an in-depth discussion).

Ideally, when we “subset” a data container, what we want is a lazy representation of that subset. In other words, we would like to avoid copying the values of our data set around until we actually need it. To that end, we provide the function datasubset(), which tries to choose the most appropriate type of subset for the given data container.

datasubset(data[, idx][, obsdim])¶

Returns a lazy subset of the observations in data that correspond to the given index/indices in idx. No data will be copied except of the indices

This function is similar to calling the constructor for DataSubset, with the main difference that datasubset() will return a SubArray if the type of data is an Array or SubArray. Furthermore, this function can be extended for custom types of data that also want to provide their own subset-type.

The returned subset will in general not be of the same type as the underlying observations it represents. If you want to query the actual observations corresponding to the given indices in their true form, use getobs() instead.

Parameters:

data – The object representing a data container.
idx –
Optional. The index or indices of the observation(s) in data that should be part of the subset. Can be of type Int or some subtype AbstractVector{Int}. Defaults to 1:nobs(data,obsdim)
obsdim –
Optional. If it makes sense for the type of data, then obsdim can be used to specify which dimension of data denotes the observations. It can be specified in a type-stable manner as a positional argument, or as a more convenient keyword parameter. See Observation Dimension for more information.

Returns:

An object representing a lazy subset of data for the observation-indices in idx. The type of the return value depends on the type of data.

Out of the box, this package provides custom support for all subtypes of AbstractArray. With the exception of sparse arrays, we represent all subsets of arrays in the form of a SubArray. To give a concrete example of what we mean, let us consider the following random matrix X. We will think about it as a small data set that has 4 observations with 2 features each.

julia> X = rand(2,4)
2×4 Array{Float64,2}:
 0.226582  0.933372  0.505208   0.0443222
 0.504629  0.522172  0.0997825  0.722906

julia> datasubset(X, 2) # single observation at index 2
2-element SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true}:
 0.933372
 0.522172

julia> datasubset(X, [2,4]) # batch of 2 observations
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
 0.933372  0.0443222
 0.522172  0.722906

If there is more than one array dimension, all but the observation dimension are implicitly assumed to be features (i.e. part of that observation). As you can see in the example above, the default assumption is that the last array dimension enumerates the observations. This can be overwritten by explicitly specifying the obsdim. In the following code snippet we treat X as a data set that has 2 observations with 4 features each.

julia> datasubset(X, 2, ObsDim.First())
4-element SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Colon},true}:
 0.504629
 0.522172
 0.0997825
 0.722906

julia> datasubset(X, 2, obsdim = 1)
4-element SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Colon},true}:
 0.504629
 0.522172
 0.0997825
 0.722906

Note how obsdim can either be provided using a type-stable positional argument from the namespace ObsDim, or by using a more flexible and convenient keyword argument. For more take a look at Observation Dimension.

Remember that every data subset - which includes SubArray - is again a fully qualified data container. As such, it supports both nobs() and getobs().

julia> mysubset = datasubset(X, [2,4]) # batch of 2 observations
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
 0.933372  0.0443222
 0.522172  0.722906

julia> nobs(mysubset)
2

julia> getobs(mysubset)
2×2 Array{Float64,2}:
 0.933372  0.0443222
 0.522172  0.722906

Because a SubArray is also a data container, it can be subsetted even further by using datasubset() again. The result of which will be a new SubArray into the original data container X. As such it will use the accumulated indices of both subsetting steps. In other words, while subsetting operations can be nested, they will be combined into a single layer (i.e. you don’t want a subset of a subset of a subset represented as nested types)

julia> datasubset(mysubset, 1) # will still be a view into X
2-element SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true}:
 0.933372
 0.522172

It is also possible to link multiple different data containers together on an per-observation level. This way they can be subsetted as one coherent unit. To do that, simply put all the relevant data container into a single Tuple, before passing it to datasubset() (or any other function that expect a data container). The return value will then be a Tuple of the same length, with the resulting data subsets in the same tuple position.

julia> X = rand(2,4)
2×4 Array{Float64,2}:
 0.226582  0.933372  0.505208   0.0443222
 0.504629  0.522172  0.0997825  0.722906

julia> y = rand(4)
4-element Array{Float64,1}:
 0.812814
 0.245457
 0.11202
 0.000341996

julia> datasubset((X,y), 2) # single observation at index 2
([0.933372,0.522172],0.24545709827626805)

julia> Xs, ys = datasubset((X,y), [2,4]) # batch of 2 observations
([0.933372 0.0443222; 0.522172 0.722906], [0.245457,0.000341996])

julia> Xs
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
 0.933372  0.0443222
 0.522172  0.722906

julia> ys
2-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
 0.245457
 0.000341996

It is worth pointing out, that the tuple elements (i.e. data container) need not be of the same type, nor of the same shape. You can observe this in the code above, where X is a Matrix while y is a Vector. Note, however, that all tuple elements must be data containers themselves. Furthermore, they all must contain the same exact number of observations. This is required, even if the requested observation-index would be in-bounds for each data container individually.

julia> datasubset((rand(3), rand(4)), 2)
ERROR: DimensionMismatch("all data container must have the same number of observations")
[...]

When grouping data containers in a Tuple, it is of course possible to specify the obsdim for each data container. If all data container share the same observation dimension, it suffices to specify it once.

julia> Xs, ys = datasubset((X,y), [2,4], obsdim = :last);

julia> Xs
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
 0.933372  0.0443222
 0.522172  0.722906

julia> ys
2-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
 0.245457
 0.000341996

Note that if obsdim is specified as a Tuple, then it needs to have the same number of elements as the Tuple of data containers.

julia> Xs, ys = datasubset((X,y), [2,4], obsdim = (2,1));

julia> Xs
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
 0.933372  0.0443222
 0.522172  0.722906

julia> ys
2-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
 0.245457
 0.000341996

Multiple obsdim can of course also be specified using type-stable positional arguments.

julia> Xs, ys = datasubset((X',y), [2,4], (ObsDim.First(),ObsDim.Last())); # note the transpose

julia> Xs
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Array{Int64,1},Colon},false}:
 0.933372   0.522172
 0.0443222  0.722906

julia> ys
2-element SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}:
 0.245457
 0.000341996

The DataSubset Type¶

So far we have only considered subsetting data container of type Array. However, what if we want to subset some other data container that does not implement the AbstractArray interface? Naturally, we can’t just use SubArray to represent those subsets. For that reason we provide a generic type DataSubset, that serves as the default subset type for every data container that does not implement their own methods for datasubset().

class DataSubset¶

Used as the default type to represent a subset of some arbitrary data container. Its main task is to keep track of which observation-indices the subset spans. As such it is designed in a way that makes sure that subsequent subsettings are accumulated without needing to access the actual data.

The main purpose for the existence of DataSubset is to delay data-access and -movement until an actual batch of data (or single observation) is needed for some computation. This is particularily useful when the data is not located in memory, but on the hard drive or some remote location. In such a scenario one wants to load the required data only when needed.

DataSubset(data[, idx][, obsdim]) → DataSubset

Create an instance of DataSubset that will represent a lazy subset of the observations in data corresponding to the given index/indices in idx. No data will be copied except of the indices.

If data is a DataSubset, then the indices of the subset will be combined with idx and consequently an accumulated DataSubset will be created and returned.

In general we advice to use datasubset() instead of calling DataSubset() directly. This is because datasubset() will only invoke DataSubset() if there is no alternative choice of subset-type known for the given data.

Parameters:

data – The object representing a data container.
idx –
Optional. The index or indices of the observation(s) in data that should be part of the subset. Can be of type Int or some subtype AbstractVector{Int}. Defaults to 1:nobs(data,obsdim)
obsdim –
Optional. If it makes sense for the type of data, then obsdim can be used to specify which dimension of data denotes the observations. It can be specified in a type-stable manner as a positional argument, or as a more convenient keyword parameter. See Observation Dimension for more information.

The type DataSubset can be used to represent a subset of any type of data container. This even includes arrays, which we have seen provide their own special type of subset.

julia> X = rand(2,4)
2×4 Array{Float64,2}:
 0.226582  0.933372  0.505208   0.0443222
 0.504629  0.522172  0.0997825  0.722906

julia> DataSubset(X, 2) # single observation at index 2
DataSubset(::Array{Float64,2}, ::Int64, ObsDim.Last())
 1 observations

julia> DataSubset(X, [2, 4]) # batch of 2 observations
DataSubset(::Array{Float64,2}, ::Array{Int64,1}, ObsDim.Last())
 2 observations

As you can see, a DataSubset does not tell you a lot of information about the observations it represents. The reason for this is that it was designed around the requirement of not needlessly accessing actual data unless requested using getobs(). That said, remember that every data subset is also a fully qualified data container. As such, it supports both nobs() and getobs().

julia> mysubset = DataSubset(X, [2, 4]) # batch of 2 observations
DataSubset(::Array{Float64,2}, ::Array{Int64,1}, ObsDim.Last())
 2 observations

julia> nobs(mysubset)
2

julia> getobs(mysubset) # request the data it represents
2×2 Array{Float64,2}:
 0.933372  0.0443222
 0.522172  0.722906

The real strength of the DataSubset type (or any data subset really), is that it can be subsetted even further. The result of which will be a new DataSubset into the original data container X that uses the accumulated indices. In other words, while subsetting operations can be nested, they will be combined into a single layer (i.e. you don’t want a subset of a subset of a subset represented as nested types)

julia> mysubset2 = DataSubset(mysubset, 2) # second observation of mysubset
DataSubset(::Array{Float64,2}, ::Int64, ObsDim.Last())
 1 observations

julia> getobs(mysubset2) # request the data it represents
2-element Array{Float64,1}:
 0.0443222
 0.722906

As you can see in the example above, DataSubset also stores the utilized obsdim. Because we are using an Array as example data container, the default assumption is that the last array dimension enumerates the observations. This can be overwritten by explicitly specifying the obsdim. As always, the obsdim can be specified in a type-stable manner using a positional argument, or by using a more convenient keyword argument.

julia> mysubset = DataSubset(X', 2, obsdim = 1) # note the transpose
DataSubset(::Array{Float64,2}, ::Int64, ObsDim.Constant{1}())
 1 observations

julia> getobs(mysubset)
2-element Array{Float64,1}:
  0.933372
  0.522172

It is worth pointing out that DataSubset remembers the specified obsdim, which means that it is not required to specify it again for subsequent data access pattern. In contrast to this, a SubArray does not have the means to remember it, and as such one needs to specify the obsdim every time.

It is also possible to link multiple different data containers together on an per-observation level. This way they can be subsetted as one coherent unit. To do that, simply put all the relevant data container into a single Tuple, before passing it to DataSubset().

julia> X = rand(2,4)
2×4 Array{Float64,2}:
 0.226582  0.933372  0.505208   0.0443222
 0.504629  0.522172  0.0997825  0.722906

julia> y = rand(4)
4-element Array{Float64,1}:
 0.812814
 0.245457
 0.11202
 0.000341996

julia> Xs, ys = DataSubset((X,y), [2,4]); # batch of 2 observations
(DataSubset(::Array{Float64,2}, ::Array{Int64,1}, ObsDim.Last())
  2 observations,
 DataSubset(::Array{Float64,1}, ::Array{Int64,1}, ObsDim.Last())
  2 observations)

julia> getobs(Xs)
2×2 Array{Float64,2}:
 0.933372  0.0443222
 0.522172  0.722906

julia> getobs(ys)
2-element Array{Float64,1}:
 0.245457
 0.000341996

Note something subtle but important in the code snippet above. The constructor DataSubset() does not return a DataSubset when it is called with a tuple of data containers. Instead, it maps the constructor onto each data container individually. Thus if we invoke DataSubset() with a Tuple, it will return a Tuple of DataSubset.

Support for Custom Types¶

We have seen in the previous section what the type DataSubset is, and why it exists. We also mentioned that an end-user does not usually need to work with the constructor DataSubset() directly. Instead, we recommended to always just use datasubset() instead.

You may ask yourself right now why we were using this DataSubset type in the first place. After all, we saw that calling the function datasubset() gave us a more convenient SubArray to work with. Well, as we hinted before, not every data container can be expected to be a subtype of AbstractArray. To get a better understanding of why we care about this, let us together explore the implications on a couple of commonly used data sources that are available in the Julia package ecosystem.

Example: DataFrames.jl¶

Note

If you are using MLDataUtils.jl then support for DataFrame is already provided for you.

Let’s consider a type of data source that is very different to an Array; a DataFrame from the DataFrames.jl package. By default, a DataFrame is not a data container, because it does not implement the required interface. We can change that however.

julia> using DataFrames, LearnBase

julia> LearnBase.getobs(df::DataFrame, idx) = df[idx,:]

julia> StatsBase.nobs(df::DataFrame) = nrow(df)

With those two methods defined, every DataFrame is a fully qualified data container. This means that it can now be subsetted.

julia> df = DataFrame(x1 = rand(4), x2 = rand(4))
4×2 DataFrames.DataFrame
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.226582 │ 0.505208  │
│ 2   │ 0.504629 │ 0.0997825 │
│ 3   │ 0.933372 │ 0.0443222 │
│ 4   │ 0.522172 │ 0.722906  │

julia> mysubset = datasubset(df, [2,4])
DataSubset(::DataFrames.DataFrame, ::Array{Int64,1})
 2 observations

julia> getobs(mysubset)
2×2 DataFrames.DataFrame
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.504629 │ 0.0997825 │
│ 2   │ 0.522172 │ 0.722906  │

Notice how we used datasubset() here, instead of invoking the DataSubset() constructor directly. This is the recommended way of creating data subsets. The main difference is, that datasubset() will try to choose the most appropriate type to represent a subset for the given container, while the constructor will always use DataSubset. For this example we did not specify any special kind of data subset for DataFrame, and thus the default DataSubset is used.

Example: DataTables.jl¶

Another good example for a custom data source are DataTable from the DataTables.jl package. This rather new, table-like type is advertised as the “future of working with tabular data in Julia”. To make it more interesting after the DataFrame example, we will also make use of a native view-type called SubDataTable, which is a perfect candidate for a custom data subset type.

Not unlike DataFrame, a DataTable is by default not a data container, because it does not implement the required interface. We will again change that. In contrast to before, however, we will also implement a custom method for datasubset().

julia> using DataTables, LearnBase

julia> StatsBase.nobs(dt::AbstractDataTable) = nrow(dt)

julia> LearnBase.getobs(dt::AbstractDataTable, idx) = dt[idx,:]

julia> LearnBase.datasubset(dt::AbstractDataTable, idx, ::ObsDim.Undefined) = view(dt, idx)

It is worth pointing out that it is a current limitation that any custom method for datasubset() must also include the third parameter obsdim (even if it is undefined).

Now that we have the required interface implemented, every DataTable is regarded as a fully qualified data container. In contrast to the DataFrame example, it even has its own custom type for representing a data subset (Note that we could also do the same thing for DataFrame using the type SubDataFrame).

julia> dt = DataTable(x1 = rand(4), x2 = rand(4))
4×2 DataTables.DataTable
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.226582 │ 0.505208  │
│ 2   │ 0.504629 │ 0.0997825 │
│ 3   │ 0.933372 │ 0.0443222 │
│ 4   │ 0.522172 │ 0.722906  │

julia> mysubset = datasubset(dt, [2, 4])
2×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.504629 │ 0.0997825 │
│ 2   │ 0.522172 │ 0.722906  │

julia> datasubset(mysubset, 2) # subsetting a subset
1×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ x1       │ x2       │
├─────┼──────────┼──────────┤
│ 1   │ 0.522172 │ 0.722906 │

julia> getobs(mysubset)
2×2 DataTables.DataTable
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.504629 │ 0.0997825 │
│ 2   │ 0.522172 │ 0.722906  │

One may ask why we go through this trouble, if we could just use Base.view instead. Aside from the observation dimension aspect when working with arrays, there are good reason for having such a neutral interface. After all, a data subset is just a means to an end. We will see in the following sections how higher-level functions can create various data subsets in much more useful ways than us just calling datasubset() ourselves. So once some data source supports the data container interface, all the high-level functionality that we will spend the rest of this document on, comes with it for free.

Shuffling a Data Container¶

A vastly under-appreciated duty of any Machine Learning framework is shuffling a data set (or parts of a data set). Shuffling the order of the observations before training a model on that data set is important for various practical and well known reasons. We still call it under-appreciated, however, because it is easy to implement “shuffling” inefficiently. That in turn can influence a lot of dependent functionality; especially if big data sets are involved. For example, it is not unusual that the shuffling is performed very early in the ML pipeline. Depending on the design of the framework, this could cause a lot of unnecessary data movement.

In this package we follow the simple idea, that the “shuffling” of a data set should be performed on an indices level, and not an observation level. What that means is that instead of copying or mutating the actual data, we simply create a lazy “subset” of that data using shuffled indices. As a consequence, the actual data remains untouched by the process until getobs() is called. In other words, while the resulting subset points to the same observations, it has the order of the indices shuffled. The function that implements this functionality is called shuffleobs().

shuffleobs(data[, obsdim])¶

Return a “subset” of data that spans the same exact observations, but has the order of those observations permuted.

The values of data itself are not copied. Instead only the indices are shuffled. This function calls datasubset() to accomplish that, which means that the return value is likely of a different type than data.

Parameters:	data – The object representing a data container. obsdim – Optional. If it makes sense for the type of data, then obsdim can be used to specify which dimension of data denotes the observations. It can be specified in a type-stable manner as a positional argument, or as a more convenient keyword parameter. See Observation Dimension for more information.

This is where we will start to see the subtle beauty of the package design. We have previously discussed in some detail how to interact (and subset) data containers such as Array, DataTable, and DataFrame. Let us now take a look at what it means to “shuffle” each of those. First, we will consider a plain Julia Array.

julia> X = rand(2,4)
2×4 Array{Float64,2}:
 0.226582  0.933372  0.505208   0.0443222
 0.504629  0.522172  0.0997825  0.722906

julia> X_shuf = shuffleobs(X) # each column is an observation
2×4 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
 0.933372  0.505208   0.0443222  0.226582
 0.522172  0.0997825  0.722906   0.504629

julia> getobs(X_shuf) # copy into a Array
2×4 Array{Float64,2}:
 0.933372  0.505208   0.0443222  0.226582
 0.522172  0.0997825  0.722906   0.504629

julia> shuffleobs(X, obsdim = 1) # each row is an observation
2×4 SubArray{Float64,2,Array{Float64,2},Tuple{Array{Int64,1},Colon},false}:
 0.504629  0.522172  0.0997825  0.722906
 0.226582  0.933372  0.505208   0.0443222

As we can see, shuffleobs() returns a SubArray instead of an Array. As such, it still points at the data in X. To get the actual data as a proper Array (e.g. for memory locality) we can use getobs() on the result. Also note how the result of shuffleobs() depends on the specified obsdim. This is because we just want to permute the order of the observations, not the features.

Next we will take a look at what happens when we call shuffleobs() with a DataTable. Note that for this to work it is required that the data container interface is implemented (which we did as an exercise in Example: DataTables.jl)

julia> dt = DataTable(x1 = rand(4), x2 = rand(4))
4×2 DataTables.DataTable
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.226582 │ 0.505208  │
│ 2   │ 0.504629 │ 0.0997825 │
│ 3   │ 0.933372 │ 0.0443222 │
│ 4   │ 0.522172 │ 0.722906  │

julia> dt_shuf = shuffleobs(dt)
4×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.933372 │ 0.0443222 │
│ 2   │ 0.504629 │ 0.0997825 │
│ 3   │ 0.226582 │ 0.505208  │
│ 4   │ 0.522172 │ 0.722906  │

julia> getobs(dt_shuf)
4×2 DataTables.DataTable
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.933372 │ 0.0443222 │
│ 2   │ 0.504629 │ 0.0997825 │
│ 3   │ 0.226582 │ 0.505208  │
│ 4   │ 0.522172 │ 0.722906  │

Note how the actual code did not change much, even though DataTables are quite different to Array. We can again observe how shuffleobs() did not return a new DataTable, but instead a lazy view in the form of a SubDataTable.

To mix it up a little, let us take a look at a data container that does not provide its own type of data subset; a DataFrame. Note that for the following code to work, it is required that the data container interface is implemented (which we did as an exercise in Example: DataFrames.jl)

julia> df = DataFrame(x1 = rand(4), x2 = rand(4))
4×2 DataFrames.DataFrame
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.226582 │ 0.505208  │
│ 2   │ 0.504629 │ 0.0997825 │
│ 3   │ 0.933372 │ 0.0443222 │
│ 4   │ 0.522172 │ 0.722906  │

julia> df_shuf = shuffleobs(df)
DataSubset(::DataFrames.DataFrame, ::Array{Int64,1})
 4 observations

julia> getobs(df_shuf)
4×2 DataFrames.DataFrame
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.933372 │ 0.0443222 │
│ 2   │ 0.504629 │ 0.0997825 │
│ 3   │ 0.226582 │ 0.505208  │
│ 4   │ 0.522172 │ 0.722906  │

Admittedly, the result of shuffleobs() does not look as intuitive or information in this example. It does however do its job perfectly, which is avoiding data access. This property of DataSubset is particularly useful if our data container is some interface to a big remote data set. In such a case we would like to avoid loading any data until we really need it.

Aside from a common interface for different data types, the real power of using shuffleobs() is in linking multiple data containers together on an per-observation level. This way they can be shuffled as one coherent unit. To do that, simply put all the relevant data container into a single Tuple, before passing it to shuffleobs(). For example, let’s say that our features are contained in a DataTable and the targets stored in a separate Vector.

julia> dt = DataTable(x1 = rand(4), x2 = rand(4))
4×2 DataTables.DataTable
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.226582 │ 0.505208  │
│ 2   │ 0.504629 │ 0.0997825 │
│ 3   │ 0.933372 │ 0.0443222 │
│ 4   │ 0.522172 │ 0.722906  │

julia> y = rand(4)
4-element Array{Float64,1}:
 0.812814
 0.245457
 0.11202
 0.000341996

julia> df_shuf, y_shuf = shuffleobs((dt, y))
(4×2 DataTables.SubDataTable{Array{Int64,1}}
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.504629 │ 0.0997825 │
│ 2   │ 0.933372 │ 0.0443222 │
│ 3   │ 0.522172 │ 0.722906  │
│ 4   │ 0.226582 │ 0.505208  │,[0.245457,0.11202,0.000341996,0.812814])

julia> typeof(y_shuf)
SubArray{Float64,1,Array{Float64,1},Tuple{Array{Int64,1}},false}

As we can see, the observations in dt and y are both shuffled in the same manner. Thus the per-observation link is preserved and we can continue to treat it as a single data set.

Splitting into Train and Test¶

Some data preparation tasks, such as partitioning the data set into a training-, (validation-,) and test-set, are often performed offline or sometimes even predefined by a third party (e.g. the initial authors of a benchmark data set). That said, it is useful to efficiently and conveniently be able to split a given data set into differently sized subsets. For that purpose, this package provides a function called splitobs(). As the name subtly hints, this function does not shuffle the content, but instead performs a static split at the relative position specified in at.

To begin with, splitobs() provides a method to pre-compute a partition that is applicable to any data set of some fixed size.

splitobs(n[, at = 0.7]) → Tuple¶

Compute the indices for two disjoint subsets and return them as a tuple of two ranges. The first range will span the first at fraction of possible indices, while the second range will cover the rest. These indices are applicable to any data container of size n.

Parameters:	n (Integer) – Total number of observations to compute the partition indices for. at (AbstractFloat) – Optional. The fraction of observations that should be in the first subset. Must be in the interval (0,1). Can be specified as positional or keyword argument. Defaults to 0.7 (i.e. 70% of the observations in the first subset).

The following code snippet will pre-compute the subset indices for a training- and a test portion of some data set that has 100 observations in it. The training indices will cover 70% of the observations, while the test indices will cover the other 30%

julia> train_idx, test_idx = splitobs(100, at = 0.7)
(1:70,71:100)

These pre-computed indices could then be used to create the subsets of some data container manually. Naturally, most of the time it would be much more convenient to just specify the data and have the function do all the work. To then end we provide a more convenient method for splitobs() as well.

splitobs(data[, at = 0.7][, obsdim]) → Tuple

Split the given data into two disjoint subsets and returns them as a Tuple. The first subset contains the fraction at of observations in data, and the second subset contains the rest.

Note that this function will perform the splits statically and thus not perform any shuffling or sampling. If you want to perform a random assignment of observations to the subsets, you can use the function in combination with shuffleobs().

Parameters:

data – The object representing a data container.
at (AbstractFloat) –
Optional. The fraction of observations that should be in the first subset. Must be in the interval (0,1). Can be specified as positional or keyword argument. Defaults to 0.7 (i.e. 70% of the observations in the first subset).
obsdim –
Optional. If it makes sense for the type of data, then obsdim can be used to specify which dimension of data denotes the observations. It can be specified in a type-stable manner as a positional argument, or as a more convenient keyword parameter. See Observation Dimension for more information.

Let’s consider an example feature matrix X in the form of an Array, which has 8 observations with 2 features each.

julia> X = rand(2, 8)
2×8 Array{Float64,2}:
 0.226582  0.933372  0.505208   0.0443222  0.812814  0.11202      0.380001  0.841177
 0.504629  0.522172  0.0997825  0.722906   0.245457  0.000341996  0.505277  0.326561

We can split this data container into two subsets by calling splitobs() with the desired relative split point. If at is specified as a floating point number, then the return-value will be a Tuple with two elements (i.e. subsets), in which the first subset contains the fraction of observations specified by at and the second subset contains the rest.

In the following code the first subset train will contain the first 60% of the observations and the second subset test the rest. Note how we can provide the split point at as either a type-stable positional argument, or as a more descriptive keyword argument.

julia> train, test = splitobs(X, at = 0.6); # or splitobs(X, 0.6)

julia> train
2×5 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
 0.226582  0.933372  0.505208   0.0443222  0.812814
 0.504629  0.522172  0.0997825  0.722906   0.245457

julia> test
2×3 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
 0.11202      0.380001  0.841177
 0.000341996  0.505277  0.326561

It is worth pointing out explicitly, that splitobs() works for any type that implements the data container interface. The following code shows a concrete example using a DataTable (see Example: DataTables.jl to make the following code work).

julia> dt = DataTable(x1 = rand(4), x2 = rand(4))
4×2 DataTables.DataTable
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.226582 │ 0.505208  │
│ 2   │ 0.504629 │ 0.0997825 │
│ 3   │ 0.933372 │ 0.0443222 │
│ 4   │ 0.522172 │ 0.722906  │

julia> train, test = splitobs(dt, at = 0.8);

julia> train
3×2 DataTables.SubDataTable{UnitRange{Int64}}
│ Row │ x1       │ x2        │
├─────┼──────────┼───────────┤
│ 1   │ 0.226582 │ 0.505208  │
│ 2   │ 0.504629 │ 0.0997825 │
│ 3   │ 0.933372 │ 0.0443222 │

julia> test
1×2 DataTables.SubDataTable{UnitRange{Int64}}
│ Row │ x1       │ x2       │
├─────┼──────────┼──────────┤
│ 1   │ 0.522172 │ 0.722906 │

Naturally, splitobs() also supports the optional parameter obsdim, which is especially useful for arrays. It can be specified as either a positional argument, or as a keyword argument. See Observation Dimension for more information.

julia> train, test = splitobs(X', at = 0.6); # note the transpose

julia> train
5×2 SubArray{Float64,2,Array{Float64,2},Tuple{UnitRange{Int64},Colon},false}:
 0.226582   0.504629
 0.933372   0.522172
 0.505208   0.0997825
 0.0443222  0.722906
 0.812814   0.245457

julia> test
3×2 SubArray{Float64,2,Array{Float64,2},Tuple{UnitRange{Int64},Colon},false}:
 0.11202   0.000341996
 0.380001  0.505277
 0.841177  0.326561

It is also possible to call splitobs() with multiple data container wrapped in a Tuple, which all must have the same number of total observations. This will link the data containers together on a per-observation basis. Consider the following example feature-matrix X and the corresponding target vector y. Note how both data container have 8 observations.

julia> X = rand(2,8)
2×8 Array{Float64,2}:
226582  0.933372  0.505208   0.0443222  0.812814  0.11202      0.380001  0.841177
504629  0.522172  0.0997825  0.722906   0.245457  0.000341996  0.505277  0.326561

julia> y = rand(8)
8-element Array{Float64,1}:
810857
850456
478053
179066
44701
219519
677372
746407

We can pass both data containers to splitobs() using a tuple to group them together. The result of calling the function will still be a tuple just like in the examples we have seen so far.

julia> train, test = splitobs((X, y), at = 0.6);

Unlike previous examples, however, both train and test will themselves be tuples as well. Their elements and order will correspond to the elements of the given data container tuple passed to splitobs() (here (X, Y)). We can see this explicitly by splatting their elements into variables.

julia> (x_train,y_train), (x_test,y_test) = splitobs((X, y), at = 0.6); # same but splat

julia> x_train
2×5 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
 0.226582  0.933372  0.505208   0.0443222  0.812814
 0.504629  0.522172  0.0997825  0.722906   0.245457

julia> y_train
5-element SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}:
 0.810857
 0.850456
 0.478053
 0.179066
 0.44701

julia> x_test
2×3 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
 0.11202      0.380001  0.841177
 0.000341996  0.505277  0.326561

julia> y_test
3-element SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}:
 0.219519
 0.677372
 0.746407

As we can see in all previous examples, the function performs a static split and not a random assignment. This may not always be what we really want. For that purpose, this package provides a function called shuffleobs(), which we introduced in an earlier section. Using shuffleobs() in combination with splitobs() will result in a random assignment of observations to the data partitions.

julia> train, test = splitobs(shuffleobs(X), at = 0.6);

julia> train
2×5 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
 0.841177  0.812814  0.226582  0.11202      0.933372
 0.326561  0.245457  0.504629  0.000341996  0.522172

julia> test
2×3 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
 0.0443222  0.380001  0.505208
 0.722906   0.505277  0.0997825

So far we have only considered how to partition one or more data container into exactly two disjoint data subsets. The function splitobs() allows to partition into an arbitrary amount of subsets, however. To partition the given data into \(N\) subsets, you simply need to specify a tuple of \(N-1\) fractions. The sum of all fractions must be in the interval (0,1).

splitobs(data, at[, obsdim]) → NTuple

Split the given data into multiple disjoint subsets with sizes proportional to the value(s) of at.

Note that this function will perform the splits statically and thus not perform any randomization. The function creates a NTuple of data subsets in which the first \(N-1\) elements/subsets contain the fraction of observations from data that is specified by the values in at. The last tuple element will then contain the rest of the data.

Parameters:

data – The object representing a data container.
at (Tuple) –
Tuple of fractions. All elements must be positive and their sum must be in the interval (0,1).
obsdim –
Optional. If it makes sense for the type of data, then obsdim can be used to specify which dimension of data denotes the observations. It can be specified in a type-stable manner as a positional argument, or as a more convenient keyword parameter. See Observation Dimension for more information.

Creating more than two data subsets is particularly convenient for creating an additional validation set. In the following example train will contain the first 50% of the observations, val will have the next 40%, and test the last 10%.

julia> X = rand(2,8)
2×8 Array{Float64,2}:
 0.226582  0.933372  0.505208   0.0443222  0.812814  0.11202      0.380001  0.841177
 0.504629  0.522172  0.0997825  0.722906   0.245457  0.000341996  0.505277  0.326561

julia> train, val, test = splitobs(X, at = (0.5, 0.4));

julia> train
2×4 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
 0.226582  0.933372  0.505208   0.0443222
 0.504629  0.522172  0.0997825  0.722906

julia> val
2×3 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
 0.812814  0.11202      0.380001
 0.245457  0.000341996  0.505277

julia> test
2×1 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
 0.841177
 0.326561

While the ability to partitioning a data set this way is very useful, a fixed validation set is rarely the best approach for estimating a model’s performance on the held-out test set. In the Repartitioning Strategies we will introduce various alternative, including \(k\)-folds. These usually allow for a more effective use of the training data.