Getting Started

Typical machine learning experiments require a lot of rather mundane but error prone data handling glue code. One particularly interesting category of data handling functionality - and the sole focus of this package - are what we call data access pattern. These “pattern” include subsetting, resampling, iteration, and partitioning of various types of data sets.

MLDataPattern was designed around the core requirment of providing first class support for user-defined data sources. This idea is based on the assumption that the data source a user is working with, is likely of some very user-specific custom type. That said, we also put a lot of attention into first class support for the most commonly used data containers, such as Array.

In this section we will provide a condensed overview of the package functionality. We will cover the topics in about the same relative order as the main documentation. To keep this overview concise, we will not discuss anything here in detail. We will, however, link to the appropriate places. Please take a look at the corresponding sections for more information about a function or concept.

Installation

To install MLDataPattern.jl, start up Julia and type the following code snippet into the REPL. It makes use of the native Julia package manger.

Pkg.add("MLDataPattern")

Additionally, for example if you encounter any sudden issues, or in the case you would like to contribute to the package, you can manually choose to be on the latest (untagged) version.

Pkg.checkout("MLDataPattern")

Overview

If there is one requirement that almost all machine learning experiments have in common, it is that they have to interact with “data” in one way or the other. After all, the goal is for a program to learn from the implicit information contained in that data. Consequently, it is of no surprise that over time a number of particularly useful pattern emerged for how to utilize this data effectively. For instance, we learned that we should leave a subset of the available data out of the training process in order to spot and subsequently prevent over-fitting.

In the context of this package we differentiate between two categories of data sources based on some useful properties. A “data source”, by the way, is simply any Julia type that can provide data. We need not be more precise with this definition, since it is of little practical consequence. The definitions that matter are for the two sub-categories of data sources that this package can actually interact with: Data Containers and Data Iterators. These abstractions will allow us to interact with many different types of data using a coherent and non-invasive interface.

Data Container

For a data source to belong in this category it needs to be able to provide two things:

  1. The total number of observations \(N\), that the data source contains.
  2. A way to query a specific observation or sequence of observations. This must be done using indices, where every observation has a unique index \(i \in I\) assigned from the set of indices \(I = \{1, 2, ..., N\}\).
Data Iterator

To belong to this group, a data source must implement Julia’s iterator interface. The data source may or may not know the total amount of observations it can provide, which means that knowing \(N\) is not necessary.

The key requirement for a iteration-based data source is that every iteration consistently returns either a single observation or a batch of observations.

The more flexible of the two categories are what we call data containers. A good example for such a type is a plain Julia Array. Well, almost. To be considered a data container, the type has to implement the required interface. In particular, a data container has to implement the functions getobs() and nobs(). For convenience both of those implementations are already provided for Array out of the box. Thus on package import the type Array becomes a data container type. For more details on the required interface take a look at the section on Data Container.

Let us take a look at a few examples to get a feeling of how one can use this library. This package is registered in the Julia package ecosystem. Once installed the package can be imported as usual.

using MLDataPattern

Consider the following toy feature matrix X, which has 2 rows and 6 columns. We can use nobs() to query the number of observations it contains, and getobs() to query one or more specific observation(s).

julia> X = rand(2, 6)
2×6 Array{Float64,2}:
 0.226582  0.933372  0.505208   0.0443222  0.812814  0.11202
 0.504629  0.522172  0.0997825  0.722906   0.245457  0.000341996

julia> nobs(X)
6

julia> getobs(X, 2) # query the second observation
2-element Array{Float64,1}:
 0.933372
 0.522172

julia> getobs(X, [4, 1]) # create a batch with observation 4 and 1
2×2 Array{Float64,2}:
 0.0443222  0.226582
 0.722906   0.504629

As you may have noticed, the two functions make a pretty strong assumption about how to interpret the shape of X. In particular, they assume that each column denotes a single observation. This may not be what we want. Given that X has two dimensions that we could assign meaning to, we should have the opportunity to choose which dimension enumerates the observations. After all, we can think of X as a data container that has 6 observations with 2 features each, or as a data container that has 2 observations with 6 features each. To allow for that choice, all relevant functions accept the optional parameter obsdim. For more information take a look at the section on Observation Dimension.

julia> nobs(X, obsdim = 1)
2

julia> getobs(X, 2, obsdim = 1)
6-element Array{Float64,1}:
 0.504629
 0.522172
 0.0997825
 0.722906
 0.245457
 0.000341996

Every data container can be subsetted manually using the low-level function datasubset(). Its signature is identical to getobs(), but instead of copying the data it returns a lazy subset. A lot of the higher-level functions use datasubset() internally to provide their functionality. This allows for delaying the actual data access until the data is actually needed. For arrays the returned subset is in the form of a SubArray. For more information take a look at the section on Data Subsets.

julia> datasubset(X, 2)
2-element SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true}:
 0.933372
 0.522172

julia> datasubset(X, [4, 1])
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
 0.0443222  0.226582
 0.722906   0.504629

julia> datasubset(X, 2, obsdim = 1)
6-element SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Colon},true}:
 0.504629
 0.522172
 0.0997825
 0.722906
 0.245457
 0.000341996

Note that a data subset doesn’t strictly have to be a true “subset” of the data set. For example, the function shuffleobs() returns a lazy data subset, which contains exactly the same observations, but in a randomly permuted order.

julia> shuffleobs(X)
2×6 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
 0.0443222  0.812814  0.226582  0.11202      0.505208   0.933372
 0.722906   0.245457  0.504629  0.000341996  0.0997825  0.522172

Since this function is non-deterministic, it raises the question of what to do when our data set is made up of multiple variables. It is not uncommon, for example, that the targets of a labeled data set are stored in a separate Vector. To support such a scenario, all relevant functions also accept a Tuple as the data argument. If that is the case, then all elements of the given tuple will be processed in the exact same manner. The return value will then again be a tuple with the individual results. As you can see in the following code snippet, the observation-link between x and y is preserved after the shuffling. For more information about grouping data containers in a Tuple, take a look at the section on Tuples and Labeled Data.

julia> x = collect(1:6);

julia> y = [:a, :b, :c, :d, :e, :f];

julia> xs, ys = shuffleobs((x, y))
([6,1,4,5,3,2],Symbol[:f,:a,:d,:e,:c,:b])

A common requirement in a machine learning experiment is to split the data set into a training and a test portion. While we could already do this manually using datasubset(), this package also provides a high-level convenience function splitobs().

julia> y1, y2 = splitobs(y, at = 0.6)
(Symbol[:a,:b,:c,:d],Symbol[:e,:f])

As we can see in the example above, the function splitobs() performs a static “split” of the given data at the relative position at, and returns the result in the form of two data subsets. It is also possible to specify multiple fractions, which will cause the function to perform additional splits.

julia> y1, y2, y3 = splitobs(y, at = (0.5, 0.3))
(Symbol[:a,:b,:c],Symbol[:d,:e],Symbol[:f])

Of course, a simple static split isn’t always what we want. In most situations we would rather partition the data set into two disjoint subsets using random assignment. We can do this by combining splitobs() with shuffleobs(). Since neither of which copies actual data we do not pay any significant performance penalty for nesting “subsetting” functions.

julia> y1, y2 = splitobs(shuffleobs(y), at = 0.6)
(Symbol[:c,:e,:f,:a],Symbol[:b,:d])

julia> y1, y2, y3 = splitobs(shuffleobs(y), at = (0.5, 0.3))
(Symbol[:b,:f,:e],Symbol[:d,:a],Symbol[:c])

It is also possible to call splitobs() with two data containers grouped in a Tuple. While this is especially useful for working with labeled data, neither implies the other. That means that one can use tuples to group together unlabeled data, or have a labeled data container that is not a tuple (see Labeled Data Container for some examples). For instance, since the function splitobs() performs a static split, it doesn’t actually care if the given Tuple describes a labeled data set. In fact, it makes no difference.

julia> X = rand(2, 6)
2×6 Array{Float64,2}:
 0.226582  0.933372  0.505208   0.0443222  0.812814  0.11202
 0.504629  0.522172  0.0997825  0.722906   0.245457  0.000341996

julia> y = ["a", "b", "b", "b", "b", "a"]
6-element Array{String,1}:
 "a"
 "b"
 "b"
 "b"
 "b"
 "a"

julia> (X1, y1), (X2, y2) = splitobs((X, y), at = 0.6);

julia> y1, y2
(String["a","b","b","b"],String["b","a"])

On the other hand, some functions require the presence of targets to perform their respective tasks. In such a case, it is always assumed that the last tuple element contains the targets. An alternative to splitobs() that is explicitly for labeled data is stratifiedobs(), which tries to preserve the class distribution. The following example shows that both, y1 and y2, contain twice as much "b" as "a", just like y does.

julia> (X1, y1), (X2, y2) = stratifiedobs((X, y), p = 0.5);

julia> y1, y2
(String["a","b","b"],String["b","b","a"])

Other functions that deal with supervised data sets are undersample() and oversample(), which can be used to re-sample a labeled data container in such a way, that the resulting class distribution is uniform.

julia> undersample(y)
4-element SubArray{String,1,Array{String,1},Tuple{Array{Int64,1}},false}:
 "a"
 "b"
 "b"
 "a"

julia> Xnew, ynew = undersample((X, y), shuffle = false)
([0.226582 0.933372 0.812814 0.11202; 0.504629 0.522172 0.245457 0.000341996],
 String["a","b","b","a"])

julia> Xnew, ynew = oversample((X, y), shuffle = true)
([0.11202 0.933372 … 0.505208 0.0443222; 0.000341996 0.522172 … 0.0997825 0.722906],
 String["a","b","a","a","b","a","b","b"])

If need be, all functions that require a labeled data container accept a target-extraction-function as an optional first parameter. If such a function is provided, it will be applied to each observation individually. In the following example the function indmax will be applied to each column slice of Y in order to derive a class label, which is then used for down-sampling. For more information take a look at the section on Labeled Data Container.

julia> Y = [1. 0. 0. 0. 0. 1.; 0. 1. 1. 1. 1. 0.]
2×6 Array{Float64,2}:
 1.0  0.0  0.0  0.0  0.0  1.0
 0.0  1.0  1.0  1.0  1.0  0.0

julia> Xnew, Ynew = undersample(indmax, (X, Y));

julia> Ynew
2×4 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
 1.0  0.0  0.0  1.0
 0.0  1.0  1.0  0.0

This package also provides functions to perform re-partitioning strategies. These result in vector-like views that can be iterated over, in which each element is a different partition of the original data. Note again that all partitions are just lazy subsets, which means that no data is copied. For more information take a look at Repartitioning Strategies.

julia> x = collect(1:10);

julia> folds = kfolds(x, k = 5)
5-element FoldsView(::Array{Int64,1}, ::Array{Array{Int64,1},1}, ::Array{UnitRange{Int64},1}, ObsDim.Last()) with element type Tuple{SubArray{Int64,1,Array{Int64,1},Tuple{Array{Int64,1}},false},SubArray{Int64,1,Array{Int64,1},Tuple{UnitRange{Int64}},true}}:
 ([3,4,5,6,7,8,9,10],[1,2])
 ([1,2,5,6,7,8,9,10],[3,4])
 ([1,2,3,4,7,8,9,10],[5,6])
 ([1,2,3,4,5,6,9,10],[7,8])
 ([1,2,3,4,5,6,7,8],[9,10])

julia> train, val = folds[1] # access first fold
([3,4,5,6,7,8,9,10],[1,2])

Such “views” also exist for other purposes. For example, the function obsview() will create a decorator around some data container, that makes the given data container appear as a vector of individual observations. This “vector” can then be indexed into or iterated over.

julia> X = rand(2, 6)
2×6 Array{Float64,2}:
 0.226582  0.933372  0.505208   0.0443222  0.812814  0.11202
 0.504629  0.522172  0.0997825  0.722906   0.245457  0.000341996

julia> ov = obsview(X)
6-element obsview(::Array{Float64,2}, ObsDim.Last()) with element type SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true}:
 [0.226582,0.504629]
 [0.933372,0.522172]
 [0.505208,0.0997825]
 [0.0443222,0.722906]
 [0.812814,0.245457]
 [0.11202,0.000341996]

Similarly, the function batchview() creates a decorator that makes the given data container appear as a vector of equally sized mini-batches.

julia> bv = batchview(X, size = 2)
3-element batchview(::Array{Float64,2}, 2, 3, ObsDim.Last()) with element type SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
 [0.226582 0.933372; 0.504629 0.522172]
 [0.505208 0.0443222; 0.0997825 0.722906]
 [0.812814 0.11202; 0.245457 0.000341996]

A third but conceptually different kind of view is provided by slidingwindow(). This function is particularly useful for preparing sequence data for various training tasks. For more information take a look at the section on Data Views.

julia> data = split("The quick brown fox jumps over the lazy dog")
9-element Array{SubString{String},1}:
 "The"
 "quick"
 "brown"
 "fox"
 "jumps"
 "over"
 "the"
 "lazy"
 "dog"

julia> A = slidingwindow(i->i+2, data, 2, stride=1)
7-element slidingwindow(::##9#10, ::Array{SubString{String},1}, 2, stride = 1) with element type Tuple{...}:
 (["The", "quick"], "brown")
 (["quick", "brown"], "fox")
 (["brown", "fox"], "jumps")
 (["fox", "jumps"], "over")
 (["jumps", "over"], "the")
 (["over", "the"], "lazy")
 (["the", "lazy"], "dog")

julia> A = slidingwindow(i->[i-2:i-1; i+1:i+2], data, 1)
5-element slidingwindow(::##11#12, ::Array{SubString{String},1}, 1) with element type Tuple{...}:
 (["brown"], ["The", "quick", "fox", "jumps"])
 (["fox"], ["quick", "brown", "jumps", "over"])
 (["jumps"], ["brown", "fox", "over", "the"])
 (["over"], ["fox", "jumps", "the", "lazy"])
 (["the"], ["jumps", "over", "lazy", "dog"])

Aside from data containers, there is also another sub-category of data sources, called data iterators, that can not be indexed into. For example the following code creates an object that when iterated over, continuously and indefinitely samples a random observation (with replacement) from the given data container.

julia> iter = RandomObs(X)
RandomObs(::Array{Float64,2}, ObsDim.Last())
 Iterator providing Inf observations

To give a second example for a data iterator, the type RandomBatches generates randomly sampled mini-batches for a fixed size. For more information on that topic, take a look at the section on Data Iterators.

julia> iter = RandomBatches(X, size = 10)
RandomBatches(::Array{Float64,2}, 10, ObsDim.Last())
 Iterator providing Inf batches with size 10

julia> iter = RandomBatches(X, count = 50, size = 10)
RandomBatches(::Array{Float64,2}, 10, 50, ObsDim.Last())
 Iterator providing 50 batches with size 10

Let us round out this introduction by taking a look at a “hello world” example (with little explanation) to get a feeling for how to combine the various functions of this package in a typical ML scenario.

# X is a matrix; Y is a vector
X, Y = rand(4, 150), rand(150)

# The iris dataset is ordered according to their labels,
# which means that we should shuffle the dataset before
# partitioning it into training- and test-set.
Xs, Ys = shuffleobs((X, Y))
# Notice how we use tuples to group data.

# We leave out 15 % of the data for testing
(cv_X, cv_Y), (test_X, test_Y) = splitobs((Xs, Ys); at = 0.85)

# Next we partition the data using a 10-fold scheme.
# Notice how we do not need to splat train into X and Y
for (train, (val_X, val_Y)) in kfolds((cv_X, cv_Y); k = 10)

    for epoch = 1:100
        # Iterate over the data using mini-batches of 5 observations each
        for (batch_X, batch_Y) in eachbatch(train, size = 5)
            # ... train supervised model on minibatches here
        end
    end
end

In the above code snippet, the inner loop for eachbatch() is the only place where data other than indices is actually being copied. That is because cv_X, test_X, val_X, etc. are all array views of type SubArray (the same applies to all the Y’s of course). In contrast to this, batch_X and batch_Y will be of type Array. Naturally, array views only work for arrays, but we provide a generalization of such a data subset for any type of data container.

Furthermore both, batch_X and batch_Y, will be the same instances each iteration with only their values changed. In other words, they both are preallocated buffers that will be reused each iteration and filled with the data for the current batch. Naturally, one is not required to work with buffers like this, as stateful iterators can have undesired side-effects when used without care. For example collect(eachbatch(X)) would result in an array that has the exact same batch in each position. Oftentimes, though, reusing buffers is preferable. This package provides different alternatives for different use-cases.

Getting Help

To get help on specific functionality you can either look up the information here, or if you prefer you can make use of Julia’s native doc-system. The following example shows how to get additional information on DataSubset within Julia’s REPL:

?DataSubset

If you find yourself stuck or have other questions concerning the package you can find us at gitter or the Machine Learning domain on discourse.julialang.org

If you encounter a bug or would like to participate in the further development of this package come find us on Github.