20141221

Monitoring Experiments in Pylearn2

In an earlier post I covered the basics of running experiments in Pylearn2. However I only covered the bare minimum commands required leaving out many details. One fundamental concept to running experiments in Pylearn2 is knowing how to monitor their progress, or "monitoring" for short.

In this tutorial we will look at two forms of monitoring. The basic form which is always done and a new approach for real-time remote monitoring.

Basic Monitoring

We will build upon the bare-bones example from the previous tutorial which means we will be using the MNIST dataset. Most datasets have two or three parts. At a minimum they have a part for training and a part for testing. If a dataset has a third part its purpose is for validation, or measuring the performance of our learner without unduly biasing our learner towards the dataset.

Pylearn2 performs monitoring at the end of each epoch and it can monitor any combination of the parts of the dataset. When using Stochastic Gradient Descent (SGD) as the training algorithm one uses the monitoring_dataset parameter to specify which parts of the dataset are to be monitored. For example, if we are only interested in monitoring the training set we would add the following entry to the SGD parameter dictionary:

monitoring_dataset:
{
    'train': !obj:pylearn2.datasets.mnist.MNIST { which_set: 'train' }
}

This will instruct Pylearn2 to calculate statistics about the performance of our learner using the training part of the dataset at the end of each epoch. This will change the default output after each epoch from:

Monitoring step:
 Epochs seen: 1
 Batches seen: 1875
 Examples seen: 60000
Time this epoch: 2.875921 seconds

to:

Monitoring step:
 Epochs seen: 0
 Batches seen: 0
 Examples seen: 0
 learning_rate: 0.0499996989965
 total_seconds_last_epoch: 0.0
 train_objective: 2.29713964462
 train_y_col_norms_max: 0.164925798774
 train_y_col_norms_mean: 0.161361783743
 train_y_col_norms_min: 0.158035755157
 train_y_max_max_class: 0.118635632098
 train_y_mean_max_class: 0.109155222774
 train_y_min_max_class: 0.103917405009
 train_y_misclass: 0.910533130169
 train_y_nll: 2.29713964462
 train_y_row_norms_max: 0.0255156457424
 train_y_row_norms_mean: 0.018013747409
 train_y_row_norms_min: 0.00823106430471
 training_seconds_this_epoch: 0.0
Time this epoch: 2.823628 seconds

Each of the entries in the output (e.g. learning_rate, train_objective) are called channels. Channels give one insight into what the learner is doing. The two most frequently used are train_objective and train_y_nll. The channel train_objective reports the cost being optimized by training while train_y_nll monitors the negative log likelihood of the current parameter values. In this particular example these two channels are monitoring the same thing but this will not always be the case.

Monitoring the train part of the dataset is useful for debugging purposes. However it is not enough alone to evaluate the performance of our learner because the learner will likely always improve and at some point it begins to overfit on the training data. In other words it will find parameters that work well on the data used to train it but not on data it has not seen during training. To combat this we use a validation set. MNIST does not explicitly reserve a part of the data for validation but it has become a de facto standard to use the last 10,000 samples from the train part. To specify this one uses the start and stop parameters when instantiating MNIST. If we were only monitoring the validation set our monitoring_dataset parameter to SGD would be:

monitoring_dataset:
{
    'valid': !obj:pylearn2.datasets.mnist.MNIST {
        which_set: 'train',
        start: 50000,
        stop: 60000
    }
}

Note that the key to the dictionary, 'valid' in this case, is merely a label. It can be whatever we choose. Each channel monitored for the associated dataset is prepended with this value.

It's also worth noting that we are not limited to monitoring just one part of the dataset. It is usually helpful to monitor both the train and validation parts of a data set. This is done as follows:
monitoring_dataset:
{
    'train': !obj:pylearn2.datasets.mnist.MNIST {
        which_set: 'train',
        start: 0,
        stop: 50000
    },
    'valid': !obj:pylearn2.datasets.mnist.MNIST {
        which_set: 'train',
        start: 50000,
        stop: 60000
    }
}

Note that here we use the start and stop parameters when loading both the train and valid parts to appropriately partition the dataset. We do not want the learner to validate on the data from the train dataset otherwise we will not be able to identify overfitting.

Putting it all together our our complete YAML now looks like:

!obj:pylearn2.train.Train {
    dataset: &train !obj:pylearn2.datasets.mnist.MNIST {
        which_set: 'train'
        start: 0,
        stop: 50000
    },
    model: !obj:pylearn2.models.softmax_regression.SoftmaxRegression {
        batch_size: 20,
        n_classes: 10,
        nvis: 784,
        irange: 0.01
    },
    algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
        learning_rate: 0.05,
        monitoring_dataset:
        {
            'train': *train,
            'valid': !obj:pylearn2.datasets.mnist.MNIST {
                which_set: 'train',
                start: 50000,
                stop: 60000
            }
        }
    }
}

Note that here we have used a YAML trick to reference a previously instantiated object to save ourselves typing. Specifically the dataset has been tagged "&train" and when specifying monitor_dataset the reference "*train" is used to identify the previously instantiated object.

Live Monitoring

There are two problems with the basic monitoring mechanism in Pylearn2. First the output is raw text. This alone can make it difficult to understand how the values of the various channels are evolving in time. Especially when attempting to track multiple channels simultaneously. Second, due in part to the ability to add channels for monitoring, the amount of output after each epoch can and frequently does grow quickly. Combined these problems make the basic monitoring mechanism difficult to use.

An alternative approach is to use a new mechanism called live monitoring. To be completely forthright the live monitoring mechanism is something that I developed to combat the aforementioned problems. Furthermore I am interested in feedback regarding its user interface and what additional functionality people would like. Please feel free to send an E-mail to the Pylearn2 users mailing list or leave a comment below with feedback.

The live monitoring mechanism has two parts. The first part is a training extension, i.e. an optional plug-in that modifies the way training is performed. The second part is a utility class that can query the training extension for data about channels being monitored.

Training extensions can be selected using the extensions parameter to the train object. In other words add the following to the parameters dictionary for the train object in any YAML:

extensions: [
    !obj:pylearn2.train_extensions.live_monitoring.LiveMonitoring {}
]

The full YAML would look like:

!obj:pylearn2.train.Train {
    dataset: &train !obj:pylearn2.datasets.mnist.MNIST {
        which_set: 'train'
        start: 0,
        stop: 50000
    },
    model: !obj:pylearn2.models.softmax_regression.SoftmaxRegression {
        batch_size: 20,
        n_classes: 10,
        nvis: 784,
        irange: 0.01
    },
    algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
        learning_rate: 0.05,
        monitoring_dataset:
        {
            'train': *train,
            'valid': !obj:pylearn2.datasets.mnist.MNIST {
                which_set: 'train',
                start: 50000,
                stop: 60000
            }
        }
    },
    extensions: [
        !obj:pylearn2.train_extensions.live_monitoring.LiveMonitoring {}
    ]
}

The LiveMonitoring training extension listens for queries about channels being monitored. To perform queries one need only instantiate LiveMonitor and use it's methods to request data. Currently it has three methods:
  • list_channels: Returns a list of channels being monitored.
  • update_channels: Retrieves data about the list of specified channels.
  • follow_channels: Plots the data for the specified channels. This command blocks other commands from being executed because it repeatedly requests the latest data for the specified channels and redraws the plot as new data arrives.
To instantiate LiveMonitor start ipython and execute the following commands:

from pylearn2.train_extensions.live_monitoring import LiveMonitor
lm = LiveMonitor()

Each of the methods listed above return a different message object. The data of interest is contained in the data member of that object. As such, given an instance of LiveMonitor, one would view the channels being monitored as follows:

print lm.list_channels().data

Which, if we're running the experiment specified by the YAML above, will yield:

['train_objective', 'train_y_col_norms_max',
'train_y_row_norms_min', 'train_y_nll', 'train_y_col_norms_mean',
'train_y_max_max_class', 'train_y_min_max_class', 'train_y_row_norms_max',
'train_y_misclass', 'train_y_col_norms_min', 'train_y_row_norms_mean',
'train_y_mean_max_class', 'valid_objective', 'valid_y_col_norms_max',
'valid_y_row_norms_min', 'valid_y_nll', 'valid_y_col_norms_mean',
'valid_y_max_max_class', 'valid_y_min_max_class', 'valid_y_row_norms_max',
'valid_y_misclass', 'valid_y_col_norms_min', 'valid_y_row_norms_mean',
'valid_y_mean_max_class', 'learning_rate', 'training_seconds_this_epoch',
'total_seconds_last_epoch']

From this we can pick channels to plot using follow_channels:

lm.follow_channels(['train_objective', 'valid_objective'])

This command will then display a graph like that in figure 1 and continually updates the plot at the end of each epoch.

Figure 1: Example output from the follow_channels method of the LiveMonitor utility object.

The live monitoring mechanism is network aware and by default it answers queries on port 5555 of any network interface on the computer wherein the experiment is being executed. It is not necessary for a user to know anything about networking to use live monitoring however. By default the live monitoring mechanism assumes the experiment of interest is being executed on the same computer as the LiveMonitor utility class. If that is not the case and one knows the IP address of the computer on which the experiment is running then one need only specify the address when instantiating LiveMonitor. The live monitoring mechanism will automatically take care of the networking.

Live monitoring is also very efficient. It only ever requests data it does not already have and the underlying networking utility waits for new data without taking unnecessary CPU time.

The live monitoring mechanism has many benefits including:
  • The ability to filter the channels being monitored.
  • The ability to plot data for any given set of channels being monitored.
  • The ability to retrieve data from an experiment in real-time*
  • The ability to query for data from an experiment running on a remote machine.
  • The ability to change which channels are being followed or plotted without restarting an experiment.
* Updates only occur at the end of each epoch but this is real-time with respect to Pylearn2 experiments.

Conclusion

Monitoring the progress of experiments in Pylearn2 is as easy as setting up an experiment. Monitoring is also very flexible and offers output both directly in the terminal as text or graphically via a training extension.

13 comments:

  1. Hi Dustin, thanks for the post. I was particularly interested in the follow_channels plot, but isn't working for me on Ubuntu 14.04. The most immediate reason seems to be a typo (I think self.update_channel should be self.update_channels), but even if I fix this, the plot still doesn't display. I think the reason is the use of matplotlib's interactive mode as the means of delivering live updates.

    I've had trouble in the past trying to get matplotlib's ion() mode to update a plot from inside a loop, with inconsistent behavior between Ubuntu and Windows. In hunting for a solution, I got the impression that ion() is really only supported for interactive usage directly from the terminal and may or may not happen to work when used in the way it's used here. For this sort of dynamically updating use case, the recommended approach seems to be using a UI toolkit like wxPython, PySide, PyQt, etc. (see e.g. http://stackoverflow.com/questions/1940387/interactive-mode-in-matplolib).

    I was able to get it working using a PySide GUI instead of ion(). I think this is a more robust solution for live plotting when PySide is available, and it could be extended with dynamic UI elements e.g. for changing selected channels while the plot is running. If you're interested, I submitted it as a pull request here https://github.com/lisa-lab/pylearn2/pull/1328

    ReplyDelete
    Replies
    1. Hi Adam!

      I'm not sure how the spelling mistake made it through the merge process given that we have unit tests (perhaps I didn't implement one for that method) and we have two developers look at the code. Thanks for catching that and fixing it.

      Suffice it to say the plotting code has been more than a little troublesome. It would be great to have a more robust solution. I also really like the idea of dynamic UI elements. The ability to change the channels being displayed while the plot is running is very appealing. In fact I plan to look into implementing it as soon as your pull request is merged. Beware that most of the Pylearn2 developers are on holiday so the merge may take a little longer than usual. But I will be sure to follow up on it.

      Thanks for the pull request and the great contribution and have a happy new year!

      Delete
  2. Thanks for the post. Is there a way to printout the final weight of the model or save into a csv file?

    ReplyDelete
    Replies
    1. Though the live monitor could be adapted to do this easily enough it would be a bit of a hack in my opinion. I can think of a number of other ways to do it though I can't think of a standard way off the top of my head. A simple approach would be to use the save_path and save_freq parameters of the train object to ensure that a pickle of the trained model is saved. Then load the file and extract the weights. This last part is going to be model specific. If you're using an MLP, for instance, then you will need to iterate through the layers of the MLP and access the W and b Theano shared variable members and write them out to your file. In this case you will have to watch out for more complicated structures like embedded MLPs and handle them correctly. If this doesn't make immediate sense then I would advise posting this question to the pylearn2-users mailing list with additional details about what you're trying to do.

      Delete
    2. Thank you very much. Yes I am using MLP, I will try if I can list the W and b.

      Delete
  3. Great posts!

    Just a minor mistake. You should use the keyword stop rather than end to specify the training and validation sets

    ReplyDelete
  4. Hi Dustin, thanks for your awesome post! I'm just starting my studies on deep learning and I found pylearn extremely useful both for experimentation and my own studies. I would like to know if there is any solution on the current version of pylearn regarding the live monitor? I'm also using Ubuntu 14.04 and I found yours and Adam's discussions but I couldn't get this to work. Thanks once again for your contributions!

    ReplyDelete
    Replies
    1. Hi Gian!

      I'm not sure what you are asking precisely but I suspect you're running into a bug for which I recently submitted a fix. Try pulling the latest version of Pylearn2 and follow the steps again. If you continue to run into trouble feel free to post details about the error here or on the Pylearn2 users E-mail list and I will take a look.

      Delete
  5. Hi Dustin,

    Sorry about the confusion. I read again my comment and even I found it strange... My question is regarding the live monitor on Ubuntu 14.04. I have the version of June 03 and I'm following your instructions. I can instantiate the live monitor, I can print the channels but the picture does not show up... I will try again and if I can't get to work I will post at the users e-mail list.

    Thanks again!

    ReplyDelete
  6. Hi, sorry if I'm doing a dumb mistake but I just tried to run train.py on the yaml code you posted here and I get the error:

    File "/afs/inf.ed.ac.uk/user/s13/s1333293/pylearn2/pylearn2/train_extensions/live_monitoring.py", line 247, in on_monitor
    except zmq.Again:
    AttributeError: 'module' object has no attribute 'Again'

    Do you know why is this happening?

    ReplyDelete
    Replies
    1. I have not seen this error myself. It seems you have a different version of ZMQ than the one this extension was written for or perhaps a different package of the same name installed.

      Delete