Web Interface: Inputs and outputs


The SVM server takes as input three files: a training data set, a corresponding set of classification labels, and a test data set. Each row in each of these files corresponds to one example. The definition of "example" may vary widely, depending upon what kind of data you are interested in classifying. For example, if you are analyzing gene expression data, each example in the data files might correspond to one gene or to one experimental condition. On the other hand, if you are learning to recognize hand-written digits, each example would correspond to one such digit.

The important point is that, in whatever domain you are working, you must be able to convert each example into a fixed-length list of numbers. This list is called a "vector." Every element of each vector should have the same semantics. For example, if you have measured gene expression levels in, say, 20 different experimental conditions, then you could represent each gene in your experiment as a vector of 20 numbers. Of course, the order of the numbers must remain constant across all the genes, so that the SVM can compare one gene to another. Similarly, if you are looking at a digitized picture of a hand-written digit, your vector might consist of the grayscale value at each pixel in the image. Each element in the vector is called a "feature."

The format for the data files is tab-delimited text. The first row should contain the name of each feature, and the first column should contain the name of each element in the data set. The entry in the first row and column of the file is arbitrary. Every element in the second or higher row and in the second or higher column should be a number. Here is a very simple data file, containing ten examples with four features each:

corner     feature_1  feature_2  feature_3  feature_4
example_1   -0.9      -3.9       -3.1        0.7
example_2    2.1       1.1        0.3       -1.6
example_3    3.5       2.0       -0.3        3.1
example_4   -2.3      -0.4       -0.4       -0.1
example_5   -1.4       0.1       -1.7       -0.1
example_6    4.2      -0.4       -0.3        0.4
example_7    1.2      -2.4       -1.5       -4.2
example_8    1.8       3.3        3.2        1.9
example_9   -1.8       3.9        2.5        1.1
example_10  -1.8       3.1        4.5        3.0

Note that the white space between fields must be tab characters. This requirement allows the example and feature names to contain spaces, if you so desire.

Along with the training data set, you must also supply a set of binary classification labels. These labels are what you are asking the SVM to learn to predict. The SVM is fundamentally a binary classifier. Although it is possible to train a collection of SVMs to recognize multiple classes, this web server only trains single SVMs at a time. Thus, for example, if you are classifying genes by their function, you cannot ask, "In what functional class does this gene belong?" Instead, you can only ask, for example, "Does this gene belong to the cytoplasmic ribosomal functional class?" Similarly, you cannot ask, "What digit is this?", but you can ask, "Is this the digit '9'?"

The format for the label file is very similar to the training data set format, except that the file contains only two columns: the example names and the corresponding binary label. The labels must be either "1" or "-1." Furthermore, the example names must be the same as the example names in the training set, and they must appear in the same order. A sample label file to go with the sample training set above might look like this:

corner    class
example_1  -1
example_2   1
example_3   1
example_4  -1
example_5  -1
example_6   1
example_7  -1
example_8   1
example_9   1
example_10  1

This label file would tell the SVM to attempt to find a hyperplane that separates examples 1, 2, 4, 5 and 7 from examples 3, 6, 8, 9 and 10.

The third input file is the test data set. This file contains examples for which you do not necessarily know what the right classification is. The SVM will learn from the training set and will make predictions on the test set. The test set file format is identical to that of the training set. The names and number of examples may be different from the training set, but the feature names and order must be the same. Here is a sample test set:

corner     feature_1  feature_2  feature_3  feature_4
example_11     0.3        0.3       -2.2    -0.1
example_12    -1.9       -1.8        0.5     2.6
example_13    -1.0        3.0        2.1    -0.1
example_14    -1.0       -2.6       -0.9    -4.3
example_15    -2.3        0.1       -2.9    -4.4


The SVM server produces two primary outputs. The first is a weights file. This file contains a weight associated with each training set example. A non-zero weight indicates that the training set example is considered a "support vector"; i.e., that example lies near (or on the wrong side of) the separating hyperplane found by the SVM learning algorithm. In addition to the weights, the file contains the classification label (as specified in the label file you provided), a predicted classification label (which indicates which side of the hyperplane the example lies on), and a discriminant value (which is proportional to the distance between the example and the hyperplane). Here is a sample weights matrix:

corner      class  weight train_classification train_discriminant
example_1     -1      -0          -1              -2.341
example_2      1  0.1321           1              0.9991
example_3      1       0           1                1.83
example_4     -1      -0          -1              -1.058
example_5     -1  -0.09971        -1                  -1
example_6      1       0           1               1.229
example_7     -1  -0.06195        -1                  -1
example_8      1       0           1               2.603
example_9      1       0           1               1.281
example_10     1       0           1                1.77
This weights file indicates that examples 2, 5 and 7 are support vectors and that the hyperplane successfully separates the two classes (since the value under "class" is always equal to the value under "train_classification"). Note that a real weights file will also contain a header at the top of the file, which contains detailed information about the SVM parameters that were used during training.

The second output is usually more interesting than the weights file. The prediction file contains, for each example in the test set, a predicted classification, along with a discriminant value. Again, the discriminant value is proportional to the example's distance from the hyperplane. Positive values indicate that the example is predicted to be in the positive class, and vice versa. If the absolute value of the discriminant is small, then the probability that the prediction is incorrect increases. Here is a sample prediction file for the test set shown above:

corner      classification   discriminant
example_11       -1           -0.03785
example_12        1            0.0522
example_13       -1           -0.08235
example_14       -1           -0.04615
example_15       -1           -0.2354