Description: Train a support vector machine using a simple iterative update procedure first described by Jaakkola, Diekhans and Haussler.
Usage: compute-weights [options] <train filename> <class filename>
Input:
- <train filename> - a tab-delimited file of training examples. The first column contains labels, and the remaining columns contain real-valued features.
- <class filename> - a multi-column, tab-delimited file of training set labels. This file must contain exactly the same number of lines as the training data file. The first column contains labels, which must appear in the same order as in the training data file. The second and subsequent columns contain binary classifications (1 or -1). The classification column used from this file is the first one by default; subsequent columns can be used by invoking the -useclassnumber option described below.
Output: A five-column, tab-delimited file. The first two columns are identical to the classification file that was provided as input. Column three contains learned weights for the SVM. Columns four and five contain the predicted classification and the corresponding discriminant value. This output file is suitable for input to classify.
Note that the predicted classifications are computed by default using a 2-norm soft margin. Because the soft margin incorporates information about the training set labels, the predictions given in the weights file will differ from the predictions you would get by running your original training set through classify.
Options:
The following four options control feature selection, which is only available in conjunction with hold-one-out cross-validation. In order to perform feature selection on distinct training and test sets, you must first use fselect to select a feature subset.
- -matrix - By default, the base kernel function is a dot product. This option allows that function to be replaced by an arbitrary function. Read a kernel matrix, rather than training set examples, from the file specified by '-train'. The matrix is an n+1 by n+1 matrix, where n is the number of training examples. The first row and column contain data labels. The matrix entry for row x, column y, contains the kernel value K(x,y). Note that there are special options that must be invoked during later classification using classify with an SVM trained using the -matrix option. These include the -selftrain and -selftest options. See the documentation for classify for details.
- -useclassnumber <value> - If the class file contains multiple classes, use the class indicated by this number. The first column of class labels is column 1. If this option is omitted, the first column of classifications is used.
- -initial <file> - Initialize the weights to the given values. The weights should appear in column 3 of the file. Output files produced by this program may be used to initialize the weights.
- -holdout <percent> - Add two additional columns to the output, which will contain the predicted classification and corresponding discriminant values computed via hold-one-out cross-validation. The specified <percent> determines what percentage of the training set will be randomly selected for cross-validation. For the remaining, non-held-out examples, the final two columns will contain the value "NaN".
- -zeromean - Subtract from each element in the input data the mean of the elements in that row, giving the row a mean of zero.
- -varone - Divide each element in the input data by the standard deviation of the elements in that row, giving the row a variance of one.
The following eight options modify the base kernel function. The operations occur in the order listed below.
- -fselect fisher|ttest|welch|mannwhitney|sam|tnom - Specify the metric used to evaluate individual features. See the documentation for fselect for more information.
- -fthreshtype percent|number|value - Select different means of setting the feature selection threshold. The "percent" option chooses the top n% of the features. The "number" option chooses the top n features. The "value" option chooses features that score above n. The default setting is "percent".
- -fthreshold <value> - Set the threshold for feature selection. The default setting depends upon the threshold type: for "percent" and "number", the default is 10; for "value" it is 1.
- -fscores <file> - Write to the given file a matrix containing the computed quality scores for each feature. Each row corresponds to one feature. The first column contains the feature name, and the second column contains the the Fisher score, the t-test score, or the negative log2 of the t-test p-value. If the "-holdout" option is specified, additional columns are included, corresponding to each held-out example.
- -adddiag <value> - Add the given value to the diagonal of the training kernel matrix. This option effects a 2-norm soft margin and should therefore not be used in conjunction with the
-posconstraint
and-negconstraint
options. The default value is 0.- -nonormalize - Do not normalize the kernel matrix. By default, the matrix is normalized by dividing K(x,y) by sqrt(K(x,x) * K(y,y)).
- -constant <value> - Add a given constant to the kernel. The default constant is 10.
- -coefficient <value> - Multiply the kernel by a given coefficient. The default coefficient is 1.
- -power <value> - Raise the kernel to a given power. The default power is 1.
- -radial - Convert the kernel to a radial basis function. If K is the base kernel, this option creates a kernel of the form exp[(-D(x,y)2)/(2 w2)], where w is the width of the kernel (see below) and D(x,y) is the distance between x and y, defined as D(x,y) = sqrt[K(x,x)2 - 2 K(x,y) + K(y,y)2].
- -widthfactor <value> - The width w of the radial basis kernel is set using a heuristic: it is the median of the distance from each positive training point to the nearest negative training point. This option specifies a multiplicative factor to be applied to that width.
- -diagfactor <value> - Add to the diagonal of the kernel matrix (n+/N) * m * k, where n+ is the number of positive training examples if the current example is positive (and similarly for negative training examples), N is the total number of training examples, m is the median value of the diagonal of the kernel matrix, and k is the value specified here. This option effects a 2-norm soft margin and should therefore not be used in conjunction with the
-posconstraint
and-negconstraint
options. The default diagonal factor is 0.1.- -posconstraint <value> - Set an explicit upper bound on the magnitude of the weights for positive training examples. By default, the magnitude is unconstrained. Note that this option (and the next) should be used in combination with a
-diagfactor
of 0.- -negconstraint <value> - Set an explicit upper bound on the magnitude of the weights for negative training examples. By default, the magnitude is unconstrained.
- -rdb - Allow the program to read and create RDB formatted files, which contain an additional format line after the first line of text.
- -kernelout - Compute and print the kernel matrix to stdout. Do not compute the weights.
- -threshold <value> - Set the convergence threshold. Training halts when the objective function changes by less than this amount. Default is 0.000001. Note that lowering the threshold also increases the precision with which weights are reported by the program.
- -maxiter <value> - Set the maximum number of iterations for the optimization routine. Default is 10000.
- -seed <value> - Set the seed for the random number generator. By default the seed is set from the clock.
- -notime - Do not include timing information in the output header.
- -verbose 1|2|3|4|5 - Set the verbosity level of the output to stderr. The default level is 2.
Bugs:
- The program does not verify that the labels in the class file match the labels in the data file.
- Tie breaking is not actually random in the sense that each run will give different results, because the decision depends only on the original order of the examples in the input files.