Description: Select features from a given data set based upon a specified measure of feature quality. Usually you will also want to set the threshold for the measure you select using the -fthreshold option.
Usage: gist-fselect [options] <labels> <primary> <secondary>
Input:
- <labels> - a multi-column, tab-delimited file of classification labels. This file must contain exactly the same number of lines as the training data file. The first column contains labels, which must appear in the same order as in the primary data file (though this is not checked for by the program). The second and subsequent columns contain binary classifications for each label (1 for positive examples, -1 or 0 for negatives). The classification column used from this file is the first one by default; subsequent columns can be used by invoking the -useclassnumber option described below.
- <primary> - a tab-delimited matrix file containing strings in the first row and column, and floating point values in the rest of the matrix. Each row in the primary data file corresponds to one row in the label file.
- <secondary> - similar to <primary>. The two files must have the same number of features but may have different numbers of rows.
Output: Chooses a subset of features from the primary data matrix using a given quality metric. Writes to standard output a version of the secondary data matrix in which low-quality columns have been removed.
Options:
- -metric fisher|ttest|welch|mannwhitney|sam|tnom - Specify the metric used to evaluate individual features.
By default, features are scored using the Fisher criterion score.
- The Fisher criterion score is (m1 - m2)2 / (v1 + v2), where mi and vi are the mean and variance of the given feature in class i.
- The standard t-test is |m1 - m2| / sqrt((v/n1) + (v/n2)), where ni is the number of examples in class i, and v is the pooled variance across both classes. The score is reported as a negative log10 p-value.
- Welch's approximate t-test is |m1 - m2| / sqrt((v1/n1) + (v2/n2)). The score is reported as a negative log10 p-value.
- The Mann-Whitney test is a nonparametric test. The Student's t-test is used to break ties. The score is reported as a negative log10 p-value.
- The SAM metric is adapted from the "Signficance analysis of microarrays" method developed by Tusher et al (Proc Natl Acad Sci U S A 2001 Apr 24;98(9):5116-21). SAM uses the calculation method described in detail in the SAM user manual (Chu et al.). Our implementation of SAM is currently only partial, in that it only calculates the raw statistic for ranking genes, without calculation of error rates at particular score cutoffs.
- The threshold number of misclassifications (tnom) method is adapted from "Tissue Classification of Gene Expression Profiles" (Ben-dor et al., Journal of Computational Biology 7 pp 559-583 (200)). It is a nonparametric method based on "decision stumps". Our implementation of tnom is currently only partial in that only raw scores, not p-values, are calculated. This provides a ranking of the genes but cannot be used to determine error rates.
- -scores <file> - Write to the given file a two-column matrix containing the computed quality scores for each feature. The score is the Fisher score or the negative log10 of the t-test p-value.
- -useclassnumber <value> - If the class file contains multiple classes, use the class indicated by this number. The first column of class labels is column 1. If this option is omitted, the first column of classifications is used.
- -threshtype percent|number|value - Select different means of setting the feature selection threshold. The "percent" option chooses the top n% of the features. The "number" option chooses the top n features. The "value" option chooses features that score above n. The default setting is "percent".
- -fthreshold <value> - Set the threshold for feature selection. The default setting depends upon the threshold type: for "percent" and "number", the default is 10; for "value" it is 1.
- -rdb - Allow the program to read and create RDB formatted files, which contain an additional format line after the first line of text.
- -verbose 1|2|3|4|5 - Set the verbosity level of the output to stderr. The default level is 2.