rankfeatures

Rank key features by class separability criteria

Syntax

IDX = rankfeatures(X,GROUP)

IDX = rankfeatures(X,GROUP,Name=Value)

[IDX,Z] = rankfeatures(X,GROUP,___)

Description

IDX= rankfeatures(X,GROUP)ranks the features inXusing an independent evaluation criterion for binary classification.Xis a matrix where every column is an observed vector and the number of rows corresponds to the original number of features.GROUPcontains the class labels.IDXis a list of indices to the rows ofXwith the most significant features.

example

IDX= rankfeatures(X,GROUP,Name=Value)使用一个或多个名称值参数指定的其他选项。

example

[IDX,Z] = rankfeatures(X,GROUP,___)also returns a list of absolute values of the criterion used for every feature.

Examples

collapse all

Find a reduced set of genes to differentiate breast cancer cells

Open Live Script

Find a reduced set of genes that is sufficient for differentiating breast cancer cells from all other types of cancer in the t-matrix NCI60 data set.

Load sample data.

loadNCI60tmatrix

Get a logical index vector to the breast cancer cells.

BC = GROUP == 8;

Select features.

I = rankfeatures(X,BC,NumberOfIndices=12);

Test features with a linear discriminant classifier.

C = classify(X(I,:)',X(I,:)',double(BC)); cp = classperf(BC,C); cp.CorrectRate

ans = 1

Use cross-correlation weighting to further reduce the required number of genes.

I = rankfeatures(X,BC,'CCWeighting',0.7,'NumberOfIndices',8); C = classify(X(I,:)',X(I,:)',double(BC)); cp = classperf(BC,C); cp.CorrectRate

ans = 1

Find discriminant peaks of two groups of signals

Open Live Script

Find the discriminant peaks of two groups of signals with Gaussian pulses modulated by two different sources.

Load data.

loadGaussianPulses

Specify the regional information to outweigh Z-value of features as a function handle. Set the number of output indices to 5.

f = rankfeatures(y',grp,NWeighting=@(x) x/10+5,NumberOfIndices=5); plot(t,y(grp==1,:),'b',t,y(grp==2,:),'g',t(f),1.35,'vr');

Figure contains an axes object. The axes object contains 45 objects of type line.

Input Arguments

collapse all

`X`—Sample data
numeric matrix

Sample data, specified as a numeric matrix. Each column is an observed vector, and each row is a feature.

Data Types:double

`GROUP`—Class labels
numeric vector|字符串向量|cell array of character vectors

Class labels, specified as a numeric vector, string vector, or cell array of character vectors.numel(GROUP)is the same as the number of columns inX.GROUPmust have only two unique values. If it contains any南values, the function ignores the corresponding observation vector inX.

Data Types:double|string|cell

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, whereNameis the argument name andValueis the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example:[idx,x] = rankfeatures(x,groups,Criterion="entrophy",NWeighting=0.2)specifies to use the relative entropy as the criterion to assess the feature significance and regional information value of 0.2 to outweigh the Z-value of potential features.

Before R2021a, use commas to separate each name and value, and encloseName在报价。

Example:[idx,x] = rankfeatures(x,groups,'Criterion',"entrophy",'NWeighting',0.2)

`Criterion`—标准为sess significance of feature
`"ttest"`(default) |"entrophy"|`"bhattacharyya"`|`"roc"`|`"wilcoxon"`

标准为sess the significance of each feature for separating two labeled groups, specified as one of the following:

"ttest"— Absolute value two-sample t-test with pooled variance estimate.
"entropy"— Relative entropy, also known as Kullback-Leibler distance or divergence.
"bhattacharyya"— Minimum attainable classification error or Chernoff bound.
"roc"— Area between the empirical receiver operating characteristic (ROC) curve and the random classifier slope.
"wilcoxon"— Absolute value of the standardized u-statistic of a two-sample unpaired Wilcoxon test, also known as Mann-Whitney.

Note

"ttest","entropy", and"bhattacharyya"assume normal distributed classes while"roc"and"wilcoxon"are nonparametric tests. All tests are feature independent.

Data Types:char|string

`CCWeighting`—Correlation information to outweigh Z-value of features
`0`(default) |numeric scalar between`0`and`1`

Correlation information to outweigh the Z-value of potential features, specified as a numeric scalar between0and1.

The function uses $Z \times (1 - α \times ρ)$ to calculate the weight, whereρis the average of the absolute values of the cross-correlation coefficient between the candidate feature and all previously selected features.αis theCCWeightingvalue that sets the weighting factor.

By default,αis0, and the function does not weight the potential features. A large value ofρ(close to 1) outweighs the significance statistic, meaning that features are highly correlated with the features already picked are less likely to be included in the output list.

Data Types:double

`NWeighting`—Regional information to outweigh Z-value of features
`0`(default) |nonnegative scalar|function handle

Regional information to outweigh the Z-value of potential features, specified as a nonnegative scalar or function handle.

The function uses $Z \times (1 - e^{- {(\frac{D}{β})}^{2}})$ to calculate the weight, whereDis the distance (in rows) between the candidate feature and previously selected features.βis theNWeightingvalue that sets the weighting factor.βmust be greater than or equal to0.

By default,βis0, and the function does not weight the potential features. A small value ofD(close to0) outweighs the significance statistics of only close features. This means that features that are close to already picked features are less likely to be included in the output list. This option is useful for extracting features from time series with temporal correlation.

βcan also be a function of the feature location, specified using@or an anonymous function. In both casesrankfeaturespasses the row position of the feature to the specified function and expects back a value greater than or equal to0.

Note

You can useCCWeightingandNWeightingtogether.

Data Types:double|function_handle

`NumberOfIndices`—Number of output indices
positive scalar

Number of output indices inIDX, specified as a positive scalar. By default, the number of indices is the same as the number of features whenαandβare0. Otherwise, the number of indices is set to20.

Data Types:double

`CrossNorm`—Method for independent normalization across observations
`"none"`(default) |`"meanvar"`|`"softmax"`|`"minmax"`

针对每个功能的跨观测的独立归一化方法，指定为以下一项：

"none"(default) — No normalization.
"meanvar"— $X_{n e w} = \frac{X - μ}{σ}$
"softmax"— $X_{n e w} = \frac{1}{1 + e^{(\frac{μ - X}{σ})}}$
"minmax"— $X_{n e w} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}$

In these equations,μ= mean(X),σ= std(X),X_min= min(X), andX_max= max(X).

Cross-normalization ensures comparability among different features although it is not always necessary because the selected criterion might already account for this.

Data Types:char|string

Output Arguments

collapse all

`IDX`— List of indices
numeric vector

List of indices to the rows of X with the most significant features, returned as a numeric vector.

`Z`— List of absolute values of criterion for features
numeric vector

List of absolute values of theCriterionused for the features, returned as a numeric vector.

References

[1] Theodoridis, Sergios, and Konstantinos Koutroumbas.Pattern Recognition. San Diego: Academic Press, 1999: 341-342.

[2] Liu, Huan, and Hiroshi Motoda.Feature Selection for Knowledge Discovery and Data Mining. Kluwer International Series in Engineering and Computer Science 454. Boston: Kluwer Academic Publishers, 1998.

[3] Ross, Douglas T., Uwe Scherf, Michael B. Eisen, Charles M. Perou, Christian Rees, Paul Spellman, Vishwanath Iyer, et al. “Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines.”Nature Genetics24, no. 3 (March 2000): 227–35.

Version History

Introduced before R2006a

rankfeatures

Syntax

Description

Examples

Find a reduced set of genes to differentiate breast cancer cells

Find discriminant peaks of two groups of signals

Input Arguments

X—Sample datanumeric matrix

GROUP—Class labelsnumeric vector|字符串向量|cell array of character vectors

Name-Value Arguments

Criterion—标准为sess significance of feature"ttest"(default) |"entrophy"|"bhattacharyya"|"roc"|"wilcoxon"

CCWeighting—Correlation information to outweigh Z-value of features0(default) |numeric scalar between0and1

NWeighting—Regional information to outweigh Z-value of features0(default) |nonnegative scalar|function handle

NumberOfIndices—Number of output indicespositive scalar

CrossNorm—Method for independent normalization across observations"none"(default) |"meanvar"|"softmax"|"minmax"

Output Arguments

IDX— List of indicesnumeric vector

Z— List of absolute values of criterion for featuresnumeric vector

References

Version History

See Also

`X`—Sample data
numeric matrix

`GROUP`—Class labels
numeric vector|字符串向量|cell array of character vectors

`Criterion`—标准为sess significance of feature
`"ttest"`(default) |"entrophy"|`"bhattacharyya"`|`"roc"`|`"wilcoxon"`

`CCWeighting`—Correlation information to outweigh Z-value of features
`0`(default) |numeric scalar between`0`and`1`

`NWeighting`—Regional information to outweigh Z-value of features
`0`(default) |nonnegative scalar|function handle

`NumberOfIndices`—Number of output indices
positive scalar

`CrossNorm`—Method for independent normalization across observations
`"none"`(default) |`"meanvar"`|`"softmax"`|`"minmax"`

`IDX`— List of indices
numeric vector

`Z`— List of absolute values of criterion for features
numeric vector