Main Content

fscmrmr

Rank features for classification using minimum redundancy maximum relevance (MRMR) algorithm

Description

idx= fscmrmr(Tbl,ResponseVarName)ranks features (predictors) using theMRMR algorithm. The tableTblcontains predictor variables and a response variable, andResponseVarNameis the name of the response variable inTbl. The function returnsidx, which contains the indices of predictors ordered by predictor importance. You can useidxto select important predictors for classification problems.

idx= fscmrmr(Tbl,formula)specifies a response variable and predictor variables to consider among the variables inTblby usingformula.

example

idx= fscmrmr(Tbl,Y)ranks predictors inTblusing the response variableY.

example

idx= fscmrmr(X,Y)ranks predictors inXusing the response variableY.

idx= fscmrmr(___,Name,Value)specifies additional options using one or more name-value pair arguments in addition to any of the input argument combinations in the previous syntaxes. For example, you can specify prior probabilities and observation weights.

[idx,scores) = fscmrmr (___)also returns the predictor scoresscores. A large score value indicates that the corresponding predictor is important.

Examples

collapse all

Load the sample data.

loadionosphere

Rank the predictors based on importance.

[idx,scores] = fscmrmr(X,Y);

Create a bar plot of the predictor importance scores.

bar(scores(idx)) xlabel('Predictor rank') ylabel('Predictor importance score')

Figure contains an axes object. The axes object contains an object of type bar.

The drop in score between the first and second most important predictors is large, while the drops after the sixth predictor are relatively small. A drop in the importance score represents the confidence of feature selection. Therefore, the large drop implies that the software is confident of selecting the most important predictor. The small drops indicate that the difference in predictor importance are not significant.

Select the top five most important predictors. Find the columns of these predictors inX.

idx(1:5)
ans =1×55 4 1 7 24

The fifth column ofXis the most important predictor ofY.

Find important predictors by usingfscmrmr. Then compare the accuracies of the full classification model (which uses all the predictors) and a reduced model that uses the five most important predictors by usingtestckfold.

Load the census1994 data set.

loadcensus1994

The tableadultdataincensus1994包含demographicdata from the US Census Bureau to predict whether an individual makes over $50,000 per year. Display the first three rows of the table.

head(adultdata,3)
ans=3×15 tableage workClass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country salary ___ ________________ __________ _________ _____________ __________________ _________________ _____________ _____ ____ ____________ ____________ ______________ ______________ ______ 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K 38 Private 2.1565e+05 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K

The output arguments offscmrmrinclude only the variables ranked by the function. Before passing a table to the function, move the variables that you do not want to rank, including the response variable and weight, to the end of the table so that the order of the output arguments is consistent with the order of the table.

In the tableadultdata, the third columnfnlwgtis the weight of the samples, and the last columnsalaryis the response variable. Movefnlwgtto the left ofsalaryby using themovevarsfunction.

adultdata = movevars(adultdata,'fnlwgt','before','salary'); head(adultdata,3)
ans=3×15 tableage workClass education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country fnlwgt salary ___ ________________ _________ _____________ __________________ _________________ _____________ _____ ____ ____________ ____________ ______________ ______________ __________ ______ 39 State-gov Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States 77516 <=50K 50 Self-emp-not-inc Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States 83311 <=50K 38 Private HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States 2.1565e+05 <=50K

Rank the predictors inadultdata. Specify the columnsalaryas the response variable.

[idx,scores] = fscmrmr(adultdata,'salary',“重量”,'fnlwgt');

Create a bar plot of predictor importance scores. Use the predictor names for thex-axis tick labels.

bar(scores(idx)) xlabel('Predictor rank') ylabel('Predictor importance score') xticklabels(strrep(adultdata.Properties.VariableNames(idx),'_','\_')) xtickangle(45)

Figure contains an axes object. The axes object contains an object of type bar.

The five most important predictors arerelationship,capital_loss,capital_gain,education, andhours_per_week.

Compare the accuracy of a classification tree trained with all predictors to the accuracy of one trained with the five most important predictors.

Create a classification tree template using the default options.

C = templateTree;

Define the tabletbl1to contain all predictors and the tabletbl2to contain the five most important predictors.

tbl1 = adultdata(:,adultdata.Properties.VariableNames(idx(1:13))); tbl2 = adultdata(:,adultdata.Properties.VariableNames(idx(1:5)));

Pass the classification tree template and the two tables to thetestckfoldfunction. The function compares the accuracies of the two models by repeated cross-validation. Specify'Alternative','greater'to test the null hypothesis that the model with all predictors is, at most, as accurate as the model with the five predictors. The'greater'option is available when'Test'is'5x2t'(5-by-2 pairedttest) or'10x10t'(10-by-10 repeated cross-validationttest).

[h,p] = testckfold(C,C,tbl1,tbl2,adultdata.salary,“重量”,adultdata.fnlwgt,'Alternative','greater','Test','5x2t')
h =logical0
p = 0.9969

hequals 0 and thep-value is almost 1, indicating failure to reject the null hypothesis. Using the model with the five predictors does not result in loss of accuracy compared to the model with all the predictors.

Now train a classification tree using the selected predictors.

mdl = fitctree(adultdata,'salary ~ relationship + capital_loss + capital_gain + education + hours_per_week',...“重量”,adultdata.fnlwgt)
mdl = ClassificationTree PredictorNames: {1x5 cell} ResponseName: 'salary' CategoricalPredictors: [1 2] ClassNames: [<=50K >50K] ScoreTransform: 'none' NumObservations: 32561 Properties, Methods

Input Arguments

collapse all

Sample data, specified as a table. Multicolumn variables and cell arrays other than cell arrays of character vectors are not allowed.

Each row ofTblcorresponds to one observation, and each column corresponds to one predictor variable. Optionally,Tblcan contain additional columns for a response variable and observation weights.

A response variable can be a categorical, character, or string array, logical or numeric vector, or cell array of character vectors. If the response variable is a character array, then each element of the response variable must correspond to one row of the array.

  • IfTblcontains the response variable, and you want to use all remaining variables inTblas predictors, then specify the response variable by usingResponseVarName. IfTblalso contains the observation weights, then you can specify the weights by usingWeights.

  • IfTblcontains the response variable, and you want to use only a subset of the remaining variables inTblas predictors, then specify the subset of variables by usingformula.

  • IfTbldoes not contain the response variable, then specify a response variable by usingY. The response variable andTblmust have the same number of rows.

Iffscmrmruses a subset of variables inTblas predictors, then the function indexes the predictors using only the subset. The values in the'CategoricalPredictors'name-value pair argument and the output argumentidxdo not count the predictors that the function does not rank.

fscmrmrconsidersNaN,''(empty character vector),""(empty string),, andvalues inTblfor a response variable to be missing values.fscmrmrdoes not use observations with missing values for a response variable.

Data Types:table

Response variable name, specified as a character vector or string scalar containing the name of a variable inTbl.

For example, if a response variable is the columnYofTbl(Tbl.Y), then specifyResponseVarNameas"Y".

Data Types:char|string

Explanatory model of the response variable and a subset of the predictor variables, specified as a character vector or string scalar in the form“Y ~ x1 + x2 + x3”. In this form,Yrepresents the response variable, andx1,x2, andx3represent the predictor variables.

To specify a subset of variables inTblas predictors, use a formula. If you specify a formula, thenfscmrmrdoes not rank any variables inTblthat do not appear informula.

The variable names in the formula must be both variable names inTbl(Tbl.Properties.VariableNames) and valid MATLAB®identifiers. You can verify the variable names inTblby using theisvarnamefunction. If the variable names are not valid, then you can convert them by using thematlab.lang.makeValidNamefunction.

Data Types:char|string

Response variable, specified as a numeric, categorical, or logical vector, a character or string array, or a cell array of character vectors. Each row ofYrepresents the labels of the corresponding row ofX.

fscmrmrconsidersNaN,''(empty character vector),""(empty string),, andvalues inYto be missing values.fscmrmrdoes not use observations with missing values forY.

Data Types:single|double|categorical|logical|char|string|cell

Predictor data, specified as a numeric matrix. Each row ofXcorresponds to one observation, and each column corresponds to one predictor variable.

Data Types:single|double

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, whereNameis the argument name andValue相应的价值。名称-值参数must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and encloseNamein quotes.

Example:'CategoricalPredictors',[1 2],'Verbose',2specifies the first two predictor variables as categorical variables and specifies the verbosity level as 2.

分类预测列表,指定为一个of the values in this table.

Value Description
Vector of positive integers

Each entry in the vector is an index value indicating that the corresponding predictor is categorical. The index values are between 1 andp, wherepis the number of predictors used to train the model.

Iffscmrmruses a subset of input variables as predictors, then the function indexes the predictors using only the subset. TheCategoricalPredictorsvalues do not count the response variable, observation weight variable, or any other variables that the function does not use.

Logical vector

Atrueentry means that the corresponding predictor is categorical. The length of the vector isp.

Character matrix Each row of the matrix is the name of a predictor variable. The names must match the names inTbl. Pad the names with extra blanks so each row of the character matrix has the same length.
String array or cell array of character vectors Each element in the array is the name of a predictor variable. The names must match the names inTbl.
"all" All predictors are categorical.

By default, if the predictor data is in a table (Tbl),fscmrmrassumes that a variable is categorical if it is a logical vector, unordered categorical vector, character array, string array, or cell array of character vectors. If the predictor data is a matrix (X),fscmrmrassumes that all predictors are continuous. To identify any other predictors as categorical predictors, specify them by using theCategoricalPredictorsname-value argument.

Example:"CategoricalPredictors","all"

Example:CategoricalPredictors=[1 5 6 8]

Data Types:single|double|logical|char|string|cell

Names of the classes to use for ranking, specified as the comma-separated pair consisting of'ClassNames'and a categorical, character, or string array, a logical or numeric vector, or a cell array of character vectors.ClassNamesmust have the same data type asYor the response variable inTbl.

IfClassNamesis a character array, then each element must correspond to one row of the array.

Use'ClassNames'to:

  • Specify the order of thePriordimensions that corresponds to the class order.

  • Select a subset of classes for ranking. For example, suppose that the set of all distinct class names inYis{'a','b','c'}. To rank predictors using observations from classes'a'and'c'only, specify'ClassNames',{'a','c'}.

The default value for'ClassNames'is the set of all distinct class names inYor the response variable inTbl. The default'ClassNames'value has mathematical ordering if the response variable is ordinal. Otherwise, the default value has alphabetical ordering.

Example:'ClassNames',{'b','g'}

Data Types:categorical|char|string|logical|single|double|cell

Prior probabilities for each class, specified as one of the following:

  • Character vector or string scalar.

    • 'empirical'determines class probabilities from class frequencies in the response variable inYorTbl. If you pass observation weights,fscmrmruses the weights to compute the class probabilities.

    • 'uniform'sets all class probabilities to be equal.

  • Vector (one scalar value for each class). To specify the class order for the corresponding elements of'Prior', set the'ClassNames'name-value argument.

  • StructureSwith two fields.

    • S.ClassNamescontains the class names as a variable of the same type as the response variable inYorTbl.

    • S.ClassProbscontains a vector of corresponding probabilities.

fscmrmrnormalizes the weights in each class (“重量”) to add up to the value of the prior probability of the respective class.

Example:'Prior','uniform'

Data Types:char|string|single|double|struct

指标是否使用missing values in predictors, specified as eithertrueto use the values for ranking, orfalseto discard the values.

fscmrmrconsidersNaN,''(empty character vector),""(empty string),, andvalues to be missing values.

If you specifyUseMissingastrue, thenfscmrmruses missing values for ranking. For a categorical variable,fscmrmrtreats missing values as an extra category. For a continuous variable,fscmrmrplacesNaNvalues in a separate bin for binning.

If you specifyUseMissingasfalse, thenfscmrmrdoes not use missing values for ranking. Becausefscmrmrcomputes mutual information for each pair of variables, the function does not discard an entire row when values in the row are partially missing.fscmrmruses all pair values that do not include missing values.

Example:"UseMissing",true

Example:UseMissing=true

Data Types:logical

Verbosity level, specified as the comma-separated pair consisting of'Verbose'and a nonnegative integer. The value ofVerbosecontrols the amount of diagnostic information that the software displays in the Command Window.

  • 0 —fscmrmrdoes not display any diagnostic information.

  • 1 —fscmrmrdisplays the elapsed times for computingMutual Informationand ranking predictors.

  • ≥ 2 —fscmrmrdisplays the elapsed times and more messages related to computing mutual information. The amount of information increases as you increase the'Verbose'value.

Example:'Verbose',1

Data Types:single|double

Observation weights, specified as the comma-separated pair consisting of“重量”and a vector of scalar values or the name of a variable inTbl. The function weights the observations in each row ofXorTblwith the corresponding value inWeights. The size ofWeightsmust equal the number of rows inXorTbl.

If you specify the input data as a tableTbl, thenWeightscan be the name of a variable inTblthat contains a numeric vector. In this case, you must specifyWeightsas a character vector or string scalar. For example, if the weight vector is the columnWofTbl(Tbl.W), then specify'Weights,'W'.

fscmrmrnormalizes the weights in each class to add up to the value of the prior probability of the respective class.

Data Types:single|double|char|string

Output Arguments

collapse all

Indices of predictors inXorTblordered by predictor importance, returned as a 1-by-rnumeric vector, whereris the number of ranked predictors.

Iffscmrmruses a subset of variables inTblas predictors, then the function indexes the predictors using only the subset. For example, supposeTblincludes 10 columns and you specify the last five columns ofTblas the predictor variables by usingformula. Ifidx(3)is5, then the third most important predictor is the 10th column inTbl, which is the fifth predictor in the subset.

Predictor scores, returned as a 1-by-rnumeric vector, whereris the number of ranked predictors.

A large score value indicates that the corresponding predictor is important. Also, a drop in the feature importance score represents the confidence of feature selection. For example, if the software is confident of selecting a featurex, then the score value of the next most important feature is much smaller than the score value ofx.

  • If you useXto specify the predictors or use all the variables inTblas predictors, then the values inscoreshave the same order as the predictors inXorTbl.

  • If you specify a subset of variables inTblas predictors, then the values inscoreshave the same order as the subset.

For example, supposeTblincludes 10 columns and you specify the last five columns ofTblas the predictor variables by usingformula. Then,score(3)contains the score value of the 8th column inTbl, which is the third predictor in the subset.

More About

collapse all

Mutual Information

The mutual information between two variables measures how much uncertainty of one variable can be reduced by knowing the other variable.

The mutual informationIof the discrete random variablesXandZis defined as

I ( X , Z ) = i , j P ( X = x i , Z = z j ) log P ( X = x i , Z = z j ) P ( X = x i ) P ( Z = z j ) .

IfXandZare independent, thenIequals 0. IfXandZare the same random variable, thenIequals the entropy ofX.

Thefscmrmrfunction uses this definition to compute the mutual information values for both categorical (discrete) and continuous variables.fscmrmrdiscretizes a continuous variable into 256 bins or the number of unique values in the variable if it is less than 256. The function finds optimal bivariate bins for each pair of variables using the adaptive algorithm[2].

Algorithms

collapse all

Minimum Redundancy Maximum Relevance (MRMR) Algorithm

The MRMR algorithm[1]finds an optimal set of features that is mutually and maximally dissimilar and can represent the response variable effectively. The algorithm minimizes the redundancy of a feature set and maximizes the relevance of a feature set to the response variable. The algorithm quantifies the redundancy and relevance using the mutual information of variables—pairwise mutual information of features and mutual information of a feature and the response. You can use this algorithm for classification problems.

The goal of the MRMR algorithm is to find an optimal setSof features that maximizesVS, the relevance ofSwith respect to a response variabley, and minimizesWS, the redundancy ofS, whereVSandWSare defined withmutual informationI:

V S = 1 | S | x S I ( x , y ) ,

W S = 1 | S | 2 x , z S I ( x , z ) .

|的|is the number of features inS.

Finding an optimal setSrequires considering all2|Ω|combinations, whereΩis the entire feature set. Instead, the MRMR algorithm ranks features through the forward addition scheme, which requiresO(|Ω|·|S|)computations, by using the mutual information quotient (MIQ) value.

MIQ x = V x W x ,

whereVxandWxare the relevance and redundancy of a feature, respectively:

V x = I ( x , y ) ,

W x = 1 | S | z S I ( x , z ) .

Thefscmrmrfunction ranks all features inΩ并返回idx(the indices of features ordered by feature importance) using the MRMR algorithm. Therefore, the computation cost becomesO(|Ω|2). The function quantifies the importance of a feature using a heuristic algorithm and returns a score (scores). A large score value indicates that the corresponding predictor is important. Also, a drop in the feature importance score represents the confidence of feature selection. For example, if the software is confident of selecting a featurex, then the score value of the next most important feature is much smaller than the score value ofx. You can use the outputs to find an optimal setSfor a given number of features.

fscmrmrranks features as follows:

  1. Select the feature with the largest relevance, max x Ω V x . Add the selected feature to an empty setS.

  2. Find the features with nonzero relevance and zero redundancy in the complement ofS,Sc.

    • IfScdoes not include a feature with nonzero relevance and zero redundancy, go to step 4.

    • Otherwise, select the feature with the largest relevance, max x S c , W x = 0 V x . Add the selected feature to the setS.

  3. Repeat Step 2 until the redundancy is not zero for all features inSc.

  4. Select the feature that has the largest MIQ value with nonzero relevance and nonzero redundancy inSc, and add the selected feature to the setS.

    max x S c MIQ x = max x S c I ( x , y ) 1 | S | z S I ( x , z ) .

  5. Repeat Step 4 until the relevance is zero for all features inSc.

  6. Add the features with zero relevance toSin random order.

The software can skip any step if it cannot find a feature that satisfies the conditions described in the step.

References

[1] Ding, C., and H. Peng. "Minimum redundancy feature selection from microarray gene expression data."Journal of Bioinformatics and Computational Biology.Vol. 3, Number 2, 2005, pp. 185–205.

[2] Darbellay, G. A., and I. Vajda. "Estimation of the information by an adaptive partitioning of the observation space."IEEE Transactions on Information Theory.Vol. 45, Number 4, 1999, pp. 1315–1321.

Version History

介绍了R2019b

expand all

Behavior changed in R2020a