Main Content

bindata

Binned predictor variables

Description

example

bdata= bindata(scbinned predictor variables returned as a table. This is a table of the same size as the data input, but only the predictors specified in thecreditscorecardobject'sPredictorVarsproperty are binned and the remaining ones are unchanged.

example

bdata= bindata(sc,datareturns a table of binned predictor variables.bindatareturns a table of the same size as thecreditscorecarddata, but only the predictors specified in thecreditscorecardobject'sPredictorVarsproperty are binned and the remaining ones are unchanged.

example

bdata= bindata(sc,Name,Valuebinned predictor variables returned as a table using optional name-value pair arguments. This is a table of the same size as the data input, but only the predictors specified in thecreditscorecardobject'sPredictorVarsproperty are binned and the remaining ones are unchanged.

Examples

collapse all

This example shows how to use thebindatafunction to simply bin or discretize data.

Suppose bin ranges of

  • '0 to 30'

  • '31 to 50'

  • '51 and up'

are determined for the age variable (via manual or automatic binning). If a data point with age 41 is given, binning this data point means placing it in the bin for 41 years old, which is the second bin, or the '31 to 50' bin. Binning is then the mapping from the original data, into discrete groups or bins. In this example, you can say that a 41-year old is mapped into bin number 2, or that it is binned into the '31 to 50' category. If you know the Weight of Evidence (WOE) value for each of the three bins, you could also replace the data point 41 with the WOE value corresponding to the second bin.bindatasupports the three binning formats just mentioned:

  • Bin number (where the'OutputType'名称-值对的观点is set to'BinNumber'); this is the default option, and in this case, 41 is mapped to bin 2.

  • Categorical (where the'OutputType'名称-值对的观点is set to'Categorical'); in this case, 41 is mapped to the '31 to 50' bin.

  • WOE value (where the'OutputType'名称-值对的观点is set to'WOE'); in this case, 41 is mapped to the WOE value of bin number 2.

Create acreditscorecardobject using theCreditCardData.matfile to load thedata(using a dataset from Refaat 2011). Use the'IDVar'argument to indicate that'CustID'contains ID information and should not be included as a predictor variable.

loadCreditCardDatasc = creditscorecard(data,'IDVar','CustID'
sc = creditscorecard with properties: GoodLabel: 0 ResponseVar: 'status' WeightsVar: '' VarNames: {1x11 cell} NumericPredictors: {1x6 cell} CategoricalPredictors: {'ResStatus' 'EmpStatus' 'OtherCC'} BinMissingData: 0 IDVar: 'CustID' PredictorVars: {1x9 cell} Data: [1200x11 table]

Perform automatic binning.

sc = autobinning(sc);

Show the bin information for'CustAge'.

bininfo(sc,'CustAge'
ans=8×6 tableBin Good Bad Odds WOE InfoValue _____________ ____ ___ ______ _________ _________ {'[-Inf,33)'} 70 53 1.3208 -0.42622 0.019746 {'[33,37)' } 64 47 1.3617 -0.39568 0.015308 {'[37,40)' } 73 47 1.5532 -0.26411 0.0072573 {'[40,46)' } 174 94 1.8511 -0.088658 0.001781 {'[46,48)' } 61 25 2.44 0.18758 0.0024372 {'[48,58)' } 263 105 2.5048 0.21378 0.013476 {'[58,Inf]' } 98 26 3.7692 0.62245 0.0352 {'Totals' } 803 397 2.0227 NaN 0.095205

These are the first 10 age values in the original data, used to create thecreditscorecardobject.

data(1:10,'CustAge'
ans=10×1 tableCustAge _______ 53 61 47 50 68 65 34 50 50 49

Bin scorecard data into bin numbers (default behavior).

bdata = bindata(sc);

According to the bin information, the first age should be mapped into the fourth bin, the second age into the fifth bin, etc. These are the first 10 binned ages, in bin-number format.

bdata(1:10,'CustAge'
ans=10×1 tableCustAge _______ 6 7 5 6 7 7 2 6 6 6

Bin the scorecard data and show their bin labels. To do this, set thebindata名称-值对的观点for'OutputType'to'Categorical'.

bdata = bindata(sc,'OutputType','Categorical');

These are the first 10 binned ages, in categorical format.

bdata(1:10,'CustAge'
ans=10×1 tableCustAge ________ [48,58) [58,Inf] [46,48) [48,58) [58,Inf] [58,Inf] [33,37) [48,58) [48,58) [48,58)

Convert the scorecard data to WOE values. To do this, set thebindata名称-值对的观点for'OutputType'to'WOE'.

bdata = bindata(sc,'OutputType','WOE');

These are the first 10 binned ages, in WOE format. The ages are mapped to the WOE values that are internally displayed using thebininfofunction.

bdata(1:10,'CustAge'
ans=10×1 tableCustAge ________ 0.21378 0.62245 0.18758 0.21378 0.62245 0.62245 -0.39568 0.21378 0.21378 0.21378

This example shows how to use thebindatafunction's optional input for the data to bin. If not provided,bindatabins thecreditscorecardtraining data. However, if a different dataset needs to be binned, for example, some "test" data, this can be passed intobindataas an optional input.

Create acreditscorecardobject using theCreditCardData.matfile to load thedata(using a dataset from Refaat 2011). Use the'IDVar'argument to indicate that'CustID'contains ID information and should not be included as a predictor variable.

loadCreditCardDatasc = creditscorecard(data,'IDVar','CustID'
sc = creditscorecard with properties: GoodLabel: 0 ResponseVar: 'status' WeightsVar: '' VarNames: {1x11 cell} NumericPredictors: {1x6 cell} CategoricalPredictors: {'ResStatus' 'EmpStatus' 'OtherCC'} BinMissingData: 0 IDVar: 'CustID' PredictorVars: {1x9 cell} Data: [1200x11 table]

Perform automatic binning.

sc = autobinning(sc);

Show the bin information for'CustAge'.

bininfo(sc,'CustAge'
ans=8×6 tableBin Good Bad Odds WOE InfoValue _____________ ____ ___ ______ _________ _________ {'[-Inf,33)'} 70 53 1.3208 -0.42622 0.019746 {'[33,37)' } 64 47 1.3617 -0.39568 0.015308 {'[37,40)' } 73 47 1.5532 -0.26411 0.0072573 {'[40,46)' } 174 94 1.8511 -0.088658 0.001781 {'[46,48)' } 61 25 2.44 0.18758 0.0024372 {'[48,58)' } 263 105 2.5048 0.21378 0.013476 {'[58,Inf]' } 98 26 3.7692 0.62245 0.0352 {'Totals' } 803 397 2.0227 NaN 0.095205

For the purpose of illustration, take a few rows from the original data as "test" data and display the first 10 age values in the test data.

tdata = data(101:110,:); tdata(1:10,'CustAge'
ans=10×1 tableCustAge _______ 34 59 64 61 28 65 55 37 49 51

Convert the test data to WOE values. To do this, set thebindata名称-值对的观点for'OutputType'to'WOE', passing the test data (tdata)as an optional input.

bdata = bindata(sc,tdata,'OutputType','WOE'
bdata=10×11 tableCustID CustAge TmAtAddress ResStatus EmpStatus前来tIncome TmWBank OtherCC AMBalance UtilRate status ______ ________ ___________ _________ _________ __________ ________ ________ _________ ________ ______ 101 -0.39568 -0.087767 -0.095564 0.2418 -0.011271 0.76889 0.053364 -0.11274 0.048576 0 102 0.62245 0.14288 0.019329 -0.19947 0.20579 -0.13107 -0.26832 -0.11274 0.048576 1 103 0.62245 0.02263 0.019329 0.2418 0.47972 -0.12109 0.053364 0.24418 0.092164 0 104 0.62245 0.02263 -0.095564 0.2418 0.47972 -0.12109 0.053364 0.24418 0.048576 0 105 -0.42622 0.02263 0.019329 0.2418 -0.06843 0.76889 0.053364 -0.11274 0.092164 0 106 0.62245 0.02263 0.019329 -0.19947 0.20579 -0.13107 0.053364 -0.11274 -0.22899 0 107 0.21378 -0.087767 -0.095564 0.2418 0.47972 0.26704 0.053364 -0.11274 0.048576 0 108 -0.26411 -0.087767 0.019329 -0.19947 -0.29217 -0.13107 0.053364 -0.11274 0.048576 0 109 0.21378 -0.087767 -0.095564 0.2418 -0.026696 -0.13107 0.053364 0.24418 0.048576 0 110 0.21378 -0.087767 0.019329 0.2418 0.20579 -0.13107 0.053364 -0.29895 -0.22899 0

These are the first 10 binned ages, in WOE format. The ages are mapped to the WOE values displayed internally bybininfo.

bdata(1:10,'CustAge'
ans=10×1 tableCustAge ________ -0.39568 0.62245 0.62245 0.62245 -0.42622 0.62245 0.21378 -0.26411 0.21378 0.21378

Create acreditscorecardobject using theCreditCardData.matfile to load thedatawith missing values. The variablesCustAgeandResStatushave missing values.

loadCreditCardData.mathead(dataMissing,5)
ans=5×11 tableCustID CustAge TmAtAddress ResStatus EmpStatus前来tIncome TmWBank OtherCC AMBalance UtilRate status ______ _______ ___________ ___________ _________ __________ _______ _______ _________ ________ ______ 1 53 62  Unknown 50000 55 Yes 1055.9 0.22 0 2 61 22 Home Owner Employed 52000 25 Yes 1161.6 0.24 0 3 47 30 Tenant Employed 37000 61 No 877.23 0.29 0 4 NaN 75 Home Owner Employed 53000 20 Yes 157.37 0.08 0 5 68 56 Home Owner Employed 53000 14 Yes 561.84 0.11 0

Usecreditscorecardwith the name-value argument'BinMissingData'set totrueto bin the missing numeric or categorical data in a separate bin. Apply automatic binning.

sc = creditscorecard(dataMissing,'IDVar','CustID','BinMissingData',true); sc = autobinning(sc); disp(sc)
creditscorecard with properties: GoodLabel: 0 ResponseVar: 'status' WeightsVar: '' VarNames: {1x11 cell} NumericPredictors: {1x6 cell} CategoricalPredictors: {'ResStatus' 'EmpStatus' 'OtherCC'} BinMissingData: 1 IDVar: 'CustID' PredictorVars: {1x9 cell} Data: [1200x11 table]

Display and plot bin information for numeric data for'CustAge'that includes missing data in a separate bin labelled.

[bi,cp] = bininfo(sc,'CustAge'); disp(bi)
Bin Good Bad Odds WOE InfoValue _____________ ____ ___ ______ ________ __________ {'[-Inf,33)'} 69 52 1.3269 -0.42156 0.018993 {'[33,37)' } 63 45 1.4 -0.36795 0.012839 {'[37,40)' } 72 47 1.5319 -0.2779 0.0079824 {'[40,46)' } 172 89 1.9326 -0.04556 0.0004549 {'[46,48)' } 59 25 2.36 0.15424 0.0016199 {'[48,51)' } 99 41 2.4146 0.17713 0.0035449 {'[51,58)' } 157 62 2.5323 0.22469 0.0088407 {'[58,Inf]' } 93 25 3.72 0.60931 0.032198 {''} 19 11 1.7273 -0.15787 0.00063885 {'Totals' } 803 397 2.0227 NaN 0.087112
plotbins(sc,'CustAge'

Figure contains an axes object. The axes object with title CustAge contains 3 objects of type bar, line. These objects represent Good, Bad.

Display and plot bin information for categorical data for'ResStatus'that includes missing data in a separate bin labelled.

[bi,cg] = bininfo(sc,'ResStatus'); disp(bi)
Bin Good Bad Odds WOE InfoValue ______________ ____ ___ ______ _________ __________ {'Tenant' } 296 161 1.8385 -0.095463 0.0035249 {'Home Owner'} 352 171 2.0585 0.017549 0.00013382 {'Other' } 128 52 2.4615 0.19637 0.0055808 {'' } 27 13 2.0769 0.026469 2.3248e-05 {'Totals' } 803 397 2.0227 NaN 0.0092627
plotbins(sc,'ResStatus'

Figure contains an axes object. The axes object with title ResStatus contains 3 objects of type bar, line. These objects represent Good, Bad.

For the'CustAge'and'ResStatus'predictors, there is missing data (NaNs and)in the training data, and the binning process estimates a WOE value of-0.15787and0.026469respectively for missing data in these predictors, as shown above.

For the purpose of illustration, take a few rows from the original data as test data and introduce some missing data.

tdata = dataMissing(11:14,:); tdata.CustAge(1) = NaN; tdata.TmAtAddress(2) = NaN; tdata.ResStatus(3) =''; tdata.EmpStatus(4) =''; disp(tdata)
CustID CustAge TmAtAddress ResStatus EmpStatus前来tIncome TmWBank OtherCC AMBalance UtilRate status ______ _______ ___________ ___________ ___________ __________ _______ _______ _________ ________ ______ 11 NaN 24 Tenant Unknown 34000 44 Yes 119.8 0.07 1 12 48 NaN Other Unknown 44000 14 Yes 403.62 0.03 0 13 65 63  Unknown 48000 6 No 111.88 0.02 0 14 44 75 Other  41000 35 No 436.41 0.18 0

Convert the test data to WOE values. To do this, set thebindata名称-值对的观点for'OutputType'to'WOE', passing the test datatdataas an optional input.

bdata = bindata(sc,tdata,'OutputType','WOE'); disp(bdata)
CustID CustAge TmAtAddress ResStatus EmpStatus前来tIncome TmWBank OtherCC AMBalance UtilRate status ______ ________ ___________ _________ _________ __________ ________ ________ _________ ________ ______ 11 -0.15787 0.02263 -0.095463 -0.19947 -0.06843 -0.12109 0.053364 0.24418 0.048576 1 12 0.17713 NaN 0.19637 -0.19947 0.20579 -0.13107 0.053364 0.24418 0.092164 0 13 0.60931 0.02263 0.026469 -0.19947 0.47972 -0.25547 -0.26832 0.24418 0.092164 0 14 -0.04556 0.02263 0.19637 NaN -0.011271 -0.12109 -0.26832 0.24418 0.048576 0

For the'CustAge'and'ResStatus'predictors, because there is missing data in the training data, the missing values in the test data get mapped to the WOE value estimated for thebin. Therefore, a missing value for'CustAge'is replaced with-0.15787, and a missing value for'ResStatus'is replaced with0.026469.

For'TmAtAddress'and'EmpStatus',training data has no missing values, therefore there is no bin for missing data, and there is no way to estimate a WOE value for missing data. Therefore, for these predictors, the WOE transformation leaves missing values as missing (that is, sets a WOE value ofNaN).

These rules apply when'OutputType'is set to'WOE'or'WOEModelInput'. The rationale is that if a data-based WOE value exists for missing data, it should be used for the WOE transformation and for subsequent steps (for example, fitting a logistic model or scoring).

On the other hand, when'OutputType'is set to'BinNumber'or'Categorical',bindataleaves missing values as missing, since this allows you to subsequently treat the missing data as you see fit.

For example, when'OutputType'is set to'BinNumber', missing values are set toNaN:

bdata = bindata(sc,tdata,'OutputType','BinNumber'); disp(bdata)
CustID CustAge TmAtAddress ResStatus EmpStatus前来tIncome TmWBank OtherCC AMBalance UtilRate status ______ _______ ___________ _________ _________ __________ _______ _______ _________ ________ ______ 11 NaN 2 1 1 3 3 2 1 2 1 12 6 NaN 3 1 6 2 2 1 1 0 13 8 2 NaN 1 7 1 1 1 1 0 14 4 2 3 NaN 5 3 1 1 2 0

And when'OutputType'is set to'Categorical', missing values are set to'':

bdata = bindata(sc,tdata,'OutputType','Categorical'); disp(bdata)
CustID CustAge TmAtAddress ResStatus EmpStatus前来tIncome TmWBank OtherCC AMBalance UtilRate status ______ ___________ ___________ ___________ ___________ _____________ _________ _______ _____________ ___________ ______ 11  [23,83) Tenant Unknown [33000,35000) [23,45) Yes [-Inf,558.88) [0.04,0.36) 1 12 [48,51)  Other Unknown [42000,47000) [12,23) Yes [-Inf,558.88) [-Inf,0.04) 0 13 [58,Inf] [23,83)  Unknown [47000,Inf] [-Inf,12) No [-Inf,558.88) [-Inf,0.04) 0 14 [40,46) [23,83) Other  [40000,42000) [23,45) No [-Inf,558.88) [0.04,0.36) 0

bindatasupports the following types of WOE transformation:

  • When the'OutputType'name-value argument is set to'WOE',bindatasimply applies the WOE transformation to all predictors and keeps the rest of the variables in the original data in place and unchanged.

  • When the'OutputType'名称-值对的观点is set to'WOEModelInput',bindatareturns a table that can be used directly as an input for fitting a logistic regression model for the scorecard. In this case,bindata:

  • Applies WOE transformation to all predictors.

  • Returns predictor variables, but noIDVaror unused variables are included in the output.

  • Includes the mapped response variable as the last column.

  • Thefitmodelfunction callsbindatainternally using the'WOEModelInput'option to fit the logistic regression model for thecreditscorecardmodel.

Create acreditscorecardobject using theCreditCardData.matfile to load thedata(using a dataset from Refaat 2011). Use the'IDVar'argument to indicate that'CustID'contains ID information and should not be included as a predictor variable.

loadCreditCardDatasc = creditscorecard(data,'IDVar','CustID'
sc = creditscorecard with properties: GoodLabel: 0 ResponseVar: 'status' WeightsVar: '' VarNames: {1x11 cell} NumericPredictors: {1x6 cell} CategoricalPredictors: {'ResStatus' 'EmpStatus' 'OtherCC'} BinMissingData: 0 IDVar: 'CustID' PredictorVars: {1x9 cell} Data: [1200x11 table]

Perform automatic binning.

sc = autobinning(sc);

Show the bin information for'CustAge'.

bininfo(sc,'CustAge'
ans=8×6 tableBin Good Bad Odds WOE InfoValue _____________ ____ ___ ______ _________ _________ {'[-Inf,33)'} 70 53 1.3208 -0.42622 0.019746 {'[33,37)' } 64 47 1.3617 -0.39568 0.015308 {'[37,40)' } 73 47 1.5532 -0.26411 0.0072573 {'[40,46)' } 174 94 1.8511 -0.088658 0.001781 {'[46,48)' } 61 25 2.44 0.18758 0.0024372 {'[48,58)' } 263 105 2.5048 0.21378 0.013476 {'[58,Inf]' } 98 26 3.7692 0.62245 0.0352 {'Totals' } 803 397 2.0227 NaN 0.095205

These are the first 10 age values in the original data, used to create thecreditscorecardobject.

data(1:10,'CustAge'
ans=10×1 tableCustAge _______ 53 61 47 50 68 65 34 50 50 49

Convert the test data to WOE values. To do this, set thebindata名称-值对的观点for'OutputType'to'WOE'.

bdata = bindata(sc,'OutputType','WOE');

These are the first 10 binned ages, in WOE format. The ages are mapped to the WOE values displayed internally bybininfo.

bdata(1:10,'CustAge'
ans=10×1 tableCustAge ________ 0.21378 0.62245 0.18758 0.21378 0.62245 0.62245 -0.39568 0.21378 0.21378 0.21378

These are the first 10 binned ages, in WOE format. The ages are mapped to the WOE values displayed internally bybininfo.

bdata(1:10,'CustAge'
ans=10×1 tableCustAge ________ 0.21378 0.62245 0.18758 0.21378 0.62245 0.62245 -0.39568 0.21378 0.21378 0.21378

The size of the original data and the size ofbdataoutput are the same becausebindataleaves unused variables (such as'IDVar')unchanged and in place.

whosdatabdata
Name Size Bytes Class Attributes bdata 1200x11 108987 table data 1200x11 84603 table

The response values are the same in the original data and in the binned data because, by default,bindatadoes not modify response values.

disp([data.status(1:10) bdata.status(1:10)])
0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1

When fitting a logistic regression model with WOE data, set the'OutputType'名称-值对的观点to'WOEModelInput'.

bdata = bindata(sc,'OutputType','WOEModelInput');

The binned predictor data is the same as when the'OutputType'名称-值对的观点is set to'WOE'.

bdata(1:10,'CustAge'
ans=10×1 tableCustAge ________ 0.21378 0.62245 0.18758 0.21378 0.62245 0.62245 -0.39568 0.21378 0.21378 0.21378

However, the size of the original data and the size ofbdataoutput are different. This is becausebindataremoves unused variables (such as'IDVar').

whosdatabdata
Name Size Bytes Class Attributes bdata 1200x10 99167 table data 1200x11 84603 table

The response values are also modified in this case and are mapped so that "Good" is1and "Bad" is0.

disp([data.status(1:10) bdata.status(1:10)])
0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 0

Input Arguments

collapse all

Credit scorecard model, specified as acreditscorecardobject. Usecreditscorecardto create acreditscorecardobject.

Data to bin given the rules set in thecreditscorecardobject, specified using a table. By default,datais set to thecreditscorecardobject's raw data.

Before creating acreditscorecardobject, perform a data preparation task to have an appropriately structureddataas input to acreditscorecardobject.

Data Types:table

Name-Value Arguments

Specify optional comma-separated pairs ofName,Valuearguments.Nameis the argument name andValueis the corresponding value.Namemust appear inside quotes. You can specify several name and value pair arguments in any order asName1,Value1,...,NameN,ValueN.

Example:bdata = bindata(sc,'OutputType','WOE','ResponseFormat','Mapped')

Output format, specified as the comma-separated pair consisting of'OutputType'and a character vector with the following values:

  • BinNumber— Returns the bin numbers corresponding to each observation.

  • Categorical— Returns the bin label corresponding to each observation.

  • WOE— Returns the Weight of Evidence (WOE) corresponding to each observation.

  • WOEModelInput— Use this option when fitting a model. This option:

    • Returns the Weight of Evidence (WOE) corresponding to each observation.

    • Returns predictor variables, but noIDVaror unused variables are included in the output.

    • Discards any predictors whose bins haveInforNaNWOE values.

    • Includes the mapped response variable as the last column.

    Note

    When thebindata名称-值对的观点'OutputType'is set to'WOEModelInput',bdataoutput only contains the columns corresponding to predictors whose bins do not haveInforNaNWeight of Evidence (WOE) values, andbdataincludes the mapped response as the last column.

    Missing data (if any) are included in thebdataoutput as missing data as well, and do not influence the rules to discard predictors when'OutputType'is set to'WOEModelInput'.

Data Types:char

Response values format, specified as the comma-separated pair consisting of'ResponseFormat'and a character vector with the following values:

  • RawData— The response variable is copied unchanged into thebdataoutput.

  • Mapped— The response values are modified (if necessary) so that "Good" is mapped to1, and "Bad" is mapped to0.

Data Types:char

Output Arguments

collapse all

Binned predictor variables, returned as a table. This is a table of the same size (see exception in the following Note) as the data input, but only the predictors specified in thecreditscorecardobject'sPredictorVarsproperty are binned and the remaining ones are unchanged.

Note

When thebindata名称-值对的观点'OutputType'is set to'WOEModelInput',bdataoutput only contains the columns corresponding to predictors whose bins do not haveInforNaNWeight of Evidence (WOE) values, andbdataincludes the mapped response as the last column.

Missing data (if any) are included in thebdataoutput as missing data as well, and do not influence the rules to discard predictors when'OutputType'is set to'WOEModelInput'.

References

[1] Anderson, R.The Credit Scoring Toolkit.Oxford University Press, 2007.

[2] Refaat, M.Credit Risk Scorecards: Development and Implementation Using SAS.lulu.com, 2011.

Introduced in R2014b