Regression with Boosted Decision Trees

In this example we will explore a regression problem using the Boston House Prices dataset available from the UCI Machine Learning Repository.

Download Housing Prices

文件名='housing.txt'; urlwrite('http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data',filename); inputNames = {'CRIM','Zn',“印度”,'CHAS','NOX','R M','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT'}; outputNames = {'MEDV'}; housingAttributes = [inputNames,outputNames];

Import Data

Once the file is saved, you can import data into MATLAB as a table using theImport Tool使用默认选项。另外，您可以使用以下代码，该代码可以从导入工具中自动生成：

formatSpec ='%8f%7f%8f%3f%8f%8f%7f%8f%4f%7f%7f%7f%7f%f%[^\n\r]'; fileID = fopen(filename,'r'）；dataArray = textscan(fileID, formatSpec,'Delimiter','','WhiteSpace','','ReturnOnError', false); fclose(fileID); housing = table(dataArray{1:end-1},“VariableNames', {“VarName1',“VarName2',“VarName3',“VarName4',“VarName5',“VarName6',“VarName7','varname8','varname9',“VarName10',“VarName11',“VarName12',“VarName13','varname14'});% Delete the file and clear temporary variablesclearvarsfilenameformatSpecfileIDdataArrayAns; deletehousing.txt

Read into a Table

housing.properties.variablenames = housingAttributes;x = housing {：，inputNames};y = housing {：，outputNames};

Train a Regression Tree Using the Housing Data

rng(5);% For reproducibility％搁置了90％的培训数据cv = cvpartition(height(housing),'holdout'，0.1）;t = recressionTree.template（'MinLeaf',5); mdl = fitensemble(X(cv.training,:),y(cv.training,:),'LSBoost'，500，t，...'PredictorNames',inputNames,'ResponseName',outputNames{1},'LearnRate'，0.01）;l =损失（mdl，x（cv.test，:)，y（cv.test），，'模式','ensemble'）；fprintf('Mean-square testing error = %f\n',L);

Mean-square testing error = 7.056746

与训练数据相比拟合

figure(1);% plot([y(cv.training), predict(mdl,X(cv.training,:))],'LineWidth',2);情节（y（cv.Training），'b','LineWidth',2), hold上plot(predict(mdl,X(cv.training,:)),'r.-','LineWidth',1,“标记”,15)％观察前一百分，PAN查看更多xlim（[0 100]）传奇（{'Actual','Predicted'}）xlabel（'Training Data point'）；ylabel('Median house price'）；

Plot Predictor Importance

Plot the predictors sorted on importance.

[predictorImportance,sortedIndex] = sort(mdl.predictorImportance); figure(2); barh(predictorImportance) set(gca,'ytickLabel',inputNames(sortedIndex)) xlabel('Predictor Importance')

Plot Error

图（3）;trienchloss = resubloss（MDL，'模式','cumulative'）；testLoss = loss(mdl,X(cv.test,:),y(cv.test),'模式','cumulative'）；plot(trainingLoss), hold上plot(testLoss,'r') legend({'Training Set Loss',“测试集的损失”}）xlabel（'Number of trees'）；ylabel('Mean Squared Error'）；设置（GCF，'位置',[249 634 1009 420])

Regularize and Shrink the Ensemble

We may not need all 500 trees to get the full accuracy for the model. We can regularize the weights and shrink based on a regularization parameter

% Try two different regularization parameter values for lassomdl = regularize(mdl,'lambda'，[0.001 0.1]）;disp（'Number of Trees:') disp(sum(mdl.Regularization.TrainedWeights > 0))

Number of Trees: 194 128

使用λ= 0.1

mdl = shrink(mdl,'weightcolumn',2); disp('Number of Trees trained after shrinkage') disp(mdl.NTrained)

Number of Trees trained after shrinkage 128

When datasets are large, using a fewer number of trees and fewer predictors based on predictor importance will result in fast computation and accurate results.

参考和许可

Example fromscikit-learn.org

执照：BSD clause