MATLAB的机器学习

Regression with Boosted Decision Trees

In this example we will explore a regression problem using the Boston House Prices dataset available from the UCI Machine Learning Repository.

Download Housing Prices

文件名='housing.txt'; urlwrite('http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data',filename); inputNames = {'CRIM','Zn',“印度”,'CHAS','NOX','R M','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT'}; outputNames = {'MEDV'}; housingAttributes = [inputNames,outputNames];

Import Data

Once the file is saved, you can import data into MATLAB as a table using theImport Tool使用默认选项。另外,您可以使用以下代码,该代码可以从导入工具中自动生成:

formatSpec ='%8f%7f%8f%3f%8f%8f%7f%8f%4f%7f%7f%7f%7f%f%[^\n\r]'; fileID = fopen(filename,'r');dataArray = textscan(fileID, formatSpec,'Delimiter','','WhiteSpace','','ReturnOnError', false); fclose(fileID); housing = table(dataArray{1:end-1},“VariableNames', {“VarName1',“VarName2',“VarName3',“VarName4',“VarName5',“VarName6',“VarName7','varname8','varname9',“VarName10',“VarName11',“VarName12',“VarName13','varname14'});% Delete the file and clear temporary variablesclearvarsfilenameformatSpecfileIDdataArrayAns; deletehousing.txt

Read into a Table

housing.properties.variablenames = housingAttributes;x = housing {:,inputNames};y = housing {:,outputNames};

Train a Regression Tree Using the Housing Data

rng(5);% For reproducibility%搁置了90%的培训数据cv = cvpartition(height(housing),'holdout',0.1);t = recressionTree.template('MinLeaf',5); mdl = fitensemble(X(cv.training,:),y(cv.training,:),'LSBoost',500,t,...'PredictorNames',inputNames,'ResponseName',outputNames{1},'LearnRate',0.01);l =损失(mdl,x(cv.test,:),y(cv.test),,'模式','ensemble');fprintf('Mean-square testing error = %f\n',L);
Mean-square testing error = 7.056746

与训练数据相比拟合

figure(1);% plot([y(cv.training), predict(mdl,X(cv.training,:))],'LineWidth',2);情节(y(cv.Training),'b','LineWidth',2), holdplot(predict(mdl,X(cv.training,:)),'r.-','LineWidth',1,“标记”,15)%观察前一百分,PAN查看更多xlim([0 100])传奇({'Actual','Predicted'})xlabel('Training Data point');ylabel('Median house price');

Plot Predictor Importance

Plot the predictors sorted on importance.

[predictorImportance,sortedIndex] = sort(mdl.predictorImportance); figure(2); barh(predictorImportance) set(gca,'ytickLabel',inputNames(sortedIndex)) xlabel('Predictor Importance')

Plot Error

图(3);trienchloss = resubloss(MDL,'模式','cumulative');testLoss = loss(mdl,X(cv.test,:),y(cv.test),'模式','cumulative');plot(trainingLoss), holdplot(testLoss,'r') legend({'Training Set Loss',“测试集的损失”})xlabel('Number of trees');ylabel('Mean Squared Error');设置(GCF,'位置',[249 634 1009 420])

Regularize and Shrink the Ensemble

We may not need all 500 trees to get the full accuracy for the model. We can regularize the weights and shrink based on a regularization parameter

% Try two different regularization parameter values for lassomdl = regularize(mdl,'lambda',[0.001 0.1]);disp('Number of Trees:') disp(sum(mdl.Regularization.TrainedWeights > 0))
Number of Trees: 194 128

使用λ= 0.1

mdl = shrink(mdl,'weightcolumn',2); disp('Number of Trees trained after shrinkage') disp(mdl.NTrained)
Number of Trees trained after shrinkage 128

When datasets are large, using a fewer number of trees and fewer predictors based on predictor importance will result in fast computation and accurate results.

参考和许可