

This example shows how to apply partial least squares regression (PLSR) and principal components regression (PCR), and explores the effectiveness of the two methods. PLSR and PCR are both methods to model a response variable when there are a large number of predictor variables, and those predictors are highly correlated or even collinear. Both methods construct new predictor variables, known as components, as linear combinations of the original predictor variables, but they construct those components in different ways. PCR creates components to explain the observed variability in the predictor variables, without considering the response variable at all. On the other hand, PLSR does take the response variable into account, and therefore often leads to models that are able to fit the response variable with fewer components. Whether or not that ultimately translates into a more parsimonious model, in terms of its practical use, depends on the context.


Load a data set comprising spectral intensities of 60 samples of gasoline at 401 wavelengths, and their octane ratings. These data are described in Kalivas, John H., "Two Data Sets of Near Infrared Spectra," Chemometrics and Intelligent Laboratory Systems, v.37 (1997) pp.255-259.

Name Size Bytes Class Attributes NIR 60x401 192480 double octane 60x1 480 double
[虚拟,h] =排序(辛烷);oldorder = get(gcf,“ DefaultaxesColorOrder”);设置(GCF,“ DefaultaxesColorOrder”,喷气式飞机(60));plot3(repmat(1:401,60,1)',repmat(octane(h),1,401)',nir(h,:)');设置(GCF,“ DefaultaxesColorOrder”,Oldorder);Xlabel('Wavelength Index');ylabel('辛烷值');axis('紧的');网格



Use theplsregress拟合具有十个PLS组件和一个响应的PLSR模型的功能。

x = nir;y =辛烷值;[n,p] = size(x);[x loadings,yloadings,Xscores,yscores,betapls10,plspctvar] = plsRegress(。。。x,y,10);


情节(1:10,暨(100*plspctvar(2,:)),,'-bo');Xlabel(“ PLS组件的数量”);ylabel('Percent Variance Explained in Y');



[x loadings,yloadings,Xscores,yscores,betapls] = plsRegress(x,y,2);yfitpls = [hons(n,1)x]*betapls;

接下来,拟合具有两个主要组件的PCR模型。第一步是对X, using thePCA功能,并保留两个主要组件。然后,PCR只是这两个组件上响应变量的线性回归。当变量具有非常不同的可变性时,首先通过其标准偏差将每个变量归一化是有意义的,但是在这里没有进行。

[PCALOADINGS,PCASCORE,PCAVAR] = PCA(X,'Economy',错误的);betapcr =回归(y-mean(y),pcascores(:,1:2));


betapcr = pcaloadings(:,1:2)*betapcr;betapcr = [平均(y) - 平均值(x)*betapcr;betapcr];yfitpcr = [一个(n,1)x]*betapcr;

Plot fitted vs. observed response for the PLSR and PCR fits.

情节(y,yfitpls,'bo',y,yfitpcr,'r^');Xlabel(“观察到的反应”);ylabel(“合适的响应”);传奇({“ PLSR 2个组件”“带2个组件的PCR”},,。。。'地点','NW');

图包含一个坐标轴对象。轴对象包含2个类型行的对象。These objects represent PLSR with 2 Components, PCR with 2 Components.

从某种意义上说,以上图中的比较不是一个公平的 - 通过查看两个组件PLSR模型预测响应的效果,选择了组件(两个)的数量(两个),并且没有理由为什么PCR模型应仅限于相同数量的组件。但是,对于相同数量的组件,PLSR在适应方面做得更好y。实际上,查看上面图中拟合值的水平散射,具有两个组件的PCR几乎不比使用恒定模型更好。来自两个回归的R平方值证实了这一点。

tss = sum((y-mean(y))。^2);rss_pls = sum((y-yfitpls)。^2);rsquaredpls = 1 -rss_pls/tss
rsquaredpls = 0.9466
rss_pcr = sum((y-yfitpcr)。^2);rsquaredpcr = 1 -rss_pcr/tss
rsquaredpcr = 0.1962


plot3(xscores(::,1),xscores(:,2),y-mean(y),,'bo');传奇('plsr');网格; view(-30,30);

图包含一个坐标轴对象。轴对象包含一个类型行的对象。This object represents PLSR.

It's a little hard to see without being able to interactively rotate the figure, but the PLSR plot above shows points closely scattered about a plane. On the other hand, the PCR plot below shows a cloud of points with little indication of a linear relationship.

plot3(PCASCORE(:,1),PCASCORE(:,2),Y-均值(Y),,,'r^');传奇('PCR');网格; view(-30,30);


Notice that while the two PLS components are much better predictors of the observedy,下图表明,它们解释了观察到的差异的差异较小Xthan the first two principal components used in the PCR.

plot(1:10,100*cumsum(PLSPctVar(1,:)),'B-O',1:10,。。。100*cumsum(pcavar(1:10))/sum(pcavar(1:10)),'r-^');Xlabel(“主要组件数量”);ylabel(“ x中解释的百分比差异”);传奇({'plsr''PCR'},,'地点','SE');


PCR曲线统一的事实表明,为什么使用两个组件的PCR相对于PLSR而做得如此差,这一事实在拟合时做得如此差。y。PCR constructs components to best explainX,结果,前两个组件忽略了数据中重要的信息,这些信息对于拟合观察到的信息很重要y


As more components are added in PCR, it will necessarily do a better job of fitting the original datay,仅仅是因为在某个时候,大多数重要的预测信息X主要组成部分将存在。例如,下图表明,使用十个组件时,两种方法的残差差异要比两个组件的差异要小得多。

yfitPLS10 = [ones(n,1) X]*betaPLS10; betaPCR10 = regress(y-mean(y), PCAScores(:,1:10)); betaPCR10 = PCALoadings(:,1:10)*betaPCR10; betaPCR10 = [mean(y) - mean(X)*betaPCR10; betaPCR10]; yfitPCR10 = [ones(n,1) X]*betaPCR10; plot(y,yfitPLS10,'bo',y,yfitpcr10,'r^');Xlabel(“观察到的反应”);ylabel(“合适的响应”);传奇({'PLSR with 10 components''PCR with 10 Components'},,。。。'地点','NW');


两种模型都合适yfairly accurately, although PLSR still makes a slightly more accurate fit. However, ten components is still an arbitrarily-chosen number for either model.

Choosing the Number of Components with Cross-Validation



plsregresshas an option to estimate the mean squared prediction error (MSEP) by cross-validation, in this case using 10-fold C-V.


对于PCR,crossvalcombined with a simple function to compute the sum of squared errors for PCR, can estimate the MSEP, again using 10-fold cross-validation.

pcrmsep = sum(crossVal(@pcrsse,x,y,'kfold',10),1) / n;


情节(0:10,plsmsep(2,:),,,'B-O',0:10,pcrmsep,'r-^');Xlabel(“组件数量”);ylabel('Estimated Mean Squared Prediction Error');传奇({'plsr''PCR'},,'地点','ne');


实际上,PCR中的第二个成分increases模型的预测误差表明该组件中包含的预测变量的组合与y。同样,这是因为PCR构造了组件来解释变化X, 不是y



The PLS weights are the linear combinations of the original variables that define the PLS components, i.e., they describe how strongly each component in the PLSR depends on the original variables, and in what direction.

[XL,YL,XS,YS,BETA,PCTVAR,MSE,STATS] = PLSREGRESS(X,Y,3);情节(1:401,stats.w,“- - -”);Xlabel('多变的');ylabel('PLS Weight');传奇({'第一组件''第二组件''3rd Component'},,。。。'地点','NW');



plot(1:401,PCALoadings(:,1:4),“- - -”);Xlabel('多变的');ylabel(“ PCA加载”);传奇({'第一组件''第二组件''3rd Component'。。。“第四组件”},,'地点','NW');




However, the ultimate goal may to reduce the original set of variables to a smaller subset still able to predict the response accurately. For example, it may be possible to use the PLS weights or the PCA loadings to select only those variables that contribute most to each component. As shown earlier, some components from a PCR model fit may serve primarily to describe the variation in the predictor variables, and may include large weights for variables that are not strongly correlated with the response. Thus, PCR can lead to retaining variables that are unnecessary for prediction.



