Using MapReduce to Fit a Logistic Regression Model
此示例显示了如何使用MapReduce
使用单个预测指标进行简单的逻辑回归。它展示了链二MapReduce
calls to carry out an iterative algorithm. Since each iteration requires a separate pass through the data, an anonymous function passes information from one iteration to the next to supply information directly to the mapper.
Prepare Data
使用AirlinesMall.CSV
data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, the variables of interest areArrDelay
(飞行延迟)和距离
(total flight distance).
ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','na');ds.selectedVariablenames = {'arrdelay','距离'};
The datastore treats'na'
值为丢失的值,并用NaN
values by default. Additionally, the选定的variablenames
属性允许您仅处理感兴趣的指定变量,您可以使用这些变量来验证preview
.
preview(ds)
ans =8×2 tablearrdelay距离________ ________ 8 308 8 296 21 480 13 296 4 373 59 308 3 447 11 954
执行逻辑回归
Logistic regression is a way to model the probability of an event as a function of another variable. In this example, logistic regression models the probability of a flight being more than 20 minutes late as a function of the flight distance, in thousands of miles.
To accomplish this logistic regression, the map and reduce functions must collectively perform a weighted least-squares regression based on the current coefficient values. The mapper computes a weighted sum of squares and cross product for each block of input data.
Display the map function file.
功能logitMapper(b,t,〜,Intermkvstore)% Get data input table and remove any rows with missing valuesy = t.ArrDelay;x = t.Distance; t = ~isnan(x) & ~isnan(y); y = y(t)>20;%迟到20分钟以上x = x(t)/1000;百分之千里的距离%计算预测因子的线性组合和估计的平均值%概率,基于上一次迭代的系数如果~isempty(b)% Compute xb as the linear combination using the current coefficient% values, and derive mean probabilities mu from themxb = b(1)+b(2)*x; mu = 1./(1+exp(-xb));别的% This is the first iteration. Compute starting values for mu that are%1/4如果y = 0,如果y = 1,则为3/4。从它们中得出XB值。mu =(y+.5)/2;xb = log(mu ./(1-mu));end%要执行加权最小二乘,计算正方形并交叉%产s manbetx 845品矩阵:%(x'*w*x)=(x1'*w1*x1) +(x2'*w2*x2) + ... +(xn'*wn*xn),%x = [x1; x2; ...; xn]和w = [w1; w2; ...; wn]。%% The mapper receives one chunk at a time and computes one of the terms on%右侧。还原器添加了所有条款以获取左侧的%数量,然后执行回归。w =(mu。*(1-mu));重量%z = xb +(y -mu)。* 1./w;% adjusted responsex = [hons(size(x)),x,z];% matrix of unweighted datawss = X' * bsxfun(@times,w,X);%加权交叉产品x1'*w1*x1s manbetx 845% Store the results for this part of the data.add(intermKVStore,'key',WSS);end
The reducer computes the regression coefficient estimates from the sums of squares and cross products.
Display the reduce function file.
功能LogitReducer(〜,Intermvaliter,Outkvstore)% We will operate over chunks of the data, updating the count, mean, and% covariance each time we add a new chunk旧= 0;% We want to perform weighted least squares. We do this by computing a sum正方形和跨产品矩阵的%s manbetx 845%m =(x'*w*x)=(x1'*w1*x1) +(x2'*w2*x2) + ... +(xn'*wn*xn)%x = x1; x2; ...; xn]和w = [w1; w2; ...; wn]。%%映射器在右侧计算了术语。在这里%还原器我们只是添加它们。尽管hasnext(intermValIter) new = getnext(intermValIter); old = old+new;endm =旧;%左侧的价值% Compute coefficients estimates from M. M is a matrix of sums of squares%和跨产品[x y]其中s manbetx 845x是设计矩阵,包括a% constant term and Y is the adjusted response for this iteration. In other% words, Y has been included as an additional column of X. First we% separate them by extracting the X'*W*X part and the X'*W*Y part.XtWX = M(1:end-1,1:end-1); XtWY = M(1:end-1,end);%求解正常方程。b =XtWX\XtWY;% Return the vector of coefficient estimates.添加(outKVStore'key',b);end
Run MapReduce
RunMapReduce
iteratively by enclosing the calls toMapReduce
in a loop. The loop runs until the convergence criteria are met, with a maximum of five iterations.
% Define the coefficient vector, starting as empty for the first iteration.b = [];for迭代= 1:5 b_old = b;迭代% Here we will use an anonymous function as our mapper. This function% definition includes the value of b computed in the previous%迭代。mapper = @(t,忽略,IntermkvStore)logitMapper(b,t,ignore,intermkvstore);结果= mapReduce(DS,映射器,@logitreducer,'展示','离开');tbl = readall(result); b = tbl.Value{1}% Stop iterating if we have converged.如果~isempty(b_old) &&...〜任何(abs(b-b_old)> 1e-6 * abs(b_old))休息endend
迭代= 1
b =2×1-1.7674 0.1209
迭代= 2
b =2×1-1.8327 0.1807
迭代= 3
b =2×1-1.8331 0.1806
迭代= 4
b =2×1-1.8331 0.1806
View Results
Use the resulting regression coefficient estimates to plot a probability curve. This curve shows the probability of a flight being more than 20 minutes late as a function of the flight distance.
xx = linspace(0,4000);yy = 1./(1+exp(-b(1)-b(2)***(xx/1000));情节(xx,yy);Xlabel('距离');ylabel('Prob [延迟> 20]')
Local Functions
Listed here are the map and reduce functions thatMapReduce
适用于数据。
功能logitMapper(b,t,〜,Intermkvstore)% Get data input table and remove any rows with missing valuesy = t.ArrDelay;x = t.Distance; t = ~isnan(x) & ~isnan(y); y = y(t)>20;%迟到20分钟以上x = x(t)/1000;百分之千里的距离%计算预测因子的线性组合和估计的平均值%概率,基于上一次迭代的系数如果~isempty(b)% Compute xb as the linear combination using the current coefficient% values, and derive mean probabilities mu from themxb = b(1)+b(2)*x; mu = 1./(1+exp(-xb));别的% This is the first iteration. Compute starting values for mu that are%1/4如果y = 0,如果y = 1,则为3/4。从它们中得出XB值。mu =(y+.5)/2;xb = log(mu ./(1-mu));end%要执行加权最小二乘,计算正方形并交叉%产s manbetx 845品矩阵:%(x'*w*x)=(x1'*w1*x1) +(x2'*w2*x2) + ... +(xn'*wn*xn),%x = [x1; x2; ...; xn]和w = [w1; w2; ...; wn]。%% The mapper receives one chunk at a time and computes one of the terms on%右侧。还原器添加了所有条款以获取左侧的%数量,然后执行回归。w =(mu。*(1-mu));重量%z = xb +(y -mu)。* 1./w;% adjusted responsex = [hons(size(x)),x,z];% matrix of unweighted datawss = X' * bsxfun(@times,w,X);%加权交叉产品x1'*w1*x1s manbetx 845% Store the results for this part of the data.add(intermKVStore,'key',WSS);end%-----------------------------------------------------------------------------功能LogitReducer(〜,Intermvaliter,Outkvstore)% We will operate over chunks of the data, updating the count, mean, and% covariance each time we add a new chunk旧= 0;% We want to perform weighted least squares. We do this by computing a sum正方形和跨产品矩阵的%s manbetx 845%m =(x'*w*x)=(x1'*w1*x1) +(x2'*w2*x2) + ... +(xn'*wn*xn)%x = x1; x2; ...; xn]和w = [w1; w2; ...; wn]。%%映射器在右侧计算了术语。在这里%还原器我们只是添加它们。尽管hasnext(intermValIter) new = getnext(intermValIter); old = old+new;endm =旧;%左侧的价值% Compute coefficients estimates from M. M is a matrix of sums of squares%和跨产品[x y]其中s manbetx 845x是设计矩阵,包括a% constant term and Y is the adjusted response for this iteration. In other% words, Y has been included as an additional column of X. First we% separate them by extracting the X'*W*X part and the X'*W*Y part.XtWX = M(1:end-1,1:end-1); XtWY = M(1:end-1,end);%求解正常方程。b =XtWX\XtWY;% Return the vector of coefficient estimates.添加(outKVStore'key',b);end%-----------------------------------------------------------------------------
See Also
MapReduce
|Tabulartextdatastore