主要内容

Using MapReduce to Fit a Logistic Regression Model

此示例显示了如何使用MapReduce使用单个预测指标进行简单的逻辑回归。它展示了链二MapReducecalls to carry out an iterative algorithm. Since each iteration requires a separate pass through the data, an anonymous function passes information from one iteration to the next to supply information directly to the mapper.

Prepare Data

使用AirlinesMall.CSVdata set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, the variables of interest areArrDelay(飞行延迟)和距离(total flight distance).

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','na');ds.selectedVariablenames = {'arrdelay','距离'};

The datastore treats'na'值为丢失的值,并用NaNvalues by default. Additionally, the选定的variablenames属性允许您仅处理感兴趣的指定变量,您可以使用这些变量来验证preview.

preview(ds)
ans =8×2 tablearrdelay距离________ ________ 8 308 8 296 21 480 13 296 4 373 59 308 3 447 11 954

执行逻辑回归

Logistic regression is a way to model the probability of an event as a function of another variable. In this example, logistic regression models the probability of a flight being more than 20 minutes late as a function of the flight distance, in thousands of miles.

To accomplish this logistic regression, the map and reduce functions must collectively perform a weighted least-squares regression based on the current coefficient values. The mapper computes a weighted sum of squares and cross product for each block of input data.

Display the map function file.

功能logitMapper(b,t,〜,Intermkvstore)% Get data input table and remove any rows with missing valuesy = t.ArrDelay;x = t.Distance; t = ~isnan(x) & ~isnan(y); y = y(t)>20;%迟到20分钟以上x = x(t)/1000;百分之千里的距离%计算预测因子的线性组合和估计的平均值%概率,基于上一次迭代的系数如果~isempty(b)% Compute xb as the linear combination using the current coefficient% values, and derive mean probabilities mu from themxb = b(1)+b(2)*x; mu = 1./(1+exp(-xb));别的% This is the first iteration. Compute starting values for mu that are%1/4如果y = 0,如果y = 1,则为3/4。从它们中得出XB值。mu =(y+.5)/2;xb = log(mu ./(1-mu));end%要执行加权最小二乘,计算正方形并交叉%产s manbetx 845品矩阵:%(x'*w*x)=(x1'*w1*x1) +(x2'*w2*x2) + ... +(xn'*wn*xn),%x = [x1; x2; ...; xn]和w = [w1; w2; ...; wn]。%% The mapper receives one chunk at a time and computes one of the terms on%右侧。还原器添加了所有条款以获取左侧的%数量,然后执行回归。w =(mu。*(1-mu));重量%z = xb +(y -mu)。* 1./w;% adjusted responsex = [hons(size(x)),x,z];% matrix of unweighted datawss = X' * bsxfun(@times,w,X);%加权交叉产品x1'*w1*x1s manbetx 845% Store the results for this part of the data.add(intermKVStore,'key',WSS);end

The reducer computes the regression coefficient estimates from the sums of squares and cross products.

Display the reduce function file.

功能LogitReducer(〜,Intermvaliter,Outkvstore)% We will operate over chunks of the data, updating the count, mean, and% covariance each time we add a new chunk旧= 0;% We want to perform weighted least squares. We do this by computing a sum正方形和跨产品矩阵的%s manbetx 845%m =(x'*w*x)=(x1'*w1*x1) +(x2'*w2*x2) + ... +(xn'*wn*xn)%x = x1; x2; ...; xn]和w = [w1; w2; ...; wn]。%%映射器在右侧计算了术语。在这里%还原器我们只是添加它们。尽管hasnext(intermValIter) new = getnext(intermValIter); old = old+new;endm =旧;%左侧的价值% Compute coefficients estimates from M. M is a matrix of sums of squares%和跨产品[x y]其中s manbetx 845x是设计矩阵,包括a% constant term and Y is the adjusted response for this iteration. In other% words, Y has been included as an additional column of X. First we% separate them by extracting the X'*W*X part and the X'*W*Y part.XtWX = M(1:end-1,1:end-1); XtWY = M(1:end-1,end);%求解正常方程。b =XtWX\XtWY;% Return the vector of coefficient estimates.添加(outKVStore'key',b);end

Run MapReduce

RunMapReduceiteratively by enclosing the calls toMapReducein a loop. The loop runs until the convergence criteria are met, with a maximum of five iterations.

% Define the coefficient vector, starting as empty for the first iteration.b = [];for迭代= 1:5 b_old = b;迭代% Here we will use an anonymous function as our mapper. This function% definition includes the value of b computed in the previous%迭代。mapper = @(t,忽略,IntermkvStore)logitMapper(b,t,ignore,intermkvstore);结果= mapReduce(DS,映射器,@logitreducer,'展示','离开');tbl = readall(result); b = tbl.Value{1}% Stop iterating if we have converged.如果~isempty(b_old) &&...〜任何(abs(b-b_old)> 1e-6 * abs(b_old))休息endend
迭代= 1
b =2×1-1.7674 0.1209
迭代= 2
b =2×1-1.8327 0.1807
迭代= 3
b =2×1-1.8331 0.1806
迭代= 4
b =2×1-1.8331 0.1806

View Results

Use the resulting regression coefficient estimates to plot a probability curve. This curve shows the probability of a flight being more than 20 minutes late as a function of the flight distance.

xx = linspace(0,4000);yy = 1./(1+exp(-b(1)-b(2)***(xx/1000));情节(xx,yy);Xlabel('距离');ylabel('Prob [延迟> 20]')

Figure contains an axes object. The axes object contains an object of type line.

Local Functions

Listed here are the map and reduce functions thatMapReduce适用于数据。

功能logitMapper(b,t,〜,Intermkvstore)% Get data input table and remove any rows with missing valuesy = t.ArrDelay;x = t.Distance; t = ~isnan(x) & ~isnan(y); y = y(t)>20;%迟到20分钟以上x = x(t)/1000;百分之千里的距离%计算预测因子的线性组合和估计的平均值%概率,基于上一次迭代的系数如果~isempty(b)% Compute xb as the linear combination using the current coefficient% values, and derive mean probabilities mu from themxb = b(1)+b(2)*x; mu = 1./(1+exp(-xb));别的% This is the first iteration. Compute starting values for mu that are%1/4如果y = 0,如果y = 1,则为3/4。从它们中得出XB值。mu =(y+.5)/2;xb = log(mu ./(1-mu));end%要执行加权最小二乘,计算正方形并交叉%产s manbetx 845品矩阵:%(x'*w*x)=(x1'*w1*x1) +(x2'*w2*x2) + ... +(xn'*wn*xn),%x = [x1; x2; ...; xn]和w = [w1; w2; ...; wn]。%% The mapper receives one chunk at a time and computes one of the terms on%右侧。还原器添加了所有条款以获取左侧的%数量,然后执行回归。w =(mu。*(1-mu));重量%z = xb +(y -mu)。* 1./w;% adjusted responsex = [hons(size(x)),x,z];% matrix of unweighted datawss = X' * bsxfun(@times,w,X);%加权交叉产品x1'*w1*x1s manbetx 845% Store the results for this part of the data.add(intermKVStore,'key',WSS);end%-----------------------------------------------------------------------------功能LogitReducer(〜,Intermvaliter,Outkvstore)% We will operate over chunks of the data, updating the count, mean, and% covariance each time we add a new chunk旧= 0;% We want to perform weighted least squares. We do this by computing a sum正方形和跨产品矩阵的%s manbetx 845%m =(x'*w*x)=(x1'*w1*x1) +(x2'*w2*x2) + ... +(xn'*wn*xn)%x = x1; x2; ...; xn]和w = [w1; w2; ...; wn]。%%映射器在右侧计算了术语。在这里%还原器我们只是添加它们。尽管hasnext(intermValIter) new = getnext(intermValIter); old = old+new;endm =旧;%左侧的价值% Compute coefficients estimates from M. M is a matrix of sums of squares%和跨产品[x y]其中s manbetx 845x是设计矩阵,包括a% constant term and Y is the adjusted response for this iteration. In other% words, Y has been included as an additional column of X. First we% separate them by extracting the X'*W*X part and the X'*W*Y part.XtWX = M(1:end-1,1:end-1); XtWY = M(1:end-1,end);%求解正常方程。b =XtWX\XtWY;% Return the vector of coefficient estimates.添加(outKVStore'key',b);end%-----------------------------------------------------------------------------

See Also

|

相关话题