If you have Parallel Computing Toolbox™, you can use tall arrays in your local MATLAB®session, or on a local parallel pool. You can also run tall array calculations on a cluster if you haveMATLAB并行服务器™installed. This example uses the workers in a local cluster on your machine. You can develop code locally, and then scale up, to take advantage of the capabilities offered by Parallel Computing Toolbox andMATLAB Parallel Serverwithout having to rewrite your algorithm. See alsoBig Data Workflow Using Tall Arrays and Datastores。
Create a datastore and convert it into a tall table.
ds = datastore('airlinesmall.csv');varnames = {'ArrDelay','DepDelay'}; ds.SelectedVariableNames = varnames; ds.TreatAsMissing ='NA';
If you have Parallel Computing Toolbox installed, when you use thetall
function, MATLAB automatically starts a parallel pool of workers, unless you turn off the default parallel pool preference. The default cluster uses local workers on your machine.
If you want to turn off automatically opening a parallel pool, change your parallel preferences. If you turn off theAutomatically create a parallel pool选项,然后如果您想要,您必须明确启动池tall
function to use it for parallel processing. SeeSpecify Your Parallel Preferences。
如果您有并行计算工具箱,则可以运行与MATLAB相同的代码tall table example(MATLAB) and automatically execute it in parallel on the workers of your local machine.
创造一个高大的桌子tt
from the datastore.
tt = tall(ds)
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers. tt = M×2 tall table ArrDelay DepDelay ________ ________ 8 12 8 1 21 20 13 12 4 -1 59 63 3 -2 11 -1 : : : :
The display indicates that the number of rows,M
, is not yet known.M
is a placeholder until the calculation completes.
Extract the arrival delayArrDelay
从高大的桌子。此操作创建一个新的高阵列变量以在后续计算中使用。
a = tt.ArrDelay;
You can specify a series of operations on your tall array, which are not executed until you callgather
。这样做使您可以批量可能需要很长时间的命令。例如,计算到达延迟的平均值和标准偏差。使用这些值来构造延迟的上阈值和较低阈值,这在平均值的标准偏差范围内。
m =均值(a,'omitnan');s = std(a,'omitnan');one_sigma_bounds = [m-s m m + s];
Usegather
to calculateone_sigma_bounds
, and bring the answer into memory.
sig1 = gather(one_sigma_bounds)
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 4.5 sec Evaluation completed in 6.3 sec sig1 = -23.4572 7.1201 37.6975
You can specify multiple inputs and outputs togather
if you want to evaluate several things at once. Doing so is faster than callinggather
separately on each tall array . As an example, calculate the minimum and maximum arrival delay.
[max_delay, min_delay] = gather(max(a),min(a))
max_delay = 1014 min_delay = -64
如果你想连续发展and not use local workers or your specified cluster, enter the following command.
mapreducer(0);
mapreducer
to change the execution environment after creating a tall array, then the tall array is invalid and you must recreate it. To use local workers or your specified cluster again, enter the following command.mapreducer(gcp);
One of the benefits of developing algorithms with tall arrays is that you only need to write the code once. You can develop your code locally, and then usemapreducer
扩展到群集,无需重写算法。例如,看到使用高阵列在启用的HADOOP集群上。
datastore
|gather
|mapreducer
|parpool
|table
|tall