Process Big Data in the Cloud

打开实时脚本

此示例显示了如何在云中访问云中的大数据集并使用MATLAB功能在云群集中处理大数据。

Learn how to:

在Amazon Cloud上访问公开可用的大数据集。
Find and select an interesting subset of this data set.
使用数据存储，高阵列和并行计算工具箱在不到20分钟的时间内处理此子集。

此示例中的公共数据集是风集成国家数据集工具包的一部分，或Wind Toolkit [1]，[2]，[3]，[4]。有关更多信息，请参阅风集成国家数据集工具包。

要求

To run this example, you must set up access to a cluster in Amazon AWS. In MATLAB, you can create clusters in Amazon AWS directly from the MATLAB desktop. On theHome标签，在平行menu, selectCreate and Manage Clusters。在集群配置文件管理器中，单击创建云集群。另外，您可以使用Mathworks Cloud Center在Amazon AWS中创建和访问计算集群。有关更多信息，请参阅云中心入门。

Set Up Access to Remote Data

此示例中使用的数据集是技术经济风工具包。它包含2 TB（TBYTE）的数据估计数据和预测以及2007年至2013年美国大陆的大气变量

技术经济风工具包可通过Amazon Web Services在该位置获得s3：// pywtk-data。它包含两个数据集：

s3：// pywtk-data/met_data- Metrology Data
s3：// pywtk-data/fcst_data- 预测数据

To work with remote data in Amazon S3, you must define environment variables for your AWS credentials. For more information on setting up access to remote data, see使用远程数据。In the following code, replaceYOUR_AWS_ACCESS_KEY_ID和YOUR_AWS_SECRET_ACCESS_KEY拥有您自己的Amazon AWS凭据。

setenv（“ aws_access_key_id”，，，，“ your_aws_access_key_id”）；setenv（“ aws_secret_access_key”，，，，“ your_aws_secret_access_key”）；

这个数据set requires you to specify its geographic region, and so you must set the corresponding environment variable.

setenv（“ aws_default_region”，，，，“ us-west-2”）；

要使工人在集群中访问远程数据，请将这些环境变量名称添加到EnvironmentVariables群集配置文件的属性。要编辑群集配置文件的属性，请使用cluster配置文件管理器，平行>Create and Manage Clusters。

查找大数据的子集

The 2 TB data set is quite large. This example shows you how to find a subset of the data set that you want to analyze. The example focuses on data for the state of Massachusetts.

首先获取标识马萨诸塞州的计量站的ID，并确定包含其计量信息的文件。每个站的元数据信息都在一个名称的文件中trix_tier_site_metadata.csv。由于此数据很小并且适合内存，因此您可以从MATLAB客户端访问它readtable。You can use thereadtable函数可以直接在S3存储桶中访问开放数据，而无需编写特殊代码。

tmetadata =可读取（“ s3：//pywtk-data/three_tier_site_metadata.csv”，，，，...“ ReadVariablenames”，真的，“ TextType”，，，，“细绳”）；

要找出此数据集中列出的哪些状态，请使用独特的。

状态=唯一（tmetadata.state）

states =50×1字符串数组“”“”“阿拉巴马州”“亚利桑那”“阿肯色州”“加利福尼亚”“科罗拉多州”“康涅狄格”，“特拉华州”，哥伦比亚特区。肯塔基州“路易斯安那”“缅因州”“马里兰州”“马萨诸塞州”“密歇根州”“明尼苏达州”“ Mississippi”，Mississippi“ Missouri”“ Montana”，Nebraska“ Nebraska”“ Nebraska”“ Newhevada”“ New Hampshire”，New Hampshire“ New Hampshire”，新泽西“北卡罗来纳州”“北达科他州”“俄亥俄州”“俄克拉荷马州”“俄勒冈”“宾夕法尼亚州”“罗德岛”“南卡罗来纳州”“南达科他州”“南达科他州”“田纳西州”“田纳西州”“ texas”“ texas”“ texas”“ utah”，“““西弗吉尼亚州”“威斯康星州”“怀俄明州”

确定哪些车站位于马萨诸塞州。

index = tMetadata.state =="Massachusetts"; siteId = tMetadata{index,“ site_id”};

给定站的数据包含在此命名约定之后的文件中：s3：//pywtk-data/met_data/folder/site_id.nc，，，，where文件夹最近的整数小于或等于site_id/500。使用此约定，为每个站组成一个文件位置。

文件夹= floor（siteID/500）;filelacations = compose（"s3://pywtk-data/met_data/%d/%d.nc"，文件夹，siteID）;

Process Big Data

您可以使用数据存储和高阵列访问和处理不适合内存的数据。执行大数据计算时，MATLAB根据需要访问远程数据的较小部分，因此您无需一次下载整个数据集。使用高阵列，MATLAB会自动将数据分解为适合记忆进行处理的较小块。

If you have Parallel Computing Toolbox, MATLAB can process the many blocks in parallel. The parallelization enables you to run an analysis on a single desktop with local workers, or scale up to a cluster for more resources. When you use a cluster in the same cloud service as the data, the data stays in the cloud and you benefit from improved data transfer times. Keeping the data in the cloud is also more cost-effective. This example ran in less than 20 minutes using 18 workers on a c4.8xlarge machine in Amazon AWS.

If you use a parallel pool in a cluster, MATLAB processes this data using workers in the cluster. Create a parallel pool in the cluster. In the following code, use the name of your cluster profile instead. Attach the script to the pool, because the parallel workers need to access a helper function in it.

p = parpool（“ Myawscluster”）；

使用“ Myawscluster”配置文件...连接到18名工人。

addattachedFiles（p，mfileName（"fullpath"））；

Create a datastore with the metrology data for the stations in Massachusetts. The data is in the form of Network Common Data Form (NetCDF) files, and you must use a custom read function to interpret them. In this example, this function is namedncreader并将NETCDF数据读取到时间表中。您可以在此脚本的末尾探索其内容。

dsMetrology = fileDatastore(fileLocations,"ReadFcn"，，，，@ncReader,“统一”，真的）;

使用数据存储的计量数据创建一个高的时间表。

TTMETROGY =高（DSMETOLOGY）

ttMetrology = M×6 tall timetable Time wind_speed wind_direction power density temperature pressure ____________________ __________ ______________ ______ _______ ___________ ________ 01-Jan-2007 00:00:00 5.905 189.35 3.3254 1.2374 269.74 97963 01-Jan-2007 00:05:00 5.8898 188.77 3.2988 1.2376 269.73 97959 01-Jan-2007 00:10:00 5.9447 187.85 3.396 1.2376 269.71 97960 01-Jan-2007 00:15:00 6.0362 187.05 3.5574 1.2376 269.68 97961 01-Jan-2007 00:20:00 6.1156 186.49 3.6973 1.2375 269.83 97958 01-Jan-2007 00:25:00 6.2133 185.71 3.8698 1.2376 270.03 97952 01-Jan-2007 00:30:00 6.3232 184.29 4.0812 1.2379 270.19 97955 01-Jan-2007 00:35:00 6.4331 182.51 4.3382 1.2382 270.3 97957 : : : : : : : : : : : : : :

Get the mean temperature per month usinggroupsummary，，，，和sort the resulting tall table. For performance, MATLAB defers most tall operations until the data is needed. In this case, plotting the data triggers evaluation of deferred calculations.

意思是=群体（ttmetrology，"Time"，，，，“月”，，，，“意思是”，，，，"temperature"）；含义= sortrows（含义）；

绘制结果。

数字;情节（含义emperature.mean_temperature，“* - ”）；Ylim（[260 300]）;xlim（[1 12*7+1]）;Xticks（1：12：12*7+1）;Xticklabels（[“ 2007”，，，，“ 2008”，，，，"2009"，，，，"2010"，，，，"2011"，，，，“ 2012年”，，，，“ 2013”，，，，“ 2014年”）;标题（“马萨诸塞州的平均温度2007-2013”）；Xlabel（“年”）；ylabel（“温度（k）”）

许多MATLAB功能都支持高阵列，因此您可以万博1manbetx使用熟悉的语法对大数据集执行各种计算。有关支持功能的更多信息，请参见万博1manbetx万博1manbetx支持功能。

定义自定义读取功能

Techno-Wind Toolkit中的数据保存在NETCDF文件中。定义自定义读取功能以将其数据读取为时间表。有关阅读NetCDF文件的更多信息，请参阅NetCDF Files。

功能t = ncreader（文件名）％ncreader读取NetCDF文件（.NC），提取数据集并保存为时间表％获取有关NETCDF数据源的信息fileInfo = ncinfo(filename);% Extract variable names and datatypesvarnames =字符串（{fileinfo.variables.name}）;vartypes = string（{fileinfo.variables.datatype}）;％将变量名称转换为表变量的有效名称if任何（startswith）（varnames，[[“ 4”，，，，“ 6”]））strvarnames =替换（varnames，[“ 4”，，，，“ 6”]，[["four"，，，，“六”）;elsestrVarNames = varNames;end％提取每个变量的长度fileLength = fileInfo.dimensions.length;％提取初始时间戳，样品周期并创建时间轴tattributes = struct2table（fileInfo.Attributes）;starttime = dateTime（cell2mat（tattributes.value）（包含（tattributes.name，“开始时间”）），“转换”，，，，“ Epochtime”）；sampleperiod = seconds（cell2mat（tattributes.value）（包含（tattributes.name，，“ sample_period”）））;% Create the output timetablenumVars = numel(strVarNames); tableSize = [fileLength numVars]; t = timetable('Size'，桌子，“变量型”，，，，varTypes,'variablenames'，strvarnames，“时间到'，sampleperiod，'开始时间'，开始时间）;％填写使用可变数据的时间表为了k = 1：numvars t（：，k）= table（ncread（fileName，varnames {k}）））;endend

参考

[1] Draxl, C., B. M. Hodge, A. Clifton, and J. McCaa.风集成国家数据集工具包的概述和气象验证(Technical Report, NREL/TP-5000-61740). Golden, CO: National Renewable Energy Laboratory, 2015.

[2] Draxl, C., B. M. Hodge, A. Clifton, and J. McCaa. "The Wind Integration National Dataset (WIND) Toolkit."Applied Energy。卷。151，2015，第355-366页。

[3] King，J。，A。Clifton和B. M. Hodge。验证风工具包的功率输出（技术报告，NREL/TP-5D00-61714）。Golden，CO：国家可再生能源实验室，2014年。

[4] Lieberman-Crivbin，W.，C。Draxl和A. Clifton。Guide to Using the WIND Toolkit Validation Code（技术报告，NREL/TP-5000-62595）。Golden，CO：国家可再生能源实验室，2014年。