使用远程数据
您可以使用MATLAB从远程位置读取和写入数据®functions and objects, such as file I/O functions and some datastore objects. These examples show how to set up, read from, and write to remote locations on the following cloud storage platforms:
亚马逊S3™ (Simple Storage Service)
Azure®Blob Storage (previously known as Windows Azure®Storage Blob (WASB))
hadoop®分布式文件系统(HDFS™)
亚马逊S3
MATLAB lets you use Amazon S3 as an online file storage web service offered by Amazon Web Services. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form
s3://桶name/path_to_file
桶name
是容器的名称,path_to_file
是通往文件或文件夹的路径。
亚马逊S3通过Web服务界面提供数据存储。您可以使用桶作为将对象存储在Amazon S3中的容器。
放Up Access
要使用Amazon S3中的远程数据,您必须首先设置访问:
Sign up for an Amazon Web Services (AWS) root account. See亚马逊网络服务:帐户。
Using your AWS root account, create an IAM (Identity and Access Management) user. SeeCreating an IAM User in Your AWS Account。
Generate an access key to receive an access key ID and a secret access key. See管理IAM用户的访问键。
Configure your machine with the AWS access key ID, secret access key, and region using the AWS Command Line Interface tool fromhttps://aws.amazon.com/cli/。或者,通过使用直接设置环境变量
setenv
:AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
— Authenticate and enable use of Amazon S3 services. (You generated this pair of access key variables in step 3.)aws_default_region
(可选) - 选择桶的地理区域。通常会自动确定此环境变量的值,但是存储桶所有者可能需要手动设置它。AWS_SESSION_TOKEN
(optional) — Specify the session token if you are using temporary security credentials, such as with AWS®Federated Authentication.
如果使用并行计算工具箱™,则必须确保已配置群集以访问S3服务。您可以通过设置将客户环境变量复制到群集上的工人环境变量
inParpool
,,,,批
,,,,CreateJob
,,,,or in the Cluster Profile Manager.
Read Data from亚马逊S3
以下示例显示了如何使用成像
object to read a specified image from Amazon S3, and then display the image to screen.
setEnv('aws_access_key_id','your_aws_access_key_id');setEnv('aws_secret_access_key','your_aws_secret_access_key');ds = imagedatastore('s3:// bucketname/image_datastore/jpegfiles',...'inclateubFolders',true,'labelSource','foldernames');img = ds.ReadImage(1);imshow(img)
将数据写入亚马逊S3
The following example shows how to use atabularTextDatastore
目的是将来自Amazon S3的表格数据读取到一个高阵列中,通过删除缺失的条目和排序来进行预处理,然后将其写回Amazon S3。
setEnv('aws_access_key_id','your_aws_access_key_id');setEnv('aws_secret_access_key','your_aws_secret_access_key');ds = tabulartextdatastore('s3://bucketname/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('s3://bucketname/preprocessedData/',tt);
To read your tall data back, use the数据存储
功能。
ds = datastore('s3:// bucketname/preprocesseddata/');tt =高(DS);
AzureBlob Storage
MATLAB lets you use Azure Blob Storage for online file storage. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form
wasbs://container@account/path_to_file/file.ext
container@account
是容器的名称,path_to_file
是通往文件或文件夹的路径。
Azure通过Web服务界面提供数据存储。您可以使用blob将数据文件存储在Azure中。看简介Azurefor more information.
放Up Access
要使用Azure存储中的远程数据,您必须首先设置访问:
Sign up for a Microsoft Azure account, seeMicrosoft Azure帐户。
放up your authentication details by setting exactly one of the two following environment variables using
setenv
:MW_WASB_SAS_TOKEN
- 通过共享访问签名(SAS)的身份验证Obtain an SAS. For details, see the "Get the SAS for a blob container" section inManage Azure Blob Storage resources with Storage Explorer。
In MATLAB, set
MW_WASB_SAS_TOKEN
to the SAS query string. For example,setenv MW_WASB_SAS_TOKEN '?st=2017-04-11T09%3A45%3A00Z&se=2017-05-12T09%3A45%3A00Z&sp=rl&sv=2015-12-11&sr=c&sig=E12eH4cRCLilp3Tw%2BArdYYR8RruMW45WBXhWpMzSRCE%3D'
你必须将这个字符串设置为一个有效的SAS标记基因rated from the Azure Storage web UI or Explorer.
mw_wasb_secret_key
— Authentication via one of the Account's two secret keys每个存储帐户都有两个秘密键,可以允许行政特权访问。可以通过设置SAS代币来对MATLAB提供相同的访问权限
mw_wasb_secret_key
环境变量。例如:setenv MW_WASB_SECRET_KEY '1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF'
如果使用并行计算工具箱,则必须通过设置将客户端环境变量复制到群集上的工人环境变量
inParpool
,,,,批
,,,,CreateJob
,,,,or in the Cluster Profile Manager.
For more information, see使用Azure存储Azure HDInsight集群。
Read Data fromAzure
To read data from an Azure Blob Storage location, specify the location using the following syntax:
wasbs://container@account/path_to_file/file.ext
container@account
是容器的名称,path_to_file
是通往文件或文件夹的路径。
例如,如果您有文件airlinesmall.csv
在文件夹中/航空公司
在测试存储帐户上wasbs://blobContainer@storageAccount.blob.core.windows.net/
,,,,then you can create a datastore by using:
location = 'wasbs://blobContainer@storageAccount.blob.core.windows.net/airline/airlinesmall.csv';
ds = tabulartextdatastore(位置,'treatsmissing','na',...'SelectedVariablenames',{''arrdelay'});
哟u can use Azure for all calculations datastores support, including direct reading,mapreduce
,高大的阵列和深度学习。例如,创建一个成像
object, read a specified image from the datastore, and then display the image to screen.
setEnv('mw_wasb_sas_token','your_wasb_sas_token');ds = imagedatastore('wasbs://yourcontainer@youraccount.blob.core.windows.net/',...'inclateubfolders',true,true,'labelSource','foldernames');img = ds.ReadImage(1);imshow(img)
将数据写入Azure
This example shows how to read tabular data from Azure into a tall array using atabularTextDatastore
对象,通过删除缺失条目和排序来进行预处理,然后将其写回Azure。
setEnv('mw_wasb_sas_token','your_wasb_sas_token');ds = tabulartextdatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/',tt);
To read your tall data back, use the数据存储
功能。
ds = datastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/'); tt = tall(ds);
hadoopDistributed File System
指定数据位置
MATLAB lets you use Hadoop Distributed File System (HDFS) as an online file storage web service. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of one of these forms:
HDFS:/path_to_file
HDFS:///path_to_file
HDFS://主机名/path_to_file
主机名
是主机或服务器的名称,path_to_file
是通往文件或文件夹的路径。指定主机名
is optional. When you do not specify the主机名
,,,,hadoopuses the default host name associated with the Hadoop Distributed File System (HDFS) installation in MATLAB.
For example, you can use either of these commands to create a datastore for the file,file1.txt
,在一个名为的文件夹中data
located at a host namedMyserver
:
-
ds = tabulartextdatastore('hdfs:///data/file1.txt')
-
ds = tabulartextdatastore('hdfs://myserver/data/file1.txt')
If主机名
已指定,必须对应于由fs.default.name
property in the Hadoop XML configuration files for your Hadoop cluster.
Optionally, you can include the port number. For example, this location specifies a host namedMyserver
with port7867
,,,,containing the filefile1.txt
在名为的文件夹中data
:
'hdfs://myserver:7867/data/file1.txt'
The specified port number must match the port number set in your HDFS configuration.
放hadoopEnvironment Variable
Before reading from HDFS, use thesetenv
功能以将适当的环境变量设置为安装Hadoop的文件夹。必须从当前机器访问此文件夹。
hadoopv1 only — Set the
hadoop_home
环境变量。hadoopv2 only — Set the
HADOOP_PREFIX
环境变量。If you work with both Hadoop v1 and Hadoop v2, or if the
hadoop_home
andHADOOP_PREFIX
环境变量未设置,然后设置matlab_hadoop_install
环境变量。
For example, use this command to set thehadoop_home
环境变量。hadoop鱼
is the folder where Hadoop is installed, and/我自己的路/
is the path to that folder.
setenv('hadoop_home','/我自己的路/hadoop鱼');
HDFSdata on Hortonworks or克卢德拉
If your current machine has access to HDFS data on Hortonworks or Cloudera®,,,,then you do not have to set thehadoop_home
orHADOOP_PREFIX
environment variables. MATLAB automatically assigns these environment variables when using Hortonworks or Cloudera application edge nodes.
防止从内存中清除代码
从HDFS阅读或本地阅读序列文件时,数据存储
function calls theJavaaddpath
command. This command does the following:
Clears the definitions of all Java®classes defined by files on the dynamic class path
从基本工作区中删除所有全局变量和变量
Removes all compiled scripts, functions, and MEX-functions from memory
为防止持续变量,代码文件或mex文件被清除,请使用mlock
功能。
将数据写入HDFS
This example shows how to use atabularTextDatastore
对象将数据写入HDFS位置。使用write
function to write your tall and distributed arrays to a Hadoop Distributed File System. When you call this function on a distributed or tall array, you must specify the full path to a HDFS folder. The following example shows how to read tabular data from HDFS into a tall array, preprocess it by removing missing entries and sorting, and then write it back to HDFS.
ds = tabulartextdatastore('hdfs://myserver/some/path/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('hdfs://myserver/some/path/preprocessedData/',tt);
To read your tall data back, use the数据存储
功能。
ds = datastore('hdfs:// myServer/some/path/path/preprocesseddata/');tt =高(DS);
也可以看看
数据存储
|tabularTextDatastore
|write
|imageDatastore
|imread
|imshow
|Javaaddpath
|mlock
|setenv