Main Content

使用远程数据

您可以使用MATLAB从远程位置读取和写入数据®functions and objects, such as file I/O functions and some datastore objects. These examples show how to set up, read from, and write to remote locations on the following cloud storage platforms:

  • 亚马逊S3™ (Simple Storage Service)

  • Azure®Blob Storage (previously known as Windows Azure®Storage Blob (WASB))

  • hadoop®分布式文件系统(HDFS™)

亚马逊S3

MATLAB lets you use Amazon S3 as an online file storage web service offered by Amazon Web Services. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form

s3://桶name/path_to_file

桶name是容器的名称,path_to_file是通往文件或文件夹的路径。

亚马逊S3通过Web服务界面提供数据存储。您可以使用作为将对象存储在Amazon S3中的容器。

放Up Access

要使用Amazon S3中的远程数据,您必须首先设置访问:

  1. Sign up for an Amazon Web Services (AWS) root account. See亚马逊网络服务:帐户

  2. Using your AWS root account, create an IAM (Identity and Access Management) user. SeeCreating an IAM User in Your AWS Account

  3. Generate an access key to receive an access key ID and a secret access key. See管理IAM用户的访问键

  4. Configure your machine with the AWS access key ID, secret access key, and region using the AWS Command Line Interface tool fromhttps://aws.amazon.com/cli/。或者,通过使用直接设置环境变量setenv

    • AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY— Authenticate and enable use of Amazon S3 services. (You generated this pair of access key variables in step 3.)

    • aws_default_region(可选) - 选择桶的地理区域。通常会自动确定此环境变量的值,但是存储桶所有者可能需要手动设置它。

    • AWS_SESSION_TOKEN(optional) — Specify the session token if you are using temporary security credentials, such as with AWS®Federated Authentication.

如果使用并行计算工具箱™,则必须确保已配置群集以访问S3服务。您可以通过设置将客户环境变量复制到群集上的工人环境变量inParpool,,,,,,,,CreateJob,,,,or in the Cluster Profile Manager.

Read Data from亚马逊S3

以下示例显示了如何使用成像object to read a specified image from Amazon S3, and then display the image to screen.

setEnv('aws_access_key_id','your_aws_access_key_id');setEnv('aws_secret_access_key','your_aws_secret_access_key');ds = imagedatastore('s3:// bucketname/image_datastore/jpegfiles',...'inclateubFolders',true,'labelSource','foldernames');img = ds.ReadImage(1);imshow(img)

将数据写入亚马逊S3

The following example shows how to use atabularTextDatastore目的是将来自Amazon S3的表格数据读取到一个高阵列中,通过删除缺失的条目和排序来进行预处理,然后将其写回Amazon S3。

setEnv('aws_access_key_id','your_aws_access_key_id');setEnv('aws_secret_access_key','your_aws_secret_access_key');ds = tabulartextdatastore('s3://bucketname/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('s3://bucketname/preprocessedData/',tt);

To read your tall data back, use the数据存储功能。

ds = datastore('s3:// bucketname/preprocesseddata/');tt =高(DS);

AzureBlob Storage

MATLAB lets you use Azure Blob Storage for online file storage. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form

wasbs://container@account/path_to_file/file.ext

container@account是容器的名称,path_to_file是通往文件或文件夹的路径。

Azure通过Web服务界面提供数据存储。您可以使用blob将数据文件存储在Azure中。看简介Azurefor more information.

放Up Access

要使用Azure存储中的远程数据,您必须首先设置访问:

  1. Sign up for a Microsoft Azure account, seeMicrosoft Azure帐户

  2. 放up your authentication details by setting exactly one of the two following environment variables usingsetenv

    • MW_WASB_SAS_TOKEN- 通过共享访问签名(SAS)的身份验证

      Obtain an SAS. For details, see the "Get the SAS for a blob container" section inManage Azure Blob Storage resources with Storage Explorer

      In MATLAB, setMW_WASB_SAS_TOKENto the SAS query string. For example,

      setenv MW_WASB_SAS_TOKEN '?st=2017-04-11T09%3A45%3A00Z&se=2017-05-12T09%3A45%3A00Z&sp=rl&sv=2015-12-11&sr=c&sig=E12eH4cRCLilp3Tw%2BArdYYR8RruMW45WBXhWpMzSRCE%3D'

      你必须将这个字符串设置为一个有效的SAS标记基因rated from the Azure Storage web UI or Explorer.

    • mw_wasb_secret_key— Authentication via one of the Account's two secret keys

      每个存储帐户都有两个秘密键,可以允许行政特权访问。可以通过设置SAS代币来对MATLAB提供相同的访问权限mw_wasb_secret_key环境变量。例如:

      setenv MW_WASB_SECRET_KEY '1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF'

如果使用并行计算工具箱,则必须通过设置将客户端环境变量复制到群集上的工人环境变量inParpool,,,,,,,,CreateJob,,,,or in the Cluster Profile Manager.

For more information, see使用Azure存储Azure HDInsight集群

Read Data fromAzure

To read data from an Azure Blob Storage location, specify the location using the following syntax:

wasbs://container@account/path_to_file/file.ext

container@account是容器的名称,path_to_file是通往文件或文件夹的路径。

例如,如果您有文件airlinesmall.csv在文件夹中/航空公司在测试存储帐户上wasbs://blobContainer@storageAccount.blob.core.windows.net/,,,,then you can create a datastore by using:

location = 'wasbs://blobContainer@storageAccount.blob.core.windows.net/airline/airlinesmall.csv';
ds = tabulartextdatastore(位置,'treatsmissing','na',...'SelectedVariablenames',{''arrdelay'});

哟u can use Azure for all calculations datastores support, including direct reading,mapreduce,高大的阵列和深度学习。例如,创建一个成像object, read a specified image from the datastore, and then display the image to screen.

setEnv('mw_wasb_sas_token','your_wasb_sas_token');ds = imagedatastore('wasbs://yourcontainer@youraccount.blob.core.windows.net/',...'inclateubfolders',true,true,'labelSource','foldernames');img = ds.ReadImage(1);imshow(img)

将数据写入Azure

This example shows how to read tabular data from Azure into a tall array using atabularTextDatastore对象,通过删除缺失条目和排序来进行预处理,然后将其写回Azure。

setEnv('mw_wasb_sas_token','your_wasb_sas_token');ds = tabulartextdatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/',tt);

To read your tall data back, use the数据存储功能。

ds = datastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/'); tt = tall(ds);

hadoopDistributed File System

指定数据位置

MATLAB lets you use Hadoop Distributed File System (HDFS) as an online file storage web service. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of one of these forms:

HDFS:/path_to_file
HDFS:///path_to_file
HDFS://主机名/path_to_file

主机名是主机或服务器的名称,path_to_file是通往文件或文件夹的路径。指定主机名is optional. When you do not specify the主机名,,,,hadoopuses the default host name associated with the Hadoop Distributed File System (HDFS) installation in MATLAB.

For example, you can use either of these commands to create a datastore for the file,file1.txt,在一个名为的文件夹中datalocated at a host namedMyserver

  • ds = tabulartextdatastore('hdfs:///data/file1.txt'
  • ds = tabulartextdatastore('hdfs://myserver/data/file1.txt'

If主机名已指定,必须对应于由fs.default.nameproperty in the Hadoop XML configuration files for your Hadoop cluster.

Optionally, you can include the port number. For example, this location specifies a host namedMyserverwith port7867,,,,containing the filefile1.txt在名为的文件夹中data

'hdfs://myserver:7867/data/file1.txt'

The specified port number must match the port number set in your HDFS configuration.

hadoopEnvironment Variable

Before reading from HDFS, use thesetenv功能以将适当的环境变量设置为安装Hadoop的文件夹。必须从当前机器访问此文件夹。

  • hadoopv1 only — Set thehadoop_home环境变量。

  • hadoopv2 only — Set theHADOOP_PREFIX环境变量。

  • If you work with both Hadoop v1 and Hadoop v2, or if thehadoop_homeandHADOOP_PREFIX环境变量未设置,然后设置matlab_hadoop_install环境变量。

For example, use this command to set thehadoop_home环境变量。hadoop鱼is the folder where Hadoop is installed, and/我自己的路/is the path to that folder.

setenv('hadoop_home','/我自己的路/hadoop鱼');

HDFSdata on Hortonworks or克卢德拉

If your current machine has access to HDFS data on Hortonworks or Cloudera®,,,,then you do not have to set thehadoop_homeorHADOOP_PREFIXenvironment variables. MATLAB automatically assigns these environment variables when using Hortonworks or Cloudera application edge nodes.

防止从内存中清除代码

从HDFS阅读或本地阅读序列文件时,数据存储function calls theJavaaddpathcommand. This command does the following:

  • Clears the definitions of all Java®classes defined by files on the dynamic class path

  • 从基本工作区中删除所有全局变量和变量

  • Removes all compiled scripts, functions, and MEX-functions from memory

为防止持续变量,代码文件或mex文件被清除,请使用mlock功能。

将数据写入HDFS

This example shows how to use atabularTextDatastore对象将数据写入HDFS位置。使用writefunction to write your tall and distributed arrays to a Hadoop Distributed File System. When you call this function on a distributed or tall array, you must specify the full path to a HDFS folder. The following example shows how to read tabular data from HDFS into a tall array, preprocess it by removing missing entries and sorting, and then write it back to HDFS.

ds = tabulartextdatastore('hdfs://myserver/some/path/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('hdfs://myserver/some/path/preprocessedData/',tt);

To read your tall data back, use the数据存储功能。

ds = datastore('hdfs:// myServer/some/path/path/preprocesseddata/');tt =高(DS);

也可以看看

||||||||

Related Topics