Main Content

fileDatastore

Datastore with custom file reader

Description

Use aFileDatastoreobject to manage large collections of custom format files where the collection does not necessarily fit in memory or when a large custom format file does not fit in memory. You can create aFileDatastoreobject using thefileDatastorefunction, specify its properties, and then import and process the data using object functions.

Creation

Description

fds= fileDatastore(location,"ReadFcn",@fcn)creates a datastore from the collection of files specified bylocationand uses the functionfcnto read the data from the files.

example

fds= fileDatastore(location,"ReadFcn",@fcn,Name,Value)specifies additional parameters andpropertiesforfdsusing one or more name-value pair arguments. For example, you can specify which files to include in the datastore depending on their extensions withfileDatastore(location,"ReadFcn",@customreader,"FileExtensions",[".exts",".extx"]).

Input Arguments

expand all

Files or folders included in the datastore, specified as aFileSetobject, as file paths, or as aDsFileSetobject.

  • FileSetobject — You can specifylocationas aFileSetobject. Specifying the location as aFileSetobject leads to a faster construction time for datastores compared to specifying a path orDsFileSetobject. For more information, seematlab.io.datastore.FileSet.

  • File path — You can specify a single file path as a character vector or string scalar. You can specify multiple file paths as a cell array of character vectors or a string array.

  • DsFileSetobject — You can specify aDsFileSetobject. For more information, seematlab.io.datastore.DsFileSet.

Files or folders may be local or remote:

  • Local files or folders — Specify local paths to files or folders. If the files are not in the current folder, then specify full or relative paths. Files within subfolders of the specified folder are not automatically included in the datastore. You can use the wildcard character (*) when specifying the local path. This character specifies that the datastore include all matching files or all files in the matching folders.

  • Remote files or folders — Specify full paths to remote files or folders as a uniform resource locator (URL) of the formhdfs:///path_to_file. For more information, seeWork with Remote Data.

When you specify a folder, the datastore includes only files with supported file formats and ignores files with any other format. To specify a custom list of file extensions to include in your datastore, see theFileExtensionsproperty.

Example:“file1.ext”

Example:"../dir/data/file1.ext"

Example:{"C:\dir\data\file1.exts","C:\dir\data\file2.extx"}

Example:"C:\dir\data\*.ext"

Function that reads the file data, specified as a function handle.

The signature of the function represented by the function handle@fcndepends on the value of the specifiedReadMode. The function that reads the file data must confirm to one of these signatures.

ReadMode

ReadFcnsignature

"file"(default)

The function must have this signature:

function data = MyReadFcn(filename) ... end

filename— Name of file to read.

data— Corresponding file data.

"partialfile"

The function must have this signature:

function [data,userdata,done] = MyReadFcn(filename,userdata) ... end

userdata— Set and read fields ofuserdatato persist data between multipleFileDatastoreread calls.

done— Set thislogicalargument to eithertrueorfalse.

  • false— Continue to read the current file.

  • true— Terminate current file read and read the next file.

data— Portion of file data.

"byte"

The function must have this signature:

函数数据= MyReadFcn(文件名,offset,size) ... end

offset— Specify the byte offset from the first byte in the file.

size— Specify the number of bytes to read during the current read operation.

data— Portion of file data of the size specified inBlockSize.

TheFileDatastoreincrements both theoffsetandsizeinputs based on the value specified inBlockSize.

The value specified in@fcn, sets the value of theReadFcnproperty.

Example:@customreader

Data Types:function_handle

Name-Value Pair Arguments

Specify optional comma-separated pairs ofName,Valuearguments.Nameis the argument name andValueis the corresponding value.Namemust appear inside quotes. You can specify several name and value pair arguments in any order asName1,Value1,...,NameN,ValueN.

Example:fds = fileDatastore("C:\dir\data","FileExtensions",{".exts",".extx"})

Subfolder inclusion flag, specified as the comma-separated pair consisting of"IncludeSubfolders"andtrue,false, 0, or 1. Specifytrueto include all files and subfolders within each folder orfalseto include only the files within each folder.

If you do not specify"IncludeSubfolders", then the default value isfalse.

Example:"IncludeSubfolders",true

Data Types:logical|double

Custom format file extensions, specified as the comma-separated pair consisting of"FileExtensions"and a character vector, cell array of character vectors, string scalar, or string array.

When you specify a file extension, thefileDatastorefunction creates a datastore object only for files with the specified extension. You also can create a datastore for files without any extensions by specifying"FileExtensions"as an empty character vector,''. If you do not specify"FileExtensions", thenfileDatastoreautomatically includes all files within a folder.

Example:"FileExtensions",''

Example:"FileExtensions",".ext"

Example:"FileExtensions",[".exts",".extx"]

Data Types:char|cell|string

Function to preview the input data, specified as a function handle.

If you do not specify a preview function,FileDatastoreuses the value specified in@ReadFcnas the default preview function. Alternatively, you can specify your own custom preview function for your data.

  • @ReadFcn(default) — UseReadFcnto sampleFileDatastoredata. This option can lead to slower performance fortallconstruction.

  • Function handle— Use your custom preview function forFileDatastoreandtallconstruction to sample the input data. UsePreviewFcnto provide a function that reads only the minimum needed part of input data for preview and tall construction.

The function specified byPreviewFcnmust return values with the same data types that theReadFcnreturns.

Data Types:function_handle

Portion of the file to read, specified as"file","partialfile", or"bytes".

"file"(default)

Use read mode"file"when your custom function, specified inReadFcn, reads the complete file in one read operation.

Based on your custom read function, the file datastore reads the complete file with each call toread. The unit of parallelization to is a complete file.

"partialfile"

Use read mode"partialfile"when your custom file read function, specified inReadFcn, reads only a portion of the file with each read operation.

Based on your custom read function, the file datastore reads only a portion of the file with every call to thereadfunction.

In the"partialfile"read mode, the unit of parallelization is a complete file. Multiplereadoperations, in serial, are necessary to read a complete file.

"bytes"

Use read mode"bytes"when your custom function, specified inReadFcn, reads aBlockSize大小的部分文件,每个读操作.

FileDatastoresets the unit of parallelization to a block of the file containing the number of bytes specified byBlockSize.

Based on your custom read function, the file datastore readsBlockSizesized portions of a file with every call to the read function. Multiplereadoperations in parallel are necessary to read a complete file.

To use thesubsetandshufflefunctions on aFileDatastoreobject, you must set"ReadMode"to"file".

Data Types:char|string

Number of bytes to read with everyreadoperation, specified as a positive integer.

To ensure that you can distribute multiple blocks of a file across multiple parallel MATLAB®workers, specifyBlockSizeas a positive integer greater than131072bytes (128kilobytes).

To specify or to change the value ofBlockSize, you must first setReadModeto"bytes".FileDatastoresets the default value ofBlockSizebased on the value specified inReadMode.

  • IfReadModeis"file"or"partialfile", thenFileDatastoresets the defaultBlockSizetoinf.

  • IfReadModeis"bytes", thenFileDatastoresets the defaultBlockSizeto128megabytes.

Alternate file system root paths, specified as the name-value argument consisting of"AlternateFileSystemRoots"and a string vector or a cell array. Use"AlternateFileSystemRoots"when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and theMATLAB Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use"AlternateFileSystemRoots"to associate the root paths.

  • To associate a set of root paths that are equivalent to one another, specify"AlternateFileSystemRoots"as a string vector. For example,

    ["Z:\datasets","/mynetwork/datasets"]

  • To associate multiple sets of root paths that are equivalent for the datastore, specify"AlternateFileSystemRoots"as a cell array containing multiple rows where each row represents a set of equivalent root paths. Specify each row in the cell array as either a string vector or a cell array of character vectors. For example:

    • Specify"AlternateFileSystemRoots"as a cell array of string vectors.

      {["Z:\datasets", "/mynetwork/datasets"];... ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}

    • Alternatively, specify"AlternateFileSystemRoots"as a cell array of cell array of character vectors.

      {{'Z:\datasets','/mynetwork/datasets'};... {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}

The value of"AlternateFileSystemRoots"must satisfy these conditions:

  • Contains one or more rows, where each row specifies a set of equivalent root paths.

  • Each row specifies multiple root paths and each root path must contain at least two characters.

  • Root paths are unique and are not subfolders of one another.

  • Contains at least one root path entry that points to the location of the files.

For more information, seeSet Up Datastore for Processing on Different Machines or Clusters.

Example:["Z:\datasets","/mynetwork/datasets"]

Data Types:string|cell

Properties

expand all

FileDatastoreproperties describe the files associated with aFileDatastoreobject. Except for theFilesproperty, you can specify the value ofFileDatastoreproperties using name-value pair arguments. To view or modify a property after creating the object, use the dot notation.

文件包含在数据存储、解析为魅力racter vector, cell array of character vectors, string scalar, or string array, where each character vector or string is a full path to a file. Thelocationargument in thefileDatastoreanddatastorefunctions definesFileswhen the datastore is created.

Example:{"C:\dir\data\file1.ext";"C:\dir\data\file2.ext"}

Example:"hdfs:///data/*.mat"

Data Types:char|cell|string

This property is read-only.

Folders used to construct datastore, returned as a cell array of character vectors. The cell array is oriented as a column vector. Each character vector is a path to a folder that contains data files. Thelocationargument in thefileDatastoreanddatastorefunctions definesFolderswhen the datastore is created.

TheFoldersproperty is reset when you modify theFiles财产的FileDatastoreobject.

Data Types:cell

Function that reads the file data, specified as a function handle.

The value specified by@fcn, sets the value of theReadFcnproperty.

Example:@MyCustomFileReader

Data Types:function_handle

This property is read-only.

Vertically concatenateable flag, specified as a logicaltrueorfalse. Specify the value of this property when you first create theFileDatastoreobject.

true

Multiple reads of theFileDatastoreobject return uniform data that is vertically concatenateable.

When theUniformReadproperty value istrue:

  • TheReadFcnfunction must return data that is vertically concatenateable ; otherwise, thereadallmethod returns an error.

  • The underlying data type of the output of thetallfunction is the same as the data type of the output fromReadFcn.

false(default)

Multiple reads of theFileDatastoreobject do not return uniform data that is vertically concatenateable.

When theUniformReadproperty value isfalse:

  • readallreturns a cell array.

  • tallreturns a tall cell array.

Example:fds = fileDatastore(location,"ReadFcn",@load,"UniformRead",true)

Data Types:logical|double

This property is read-only.

List of formats supported for writing, returned as a row vector of strings. This property specifies the possible output formats when usingwriteallto write output files from the datastore.

Data Types:string

Object Functions

hasdata Determine if data is available to read
numpartitions Number of datastore partitions
partition Partition a datastore
preview Preview subset of data in datastore
read Read data in datastore
readall Read all data in datastore
writeall Write datastore to files
reset Reset datastore to initial state
transform Transform datastore
combine Combine data from multiple datastores
isPartitionable Determine whether datastore is partitionable
isShuffleable Determine whether datastore is shuffleable
shuffle Shuffle all data in datastore
subset Create subset of datastore or file-set

Examples

collapse all

Create a fileDatastore object using either a FileSet object or file paths.

Create a FileSet object. Create a fileDatastore object.

fs = matlab.io.datastore.FileSet("airlinesmall.parquet"); fds = fileDatastore(fs,"ReadFcn"@load)
fds =FileDatastore with properties:Files: { ' ...\matlab\toolbox\matlab\demos\airlinesmall.parquet' } Folders: { '...\matlab\toolbox\matlab\demos' } UniformRead: 0 ReadMode: 'file' BlockSize: Inf PreviewFcn: @load SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq" "png" "jpg" "jpeg" "tif" "tiff" "wav" "flac" "ogg" "mp4" "m4a"] ReadFcn: @load AlternateFileSystemRoots: {}

Alternatively, you can use file paths to create your fileDatastore object.

fds = fileDatastore("airlinesmall.parquet","ReadFcn"@load);

Create a datastore containing all the.matfiles within the MATLAB®demosfolder, specifying theloadfunction to read the file data.

fds = fileDatastore(fullfile(matlabroot,"toolbox","matlab","demos"),"ReadFcn",@load,"FileExtensions",".mat")
fds =FileDatastore with properties:Files: { '...\matlab\toolbox\matlab\demos\accidents.mat'; '...\matlab\toolbox\matlab\demos\airfoil.mat'; ' ...\matlab\toolbox\matlab\demos\airlineResults.mat' ... and 38 more } Folders: { '...\matlab\toolbox\matlab\demos' } UniformRead: 0 ReadMode: 'file' BlockSize: Inf PreviewFcn: @load SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq" "png" "jpg" "jpeg" "tif" "tiff" "wav" "flac" "ogg" "mp4" "m4a"] ReadFcn: @load AlternateFileSystemRoots: {}

Read the first file in the datastore, and then read the second file.

data1 = read(fds); data2 = read(fds);

Read all files in the datastore simultaneously.

readall(fds);

Initialize a cell array to hold the data and counteri.

dataarray = cell(numel(fds.Files), 1); i = 1;

Reset the datastore to the first file and read the files one at a time until there is no data left. Assign the data to the arraydataarray.

reset(fds);whilehasdata(fds) dataarray{i} = read(fds); i = i+1;end

You can create a datastore to read from a large MAT-file that does not necessarily fit in memory. Assuming that each array in the large MAT-file fits in the available memory, create a datastore to read and process the data in three steps:

  1. Write a custom reading function that reads one array at a time from a MAT-file.

  2. Set up the parameters of the datastore function to perform partial reads.

  3. Read one array at a time from the MAT-file.

Write a custom function that reads one array at time from MAT-file. The function must have a signature as described in the@ReadFcnargument ofFileDatastore. Save this file in your working folder or in a folder that is on the MATLAB path. For this example, a custom functionload_variableis included here.

typeload_variable.m
function [data,variables,done] = load_variable(filename,variables) % If variable list is empty, % create list of variables from the file if isempty(variables) variables = who('-file', filename); end % Load a variable from the list of variables data = load(filename, variables{1}); % Remove the newly-read variable from the list variables(1) = []; % Move on to the next file if this file is done reading. done = isempty(variables); end

Create and setup aFileDatastorecontainingaccidents.mat. Specify the datastore parameters to use"partialfile"as the read mode andload_variableas the custom reading function.

fds = fileDatastore("accidents.mat","ReadMode","partialfile","ReadFcn",@load_variable);

Read the first three variables from the file using the datastore. The fileaccidents.matcontains nine variables and every call toreadreturns one variable. Therefore, to get the first three variables, call the read function three times.

data = read(fds)
data =struct with fields:datasources: {3x1 cell}
data = read(fds)
data =struct with fields:hwycols: 17
data = read(fds)
data =struct with fields:hwydata: [51x17 double]

Note that the sample fileaccidents.matis small and fits in memory, but you can expect similar results for large MAT-files that do not fit in memory.

Tips

  • To use thesubsetandshufflefunctions on aFileDatastoreobject, you must set"ReadMode"to"file".

Introduced in R2016a