Main Content

Analyze Large Data in Database Using MapReduce

This example determines the mean arrival delay of a large set of flight data that is stored in a database. You can access large data sets using adatabaseDatastoreobject with Database Toolbox™. After creating aDatabaseDatastoreobject, you can write a MapReduce algorithm that defines the chunking and reduction of the data. Alternatively, you can use a tall array to run algorithms on large data sets.

TheDatabaseDatastoreobject does not support using a parallel pool with Parallel Computing Toolbox™ installed. To analyze data using tall arrays or run MapReduce algorithms, set the global execution environment to be the local MATLAB® session.

This example uses a preconfigured JDBC data source to create the database connection. For more information, see theconfigureJDBCDataSourcefunction.

CreateDatabaseDatastoreObject

Set the global execution environment to be the local MATLAB® session.

mapreducer(0);

The fileairlinesmall.csvcontains a large set of flight data. Load this file into the Microsoft® SQL Server® database tableairlinesmall. This table contains 123,523 records.

创建一个database connection to the JDBC data sourceMSSQLServerJDBCAuth. This data source configures a JDBC driver to a Microsoft® SQL Server® database with Windows® authentication. Specify a blank user name and password.

datasource ="MSSQLServerJDBCAuth"; username =""; password =""; conn = database(datasource,username,password);

创建一个DatabaseDatastoreobject using the database connection and an SQL query. This SQL query retrieves arrival-delay data from theairlinesmalltable.

sqlquery ='select ArrDelay from airlinesmall'; dbds = databaseDatastore(conn,sqlquery);

Define Mapper and Reducer Functions

To process large data sets in chunks, you can write your own mapper function. For each chunk in this example, usemeanArrivalDelayMapper.mto:

  • Read arrival-delay data from theDatabaseDatastoreobject.

  • Determine the number of delays and the total delay in the chunk.

  • 两个值存储在KeyValueDatastore.

ThemeanArrivalDelayMapper.mfile contains this code.

functionmeanArrivalDelayMapper (data, info, intermKVStore)% Mapper function for the MeanMapReduceExample.% Copyright 2014 The MathWorks, Inc.% Data is an n-by-1 table of the ArrDelay. Remove missing value first:data(isnan(data.ArrDelay),:) = [];%记录部分的数量和金额增加er will accumulate them.partCountSum = [length(data.ArrDelay), sum(data.ArrDelay)]; add(intermKVStore,'PartialCountSumDelay',partCountSum);

You also can write your own reducer function. In this example, usemeanArrivalDelayReducer.mto read intermediate values for the number of delays and the total arrival delay. Then, determine the overall mean arrival delay.mapreducecalls this reducer function only once because the mapper function adds just one key toKeyValueStore. ThemeanArrivalDelayReducer.mfile contains this code.

functionmeanArrivalDelayReducer(intermKey, intermValIter, outKVStore)% Reducer function for the MeanMapReduceExample.% Copyright 2014 The MathWorks, Inc.% intermKey is 'PartialCountSumDelay'count = 0; sum = 0;whilehasnext(intermValIter) countSum = getnext(intermValIter); count = count + countSum(1); sum = sum + countSum(2);endmeanDelay = sum/count;% The key-value pair added to outKVStore will become the output of mapreduce添加(outKVStore'MeanArrivalDelay',meanDelay);

Run MapReduce Using Mapper and Reducer Functions

To determine the mean arrival delay in the flight data, run MapReduce with theDatabaseDatastoreobject, mapper function, and reducer function.

outds = mapreduce(dbds,@meanArrivalDelayMapper,@meanArrivalDelayReducer);
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 15% Reduce 0% Map 30% Reduce 0% Map 46% Reduce 0% Map 61% Reduce 0% Map 76% Reduce 0% Map 92% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 100%

Display Output from MapReduce

Read the table from the output datastore usingreadall.

outtab = readall(outds)
outtab = 1×2 table Key Value __________________ ________ 'MeanArrivalDelay' [7.1201]

The table has only one row containing one key-value pair.

Display the mean arrival delay from the table.

meanArrDelay = outtab.Value{1}
meanArrDelay = 7.1201

CloseDatabaseDatastoreObject and Database Connection

close(dbds)

See Also

||||||

Related Topics

External Websites