读取

发表Robert Bemis,June 13, 2008

15 views (last 30 days) |0喜欢|3 comments

Bob's pick this week is读取by Peder Axensten. Jiro recently highlightedtextscantool这可以使将文本数据导入MATLAB变得更加容易。但是您可能遇到了使您感到沮丧的数据，textscan. I recently analyzed some data I got from a web source as a CSV file. The comma seperated values had single quotes around them all - both string and numeric types. Here's a sample.

typeSampleData.csv

“詹姆斯·墨菲”、“471”“John Doe, Jr .)”,“44”' Bill O 'Brien','127'

So there are really three kinds of delimiters per line.

在第一个值之前引用的引用
quote-comma-quote between values
a trailing quote after the last value

If you've been doing much text data importing into MATLAB than you probably know thattextscan很好，但不能正确解析此文件。

fid = fopen('SampleData.csv'）；data = textscan（fid，'%q%q',“定界符”,'，'）fclose（fid）;

data = {4x1 cell} {4x1 cell}

See the problem?datashould be 3 (not 4) rows. Look closer at column 1.

data{1}

ans = ''James Murphy'' ''John Doe' ''44'' ''Bill O'Brien''

Now look at column 2.

data{2}

ans =''471''''''''''''127''

啊。小约翰·多伊（John Doe，Jr。）的逗号被解释为定界线，所以“小”。被视为第二列值。然后数字“ 44”掉落到下一行。另请注意，所有返回的单元格 - 甚至数值值。此外，大多数（但并非全部）细胞都嵌入了那些讨厌的书峰行情。好！有很多方法可以解决这个问题。读取Peder是其中之一。特别是，我对使用基于正则表达式定界符的力量着迷。

data = readtext('SampleData.csv','(?m)^''|'',''|''(?m)$')

data = []'James Murphy'[471] []'John Doe，Jr。'[44] []'Bill O'Brien'[127]

The empty first column is an artifact that can easily be suppressed.

data(:,1) = []

data = 'James Murphy' [471] 'John Doe, Jr.' [ 44] 'Bill O'Brien' [127]

In a word - wow! The embedded comma was no problem. Moreover, first column values are strings and second column values are numbers. In another word - sweet. What's your favorite trick or tool for reading particularly nasty data files? Tell us about it这里.

Published with MATLAB® 7.6