File Exchange Pick of the Week

Our best user submissions

读取

Bob's pick this week is读取by Peder Axensten. Jiro recently highlightedtextscantool这可以使将文本数据导入MATLAB变得更加容易。但是您可能遇到了使您感到沮丧的数据,textscan. I recently analyzed some data I got from a web source as a CSV file. The comma seperated values had single quotes around them all - both string and numeric types. Here's a sample.
typeSampleData.csv
“詹姆斯·墨菲”、“471”“John Doe, Jr .)”,“44”' Bill O 'Brien','127'
So there are really three kinds of delimiters per line.
  • 在第一个值之前引用的引用
  • quote-comma-quote between values
  • a trailing quote after the last value
If you've been doing much text data importing into MATLAB than you probably know thattextscan很好,但不能正确解析此文件。
fid = fopen('SampleData.csv');data = textscan(fid,'%q%q',“定界符”,',')fclose(fid);
data = {4x1 cell} {4x1 cell}
See the problem?datashould be 3 (not 4) rows. Look closer at column 1.
data{1}
ans = ''James Murphy'' ''John Doe' ''44'' ''Bill O'Brien''
Now look at column 2.
data{2}
ans =''471''''''''''''127''
啊。小约翰·多伊(John Doe,Jr。)的逗号被解释为定界线,所以“小”。被视为第二列值。然后数字“ 44”掉落到下一行。另请注意,所有返回的单元格 - 甚至数值值。此外,大多数(但并非全部)细胞都嵌入了那些讨厌的书峰行情。好!有很多方法可以解决这个问题。读取Peder是其中之一。特别是,我对使用基于正则表达式定界符的力量着迷。
data = readtext('SampleData.csv','(?m)^''|'',''|''(?m)$')
data = []'James Murphy'[471] []'John Doe,Jr。'[44] []'Bill O'Brien'[127]
The empty first column is an artifact that can easily be suppressed.
data(:,1) = []
data = 'James Murphy' [471] 'John Doe, Jr.' [ 44] 'Bill O'Brien' [127]
In a word - wow! The embedded comma was no problem. Moreover, first column values are strings and second column values are numbers. In another word - sweet. What's your favorite trick or tool for reading particularly nasty data files? Tell us about it这里.

Published with MATLAB® 7.6

|
  • print
  • send email

注释

To leave a comment, please click这里to sign in to your MathWorks Account or create a new one.