[R] watch out for quotes in data files

Tue Jul 10 15:50:42 CEST 2001

I have had similar problems with R 1.2.2. Everytime a string has the '
single quote, it reads up to a maximum of 8192 characters into the item,
creating memory and parsing problems. I make it a habit now to remove all '
(single quotes) from the text or replace them with double quotes.
-- 
Vele Samak, Vice President
Global Quantitative Research
Salomon Smith Barney 
7 WTC, New York, NY 10048, 212-783-7007

-----Original Message-----
From: Douglas Bates [mailto:bates at stat.wisc.edu]
Sent: Monday, July 09, 2001 11:53 PM
To: R-help at stat.math.ethz.ch
Subject: [R] watch out for quotes in data files

I have just spent a day trying to determine why I seemed to be unable
to read a file of microarray expression results into R properly.  The
file was produced by the Dchip software developed by Li and Wong at
Harvard's Department of Biostatistics.  It contains rows of
tab-delimited fields in the order

Probe set identifier
Probe set description
Array 1 expression
Array 1 call
Array 2 expression
Array 2 call
...

plus an extra tab (which I think is due to a programming glich).

There are 7130 rows, including the column headers, for results from
Affymetrix Hu6800 chips. 

When I read this file using read.table(filename, sep = "\t", head = TRUE)
I got only 3720 rows.  Furthermore count.fields(filename, sep = "\t")
gave a result of length 7130 but several of the rows were reported as
having only two fields instead of 15 like the other rows.

It seemed to me that the important characteristic of these rows was
their having a very long "Probe set description" and I wasted quite a
bit of time looking for possible buffer overflows that might be
triggered by this.

When I finally came to my senses and created a much smaller input file
that only contained a few rows, including one that was giving an
aberrant field count, I could directly examine the results of scan()
applied to it.  I noticed that the second field for the aberrant line
contained all the subsequent lines and then I saw that its description
included "5'" (as in the 5' end of the sequence versus the 3' end).
Other descriptions had this written as "5 prime" but this one used
"5'".  What was happening was that everything from there to the next
"'" character in the file was being included as part of that
description.

I could read the file properly by adding the optional argument quote =
"" to the call to read.table.

The moral of the story is to watch out for molecular biologists who
use unpaired quote characters in their descriptions.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._