[R] read.table mystery

Sun Mar 6 18:57:11 CET 2011

You could pre-process your data into a more sensible format.
Or you could use scan to read each line of the file, count the number of colons,
then use read.table with ncolons + 1 columns.
Or you could use read.table with many more columns than are ever going to be
in the data, then delete the empty ones.
Or you could use read.table to read everything in as a signle column, then use
strsplit() to split it at the colons.

There are generally lots of ways to do things, but they vary in efficiency both
on the programming side and the execution side. For instance, the lots
of columns
solution is by far the easiest on the programmer, but is terribly
inefficient and
may fail completely for very large datasets.

Sarah

On Sun, Mar 6, 2011 at 12:47 PM, Johannes Graumann
<johannes_graumann at web.de> wrote:
> Thank you for pointing this out. This is really inconvenient as I do not
> know a priori how many and where those darn cases containing an additional
> (or more) ":" might be ...
>
> The seems to work, but will fail if there's a "1:sdfjhlfkh:2:adlkjf"
> somewhere (1 & 2 both integerable).
>
> na.exclude(as.integer(scan("/tmp/testfile.txt",sep=":",what="integer")))
>
> More robust pointers anyone?
>
> Joh
>
> Sarah Goslee wrote:
>
>> Not so much a mystery. read.table() only looks at the first 5 lines when
>> decided how many columns your file has (as described in the Details
>> section of the help).
>>
>> The easiest solution is to add a col.names argument to read.table() with
>> the correct number of names.
>>
>> You may want to also include as.is=TRUE if you don't want your data to
>> be imported as factors. If you expect character but have factor you may
>> get unexpected results later.
>>
>> Sarah
>>
>> On Sun, Mar 6, 2011 at 5:04 AM, Johannes Graumann
>> <johannes_graumann at web.de> wrote:
>>> Hello,
>
>>>
>>> Please have a look at the code below, which I use to read in the attached
>>> file. As line 18 of the file reads "1065:>sp|Q9V3T9|ADRO_DROME
>>> NADPH:adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
>>> melanogaster GN=dare PE=2 SV=1", I expect the code below to produce a 3
>>> column data frame with most of the last column empty and line 18 to
>>> produce a data.frame row like so:
>>>
>>> V1
>>>        1065
>>> V2
>>>        >sp|Q9V3T9|ADRO_DROME NADPH
>>> V3
>>>        adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
>>> melanogaster GN=dare PE=2 SV=1
>>>
>>> Why is that not so?
>>>
>>> Thanks for any hint.
>>>
>>> Sincerely, Joh
>>>
>>> read.table(
>>>  "/tmp/testfile.txt",
>>>  sep=":",
>>>  header=FALSE,
>>>  quote="",
>>>  fill=TRUE
>>> )[19,]
>>
-- 
Sarah Goslee
http://www.functionaldiversity.org