[R] read.table mystery

Mon Mar 7 06:04:14 CET 2011

"count.fields" is a very nice hint for a clean solution - thank you!

Joh

On Sunday 06 March 2011 21:48:32 David Winsemius wrote:
> On Mar 6, 2011, at 12:47 PM, Johannes Graumann wrote:
> > Thank you for pointing this out. This is really inconvenient as I do
> > not
> > know a priori how many and where those darn cases containing an
> > additional
> > (or more) ":" might be ...
> 
> There is a count.fields function that might assist with this task.
> 
> You seem to have a multiline (variable number of lines)  format of:
> 
> NNNN:>sp|header with "|" AND white space separators
> NNNN:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEEEEE
> NNNN+60:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEE
> NNNN+120:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDE
> NNNN+180:EXCEPT_LAST
> 
> No way that read.table can work. You might create an index with the
> location of the high-count headers and then reprocess.
> 
> log.idx <- count.fields("/tmp/testfile.txt") > 1
> corpus <- readLines("/tmp/testfile.txt")
> 
> Then parse the headers and rejoin the broken multi-line content. There
> may be worked examples in the archive for variable number multi-line
> file formats.
> 
> > The seems to work, but will fail if there's a "1:sdfjhlfkh:2:adlkjf"
> > somewhere (1 & 2 both integerable).
> > 
> > na.exclude(as.integer(scan("/tmp/
> > testfile.txt",sep=":",what="integer")))
> > 
> > More robust pointers anyone?
> > 
> > Joh
> > 
> > Sarah Goslee wrote:
> >> Not so much a mystery. read.table() only looks at the first 5 lines
> >> when
> >> decided how many columns your file has (as described in the Details
> >> section of the help).
> >> 
> >> The easiest solution is to add a col.names argument to read.table()
> >> with
> >> the correct number of names.
> >> 
> >> You may want to also include as.is=TRUE if you don't want your data
> >> to
> >> be imported as factors. If you expect character but have factor you
> >> may
> >> get unexpected results later.
> >> 
> >> Sarah
> >> 
> >> On Sun, Mar 6, 2011 at 5:04 AM, Johannes Graumann
> >> 
> >> <johannes_graumann at web.de> wrote:
> >>> Hello,
> >>> 
> >>> 
> >>> Please have a look at the code below, which I use to read in the
> >>> attached
> >>> file. As line 18 of the file reads "1065:>sp|Q9V3T9|ADRO_DROME
> >>> NADPH:adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
> >>> melanogaster GN=dare PE=2 SV=1", I expect the code below to
> >>> produce a 3
> >>> column data frame with most of the last column empty and line 18 to
> >>> produce a data.frame row like so:
> >>> 
> >>> V1
> >>> 
> >>>       1065
> >>> 
> >>> V2
> >>> 
> >>>> sp|Q9V3T9|ADRO_DROME NADPH
> >>> 
> >>> V3
> >>> 
> >>>       adrenodoxin oxidoreductase, mitochondrial OS=Drosophila
> >>> 
> >>> melanogaster GN=dare PE=2 SV=1
> >>> 
> >>> Why is that not so?
> >>> 
> >>> Thanks for any hint.
> >>> 
> >>> Sincerely, Joh
> >>> 
> >>> read.table(
> >>> "/tmp/testfile.txt",
> >>> sep=":",
> >>> header=FALSE,
> >>> quote="",
> >>> fill=TRUE
> >>> )[19,]
> >> 
> >> ---
> >> Sarah Goslee
> >> http://www.functionaldiversity.org
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html and provide commented,
> > minimal, self-contained, reproducible code.
> 
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part.
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110307/345b81cd/attachment.bin>