[R] write.table: strange output has been produced

Thu Sep 20 00:54:15 CEST 2012

Thank you David - you put me into right direction.
Back to normal, problem sorted. 
I've missed a single quote in 'annot' data when I imported it from file
using read.table function with the default 'quote' argument. quote="\""
did the trick. 

Many thanks
-Igor

On Wed, 2012-09-19 at 14:55 -0700, David Winsemius wrote:
> On Sep 19, 2012, at 12:20 PM, Igor Chernukhin wrote:
> 
> > Hi David - 
> > Thank you for your reply. You are probably right. The last 'normal' line
> > doesn't have a double quote closed. There is the complete line below:
> > 
> > -------------------------8<------------------------------------
> > "4657","159998",133.10761487064,185.450704462326,80.7645252789532,0.435504009074069,-1.19924209513405,2.75544399955331e-07,4.75176501174632e-06,"IMP-GMP specific 5-nucleotidase	Nucleotide transport and metabolism 	METABOLISM
> > --------------------------8<------------------------------------
> > 
> > So it might be that the annotation dataset is actually the culprit. But
> > it gets more complicated when I try to find find this string in the
> > 'annot' object using the id value. 
> > The id 159998 is present in the output from 'intersect' function:
> > 
> >> which(subset == 159998)
> > [1] 539
> > 
> > It also present in statdata:
> > 
> >> which(statdata$id == 159998)
> > [1] 1502
> > 
> > But I cannot find it in the 'annot' object???
> > 
> >> which(annot$id == 159998)
> > integer(0)
> > 
> >> class(annot$id)
> > [1] "integer"
> > 
> > Could it be that the annot dataset contains some illegal symbols that
> > screw everything? Shall I just edit it first with 'sed' to remove
> > everything except alpha-numeric before importing to R...
> 
> I find it very productive to use the count.fields function. It lets you play around with removing the comment character which you do not yet seem to have done. I find this code particularly useful:
> 
> table(count.fields(file = "fil.ext", sep="," quote="'", comment.char=""))
> 
> This would get tripped up with commas inside the double-quotes quoted strings, but I do not see any of those in the fragments your offered.
> 
> -- 
> David.
> > 
> > 
> > -Igor
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Wed, 2012-09-19 at 10:26 -0700, David Winsemius wrote:
> >> On Sep 19, 2012, at 9:12 AM, Igor wrote:
> >> 
> >>> Good afternoon all -
> >>> 
> >>> While making a steady progress in learning R after Matlab I encountered
> >>> a problem which seems to require some extra help to move over.
> >>> Basically I want to merge a data from biological statistical dataset
> >>> with annotation data extracted from another dataset using an 'id'
> >>> crossreference and write it to report file. The first part goes
> >>> absolutely fine, I have merged both data into data.frame but when I try
> >>> to write it to csv file using 'write.table' it seems like it does write
> >>> the 'data.frame' object but it also insert some parts from the
> >>> annotation data which are not suppose to be there...
> >>> There is a little snapshot of the file output below to illustrate. The
> >>> upper half is fine, that's how it should be. The lower half, which is
> >>> actually appears to be space-separated, not coma, obviously grabbed from
> >>> the annotation dataset and is not supposed to be here.
> >>> 
> >>> --------------------------------8<--------------------------------------------
> >>> "344","166128",126.44286392082,179.904700814932,72.9810270267088,0.40566492535281,-1.3016395254146,2.47449355237252e-07,4.2901159299567e-06,"Chitinas
> >>> "18816","238247",92.5282508325735,135.981255262454,49.0752464026927,0.36089714209487,-1.47034037615176,2.5330054329543e-07,4.38862252337004e-06,"Prot
> >>> "22072","222365",30.8191942806426,52.4262903365628,9.21209822472236,0.17571524068522,-2.50868876576414,2.54433836512085e-07,4.40531098485028e-06,NA,N
> >>> "25062","226605",30.808007579908,50.3976662241578,11.2183489356581,0.22259659575825,-2.16749656564076,2.54934711860645e-07,4.41103467375713e-06,NA,NA
> >>> "7539","247009",75.4175439970731,34.4643221134552,116.370765880691,3.37655751642533,1.75555313265164,2.60010673210741e-07,4.49585878338091e-06,NA,NA,
> >>> "407","267139",425.559675915702,279.393013150954,571.72633868045,2.04631580522577,1.03302881149302,2.61074218843609e-07,4.51123710239304e-06,NA,NA,NA
> >>> "26530","171300",146.80096060985,80.0063286553601,213.595592564339,2.66973370924738,1.4166958484644,2.68061220749976e-07,4.62888115991058e-06,NA,NA,N
> >>> "3078","159013",34.3260176515511,52.4580790080106,16.1939562950917,0.308702808057816,-1.69570948866688,2.69104298652827e-07,4.64379716436078e-06,"40S
> >>> "4657","159998",133.10761487064,185.450704462326,80.7645252789532,0.435504009074069,-1.19924209513405,2.75544399955331e-07,4.75176501174632e-06,"IMP-
> >>> 
> >>> 171597  171597  KOG1347 Uncharacterized membrane protein, predicted
> >>> efflux pump General function prediction only    POORLY CHARACTERIZED
> >>> 171658  171658  KOG4290 Predicted membrane protein  Function unknown
> >>> POORLY CHARACTERIZED
> >>> 171660  171660  KOG0903 Phosphatidylinositol 4-kinase, involved in
> >>> intracellular trafficking and secretion  Signal transduction mechanisms
> >>> CELLULAR 
> >>> 171660  171660  KOG0903 Phosphatidylinositol 4-kinase, involved in
> >>> intracellular trafficking and secretion  Intracellular trafficking,
> >>> secretion, and
> >>> 171703  171703  KOG2674 Cysteine protease required for autophagy -
> >>> Apg4p/Aut2p  Cytoskeleton    CELLULAR PROCESSES AND SIGNALING
> >>> 171703  171703  KOG2674 Cysteine protease required for autophagy -
> >>> Apg4p/Aut2p  Intracellular trafficking, secretion, and vesicular
> >>> transport   CELLU
> >>> and metabolism     METABOLISM
> >> 
> >> This looks like the sort of thing that occurs when there is a mismatched or missing double or single quote or perhaps comment character ( "#" that terminated a line read) somewhare. The logical place to look is in the line of data just above the pathological stretch of data. You have clearly only offered a truncated version of the data, since there are many instances of lines ending without matching quotes, even one in the first line.
> >> 
> >> -- 
> >> David.
> >> 
> >> 
> >>> --------------------------------8<--------------------------------------------
> >>> And this is a piece of code that produced this:
> >>> 
> >>> --------------------------------8<--------------------------------------------
> >>>> n = nrow(statdata)
> >>>> extra = data.frame(kogdefline=rep(NA,n), kogClass = rep(NA,n), kogGroup
> >>> = rep(NA,n))
> >>>> subset = intersect(statdata$id, annot$id)
> >>>> MR = match(subset, annot$id)
> >>>> ML = match(subset, statdata$id)
> >>> 
> >>>> extra[ML,1] = as.character(annot[MR,2])
> >>>> extra[ML,2] = as.character(annot[MR,3])
> >>>> extra[ML,3] = as.character(annot[MR,4])
> >>> # strangely, if I do    
> >>> # extra[ML,] = as.character(annot[MR,2:4])
> >>> # it produces digits (???) instead of a string value
> >>> 
> >>>> mergedData = data.frame(statdata, extra)
> >>>> write.table(mergedData, 'filename.csv', sep=',')
> >>> --------------------------------8<--------------------------------------------
> >>> 
> >>> Any ideas why this is happening?
> >>> 
> >>> Many thanks
> >>> -Igor
> >> 
> >> David Winsemius, MD
> >> Alameda, CA, USA
> >> 
> > 
> > -- 
> > Dr I Chernukhin
> > School of Biological Sciences
> > University of Essex
> > Wivenhoe Park
> > Colchester
> > Essex
> > CO4 3SQ
> > 
> 
> David Winsemius, MD
> Alameda, CA, USA
>