[R] duplicate rows with rbind in a loop

Cara Fiore clfiore at gmail.com
Thu Oct 8 05:19:44 CEST 2015


Dear R users,

I wrote a simple script to change the header lines in a fasta file that
contains DNA sequences in a format:

>header1
sequence1
>header2
sequence2

I am basically trying to replace the "header" in this file with a line from
another file (taxonomy file). In order to do that I have to find the
matching header in the taxonomy file.

The output should be in fasta format and it is, but the rows repeat so the
output file is huge and it looks like:

>header1
sequence1
>header1
sequence1
>header2
sequence2

The code I have is:

tax=read.table("taxonomy_file.txt", header=F, quote="", sep="\t")
tax2=data.frame(tax)

library("Biostrings")
seqs=readDNAStringSet("File.fasta")
header=names(seqs)
seqs2=paste(seqs)

new.final=NULL
i=1

#Go through tax file and match the header in tax file to header in seqs file
for(i in 1:length(tax[,1])){
  sampleID=NULL
  match=NULL
  sampleID=as.character(tax2[i,1])  #sample ID in taxonomy header
  match=which(sampleID==header) #index for match in header file
  if(match>0){
    newH1=NULL
    newH2=NULL
    seqline=NULL
    new.header=NULL
    newH1=as.character(tax2[i,1])
    newH2=as.character(tax2[i,2])
    seqline=seqs2[match]
    new.header=paste(">",newH1,"|",newH2, sep="")
    new.final=rbind(new.final, new.header, seqline)
  }
  print(paste("percent complete =", round((i/length(tax2[,1]))*100,3),
"%",sep=" "))
  write.table(new.final, file="Test_output.txt", quote=FALSE, sep="\n",
col.names=FALSE, row.names=FALSE, append=TRUE)
  i=i+1
}


Something about rbind is repeating all of the rows every time it writes to
the output file. I have not been able to find anything about this online or
in the r help for rbind, although perhaps I am missing something obvious
about this.

I greatly appreciate any help with this!

	[[alternative HTML version deleted]]



More information about the R-help mailing list