[BioC] BSgenomeForge seed file - seqnames field

Kelly V [guest] guest at bioconductor.org
Wed May 1 00:52:03 CEST 2013


I'm preparing a custom reference genome for use with the MEDIPS package. I see that one field of the seed file, which is apparently not optional, is the 'seqnames' field. The example given in the documentation is this:

paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"), "_random",
sep="")), sep="")

I have two simple questions about this.

1. Does R match this information with the source sequence file? For example, if I have a single fasta file with fasta headers chr_01...chr_20, must the seqnames entries exactly match those headers? 

2. Revealing the reason for my first question:In my genome fasta file, I have 1427 extrachromosomal scaffolds, but they are not all sequentially numbered, so that I have scaffold_1..scaffold_3681. Do I need to use a regular expression in my seqnames field to tell R to look for scaffold_ followed by 1-4 digits?

Thanks for any help,
--Kelly V.

 -- output of sessionInfo(): 

R version 3.0.0 (2013-04-03)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base   

--
Sent via the guest posting facility at bioconductor.org.



More information about the Bioconductor mailing list