[BioC] BSgenomeForge seed file - seqnames field

Wed May 1 08:11:58 CEST 2013

Hi Kelly,

On 04/30/2013 03:52 PM, Kelly V [guest] wrote:
>
> I'm preparing a custom reference genome for use with the MEDIPS package. I see that one field of the seed file, which is apparently not optional, is the 'seqnames' field. The example given in the documentation is this:
>
> paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"), "_random",
> sep="")), sep="")
>
> I have two simple questions about this.
>
> 1. Does R match this information with the source sequence file? For example, if I have a single fasta file with fasta headers chr_01...chr_20, must the seqnames entries exactly match those headers?

No. You need to provide 1 FASTA file per single sequence, that is, 1
file per name you put in the 'seqnames' field. That means that each file
is expected to contain only 1 sequence. What's in the FASTA header of
each file is not important. What's important is that the name of each
file be of the form <prefix><seqname><suffix>, where <seqname> is the
name of the sequence as it appears in the 'seqnames' field, and <prefix>
and <suffix> are a prefix and a suffix (eventually empty) that are the
same for all the files.

If what you have is a big FASTA file containing all the chromosome
sequences, then you first need to split it into 1 file per chromosome.
This is easy to do in R. For example, here is the script I used to split
the big bosTau6.fa file provided by UCSC for the bosTau6 genome:

   library(Biostrings)
   bosTau6 <- readDNAStringSet("bosTau6.fa")

   ### Partitioning:
   is_chrUn <- grepl("^chrUn", names(bosTau6))
   is_chrom <- !is_chrUn

   ### Send each chromosome to a FASTA file.
   seqnames <- paste("chr", c(1:29, "X", "M"), sep="")
   stopifnot(setequal(seqnames, names(bosTau6)[is_chrom]))

   for (seqname in seqnames) {
     seq <- bosTau6[match(seqname, names(bosTau6))]
     filename <- paste(seqname, ".fa", sep="")
     cat("writing ", filename, "\n", sep="")
     writeXStringSet(seq, file=filename, width=50L)
   }

   ### Send the 3286 chrUn_* sequences to 1 FASTA file.
   chrUn_mseq <- bosTau6[is_chrUn]
   writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L)

The input is the bosTau6.fa file containing 3317 FASTA records:
31 records for the chromosomes, and 3286 for the chrUn_* sequences.
The script produces 32 FASTA files: 1 per chromosome (chr1.fa,
chr2.fa, chr3.fa, ..., chr29.fa, chrX.fa, chrM.fa), and the
chrUn.fa file (containing the 3286 chrUn_* sequences).

Note that you can find this script in the BSgenome package, and
display its source with:

   > splitbigfasta_R <- system.file("extdata",
                                    "GentlemanLab",
                                    "BSgenome.Btaurus.UCSC.bosTau6-tools",
                                    "splitbigfasta.R",
                                    package="BSgenome")

   > cat(readLines(splitbigfasta_R), sep="\n")

It should not be too hard to adapt this script to your own needs.

>
> 2. Revealing the reason for my first question:In my genome fasta file, I have 1427 extrachromosomal scaffolds, but they are not all sequentially numbered, so that I have scaffold_1..scaffold_3681. Do I need to use a regular expression in my seqnames field to tell R to look for scaffold_ followed by 1-4 digits?

No, not in the 'seqnames' field, because those 1427 extrachromosomal
scaffolds should not go there. They would need to go in the 'mseqnames'
field.

Unlike the 'seqnames' field where you enumerate objects that can
only contain 1 sequence, in the 'mseqnames' field you can enumerate
objects that contain more than 1 sequence. More precisely, in the
final BSgenome data package, each entry in the 'seqnames' field will
correspond to a DNAString object (the DNAString container can hold
1 sequence only), and each entry in the 'mseqnames' field will
correspond to a DNAStringSet object (the DNAStringSet container
can hold multiple sequences).

So typically, all the extrachromosomal scaffolds would go in one
DNAStringSet object in the final BSgenome package (this is for example
what is done in the BSgenome.Drerio.UCSC.danRer7 package).
To achieve this, you need to put 1 entry in the 'mseqnames' field
(e.g. "scaffolds"), and to put the 1427 extrachromosomal scaffolds
in one FASTA file named accordingly to that entry (e.g. scaffolds.fa).

In the above script, replace:

   is_chrUn <- grepl("^chrUn", names(bosTau6))

with:

   is_scaffold <- grepl("^scaffold", names(<your_genome>))

then every occurrence of 'is_chrUn' with 'is_scaffold', and
finally those 3 lines:

   ### Send the 3286 chrUn_* sequences to 1 FASTA file.
   chrUn_mseq <- bosTau6[is_chrUn]
   writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L)

with:

   ### Send the 1427 scaffold_* sequences to 1 FASTA file.
   scaffolds_mseq <- <your_genome>[is_scaffold]
   writeXStringSet(scaffolds_mseq, file="scaffolds.fa", width=50L)

and that should take care of sending all the scaffold sequences
to the scaffolds.fa file.

Let me know if you need further assistance with this.

Cheers,
H.

>
> Thanks for any help,
> --Kelly V.
>
>   -- output of sessionInfo():
>
> R version 3.0.0 (2013-04-03)
> Platform: i386-w64-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319