[BioC] BioStrings for current R: RE: BSgenomeForge seed file - seqnames field

Hervé Pagès hpages at fhcrc.org
Mon Dec 9 23:09:09 CET 2013


Hi Kelly,

On 12/06/2013 12:23 PM, Vining, Kelly wrote:
> Hi Herve,
> It has been a while since I have worked with Biostrings and the MEDIPS package. When I went to install Biostrings just now, got an error message that Biostrings isn't available:
>
>> install.packages("Biostrings")
> Installing package into â/raid1/home/pi/viningk/Râ
> (as âlibâ is unspecified)
> --- Please select a CRAN mirror for use in this session ---

Here you are asked to choose a CRAN mirror but Biostrings is not a CRAN
package. This suggests that you might be on the wrong path.


> Warning message:
> package âBiostringsâ is not available (for R version 3.0.0)
>
> Revisiting the online documentation, it does look like the package is still available.
> Is there an available update, or a different alternative I should be using?

The only proper way to install a Bioconductor package is to use
biocLite(), as documented here:

   http://bioconductor.org/install/

Unlike install.packages(), biocLite() "knows" where to find Bioconductor
packages. Also make sure you're using the latest BioC release (BioC
2.13).

Cheers,
H.

>
> Thanks,
> --Kelly Vining
> ________________________________________
> From: Hervé Pagès [hpages at fhcrc.org]
> Sent: Thursday, May 02, 2013 10:49 AM
> To: Vining, Kelly
> Cc: Kelly V [guest]; bioconductor at r-project.org; MEDIPS Maintainer
> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field
>
> Hi Kelly,
>
> On 05/02/2013 06:44 AM, Vining, Kelly wrote:
>> Hi Herve,
>> Thanks for your helpful response, and for pointing me to the proper scripts. Given this, it appears that I should treat my gene feature annotation files (promoters, genes) as mseqnames objects, similarly to how I will treat the extrachromosomal scaffolds. Is there any way to include additional features that I may find of interest after I forge this initial package (say, using gff files), or am I limited such that I would have to re-forge a new reference?
>
> A BSgenome data package is for storing the sequences of a given
> genome/assembly. No annotations should go there.
>
> In Bioconductor we try to maintain a clear separation between
> packages that contain the sequences of a genome/assembly (BSgenome
> data packages), and packages that contain annotations for that
> genome/assembly (TxDb, FDb, OrganismDb, SNPlocs, etc... packages).
>
> There are at least 2 reasons for this:
>
>     1. For a given assembly there are many kinds of annotations available
>        in many places on the internet, and some of them tend to be updated
>        frequently. By contrast, the sequences of a given assembly
>        never change. So putting only the reference sequences in a
>        BSgenome package make it very stable. It almost never needs to
>        be updated, except for updating a man page or when something
>        changes in the way sequences are stored in the package. This is
>        good with such big packages (800 Mb for the biggest ones).
>
>     2. Storing the DNA sequences of the genes, promoters, or other genomic
>        feature in a BSgenome package would duplicate a lot of DNA
>        sequences that are already in the reference sequences (those
>        small sequences are substrings of the reference sequences).
>        This would make the BSgenome package unnecessarily bigger.
>        It's better to store only the locations of those genomic features,
>        not in the BSgenome package, but in a separate package, and then
>        to use those locations to extract the corresponding sequences from
>        the BSgenome object. This extraction can be done with getSeq()
>        (defined in the BSgenome software package) and is relatively
>        cheap. This approach gives you a lot more flexibility than having
>        those sequences pre-extracted and bundled with the BSgenome data
>        package.
>
>        If your annotations are in GFF files, you don't really need to
>        put them in a package: you can load them directly in GRanges
>        objects with import() (defined in the rtracklayer package),
>        or if you are particularly interested in the exon/transcript
>        structure of your genes, you could use makeTranscriptDbFromGFF()
>        to load a GFF file containing genes, mRNAs, exons, and CDSs into
>        a TranscriptDb. See ?makeTranscriptDbFromGFF in the
>        GenomicFeatures package for more information. Then it's easy to
>        extract the locations of transcripts, exons, or CDSs as a GRanges
>        or GRangesList object from this TranscriptDb object. See
>        ?transcripts and ?exonsBy. Finally once you have the locations
>        in a GRanges object, you can use getSeq() to extract the
>        corresponding sequences from the BSgenome object.
>
>        Other more specialized functions are promoters() and
>        getPromoterSeq() for extracting the promoter locations and
>        their sequences, respectively. Also extractTranscriptsFromGenome()
>        for extracting the full transcriptome from a BSgenome package.
>
> HTH,
> H.
>
> PS: The upstream sequences that you see in some of our BSgenome data
> packages are a relic of the past and will be removed soon.
>
>>
>> Thanks much,
>> --Kelly V.
>>
>> -----Original Message-----
>> From: Hervé Pagès [mailto:hpages at fhcrc.org]
>> Sent: Tuesday, April 30, 2013 11:12 PM
>> To: Kelly V [guest]
>> Cc: bioconductor at r-project.org; Vining, Kelly; MEDIPS Maintainer
>> Subject: Re: [BioC] BSgenomeForge seed file - seqnames field
>>
>> Hi Kelly,
>>
>> On 04/30/2013 03:52 PM, Kelly V [guest] wrote:
>>>
>>> I'm preparing a custom reference genome for use with the MEDIPS package. I see that one field of the seed file, which is apparently not optional, is the 'seqnames' field. The example given in the documentation is this:
>>>
>>> paste("chr", c(1:20, "X", "M", "Un", paste(c(1:20, "X", "Un"),
>>> "_random", sep="")), sep="")
>>>
>>> I have two simple questions about this.
>>>
>>> 1. Does R match this information with the source sequence file? For example, if I have a single fasta file with fasta headers chr_01...chr_20, must the seqnames entries exactly match those headers?
>>
>> No. You need to provide 1 FASTA file per single sequence, that is, 1 file per name you put in the 'seqnames' field. That means that each file is expected to contain only 1 sequence. What's in the FASTA header of each file is not important. What's important is that the name of each file be of the form <prefix><seqname><suffix>, where <seqname> is the name of the sequence as it appears in the 'seqnames' field, and <prefix> and <suffix> are a prefix and a suffix (eventually empty) that are the same for all the files.
>>
>> If what you have is a big FASTA file containing all the chromosome sequences, then you first need to split it into 1 file per chromosome.
>> This is easy to do in R. For example, here is the script I used to split the big bosTau6.fa file provided by UCSC for the bosTau6 genome:
>>
>>      library(Biostrings)
>>      bosTau6 <- readDNAStringSet("bosTau6.fa")
>>
>>      ### Partitioning:
>>      is_chrUn <- grepl("^chrUn", names(bosTau6))
>>      is_chrom <- !is_chrUn
>>
>>      ### Send each chromosome to a FASTA file.
>>      seqnames <- paste("chr", c(1:29, "X", "M"), sep="")
>>      stopifnot(setequal(seqnames, names(bosTau6)[is_chrom]))
>>
>>      for (seqname in seqnames) {
>>        seq <- bosTau6[match(seqname, names(bosTau6))]
>>        filename <- paste(seqname, ".fa", sep="")
>>        cat("writing ", filename, "\n", sep="")
>>        writeXStringSet(seq, file=filename, width=50L)
>>      }
>>
>>      ### Send the 3286 chrUn_* sequences to 1 FASTA file.
>>      chrUn_mseq <- bosTau6[is_chrUn]
>>      writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L)
>>
>> The input is the bosTau6.fa file containing 3317 FASTA records:
>> 31 records for the chromosomes, and 3286 for the chrUn_* sequences.
>> The script produces 32 FASTA files: 1 per chromosome (chr1.fa, chr2.fa, chr3.fa, ..., chr29.fa, chrX.fa, chrM.fa), and the chrUn.fa file (containing the 3286 chrUn_* sequences).
>>
>> Note that you can find this script in the BSgenome package, and display its source with:
>>
>>      > splitbigfasta_R <- system.file("extdata",
>>                                       "GentlemanLab",
>>                                       "BSgenome.Btaurus.UCSC.bosTau6-tools",
>>                                       "splitbigfasta.R",
>>                                       package="BSgenome")
>>
>>      > cat(readLines(splitbigfasta_R), sep="\n")
>>
>> It should not be too hard to adapt this script to your own needs.
>>
>>>
>>> 2. Revealing the reason for my first question:In my genome fasta file, I have 1427 extrachromosomal scaffolds, but they are not all sequentially numbered, so that I have scaffold_1..scaffold_3681. Do I need to use a regular expression in my seqnames field to tell R to look for scaffold_ followed by 1-4 digits?
>>
>> No, not in the 'seqnames' field, because those 1427 extrachromosomal scaffolds should not go there. They would need to go in the 'mseqnames'
>> field.
>>
>> Unlike the 'seqnames' field where you enumerate objects that can only contain 1 sequence, in the 'mseqnames' field you can enumerate objects that contain more than 1 sequence. More precisely, in the final BSgenome data package, each entry in the 'seqnames' field will correspond to a DNAString object (the DNAString container can hold
>> 1 sequence only), and each entry in the 'mseqnames' field will correspond to a DNAStringSet object (the DNAStringSet container can hold multiple sequences).
>>
>> So typically, all the extrachromosomal scaffolds would go in one DNAStringSet object in the final BSgenome package (this is for example what is done in the BSgenome.Drerio.UCSC.danRer7 package).
>> To achieve this, you need to put 1 entry in the 'mseqnames' field (e.g. "scaffolds"), and to put the 1427 extrachromosomal scaffolds in one FASTA file named accordingly to that entry (e.g. scaffolds.fa).
>>
>> In the above script, replace:
>>
>>      is_chrUn <- grepl("^chrUn", names(bosTau6))
>>
>> with:
>>
>>      is_scaffold <- grepl("^scaffold", names(<your_genome>))
>>
>> then every occurrence of 'is_chrUn' with 'is_scaffold', and finally those 3 lines:
>>
>>      ### Send the 3286 chrUn_* sequences to 1 FASTA file.
>>      chrUn_mseq <- bosTau6[is_chrUn]
>>      writeXStringSet(chrUn_mseq, file="chrUn.fa", width=50L)
>>
>> with:
>>
>>      ### Send the 1427 scaffold_* sequences to 1 FASTA file.
>>      scaffolds_mseq <- <your_genome>[is_scaffold]
>>      writeXStringSet(scaffolds_mseq, file="scaffolds.fa", width=50L)
>>
>> and that should take care of sending all the scaffold sequences to the scaffolds.fa file.
>>
>> Let me know if you need further assistance with this.
>>
>> Cheers,
>> H.
>>
>>>
>>> Thanks for any help,
>>> --Kelly V.
>>>
>>>     -- output of sessionInfo():
>>>
>>> R version 3.0.0 (2013-04-03)
>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United
>>> States.1252 [3] LC_MONETARY=English_United States.1252 [4]
>>> LC_NUMERIC=C [5] LC_TIME=English_United States.1252
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list