[BioC] chromosome name match among vcf, txdb,BSgenome

Hervé Pagès hpages at fhcrc.org
Fri Oct 5 01:07:50 CEST 2012


Hi Kasper,

On 10/04/2012 02:40 PM, Kasper Daniel Hansen wrote:
> This sounds awesome.
>
> For this, it may be worthwhile to be able to specify it universally,
> through an option or something.  I expect most of us will choose one
> style and stick to it.

Yes why not, sounds worth exploring. For the purpose of troubleshooting
and code sharing thru the mailing list, people will need to remember to
show us what global settings they have though (unfortunately, this
doesn't show up in the sessionInfo()). Nothing new here, we already
have this situation with the stringsAsFactors=FALSE global option
that alters the semantic of some functions (fortunately, very few
people seem to be using this non-default setting). Some people argue
that global options should not alter the value returned by core
functions, but only affect cosmetic things like the width of the
display, the primary & secondary prompt character, etc... and I
tend to agree with that.

>
> I hope style will also imply an ordering chr1 < ... <  chr10

We are not planning to support re-ordering of the chromosomes for
things like TranscriptDb or BSgenome objects at the moment but we
are trying to make sure that those objects are generated with the
"main chromosomes" always coming first and in the natural order i.e.

    chr1 < chr2 < ... <  chr10
    chrI < chrII < ... <  chrX

followed by the sex chromosomes (which should not be present if
roman numbers are in use), followed by the mitochondrian chromosome
if any. After that, it's a mess: there is a bunch of stuff that
varies from one genome build to the other (see hg18 vs hg19),
even from one genome provider to the other for the *same* genome
build. See for example hg19 vs GRCh37.p10 where, except for chrM,
GRCh37.p10 is a superset of hg19 but they use very different naming
conventions so it's hard to map the sequence names between the
2 assemblies.

But would there be much to be gained? The typical situation where
inconsistent naming styles are hurting the user is when a binary
operation like findOverlaps() requires that the 2 input objects
are based on the same reference genome. Right now, findOverlaps()
reject the objects if they don't use the same naming style, but if
they do, and if the chromosome lengths stored in each object are
the same (when they are stored of course, which is not required),
then it works as expected. The chromosome don't need to be stored
in the same order.

BTW, it's worth mentioning that comparing the chromosome names
and lengths is not a guarantee that the chromosomes are coming
from the same reference genome. The chromosome can have the same
name and length, but come from 2 different assemblies, have different
DNA sequences, and therefore the annotations provided for the 2
assemblies are different. Let's keep in mind that, by making it easy
to alter chromosome names, we also make it easy for the user to pass
objects that are not based on the same reference genomes to tools
like findOverlaps(), and to silently get a result that doesn't make
sense.

Finally, and FWIW, there is a universal/unique ID for DNA sequences,
which is the RefSeq ID. Storing this ID (as an additional field) in
the little Seqinfo table contained in our objects sounds like maybe
it could help? I don't know how many objects would actually end up
with IDs instead of NAs in that field, but, for example, it would
not be too hard to add those IDs to the BSgenome and SNPlocs packages
we mantain. Not so sure for other objects like TranscriptDb objects
or GappedAlignments objects though...

Cheers,
H.

>
> On Thu, Oct 4, 2012 at 4:18 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
>> Hi Rebecca,
>>
>>
>> On 10/04/2012 12:10 PM, sun wrote:
>>>
>>> Hi All,
>>>
>>> I am going to use "coding <- predictCoding(vcf, txdb,
>>> seqSource=Athaliana)"
>>> to detect coding SNPs. The problem is that the chromosome names are not
>>> consistent among VCF, txdb and BSgenome. In vcf, the chromosome name is
>>> "Chr*", in txdb, the chr name is "Chr", but in BSgenome, the chr name is
>>> "chr*" .
>>>
>>> I know I can use renameSeqlevels() to adjust the seqlevels (chromosome
>>> names) of the VCF object to match that of the txdb annotation. But how can
>>> I adjust the chr name of BSgenome or TranscriptDB?
>>
>>
>> In BioC 2.11 (released yesterday), you can rename the chromosomes of a
>> TranscriptDb object, so you could rename the chromosomes of your
>> VCF and TranscriptDb objects to match the names of the BSgenome object.
>>
>> E.g. for the TranscriptDb object:
>>
>>    seqlevels(txdb) <- sub("^c", "C", seqlevels(txdb))
>>
>> Note that renaming the chromosomes of a TranscriptDb object is a new
>> feature and is not fully implemented yet. For example, if you use
>> select() on the object you'll still get the original names (those
>> stored in the db), and if you try to specify a chromosome name thru
>> the 'vals' arg of the transcripts(), exons() and cds() extractors,
>> you still need to use the original names. This will be addressed soon.
>>
>> Our plan is to also support renaming of the chromosomes of BSgenome
>> and SNPlocs objects very soon.
>>
>> Also, an additional level of convenience will be provided via the
>> seqnameStyle() getter and setter, so you'll be able to quickly rename
>> with something like:
>>
>>    seqnameStyle(x) <- "UCSC"
>>
>> or
>>
>>    seqnameStyle(vcf) <- seqnameStyle(txdb) <- seqnameStyle(genome)
>>
>> This will work on almost any 'x' object that contains chromosome
>> names (GRanges, GRangesList, GappedAlignments, TranscriptDb, VCF,
>> BSgenome, SNPlocs, etc...)
>>
>> Cheers,
>> H.
>>
>>
>>
>>>
>>> Thanks,
>>>
>>> Rebecca
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list