[BioC] MT reference differences between hg19/GRCH37

Hervé Pagès hpages at fhcrc.org
Tue Mar 19 01:21:09 CET 2013


Hi Julian,

On 03/17/2013 05:53 AM, Julian Gehring wrote:
> Hi,
>
> The UCSC hg19 and GRCH 37 reference genome use different reference
> sequences for the mitochondrium (MT) that differ in length and have
> mismatches at multiple positions.  For a short explanation on this, see
> https://lists.soe.ucsc.edu/pipermail/genome/2009-July/019631.html.
>
> Bioconductor normally only provides the UCSC references (see e.g. the
> BSgenome.* or TxDb.* packages.

Because they're the most frequently used but other annotations can
also be used. For example you can use makeTranscriptDbFromBiomart()
to make your own TranscriptDb object from Ensembl for Human. In
that case, you get annotations relative to GRCh37, I think. Or, more
precisely, to the patched version of GRCh37 currently used by Ensembl
(GRCh37.p10 for Ensembl release 70, see Ensembl release announcements
for the details).

Also, if you work with SNPs, a tool like predictCoding() (in
VariantAnnotation) accepts a FaFile object for sequence extraction
(as an alternative to a BSgenome object). Making a FaFile object for
GRCh37 should be easy. Note that you can also make your own BSgenome
data package for GRCh37 (see the BSgenomeForge vignette in the BSgenome
software package). Probably a little more work than making a FaFile
object but shouldn't be too hard.

> When using data aligned to the GRCH
> reference, to what extend does using the UCSC reference influence the
> analysis of features located on the MT, and for which kinds of
> downstream analyses could this become critical?  E.g. locating SNPs on
> the MT is such a critical case.

Any ranged-based analysis (i.e. an analysis where the genomic ranges
of the data are compared with the genomic ranges of the annotations)
might produce wrong results. And AFAIK most HTS downstream analyses
are range-based analyses.

The point-of-view we took in the GenomicRanges package is that it's
always critical. This is the reason why any range-based operation
that involves more than 1 object (e.g. findOverlaps) starts by checking
that the underlying sequences (i.e. the chromosomes) are compatible
between the 2 objects (i.e. have the same length and circularity flag),
and raises an error if they are not. Packages that provide 
functionalities built on top of GenomicRanges package also inherit
that behavior.

This is not an absolute protection though because the chromosomes
could have the same length and circularity flag and still differ.
So it's really the responsibility of the user to check that s/he
uses annotations that match the aligned data.

>
> The problem will likely be solved with the hg20/GRCH38 release,

I hope so. Unless some chromosome sequence in GRCh38 gets updated
after the initial GRCh38 release (this is what happened for GRCh37).

> but data
> that requires the hg19/GRCH37 releases may still be relevant for several
> years.  Would it be reasonable to provide alternative reference
> packages, such as a GRCH37 BSgenome package?

Maybe we could do that, even though there was no demand so far.
Possible reasons could be that:

   - Most people manage to use annotations that match their aligned
     data. I agree that maybe we provide more BSgenome.* and TxDb.*
     packages based on UCSC than on other annotation resources,
     but we also do provide annotations from other resources. For
     example the SNPlocs.* packages contain dbSNP snapshots based
     on GRCh37, not hg19.
     Anyway we don't want people to be stuck with UCSC so we try to
     provide flexible tools that allow people to use annotations
     from different sources. Please let us know if you are in a
     situation where you are stuck and cannot use annotations that
     match your aligned data.

   - Most people simply exclude the MT chromosome from their analysis
     and are happy with that.

Hope this helps,
H.

>
> Best wishes
> Julian
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list