[BioC] IlluminaHumanMethylation450k.db: Missing probes in IlluminaHumanMethylation450kPROBELOCATION function?

Simone enomis.bioc at gmail.com
Wed Jul 3 19:06:53 CEST 2013


Wow, Tim, thank you so much, this code works perfectly! I think I
never received from anybody a code directly and completely working ...
And you are definitely right, I have to learn how to use all the
GRanges stuff, it seems to be really flexible. Will work through the
vignette!

Again, thank you for your great help.

Simone



On Tue, Jul 2, 2013 at 7:23 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
> First off, my apologies about the BSgenome requirement.  I realized
> (...slowly...) that rtracklayer requires the appropriate build of BSgenome
> to correctly assign the chromosome lengths when requested.  It's either that
> or assume the user has connectivity to UCSC, or store them statically (which
> seems like a bad idea, but perhaps it isn't).  I attempted to upload an
> updated version of the package but due to missed communications with Marc,
> it never made it onto the BioC repository.  Mea culpa.  That said...
>
> The following is for a bunch of hits from an experiment, but you could just
> swap in something like
>
> if(!require(BSgenome.Hsapiens.UCSC.hg19))
> biocLite('BSgenome.Hsapiens.UCSC.hg19')
> if(!require(FDb.InfiniumMethylation.hg19))
> biocLite('FDb.InfiniumMethylation.hg19')
>
> library(FDb.InfiniumMethylation.hg19)
> hm450 <- get450k()
> hm450.CpGs <- split(hm450, hm450$probeType)$cg
>
> strand(hm450.CpGs) <- '*' ## always worth double checking
>
>
>
> and use hm450.CpGs in the following code where I am using varByAgeProbes...
>
>
> library(Homo.sapiens)
> txsByGene <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, 'gene')
> names(txsByGene) <- mget(names(txsByGene),
>                                org.Hs.egSYMBOL, ifnotfound=NA)
> genicHits <- subsetByOverlaps(varByAgeProbes, txsByGene)
>
> genicOLs <- findOverlaps(varByAgeProbes, txsByGene)
> genesHit <- sort(table(names(txsByGene)[subjectHits(genicOLs)]))
>
> promotersByGene <- flank(txsByGene, 1500) ## adjust the number to suit your
> definitions
>
> promoterHits <- subsetByOverlaps(varByAgeProbes, promotersByGene)
>
> promoterOLs <- findOverlaps(varByAgeProbes, promotersByGene)
> promotersHit <-
> sort(table(names(promotersByGene)[subjectHits(promoterOLs)]))
>
> promoterGenicHits <- intersect( promoterHits, genicHits )  ## probes that
> hit both regions
>
> promotersAndGenes <- reduce(unlist(c(reduce(promotersByGene),
> reduce(txsByGene))))
>
> intergenicHits <- varByAgeProbes[ which(!varByAgeProbes %over%
> promotersAndGenes) ]
>
> ## trust but verify:
> length(intersect(intergenicHits, genicHits))
> ## [1] 0
> length(intersect(intergenicHits, promoterHits))
> ## [1] 0
>
> So what, right?  I mean, you could have figured this out from the
> manifest...maybe (i.e., assuming you didn't use the GEO manifest with its
> Excel-gene names, and assuming there are no recently updated or rare
> transcripts of interest in your data).  But suppose you have a bunch of
> ChIP-seq or DNAse-seq results, or footprints of transcription factor binding
> sites, and you want to see which of those are hit.  Or H3K27me3 peak calls,
> or CTCF peaks.  Let's use the latter.  (There's a very specific reason that
> I am looking at CTCF locations in this particular analysis)
>
> require(rtracklayer)
> CTCF.peaks <- import("CTCF_peaks.bed", genome='hg19')
>
> subsetByOverlaps( intergenicHits, CTCF.peaks )
>
> You *could* do the above with just the manifest, but it would be a colossal
> pain in the... posterior.  Now consider if you were looking at
> DNAse-inaccessible PU.1 sites, or HiC interacting domains, or the like.
>
> Anyways, I'll patch up the issues with FDb.InfiniumMethylation.hg18/19, try
> to make it a bit faster, and re-upload the packages over the weekend (or
> before, but I worry that can't be done).  I hope the above clarifies my
> intent with the package.  An alternative is to use minfi's mapToGenome()
> function, FWIW.
>
>
>
>
>
> On Tue, Jul 2, 2013 at 2:14 AM, Simone <enomis.bioc at gmail.com> wrote:
>>
>> Dear Tim,
>>
>> > You will need to use toggleProbes() to get all probe locations (this is
>> > an
>> > undesirable side effect of the db0 package infrastructure, which is
>> > geared
>> > towards expression arrays where probes that map to multiple accessions
>> > are a
>> > "bad thing", and that is the reason I deprecated the 27k.db and 450k.db
>> > packages).
>>
>> I understand. However, also with toggleProbes() I have got a problem.
>> I tried for example:
>>
>> > x <- toggleProbes(IlluminaHumanMethylation450kPATH2PROBE, "all")
>>
>> Error in (function (classes, fdef, mtable)  :
>>   unable to find an inherited method for function ‘toggleProbes’ for
>> signature ‘"AnnDbBimap"’
>>
>> But for example the following works
>>
>> > x <- toggleProbes(IlluminaHumanMethylation450kENSEMBL2PROBE, "all")
>>
>> However, I would need the location information, so the column
>> "transcript_location" (e.g. 5'UTR, TSS200, Body, etc.) and it seems
>> that I cannot get it like this?
>>
>> > The FDb.InfiniumMethylation packages have the locations of the targeted
>> > CpGs
>> > (and CpHs, and SNPs) for the corresponding build (just for the record,
>> > the
>> > genome build targeted by 450k.db is hg19/GRch37), but they are returned
>> > using get450k() or get27k() as a GRanges object.
>>
>> Currently, this doesn't work for me neither (see my previous post).
>> Probably I'm doing something wrong ...
>>
>> > Essentially, the mistake I made in compiling the 27k.db and 450k.db
>> > packages
>> > was to use an expression-centric (or gene- and transcript-centric)
>> > infrastructure to represent probes targeting individual loci.  It's more
>> > useful to think of the 450k array, in particular, as a SNP array, with
>> > functions for a given target locus inferred after discovery.  It's also
>> > more
>> > useful to think of the chip as a SNP array when considering
>> > differentially
>> > methylated regions, which could be loosely regarded as the methylation
>> > equivalent of copy number variations.  Last but not least, you can in
>> > fact
>> > use the array to estimate copy number along the genome, as for
>> > determining
>> > sex from chrX copy number.
>>
>> Thank you for your hints and explanations!
>> But at the moment I am still stucked at a much simpler step, I would
>> just need an easy way to distinguish probes located in promoters from
>> those located in gene bodies (and to discard those which are annotated
>> to both).
>>
>> Best wishes,
>> Simone
>
>
>
>
> --
> A model is a lie that helps you see the truth.
>
> Howard Skipper



More information about the Bioconductor mailing list