[BioC] [Rocky] - Clarification for finding SNPs at 3'UTR, 5'UTR, CDS genomic regions

Hervé Pagès hpages at fhcrc.org
Thu Jan 13 00:13:27 CET 2011


Hi Haojam,

On 01/05/2011 08:57 PM, 하오잠 로 wrote:
> Mr.Herv? Pag?s
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> Dear Sir,
> I would like to find the SNPs density located in human gene transcript
> regions i.e. 5'UTR, 3'UTR, CDS or coding exon from the total SNP count
> (30443455) located at dbSNP Build 132. For this I have already
> downloaded the 3'UTR, 5'UTR and CDS from the UCSC table genome browser.
> Could I do it with SNPBLAST or ENTREZ SNP Database but SNPBLAST is not
> supporting what I want.

If you are asking me if you could do this with SNPBLAST or ENTREZ SNP
Database, well I don't know. I don't use those tools. Did you try to ask
directly to the people supporting/distributing those tools? If you do
so, you will need to be much more specific than "SNPBLAST is not
supporting what I want". In particular, people will typically expect
you to show what you've done so far, how you did it, what worked,
what didn't, etc...

> Could you please tell me how would I proceed to
> obtain my goal. I would be glad for your kindness and assistance.

I'm sure there must be many ways to do this. Following the same kind
of approach as described in this recent thread
   https://stat.ethz.ch/pipermail/bioconductor/2010-December/036741.html
would probably work for this too. (Be aware that this thread is
continued in January 2011 and, unfortunately, navigating a thread
by clicking the "Next message" link doesn't cross month boundaries :-( )

However, pulling all the SNPs from SNPlocs.Hsapiens.dbSNP.20101109
into a single GRanges object and finding the overlaps with the
human 5'UTR, 3'UTR and CDSs might be putting too much stress on
your hardware so this is a situation where you will benefit a lot
from working one chromosome at a time.

Something along the line of (pseudo-code, not tested):

library(GenomicFeatures)

### Use the 'tablename' argument to pickup the exact transcript
### annotations you want. Also have a look at
### ?makeTranscriptDbFromBiomart for another source of
### annotations.
txdb <- makeTranscriptDbFromUCSC("hg19")

cdsbytx <- cdsBy(txdb, by="tx", use.names=TRUE)
utr5bytx <- fiveUTRsByTranscript(txdb, use.names=TRUE)
utr3bytx <- threeUTRsByTranscript(txdb, use.names=TRUE)

### Note that the lengths of the above 3 GRangesList objects
### differ because some transcripts don't have a 5' UTR or a
### 3' UTR or both.

countSNPsByFeature <- function(txdb)
{
     cdsbytx <- cdsBy(txdb, by="tx")
     utr5bytx <- fiveUTRsByTranscript(txdb)
     utr3bytx <- threeUTRsByTranscript(txdb)
     ans <- c(cds=0L, utr5=0L, utr3=0L)
     library(SNPlocs.Hsapiens.dbSNP.20101109)
     allchroms <- names(getSNPcount())
     for (seqname in allchroms) {
         snps <- getSNPlocs(seqname, as.GRanges=TRUE)
         seqnames(snps) <- sub("ch", "chr", seqnames(snps))
         seqnames(snps) <- sub("chrMT", "chrM", seqnames(snps))
         n1 <- sum(countOverlaps(snps, cdsbytx))
         n2 <- sum(countOverlaps(snps, utr5bytx))
         n3 <- sum(countOverlaps(snps, utr3bytx))
         ans <- ans + c(n1, n2, n3)
     }
     ans
}

Then:

snpcount <- countSNPsByFeature(txdb)

Don't know how long this would take, 15 minutes? one hour?
maybe more... Let us know. You might need at least 4GB of RAM for
this to work in reasonable time.

Note that some SNPs will be counted more than once because a SNP
can hit more than one feature e.g. the 5' UTR of transcript A and
the CDS of transcript B etc...

Hope this can give you a start.

Cheers,
H.

>
> Warm regards,
> Haojam Rocky
> Ph.D. scholar
> Seoul National University College of Medicine


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list