[BioC] DNAStringsSet - remove multiple entries with same name
Hervé Pagès
hpages at fhcrc.org
Wed Sep 3 21:14:49 CEST 2014
Hi Dips,
On 09/03/2014 10:30 AM, deepti anand wrote:
> Hi All,
>
> I am trying to extract promoter sequences for a few ENTREZ IDS. The problem I am having is that there exists multiple transcripts for same gene. So this gives me multiple promoter sequences for same gene. Can I filter out the redundant promoter sequences?
>
> Here is my code:
> ids.ok = c("67665" ,"13198" ,"110196","15368")## Obtain coordinates of transcript #####>grl <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene") [ids.ok]>promoter.seqs <- getPromoterSeq(grl,Mmusculus, upstream=1500,downstream=0)>promoter.seqs<- unlist(promoter.seqs)> promoter.seqs A DNAStringSet instance of length 8 width seq names [1] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC 67665.67665[2] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC 67665.67665[3] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC 67665.67665[4] 1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTC
C!
> TC 67665.67665[5] 1500 CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCC...CCCCCCCCCAGGAGGGGCCGGACAGCATAAAGGATACTCGCTCTCCGCTCTTGA 13198.13198[6] 1500 CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG 110196.110196[7] 1500 CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG 110196.110196[8] 1500 GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTT...TGGGCGGCCACCACGTGACCCGCGTACTTAAAGGGCTGGCGCGGGCAGCTGCTC 15368.15368
>
> In example above, there are four sequences for same gene '67665.67665'. How can I remove these entries?
> I would appreciate any help
In this particular example, the TSS (transcription starting site) is
the same for all the transcripts in any of your genes. This explains
why you get the same promoter sequence. However, this won't always be
the case. So in order to handle the general situation, we need to
choose between 2 behaviors:
1) Return 1 sequence per gene (even for genes with more than 1 unique
TSS).
2) Return 1 sequence per unique TSS per gene.
To do 1) we need to choose a particular TSS for each gene. A common
choice is to pick-up the most upstream TSS. In this case, genes()
can be used to extract the gene ranges. genes() will return a GRanges
object with 1 genomic range per gene. This genomic range is the union
of all the transcript ranges in the gene.
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
ids.ok <- c("67665", "13198", "110196", "15368")
gn <- genes(txdb)[ids.ok]
Then:
> gn
GRanges with 4 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <CharacterList>
67665 chr18 [ 60526221, 60558762] + | 67665
13198 chr10 [127290793, 127296288] + | 13198
110196 chr3 [ 89093588, 89101967] - | 110196
15368 chr8 [ 75093618, 75100593] + | 15368
---
seqlengths:
chr1 chr2 ... chrUn_JH584304
195471971 182113224 ... 114452
Note that, by default, genes() will exclude genes that don't have all
their transcripts on the same chromosome and strand (because in that
case the gene cannot be represented with a unique genomic range).
See ?genes for more information.
Then use extractUpstreamSeqs() to extract the upstream sequences for
those genes:
library(BSgenome.Mmusculus.UCSC.mm10)
genome <- BSgenome.Mmusculus.UCSC.mm10
up_seqs <- extractUpstreamSeqs(genome, gn, width=1500)
Then:
> gn_up_seqs
A DNAStringSet instance of length 4
width seq names
[1] 1500 CTGCTGTAAAGTTACATTCCTGC...GGTGCGCCGGGCGTTGGCTCCTC 67665
[2] 1500 CAGCCCTAAAAGATGAAAGTCGC...GGATACTCGCTCTCCGCTCTTGA 13198
[3] 1500 CACGTCGGCCTGCCTATCAGGGA...TTATGTTCTGCCTCCTGAGCCTG 110196
[4] 1500 GTTAGTATTTAATATTTAAAGCT...AGGGCTGGCGCGGGCAGCTGCTC 15368
For doing 2) we can use transcriptsBy() to extract all transcript ranges
grouped by gene:
tx_by_gn <- transcriptsBy(txdb, by="gene")[ids.ok]
Then we can obtain the unique TSS for each gene by first resizing each
list element in 'tx_by_gn' and then by passing the result thru unique():
TSS_by_gn <- unique(endoapply(tx_by_gn, resize, width=1))
(Note that we use endoapply() here because we don't have a "resize"
method for GRangesList objects. However we should definitely add
one because doing endoapply() is not efficient.)
Then we can get the upstream sequence for each TSS with:
TSS_up_seqs <- relist(extractUpstreamSeqs(
genome,
unlist(TSS_by_gn, use.names=FALSE),
width=1500),
TSS_by_gn)
This time the upstream sequences are returned in a DNAStringSetList
object with 1 list element (DNAStringSet) per gene:
> TSS_up_seqs
DNAStringSetList of length 4
[["67665"]]
CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAATCTATATAAGT...
[["13198"]]
CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCCGGGTAGTGTGC...
[["110196"]]
CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAACCATCTTAAT...
[["15368"]]
GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTTGGTTTTTCTGG...
The number of sequences in each list element is the number of unique
TSS in the gene. Of course, there is no benefit to this second approach
in *your* case because none of your genes has more than 1 unique TSS.
But some genes do have more than 1 unique TSS (e.g. 100017, 100019,
100036521, and many more...).
Hope this helps,
H.
> Dips
>
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list