[BioC] DNAStringsSet - remove multiple entries with same name

Wed Sep 3 21:14:49 CEST 2014

Hi Dips,

On 09/03/2014 10:30 AM, deepti anand wrote:
> Hi All,
>
> I am trying to extract promoter sequences for a few ENTREZ IDS. The problem I am having is that there exists multiple transcripts for same gene. So this gives me multiple promoter sequences for same gene. Can I filter out the redundant  promoter sequences?
>
> Here is my code:
> ids.ok = c("67665" ,"13198" ,"110196","15368")## Obtain coordinates of transcript #####>grl <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene") [ids.ok]>promoter.seqs <- getPromoterSeq(grl,Mmusculus, upstream=1500,downstream=0)>promoter.seqs<- unlist(promoter.seqs)> promoter.seqs  A DNAStringSet instance of length 8    width seq                                                                                                             names               [1]  1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC 67665.67665[2]  1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC 67665.67665[3]  1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC 67665.67665[4]  1500 CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTC
 C!
>   TC 67665.67665[5]  1500 CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCC...CCCCCCCCCAGGAGGGGCCGGACAGCATAAAGGATACTCGCTCTCCGCTCTTGA 13198.13198[6]  1500 CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG 110196.110196[7]  1500 CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG 110196.110196[8]  1500 GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTT...TGGGCGGCCACCACGTGACCCGCGTACTTAAAGGGCTGGCGCGGGCAGCTGCTC 15368.15368
>
> In example above, there are four sequences for same gene '67665.67665'. How can I remove these entries?
> I would appreciate any help

In this particular example, the TSS (transcription starting site) is
the same for all the transcripts in any of your genes. This explains
why you get the same promoter sequence. However, this won't always be
the case. So in order to handle the general situation, we need to
choose between 2 behaviors:

   1) Return 1 sequence per gene (even for genes with more than 1 unique
      TSS).

   2) Return 1 sequence per unique TSS per gene.

To do 1) we need to choose a particular TSS for each gene. A common
choice is to pick-up the most upstream TSS. In this case, genes()
can be used to extract the gene ranges. genes() will return a GRanges
object with 1 genomic range per gene. This genomic range is the union
of all the transcript ranges in the gene.

   library(TxDb.Mmusculus.UCSC.mm10.knownGene)
   txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
   ids.ok <- c("67665", "13198", "110196", "15368")
   gn <- genes(txdb)[ids.ok]

Then:

   > gn
   GRanges with 4 ranges and 1 metadata column:
            seqnames                 ranges strand |         gene_id
               <Rle>              <IRanges>  <Rle> | <CharacterList>
      67665    chr18 [ 60526221,  60558762]      + |           67665
      13198    chr10 [127290793, 127296288]      + |           13198
     110196     chr3 [ 89093588,  89101967]      - |          110196
      15368     chr8 [ 75093618,  75100593]      + |           15368
     ---
     seqlengths:
                      chr1                 chr2 ...       chrUn_JH584304
                 195471971            182113224 ...               114452

Note that, by default, genes() will exclude genes that don't have all
their transcripts on the same chromosome and strand (because in that
case the gene cannot be represented with a unique genomic range).
See ?genes for more information.

Then use extractUpstreamSeqs() to extract the upstream sequences for
those genes:

   library(BSgenome.Mmusculus.UCSC.mm10)
   genome <- BSgenome.Mmusculus.UCSC.mm10
   up_seqs <- extractUpstreamSeqs(genome, gn, width=1500)

Then:

   > gn_up_seqs
     A DNAStringSet instance of length 4
       width seq                                               names 

   [1]  1500 CTGCTGTAAAGTTACATTCCTGC...GGTGCGCCGGGCGTTGGCTCCTC 67665
   [2]  1500 CAGCCCTAAAAGATGAAAGTCGC...GGATACTCGCTCTCCGCTCTTGA 13198
   [3]  1500 CACGTCGGCCTGCCTATCAGGGA...TTATGTTCTGCCTCCTGAGCCTG 110196
   [4]  1500 GTTAGTATTTAATATTTAAAGCT...AGGGCTGGCGCGGGCAGCTGCTC 15368

For doing 2) we can use transcriptsBy() to extract all transcript ranges
grouped by gene:

   tx_by_gn <- transcriptsBy(txdb, by="gene")[ids.ok]

Then we can obtain the unique TSS for each gene by first resizing each
list element in 'tx_by_gn' and then by passing the result thru unique():

   TSS_by_gn <- unique(endoapply(tx_by_gn, resize, width=1))

(Note that we use endoapply() here because we don't have a "resize"
method for GRangesList objects. However we should definitely add
one because doing endoapply() is not efficient.)

Then we can get the upstream sequence for each TSS with:

   TSS_up_seqs <- relist(extractUpstreamSeqs(
                             genome,
                             unlist(TSS_by_gn, use.names=FALSE),
                             width=1500),
                         TSS_by_gn)

This time the upstream sequences are returned in a DNAStringSetList
object with 1 list element (DNAStringSet) per gene:

   > TSS_up_seqs
   DNAStringSetList of length 4
   [["67665"]] 
CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAATCTATATAAGT...
   [["13198"]] 
CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCCGGGTAGTGTGC...
   [["110196"]] 
CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAACCATCTTAAT...
   [["15368"]] 
GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTTGGTTTTTCTGG...

The number of sequences in each list element is the number of unique
TSS in the gene. Of course, there is no benefit to this second approach
in *your* case because none of your genes has more than 1 unique TSS.
But some genes do have more than 1 unique TSS (e.g. 100017, 100019,
100036521, and many more...).

Hope this helps,
H.

> Dips
>
>
>
>   		 	   		
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319