[BioC] DNAStringsSet - remove multiple entries with same name

Mon Sep 8 00:20:06 CEST 2014

Thanks Herve,
It worked.
-Dips-

> Date: Thu, 4 Sep 2014 12:15:43 -0700
> From: hpages at fhcrc.org
> To: anand.deepti at outlook.com
> CC: Bioconductor at r-project.org
> Subject: Re: [BioC] DNAStringsSet - remove multiple entries with same name
> 
> Hi Dips,
> 
> I'm cc'ing the mailing list so others can benefit.
> 
> On 09/04/2014 11:34 AM, deepti anand wrote:
> > Hi H,
> >
> > Thank you for the code. I run the code but the extractUpstremSeqs() does
> > not seem to work. Here are the codes:
> >
> >> ids.ok
> > [1] "67665"  "13198"  "110196" "15368"
> >  > txdb<- TxDb.Mmusculus.UCSC.mm10.knownGene
> >  > gn<- genes(txdb)[ids.ok]
> >  > gn
> > GRanges with 4 ranges and 1 metadata column:
> >           seqnames                 ranges strand |         gene_id
> >              <Rle>              <IRanges>  <Rle> | <CharacterList>
> >     67665    chr18 [ 60526221,  60558762]      + |           67665
> >     13198    chr10 [127290793, 127296288]      + |           13198
> >    110196     chr3 [ 89093588,  89101967]      - |          110196
> >     15368     chr8 [ 75093618,  75100593]      + |           15368
> >    ---
> >    seqlengths:
> >                     chr1                 chr2                 chr3 ...
> >      chrUn_GL456394       chrUn_GL456396       chrUn_JH584304
> >                195471971            182113224            160039680 ...
> >               24323                21240               114452
> >  > genome <- BSgenome.Mmusculus.UCSC.mm10
> >  > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500)
> > Error: could not find function "extractUpstreamSeqs"
> >
> > Could you suggest how should I proceed further?
> 
> Oops, extractUpstreamSeqs() is only available in the current devel
> version of Bioconductor (BioC 3.0, which will be released in October).
> Sorry. With the current release version (BioC 2.14), you can obtain
> 'gn_up_seqs' with:
> 
>    gn_up_seqs  <- getPromoterSeq(gn, genome, upstream=1500, downstream=0)
> 
> And 'TSS_up_seqs' with:
> 
>    TSS_up_seqs <- relist(getPromoterSeq(
>                              unlist(TSS_by_gn, use.names=FALSE),
>                              genome,
>                              upstream=1500, downstream=0),
>                          TSS_by_gn)
> 
> Please let me know if you run into other issues.
> 
> Thanks,
> H.
> 
> PS: Yesterday I added a "resize" method for GRangesList objects in
> BioC devel. It's in GenomicRanges 1.17.37 (which will become available
> via biocLite() in the next 24 hours or so).
> 
> >
> > Best,
> >
> > -Dips-
> >
> >
> >  > Date: Wed, 3 Sep 2014 12:14:49 -0700
> >  > From: hpages at fhcrc.org
> >  > To: anand.deepti at outlook.com; bioconductor at r-project.org
> >  > Subject: Re: [BioC] DNAStringsSet - remove multiple entries with same
> > name
> >  >
> >  > Hi Dips,
> >  >
> >  > On 09/03/2014 10:30 AM, deepti anand wrote:
> >  > > Hi All,
> >  > >
> >  > > I am trying to extract promoter sequences for a few ENTREZ IDS. The
> > problem I am having is that there exists multiple transcripts for same
> > gene. So this gives me multiple promoter sequences for same gene. Can I
> > filter out the redundant promoter sequences?
> >  > >
> >  > > Here is my code:
> >  > > ids.ok = c("67665" ,"13198" ,"110196","15368")## Obtain coordinates
> > of transcript #####>grl <-
> > transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene")
> > [ids.ok]>promoter.seqs <- getPromoterSeq(grl,Mmusculus,
> > upstream=1500,downstream=0)>promoter.seqs<- unlist(promoter.seqs)>
> > promoter.seqs A DNAStringSet instance of length 8 width seq names [1]
> > 1500
> > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> > 67665.67665[2] 1500
> > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> > 67665.67665[3] 1500
> > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> > 67665.67665[4] 1500
> > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTC
> >  > C!
> >  > > TC 67665.67665[5] 1500
> > CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCC...CCCCCCCCCAGGAGGGGCCGGACAGCATAAAGGATACTCGCTCTCCGCTCTTGA
> > 13198.13198[6] 1500
> > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG
> > 110196.110196[7] 1500
> > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG
> > 110196.110196[8] 1500
> > GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTT...TGGGCGGCCACCACGTGACCCGCGTACTTAAAGGGCTGGCGCGGGCAGCTGCTC
> > 15368.15368
> >  > >
> >  > > In example above, there are four sequences for same gene
> > '67665.67665'. How can I remove these entries?
> >  > > I would appreciate any help
> >  >
> >  > In this particular example, the TSS (transcription starting site) is
> >  > the same for all the transcripts in any of your genes. This explains
> >  > why you get the same promoter sequence. However, this won't always be
> >  > the case. So in order to handle the general situation, we need to
> >  > choose between 2 behaviors:
> >  >
> >  > 1) Return 1 sequence per gene (even for genes with more than 1 unique
> >  > TSS).
> >  >
> >  > 2) Return 1 sequence per unique TSS per gene.
> >  >
> >  > To do 1) we need to choose a particular TSS for each gene. A common
> >  > choice is to pick-up the most upstream TSS. In this case, genes()
> >  > can be used to extract the gene ranges. genes() will return a GRanges
> >  > object with 1 genomic range per gene. This genomic range is the union
> >  > of all the transcript ranges in the gene.
> >  >
> >  > library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> >  > txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> >  > ids.ok <- c("67665", "13198", "110196", "15368")
> >  > gn <- genes(txdb)[ids.ok]
> >  >
> >  > Then:
> >  >
> >  > > gn
> >  > GRanges with 4 ranges and 1 metadata column:
> >  > seqnames ranges strand | gene_id
> >  > <Rle> <IRanges> <Rle> | <CharacterList>
> >  > 67665 chr18 [ 60526221, 60558762] + | 67665
> >  > 13198 chr10 [127290793, 127296288] + | 13198
> >  > 110196 chr3 [ 89093588, 89101967] - | 110196
> >  > 15368 chr8 [ 75093618, 75100593] + | 15368
> >  > ---
> >  > seqlengths:
> >  > chr1 chr2 ... chrUn_JH584304
> >  > 195471971 182113224 ... 114452
> >  >
> >  > Note that, by default, genes() will exclude genes that don't have all
> >  > their transcripts on the same chromosome and strand (because in that
> >  > case the gene cannot be represented with a unique genomic range).
> >  > See ?genes for more information.
> >  >
> >  > Then use extractUpstreamSeqs() to extract the upstream sequences for
> >  > those genes:
> >  >
> >  > library(BSgenome.Mmusculus.UCSC.mm10)
> >  > genome <- BSgenome.Mmusculus.UCSC.mm10
> >  > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500)
> >  >
> >  > Then:
> >  >
> >  > > gn_up_seqs
> >  > A DNAStringSet instance of length 4
> >  > width seq names
> >  >
> >  > [1] 1500 CTGCTGTAAAGTTACATTCCTGC...GGTGCGCCGGGCGTTGGCTCCTC 67665
> >  > [2] 1500 CAGCCCTAAAAGATGAAAGTCGC...GGATACTCGCTCTCCGCTCTTGA 13198
> >  > [3] 1500 CACGTCGGCCTGCCTATCAGGGA...TTATGTTCTGCCTCCTGAGCCTG 110196
> >  > [4] 1500 GTTAGTATTTAATATTTAAAGCT...AGGGCTGGCGCGGGCAGCTGCTC 15368
> >  >
> >  > For doing 2) we can use transcriptsBy() to extract all transcript ranges
> >  > grouped by gene:
> >  >
> >  > tx_by_gn <- transcriptsBy(txdb, by="gene")[ids.ok]
> >  >
> >  > Then we can obtain the unique TSS for each gene by first resizing each
> >  > list element in 'tx_by_gn' and then by passing the result thru unique():
> >  >
> >  > TSS_by_gn <- unique(endoapply(tx_by_gn, resize, width=1))
> >  >
> >  > (Note that we use endoapply() here because we don't have a "resize"
> >  > method for GRangesList objects. However we should definitely add
> >  > one because doing endoapply() is not efficient.)
> >  >
> >  > Then we can get the upstream sequence for each TSS with:
> >  >
> >  > TSS_up_seqs <- relist(extractUpstreamSeqs(
> >  > genome,
> >  > unlist(TSS_by_gn, use.names=FALSE),
> >  > width=1500),
> >  > TSS_by_gn)
> >  >
> >  > This time the upstream sequences are returned in a DNAStringSetList
> >  > object with 1 list element (DNAStringSet) per gene:
> >  >
> >  > > TSS_up_seqs
> >  > DNAStringSetList of length 4
> >  > [["67665"]]
> >  > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAATCTATATAAGT...
> >  > [["13198"]]
> >  > CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCCGGGTAGTGTGC...
> >  > [["110196"]]
> >  > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAACCATCTTAAT...
> >  > [["15368"]]
> >  > GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTTGGTTTTTCTGG...
> >  >
> >  > The number of sequences in each list element is the number of unique
> >  > TSS in the gene. Of course, there is no benefit to this second approach
> >  > in *your* case because none of your genes has more than 1 unique TSS.
> >  > But some genes do have more than 1 unique TSS (e.g. 100017, 100019,
> >  > 100036521, and many more...).
> >  >
> >  > Hope this helps,
> >  > H.
> >  >
> >  > > Dips
> >  > >
> >  > >
> >  > >
> >  > >
> >  > > [[alternative HTML version deleted]]
> >  > >
> >  > > _______________________________________________
> >  > > Bioconductor mailing list
> >  > > Bioconductor at r-project.org
> >  > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> >  > > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >  > >
> >  >
> >  > --
> >  > Hervé Pagès
> >  >
> >  > Program in Computational Biology
> >  > Division of Public Health Sciences
> >  > Fred Hutchinson Cancer Research Center
> >  > 1100 Fairview Ave. N, M1-B514
> >  > P.O. Box 19024
> >  > Seattle, WA 98109-1024
> >  >
> >  > E-mail: hpages at fhcrc.org
> >  > Phone: (206) 667-5791
> >  > Fax: (206) 667-1319
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
 		 	   		  
	[[alternative HTML version deleted]]