[BioC] DNAStringsSet - remove multiple entries with same name
deepti anand
anand.deepti at outlook.com
Mon Sep 8 00:20:06 CEST 2014
Thanks Herve,
It worked.
-Dips-
> Date: Thu, 4 Sep 2014 12:15:43 -0700
> From: hpages at fhcrc.org
> To: anand.deepti at outlook.com
> CC: Bioconductor at r-project.org
> Subject: Re: [BioC] DNAStringsSet - remove multiple entries with same name
>
> Hi Dips,
>
> I'm cc'ing the mailing list so others can benefit.
>
> On 09/04/2014 11:34 AM, deepti anand wrote:
> > Hi H,
> >
> > Thank you for the code. I run the code but the extractUpstremSeqs() does
> > not seem to work. Here are the codes:
> >
> >> ids.ok
> > [1] "67665" "13198" "110196" "15368"
> > > txdb<- TxDb.Mmusculus.UCSC.mm10.knownGene
> > > gn<- genes(txdb)[ids.ok]
> > > gn
> > GRanges with 4 ranges and 1 metadata column:
> > seqnames ranges strand | gene_id
> > <Rle> <IRanges> <Rle> | <CharacterList>
> > 67665 chr18 [ 60526221, 60558762] + | 67665
> > 13198 chr10 [127290793, 127296288] + | 13198
> > 110196 chr3 [ 89093588, 89101967] - | 110196
> > 15368 chr8 [ 75093618, 75100593] + | 15368
> > ---
> > seqlengths:
> > chr1 chr2 chr3 ...
> > chrUn_GL456394 chrUn_GL456396 chrUn_JH584304
> > 195471971 182113224 160039680 ...
> > 24323 21240 114452
> > > genome <- BSgenome.Mmusculus.UCSC.mm10
> > > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500)
> > Error: could not find function "extractUpstreamSeqs"
> >
> > Could you suggest how should I proceed further?
>
> Oops, extractUpstreamSeqs() is only available in the current devel
> version of Bioconductor (BioC 3.0, which will be released in October).
> Sorry. With the current release version (BioC 2.14), you can obtain
> 'gn_up_seqs' with:
>
> gn_up_seqs <- getPromoterSeq(gn, genome, upstream=1500, downstream=0)
>
> And 'TSS_up_seqs' with:
>
> TSS_up_seqs <- relist(getPromoterSeq(
> unlist(TSS_by_gn, use.names=FALSE),
> genome,
> upstream=1500, downstream=0),
> TSS_by_gn)
>
> Please let me know if you run into other issues.
>
> Thanks,
> H.
>
> PS: Yesterday I added a "resize" method for GRangesList objects in
> BioC devel. It's in GenomicRanges 1.17.37 (which will become available
> via biocLite() in the next 24 hours or so).
>
> >
> > Best,
> >
> > -Dips-
> >
> >
> > > Date: Wed, 3 Sep 2014 12:14:49 -0700
> > > From: hpages at fhcrc.org
> > > To: anand.deepti at outlook.com; bioconductor at r-project.org
> > > Subject: Re: [BioC] DNAStringsSet - remove multiple entries with same
> > name
> > >
> > > Hi Dips,
> > >
> > > On 09/03/2014 10:30 AM, deepti anand wrote:
> > > > Hi All,
> > > >
> > > > I am trying to extract promoter sequences for a few ENTREZ IDS. The
> > problem I am having is that there exists multiple transcripts for same
> > gene. So this gives me multiple promoter sequences for same gene. Can I
> > filter out the redundant promoter sequences?
> > > >
> > > > Here is my code:
> > > > ids.ok = c("67665" ,"13198" ,"110196","15368")## Obtain coordinates
> > of transcript #####>grl <-
> > transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene")
> > [ids.ok]>promoter.seqs <- getPromoterSeq(grl,Mmusculus,
> > upstream=1500,downstream=0)>promoter.seqs<- unlist(promoter.seqs)>
> > promoter.seqs A DNAStringSet instance of length 8 width seq names [1]
> > 1500
> > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> > 67665.67665[2] 1500
> > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> > 67665.67665[3] 1500
> > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> > 67665.67665[4] 1500
> > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTC
> > > C!
> > > > TC 67665.67665[5] 1500
> > CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCC...CCCCCCCCCAGGAGGGGCCGGACAGCATAAAGGATACTCGCTCTCCGCTCTTGA
> > 13198.13198[6] 1500
> > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG
> > 110196.110196[7] 1500
> > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG
> > 110196.110196[8] 1500
> > GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTT...TGGGCGGCCACCACGTGACCCGCGTACTTAAAGGGCTGGCGCGGGCAGCTGCTC
> > 15368.15368
> > > >
> > > > In example above, there are four sequences for same gene
> > '67665.67665'. How can I remove these entries?
> > > > I would appreciate any help
> > >
> > > In this particular example, the TSS (transcription starting site) is
> > > the same for all the transcripts in any of your genes. This explains
> > > why you get the same promoter sequence. However, this won't always be
> > > the case. So in order to handle the general situation, we need to
> > > choose between 2 behaviors:
> > >
> > > 1) Return 1 sequence per gene (even for genes with more than 1 unique
> > > TSS).
> > >
> > > 2) Return 1 sequence per unique TSS per gene.
> > >
> > > To do 1) we need to choose a particular TSS for each gene. A common
> > > choice is to pick-up the most upstream TSS. In this case, genes()
> > > can be used to extract the gene ranges. genes() will return a GRanges
> > > object with 1 genomic range per gene. This genomic range is the union
> > > of all the transcript ranges in the gene.
> > >
> > > library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> > > txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> > > ids.ok <- c("67665", "13198", "110196", "15368")
> > > gn <- genes(txdb)[ids.ok]
> > >
> > > Then:
> > >
> > > > gn
> > > GRanges with 4 ranges and 1 metadata column:
> > > seqnames ranges strand | gene_id
> > > <Rle> <IRanges> <Rle> | <CharacterList>
> > > 67665 chr18 [ 60526221, 60558762] + | 67665
> > > 13198 chr10 [127290793, 127296288] + | 13198
> > > 110196 chr3 [ 89093588, 89101967] - | 110196
> > > 15368 chr8 [ 75093618, 75100593] + | 15368
> > > ---
> > > seqlengths:
> > > chr1 chr2 ... chrUn_JH584304
> > > 195471971 182113224 ... 114452
> > >
> > > Note that, by default, genes() will exclude genes that don't have all
> > > their transcripts on the same chromosome and strand (because in that
> > > case the gene cannot be represented with a unique genomic range).
> > > See ?genes for more information.
> > >
> > > Then use extractUpstreamSeqs() to extract the upstream sequences for
> > > those genes:
> > >
> > > library(BSgenome.Mmusculus.UCSC.mm10)
> > > genome <- BSgenome.Mmusculus.UCSC.mm10
> > > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500)
> > >
> > > Then:
> > >
> > > > gn_up_seqs
> > > A DNAStringSet instance of length 4
> > > width seq names
> > >
> > > [1] 1500 CTGCTGTAAAGTTACATTCCTGC...GGTGCGCCGGGCGTTGGCTCCTC 67665
> > > [2] 1500 CAGCCCTAAAAGATGAAAGTCGC...GGATACTCGCTCTCCGCTCTTGA 13198
> > > [3] 1500 CACGTCGGCCTGCCTATCAGGGA...TTATGTTCTGCCTCCTGAGCCTG 110196
> > > [4] 1500 GTTAGTATTTAATATTTAAAGCT...AGGGCTGGCGCGGGCAGCTGCTC 15368
> > >
> > > For doing 2) we can use transcriptsBy() to extract all transcript ranges
> > > grouped by gene:
> > >
> > > tx_by_gn <- transcriptsBy(txdb, by="gene")[ids.ok]
> > >
> > > Then we can obtain the unique TSS for each gene by first resizing each
> > > list element in 'tx_by_gn' and then by passing the result thru unique():
> > >
> > > TSS_by_gn <- unique(endoapply(tx_by_gn, resize, width=1))
> > >
> > > (Note that we use endoapply() here because we don't have a "resize"
> > > method for GRangesList objects. However we should definitely add
> > > one because doing endoapply() is not efficient.)
> > >
> > > Then we can get the upstream sequence for each TSS with:
> > >
> > > TSS_up_seqs <- relist(extractUpstreamSeqs(
> > > genome,
> > > unlist(TSS_by_gn, use.names=FALSE),
> > > width=1500),
> > > TSS_by_gn)
> > >
> > > This time the upstream sequences are returned in a DNAStringSetList
> > > object with 1 list element (DNAStringSet) per gene:
> > >
> > > > TSS_up_seqs
> > > DNAStringSetList of length 4
> > > [["67665"]]
> > > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAATCTATATAAGT...
> > > [["13198"]]
> > > CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCCGGGTAGTGTGC...
> > > [["110196"]]
> > > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAACCATCTTAAT...
> > > [["15368"]]
> > > GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTTGGTTTTTCTGG...
> > >
> > > The number of sequences in each list element is the number of unique
> > > TSS in the gene. Of course, there is no benefit to this second approach
> > > in *your* case because none of your genes has more than 1 unique TSS.
> > > But some genes do have more than 1 unique TSS (e.g. 100017, 100019,
> > > 100036521, and many more...).
> > >
> > > Hope this helps,
> > > H.
> > >
> > > > Dips
> > > >
> > > >
> > > >
> > > >
> > > > [[alternative HTML version deleted]]
> > > >
> > > > _______________________________________________
> > > > Bioconductor mailing list
> > > > Bioconductor at r-project.org
> > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > >
> > >
> > > --
> > > Hervé Pagès
> > >
> > > Program in Computational Biology
> > > Division of Public Health Sciences
> > > Fred Hutchinson Cancer Research Center
> > > 1100 Fairview Ave. N, M1-B514
> > > P.O. Box 19024
> > > Seattle, WA 98109-1024
> > >
> > > E-mail: hpages at fhcrc.org
> > > Phone: (206) 667-5791
> > > Fax: (206) 667-1319
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
[[alternative HTML version deleted]]
More information about the Bioconductor
mailing list