[BioC] DNAStringsSet - remove multiple entries with same name

Thu Sep 4 21:15:43 CEST 2014

Hi Dips,

I'm cc'ing the mailing list so others can benefit.

On 09/04/2014 11:34 AM, deepti anand wrote:
> Hi H,
>
> Thank you for the code. I run the code but the extractUpstremSeqs() does
> not seem to work. Here are the codes:
>
>> ids.ok
> [1] "67665"  "13198"  "110196" "15368"
>  > txdb<- TxDb.Mmusculus.UCSC.mm10.knownGene
>  > gn<- genes(txdb)[ids.ok]
>  > gn
> GRanges with 4 ranges and 1 metadata column:
>           seqnames                 ranges strand |         gene_id
>              <Rle>              <IRanges>  <Rle> | <CharacterList>
>     67665    chr18 [ 60526221,  60558762]      + |           67665
>     13198    chr10 [127290793, 127296288]      + |           13198
>    110196     chr3 [ 89093588,  89101967]      - |          110196
>     15368     chr8 [ 75093618,  75100593]      + |           15368
>    ---
>    seqlengths:
>                     chr1                 chr2                 chr3 ...
>      chrUn_GL456394       chrUn_GL456396       chrUn_JH584304
>                195471971            182113224            160039680 ...
>               24323                21240               114452
>  > genome <- BSgenome.Mmusculus.UCSC.mm10
>  > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500)
> Error: could not find function "extractUpstreamSeqs"
>
> Could you suggest how should I proceed further?

Oops, extractUpstreamSeqs() is only available in the current devel
version of Bioconductor (BioC 3.0, which will be released in October).
Sorry. With the current release version (BioC 2.14), you can obtain
'gn_up_seqs' with:

   gn_up_seqs  <- getPromoterSeq(gn, genome, upstream=1500, downstream=0)

And 'TSS_up_seqs' with:

   TSS_up_seqs <- relist(getPromoterSeq(
                             unlist(TSS_by_gn, use.names=FALSE),
                             genome,
                             upstream=1500, downstream=0),
                         TSS_by_gn)

Please let me know if you run into other issues.

Thanks,
H.

PS: Yesterday I added a "resize" method for GRangesList objects in
BioC devel. It's in GenomicRanges 1.17.37 (which will become available
via biocLite() in the next 24 hours or so).

>
> Best,
>
> -Dips-
>
>
>  > Date: Wed, 3 Sep 2014 12:14:49 -0700
>  > From: hpages at fhcrc.org
>  > To: anand.deepti at outlook.com; bioconductor at r-project.org
>  > Subject: Re: [BioC] DNAStringsSet - remove multiple entries with same
> name
>  >
>  > Hi Dips,
>  >
>  > On 09/03/2014 10:30 AM, deepti anand wrote:
>  > > Hi All,
>  > >
>  > > I am trying to extract promoter sequences for a few ENTREZ IDS. The
> problem I am having is that there exists multiple transcripts for same
> gene. So this gives me multiple promoter sequences for same gene. Can I
> filter out the redundant promoter sequences?
>  > >
>  > > Here is my code:
>  > > ids.ok = c("67665" ,"13198" ,"110196","15368")## Obtain coordinates
> of transcript #####>grl <-
> transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by="gene")
> [ids.ok]>promoter.seqs <- getPromoterSeq(grl,Mmusculus,
> upstream=1500,downstream=0)>promoter.seqs<- unlist(promoter.seqs)>
> promoter.seqs A DNAStringSet instance of length 8 width seq names [1]
> 1500
> CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> 67665.67665[2] 1500
> CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> 67665.67665[3] 1500
> CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTCCTC
> 67665.67665[4] 1500
> CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAA...GGAGGGAAGCGCCGGGCTGTGTCACGTGACGGGTGCGCCGGGCGTTGGCTC
>  > C!
>  > > TC 67665.67665[5] 1500
> CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCC...CCCCCCCCCAGGAGGGGCCGGACAGCATAAAGGATACTCGCTCTCCGCTCTTGA
> 13198.13198[6] 1500
> CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG
> 110196.110196[7] 1500
> CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAA...CCGTGGCATGCCGGGAGTCGTAGTTTTATATTTATGTTCTGCCTCCTGAGCCTG
> 110196.110196[8] 1500
> GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTT...TGGGCGGCCACCACGTGACCCGCGTACTTAAAGGGCTGGCGCGGGCAGCTGCTC
> 15368.15368
>  > >
>  > > In example above, there are four sequences for same gene
> '67665.67665'. How can I remove these entries?
>  > > I would appreciate any help
>  >
>  > In this particular example, the TSS (transcription starting site) is
>  > the same for all the transcripts in any of your genes. This explains
>  > why you get the same promoter sequence. However, this won't always be
>  > the case. So in order to handle the general situation, we need to
>  > choose between 2 behaviors:
>  >
>  > 1) Return 1 sequence per gene (even for genes with more than 1 unique
>  > TSS).
>  >
>  > 2) Return 1 sequence per unique TSS per gene.
>  >
>  > To do 1) we need to choose a particular TSS for each gene. A common
>  > choice is to pick-up the most upstream TSS. In this case, genes()
>  > can be used to extract the gene ranges. genes() will return a GRanges
>  > object with 1 genomic range per gene. This genomic range is the union
>  > of all the transcript ranges in the gene.
>  >
>  > library(TxDb.Mmusculus.UCSC.mm10.knownGene)
>  > txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
>  > ids.ok <- c("67665", "13198", "110196", "15368")
>  > gn <- genes(txdb)[ids.ok]
>  >
>  > Then:
>  >
>  > > gn
>  > GRanges with 4 ranges and 1 metadata column:
>  > seqnames ranges strand | gene_id
>  > <Rle> <IRanges> <Rle> | <CharacterList>
>  > 67665 chr18 [ 60526221, 60558762] + | 67665
>  > 13198 chr10 [127290793, 127296288] + | 13198
>  > 110196 chr3 [ 89093588, 89101967] - | 110196
>  > 15368 chr8 [ 75093618, 75100593] + | 15368
>  > ---
>  > seqlengths:
>  > chr1 chr2 ... chrUn_JH584304
>  > 195471971 182113224 ... 114452
>  >
>  > Note that, by default, genes() will exclude genes that don't have all
>  > their transcripts on the same chromosome and strand (because in that
>  > case the gene cannot be represented with a unique genomic range).
>  > See ?genes for more information.
>  >
>  > Then use extractUpstreamSeqs() to extract the upstream sequences for
>  > those genes:
>  >
>  > library(BSgenome.Mmusculus.UCSC.mm10)
>  > genome <- BSgenome.Mmusculus.UCSC.mm10
>  > up_seqs <- extractUpstreamSeqs(genome, gn, width=1500)
>  >
>  > Then:
>  >
>  > > gn_up_seqs
>  > A DNAStringSet instance of length 4
>  > width seq names
>  >
>  > [1] 1500 CTGCTGTAAAGTTACATTCCTGC...GGTGCGCCGGGCGTTGGCTCCTC 67665
>  > [2] 1500 CAGCCCTAAAAGATGAAAGTCGC...GGATACTCGCTCTCCGCTCTTGA 13198
>  > [3] 1500 CACGTCGGCCTGCCTATCAGGGA...TTATGTTCTGCCTCCTGAGCCTG 110196
>  > [4] 1500 GTTAGTATTTAATATTTAAAGCT...AGGGCTGGCGCGGGCAGCTGCTC 15368
>  >
>  > For doing 2) we can use transcriptsBy() to extract all transcript ranges
>  > grouped by gene:
>  >
>  > tx_by_gn <- transcriptsBy(txdb, by="gene")[ids.ok]
>  >
>  > Then we can obtain the unique TSS for each gene by first resizing each
>  > list element in 'tx_by_gn' and then by passing the result thru unique():
>  >
>  > TSS_by_gn <- unique(endoapply(tx_by_gn, resize, width=1))
>  >
>  > (Note that we use endoapply() here because we don't have a "resize"
>  > method for GRangesList objects. However we should definitely add
>  > one because doing endoapply() is not efficient.)
>  >
>  > Then we can get the upstream sequence for each TSS with:
>  >
>  > TSS_up_seqs <- relist(extractUpstreamSeqs(
>  > genome,
>  > unlist(TSS_by_gn, use.names=FALSE),
>  > width=1500),
>  > TSS_by_gn)
>  >
>  > This time the upstream sequences are returned in a DNAStringSetList
>  > object with 1 list element (DNAStringSet) per gene:
>  >
>  > > TSS_up_seqs
>  > DNAStringSetList of length 4
>  > [["67665"]]
>  > CTGCTGTAAAGTTACATTCCTGCCTAGAAATTTATATCGATTCTGCCGTCAGAATCTATATAAGT...
>  > [["13198"]]
>  > CAGCCCTAAAAGATGAAAGTCGCGACTTGCCCTGCCCCGCCCCAAAGGCTTCCCGGGTAGTGTGC...
>  > [["110196"]]
>  > CACGTCGGCCTGCCTATCAGGGAGTCTACTGCCTTTTCCCTCAGTATGAGATAACCATCTTAAT...
>  > [["15368"]]
>  > GTTAGTATTTAATATTTAAAGCTTGCTTCTAACTTGGCCCAAAATGTTGGAGTTGGTTTTTCTGG...
>  >
>  > The number of sequences in each list element is the number of unique
>  > TSS in the gene. Of course, there is no benefit to this second approach
>  > in *your* case because none of your genes has more than 1 unique TSS.
>  > But some genes do have more than 1 unique TSS (e.g. 100017, 100019,
>  > 100036521, and many more...).
>  >
>  > Hope this helps,
>  > H.
>  >
>  > > Dips
>  > >
>  > >
>  > >
>  > >
>  > > [[alternative HTML version deleted]]
>  > >
>  > > _______________________________________________
>  > > Bioconductor mailing list
>  > > Bioconductor at r-project.org
>  > > https://stat.ethz.ch/mailman/listinfo/bioconductor
>  > > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>  > >
>  >
>  > --
>  > Hervé Pagès
>  >
>  > Program in Computational Biology
>  > Division of Public Health Sciences
>  > Fred Hutchinson Cancer Research Center
>  > 1100 Fairview Ave. N, M1-B514
>  > P.O. Box 19024
>  > Seattle, WA 98109-1024
>  >
>  > E-mail: hpages at fhcrc.org
>  > Phone: (206) 667-5791
>  > Fax: (206) 667-1319

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319