[BioC] subseq with iranges on reverse strand

Hervé Pagès hpages at fhcrc.org
Fri Aug 22 20:53:22 CEST 2014


Hi Carol,

Sorry for the late answer, I was on vacations. I hope you don't mind
that I'm cc'ing the Bioconductor mailing list.

On 08/08/2014 05:07 AM, carol white wrote:
> Hi,
> If I want to take from the end of nucleotide seq of a gene that is on
> the reverse strand a variable number of bp with iranges, does it not
> take from the end of seq for the number of bp that is fixed? because it
> seems that it takes differently that I can't find out how. So if I take
> the nt seq of the gene on the ncbi web site and extract for ex 10 or 20
> bp from the end, i don't get the same as I do with iranges. As I have
> already given the strand as the parameter to the iranges function, I
> assume that it has already reverse-complemented.
>
> How can I figure out how iranges has taken the subsequence?

A concrete example and showing the code would help understand exactly
what you are trying to do.

Note that subseq() is not strand-aware i.e. it doesn't let the user
specify the strand so the extracted sequence is never reverse-
complemented. To extract the nucleotide sequences corresponding to
a given set of genomic ranges, it's better to use the getSeq() function,
which is strand-aware.

getSeq() typically takes 2 arguments, a BSgenome object and a GRanges
object. The BSgenome object contains the full genome sequences of the
organism. The GRanges object contains the genomic ranges of your regions
of interest. These can be the regions corresponding to the entire
genes, or to their flanking sequences, or to some upstream sequences,
or to the last 10 or 20 nucleotides of each genes. Each genomic range
is defined by a sequence name (e.g. chr1), a start position, an end
position, and a strand.

Here is an example of how to extract the last 10 nucleotides of all
Human genes:

   library(TxDb.Hsapiens.UCSC.hg19.knownGene)
   txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
   gn <- genes(txdb)
   min(width(gn))  # check length of shortest gene
   gn_last10 <- flank(gn, -10, start=FALSE)

   library(BSgenome.Hsapiens.UCSC.hg19)
   genome <- BSgenome.Hsapiens.UCSC.hg19
   gn_last10_seqs <- getSeq(genome, gn_last10)

Then:

   > gn_last10_seqs
     A DNAStringSet instance of length 23056
           width seq                                           names 

       [1]    10 GTGTCCTCAA                                    1
       [2]    10 AATATTGTGG                                    10
       [3]    10 GTGGCATGCA                                    100
       [4]    10 TTGGATCTGG                                    1000
       [5]    10 TCCAGGCTAC                                    10000
       ...   ... ...
   [23052]    10 GGTATTTTTA                                    9991
   [23053]    10 AAATTTGAAG                                    9992
   [23054]    10 CCGGAAGAGG                                    9993
   [23055]    10 TTTTGTTGTA                                    9994
   [23056]    10 CTGCGTTTAA                                    9997

The names on this DNAStringSet object are the Entrez Gene ids.

If you want to use NCBI genes instead of UCSC genes, you could make
your own TxDb object by pointing the makeTranscriptDbFromUCSC() function
to the RefSeq Genes track (refGene table):

   txdb <- makeTranscriptDbFromUCSC("hg19", "refGene")

This requires an internet connection and will take a few minutes to
complete.

The short sequences you will obtain should match the ones you get
by querying directly the NCBI web site. Let me know if they don't.

Finally note that in the above example, we represented each gene with
by a single genomic range, that is, we ignored splicing (and alternate
splicing). If you wanted to take them into consideration, you would
need to work at the transcript level because "the last 10 or 20
nucleotides of a gene" is not necessarily well defined when intron
sequences need to be skipped.

Hope this helps,
H.

>
> Thanks
>
> Carol
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list