[BioC] subseq with iranges on reverse strand
hpages at fhcrc.org
Fri Aug 22 20:53:22 CEST 2014
Sorry for the late answer, I was on vacations. I hope you don't mind
that I'm cc'ing the Bioconductor mailing list.
On 08/08/2014 05:07 AM, carol white wrote:
> If I want to take from the end of nucleotide seq of a gene that is on
> the reverse strand a variable number of bp with iranges, does it not
> take from the end of seq for the number of bp that is fixed? because it
> seems that it takes differently that I can't find out how. So if I take
> the nt seq of the gene on the ncbi web site and extract for ex 10 or 20
> bp from the end, i don't get the same as I do with iranges. As I have
> already given the strand as the parameter to the iranges function, I
> assume that it has already reverse-complemented.
> How can I figure out how iranges has taken the subsequence?
A concrete example and showing the code would help understand exactly
what you are trying to do.
Note that subseq() is not strand-aware i.e. it doesn't let the user
specify the strand so the extracted sequence is never reverse-
complemented. To extract the nucleotide sequences corresponding to
a given set of genomic ranges, it's better to use the getSeq() function,
which is strand-aware.
getSeq() typically takes 2 arguments, a BSgenome object and a GRanges
object. The BSgenome object contains the full genome sequences of the
organism. The GRanges object contains the genomic ranges of your regions
of interest. These can be the regions corresponding to the entire
genes, or to their flanking sequences, or to some upstream sequences,
or to the last 10 or 20 nucleotides of each genes. Each genomic range
is defined by a sequence name (e.g. chr1), a start position, an end
position, and a strand.
Here is an example of how to extract the last 10 nucleotides of all
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
gn <- genes(txdb)
min(width(gn)) # check length of shortest gene
gn_last10 <- flank(gn, -10, start=FALSE)
genome <- BSgenome.Hsapiens.UCSC.hg19
gn_last10_seqs <- getSeq(genome, gn_last10)
A DNAStringSet instance of length 23056
width seq names
 10 GTGTCCTCAA 1
 10 AATATTGTGG 10
 10 GTGGCATGCA 100
 10 TTGGATCTGG 1000
 10 TCCAGGCTAC 10000
... ... ...
 10 GGTATTTTTA 9991
 10 AAATTTGAAG 9992
 10 CCGGAAGAGG 9993
 10 TTTTGTTGTA 9994
 10 CTGCGTTTAA 9997
The names on this DNAStringSet object are the Entrez Gene ids.
If you want to use NCBI genes instead of UCSC genes, you could make
your own TxDb object by pointing the makeTranscriptDbFromUCSC() function
to the RefSeq Genes track (refGene table):
txdb <- makeTranscriptDbFromUCSC("hg19", "refGene")
This requires an internet connection and will take a few minutes to
The short sequences you will obtain should match the ones you get
by querying directly the NCBI web site. Let me know if they don't.
Finally note that in the above example, we represented each gene with
by a single genomic range, that is, we ignored splicing (and alternate
splicing). If you wanted to take them into consideration, you would
need to work at the transcript level because "the last 10 or 20
nucleotides of a gene" is not necessarily well defined when intron
sequences need to be skipped.
Hope this helps,
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor