[BioC] BiomaRt & retrieving flanking regions

Hervé Pagès hpages at fhcrc.org
Sun Dec 29 20:53:54 CET 2013


Hi Greg,

Not sure exactly why you're getting this error. FWIW this can
also be done with the GenomicFeatures and BSgenome packages:

   library(GenomicFeatures)
   txdb <- makeTranscriptDbFromBiomart("ensembl", "hsapiens_gene_ensembl")
   gn <- genes(txdb)

Then:

  > gn["ENSG00000165702"]
   GRanges with 1 range and 1 metadata column:
                     seqnames                 ranges strand | 
gene_id
                        <Rle>              <IRanges>  <Rle> | 
<CharacterList>
     ENSG00000165702        9 [135820932, 135867083]      + | 
ENSG00000165702
     ---
     seqlengths:
                      1                 2 ...            LRG_98 
    LRG_99
              249250621         243199373 ...             18750 
     13294

Obtain the ranges of the flanking regions with flank(). By default
flank() returns the upstream flanks:

   > gnflanks <- flank(gn, width=100)

   > gnflanks["ENSG00000165702"]
   GRanges with 1 range and 1 metadata column:
                     seqnames                 ranges strand | 
gene_id
                        <Rle>              <IRanges>  <Rle> | 
<CharacterList>
     ENSG00000165702        9 [135820832, 135820931]      + | 
ENSG00000165702
     ---
     seqlengths:
                      1                 2 ...            LRG_98 
    LRG_99
              249250621         243199373 ...             18750 
     13294

To extract the corresponding sequence, we can use the BSgenome
package for hg19. This assembly is compatible with the GRCh37.p12
assembly for all the main chromosomes *except* for chrM. Also we'll
need to adjust the names of the chromosomes because Ensembl/NCBI
and UCSC use different naming conventions:

   > library(BSgenome.Hsapiens.UCSC.hg19)
   > head(seqlevels(Hsapiens))
   [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"

Keep only flank regions for genes located on chromosomes 1 to 22, X
and Y, and rename the chromosomes:

   seqlevels(gnflanks, force=TRUE) <- c(1:22, "X", "Y")
   seqlevels(gnflanks) <- paste0("chr", seqlevels(gnflanks))

Finally, use getSeq() to extract the DNA sequence:

   > getSeq(Hsapiens, gnflanks["ENSG00000165702"])
     A DNAStringSet instance of length 1
       width seq
   [1]   100 
TAATGACATCTGTGACGTGAAGAATCAAGGAGAT...GCCTAGTAAATGCTCACTCACTAGATAGGTGGC

Note that getSeq() can extract more than 1 sequence at once:

   > my_genes <- c("ENSG00000165702", "ENSG00000252380", "ENSG00000200999")

   > getSeq(Hsapiens, gnflanks[my_genes])
     A DNAStringSet instance of length 3
       width seq
   [1]   100 
TAATGACATCTGTGACGTGAAGAATCAAGGAGAT...GCCTAGTAAATGCTCACTCACTAGATAGGTGGC
   [2]   100 
AGTCAAGGTGCTAGTTTGTGATGGAGCGGCAGGC...TCAACTAAATAAAACTAGCAAACTAGCTACAAA
   [3]   100 
CTTACATGCAAGAAAACTTCTAAGAAAACTTATA...AGAAAAAGCAGAATGAGCATCAGTAAATATTTA

Cheers,
H.


On 12/28/2013 04:43 AM, Greg Slodkowicz wrote:
> I should've probably mentioned the exact error I'm getting:
>
> Error in getBM(attributes = attrs, filters = list(ensembl_gene_id =
> "ENSG00000165702",  :
>    Query ERROR: caught BioMart::Exception: non-BioMart die(): Can't use an
> undefined value as an ARRAY reference at
> /ensemblweb/wwwmart/www_74/biomart-perl/lib/BioMart/Dataset/GenomicSequence.pm
> line 396.
>
> G.
>
>
> On Sat, Dec 28, 2013 at 1:42 PM, Greg Slodkowicz <gregs at ebi.ac.uk> wrote:
>
>> Dear all,
>> I'm having some trouble fetching flanking regions for a gene from the
>> Ensembl Biomart:
>>
>> library(biomaRt)
>> ensembl <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",
>> dataset="hsapiens_gene_ensembl",
>>        host="www.ensembl.org", path="/biomart/martservice")
>>
>> attrs <- c("ensembl_gene_id", "upstream_flank")
>> gene_ann <- getBM(attributes=attrs,
>> filters=list(ensembl_gene_id="ENSG00000165702", upstream_flank=100),
>> mart=ensembl, checkFilters=F)
>> gene_ann
>>
>> (here's a Gist: https://gist.github.com/jergosh/8159092)
>>
>> I also tried passing the parameters differently but it doesn't make any
>> difference:
>>
>> gene_ann <- getBM(attributes=attrs, filters=c("ensembl_gene_id",
>> "upstream_flank"), values=list(ensembl_gene_id="ENSG00000165702",
>> upstream_flank=100), mart=ensembl, checkFilters=F)
>>
>> Am I doing something wrong?
>>
>> Best,
>> Greg
>>
>> --
>> Greg Slodkowicz
>> PhD student, Nick Goldman group
>> European Bioinformatics Institute (EMBL-EBI)
>>
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list