[BioC] All 3' UTRs for an organism - biomaRt package

Steffen sdurinck at lbl.gov
Tue Oct 2 02:11:32 CEST 2007


Dear Urska,

Your errors show that the getSequence function needed extra argument 
checking to help the user in understanding what is going wrong. This has 
been added to the new biomaRt version which will be included in the new 
BioConductor release later this week.
Firstly, getSequence always needs the type and seqType arguments.  The 
type argument specifies the type of identifiers that you use as input 
and also specifies the type of identifiers that will be present in the 
output so you know to which genes the retrieved sequences belong.
Secondly, getSequence  uses two ways to specify  the  sequences you are 
interested in:
a) you give a list of identifiers and getSequence will retrieve the 
sequences for the corresponding genes (here you'll use the id argument)
b) you give a chromosomal location and getSequence will retrieve the 
sequences of all genes that are located between the given positions. 
(here you'll use the chromosome, start and end arguments)

An additional problem that you encounter is  that the data volume you 
want to retrieve is so large (as you query for all sequences of all 
genes) that this possibly resulted  in a time-out.

Here's how I would retrieve the sequences:

First retrieve all ensembl gene identifiers in the hsapiens dataset by:

ensmart=useMart("ensembl", dataset="hsapiens_gene_ensembl")

ensemblid = getBM("ensembl_gene_id", mart=ensmart)

This should give 31189 unique ensembl gene identifiers.
To retrieve the 3UTRs for all these identifiers you could now retrieve 
them in batches of 5000 at once e.g.:

utr3Batch1 = getSequence(id = ensemblid[1:5000,1], 
type="ensembl_gene_id", seqType="3utr", mart=ensmart)
utr3Batch2 = getSequence(id = ensemblid[5001:10000,1], 
type="ensembl_gene_id", seqType="3utr", mart=ensmart)

etc.

Note that you'll have to do some cleaning as genes might have multiple 
3'UTRs associated with them for alternatively spliced transcripts and 
others might have no annotated 3'UTR.

Cheers,
Steffen





Urska Cvek wrote:
> I am using the latest version of biomaRt (1.10.1) package with R version
> 2.5.1, and would like to retrieve all 3' UTR sequences in the Ensembl 46,
> NCBI 36 (current versions) for the homo sapiens organism. I have used the
> Vignette/documentation and tried:
>
>   
>> ensmart=useMart("ensembl", dataset="hsapiens_gene_ensembl")
>>     
> Checking attributes and filters ... ok
>   
>> getSequence(type="ensembl", seqType="3utr", mart=ensmart)
>>     
> Error in getSequence(type = "ensembl", seqType = "3utr", mart = ensmart) :
>         Invalid type argument.  Use the listFilters function to select a
> valid type argument.
>
>  although both arguments to "type" and "seqType" seem to be correct, it
> looks like there is something wrong with the input.
>
> I also tried
>
>   
>> getSequence(type="entrezgene", seqType="3utr", mart=ensmart)
>>     
>
>   
>> utr3_1 = getSequence(chromosome=1, start=1, end=10000000,
>>     
> type="entrezgene", seqType="3utr", mart=ensmart)
>   
>> show(utr3_1)
>>     
> NULL
>
> When I try the advanced option for retrieval, the command takes a long time
> to return, and then returns the following:
>   
>> getBM(attributes=c("ensembl_gene_id","external_gene_id","3utr"),
>>     
> mart=ensmart)
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> :
>         line 21 did not have 3 elements
>
> I must be missing something... Any suggestions how I could get all 3' UTR
> sequences for the human organism using this package?
>
> Thank you in advance,
>
> U.C.
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>



More information about the Bioconductor mailing list