[BioC] retrieving mRNA sequences via biomaRt

Thu Aug 6 19:33:37 CEST 2009

Hi Simon,

The cdna attribute is the combination of 5utr + coding + 3utr so you can
remove 5utr, coding and 3utr from your list of attributes to retrieve. I
would take ensembl_transcript_id instead of embl.

Cheers,
Steffen

> Thanks, for the recommendation.
>
> So far, I just read Steffen's and your biomaRt user’s guide and had a
> look at the BioMart 0.7 Documentation, since I needed quick results.
> I'm going to have a look at the recommended book and paper, now.
>
>
> In the meantime, I got to a solution - but not a very satisfying one:
>
> ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl)
>
> myAttributes = c("embl", "cdna", "5utr", "coding", "3utr", "5_utr_end",
> "3_utr_start", "sequence_cdna_length","cds_length")
>
> ...
>
> qresult = getBM(attributes=myAttributes,
>                   filters=...,
>                   values=...,
>                   mart=ensembl)
>
> finalResult = mySeqCdsLengthFilter(qresult, c(3000, 5000), c(2000, 3000))
>
> For now, I parse my query results manually, using
> the values for "sequence_cdna_length" and "cds_length" as limits.
> I wish these attributes were filters ...
> or there was a BioMart and a database, I could use in a linked query via
> getLDS.
>
> I'm still curious for a smarter solution.
>
>
> Best regards,
> Simon
>
>
> Wolfgang Huber wrote:
>>
>> Hi Simon,
>>
>> with all respect, for a first contact with the Bioconductor project I'd
>> also recommend studying some of the documentation.
>>
>> A (slightly biased) set of points to start with are the "Bioconductor
>> Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper
>> "Mapping identifiers for the integration of genomic datasets with the
>> R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols
>> 2009;4(8):1184-91.
>>
>>     Best wishes
>>     Wolfgang
>>
>>
>>
>>
>> Simon ha scritto:
>>> Hello everybody,
>>>
>>> I am trying to solve the following tasks as a first contact with the
>>> bioconductor project:
>>>
>>> # Task 1:
>>> # find:
>>> #   * mRNA sequence (5'UTR, Coding region, 3'UTR)
>>> #   * position of start codon in sequence
>>> #   * position of stop codon in sequence
>>> #   * ID (Which ID(s) would I choose to reference my
>>> #     sequence hits? Embl, ensembl transcript id,
>>> #     Entrez Gene id, RefSeq, etc.?)
>>> #   * name of associated protein product
>>> #
>>> #  where:
>>> #   * origin is human
>>> #     Entrez Search would be: human[ORGN]
>>> #   * sequence is mRNA transcript
>>> #     Entrez Search for Molecule Type: biomol_mRNA[PROP]?
>>> #   * mRNA sequence length is 3000 to 5000 nts
>>> #     * Entrez Search for Sequence Length: 3000:5000[SLEN]
>>> #   * coding region of mRNA length is 2000 to 3000 nts
>>> #     * Entrez Search Field for stop and start of
>>> #       coding region: start:stop[CDS]
>>> #
>>> #
>>> # Task 2:
>>> # store the retrieved information to file for the first 200 hits
>>> # (Which would be a suitable file formate?)
>>>
>>> I started by using and playing around with the biomaRt package for R,
>>> but I got overwhelmed by its many possibilities.
>>>
>>> I would be glad to get any feedback, on how to start or even solve my
>>> tasks.
>>>
>>> Best regards,
>>> Simon
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>