[BioC] Solexa fastq files and BLAT

Tue Jan 13 14:57:49 CET 2009

Hi Daniel -- 

"Sean Davis" <sdavis2 at mail.nih.gov> writes:

> On Tue, Jan 13, 2009 at 6:35 AM, Daniel Brewer <daniel.brewer at icr.ac.uk>wrote:
>
>> Hi,
>>
>> I have got hold of some solexa results in fastq format from some cancer
>> mRNA samples and I would like to analyse these to look at a number of
>> things.  I would like to be able to list the sequences that occur most
>> often and then BLAT them against the human genome to see whether that

'readFastq' and the 'top' component of the return value of 'tables' in
ShortRead would allow you to read in the fastq file(s) and tabulate
the most common occurences. If these are relatively raw reads, you'll
likely be disappointed to find that the most common are Solexa adapter
sequences and other artifacts. Also if the sample prep involved a PCR
step then likely you'll see PCR artifacts (e.g., differential
amplification). If you have access to the _export.txt files then 'qa'
and 'report' can provide a useful overview of your data and its
limitations; the relatively high-level code used in generating the
report, visible in the file at

system.file("template", "qa_solexa.Rnw", package="ShortRead")

might be suggestive of ways to explore your data (view in a text
browser, look for Sweave 'chunks' between <<>>= and @). Also the
various vignettes. 

>> sequence does occur and if so, is associated with a known transcript.
>> Further down the line I would like to do some comparisons between normal
>> and tumour tissue.
>>
>> >From looking around it seems that Shortread (in the development version)
>> can be used to read in the files into BioStrings objects and then
>> BSgenome can be used to perform some sort of BLAT.  Am I on the right
>> lines here?
>>
>> Can anyone add to what packages I should be looking at and what
>> approaches or techniques I should be using.

The IRanges package provides very useful tools, at a slightly more
abstract level (current favorites include the Rle-class, which is
returned for instance by the 'coverage' function, and the manipulation
of IRanges themselves). The rtracklayer package provides a way to
expose results as tracks in genome browsers. Biobase::matchpt and the
org.* packages can be useful, too.

> The Bio-sig-seq list is, perhaps the best place to ask for more details.

Yes it would be good to follow up to the bioc-sig-sequencing
group. See http://bioconductor.org/docs/mailList.html

Martin

> The shortreads package combined with the Biostrings package (for doing
> alignments) is one possibility.  Also, it is possible to do the alignments
> outside of R using algorithms like Bowtie, MAQ, or ELAND and the shortreads
> package can read those results directly.
>
> Sean
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793