[BioC] DEXSeq Problems

Sun May 26 19:05:10 CEST 2013

Hi Margaret

Thanks for the feed-back. We have started working on an improved 
vignette but this might still take a bit to get finished. The main issue 
with the current one is that it somehow start in the middle, namely 
after the use of the Python scripts, and then explains these
initial steps only at the end, with some parts even been moved to the 
pasilla vignette. So, you are certainly right, this needs to be cleaned 
up and brought in a chronological order.

To not let you wait for our new vignettes, let's go through your issues:

 > Issue #1
>
> module load htseq/0.5.3p9
> python
> import HTSeq
> alignment_file = HTSeq.SAM_Reader(" sam file ")
>
> This step generates a SAM alignment object.
>
> However, I need to exit in order to use PBS code to run the
> dexseq_prepare_annotation.py
> and dexseq_count.py and generate gff and txt files,
> and thus lose the SAM alignment object which I am guessing should
> be used as the gtf file?

I am unusure where you found this code fragment. Have yiu started 
reading the "Tour through HTSeq" page? This is nice, but really, there 
is no need learn about how o program your own Python scripts that use 
HTSeq when you just want to use the HTSeq-based Python scripts supplied 
with DEXSeq.

You need to start of with preparing your annotation file. This is the 
GTF file with the gene models for your species that you get from 
Ensembl. (Make sure that the GTF file matches the genome you aligned 
against.) Just run, on the bash shell

    python dexseq_prepare_annotation.py Mus_musculus.GRCm38.71.gtf.gz \
        mouse_flattened.gff

This takes the GTF file from Ensembl (here the current one for mouse, 
taken from ftp://ftp.ensembl.org/pub/release-71/gtf/ ) and "flattens" 
it. See Figure 1 of our paper for an explanation what we mean by 
"flattening".

Then, you run for each of your BAM files the counting script:

    python dexseq_count.py mouse_flattened.gff sample1.sam sample1.counts

Here, sample1.sam is the file produced by tophat (i.e., 
"accepted_hits.sam" for the respective sample). The file 
"sample1.counts" contains the counts. Look into it to see whether it 
makes sense.

Then, you follow the Section "Creating ExonCountSet objects from files 
produced by HTSeq" to read in your count files. And then, you have the 
ExonCountDataSet that you need to perform the analysis described in the 
initial sections of the vignette.

> Can I actually skip the alignment_file step if I already posses a gtf file?

Not sure which step you mean. You always need to get a GTF file 
(preferably from Ensembl, the UCSC ones only work after some tweaking), 
"flatten it", and then count.

> Issue #2
>
> If I have a homo sapiens transcript.bam file (not from ensembl) how do
> I process the file in order to make it work with
> dexseq_prepare_annotation.py?
 >
 > Should I pass the transcript.sam file through the alignment_file code 
 > first?

A ".bam" file? Are you sure you don't mean a gtf or gff file?

Explain a bit more what this file contains, please.

> Issue #3
>
> Using an ensembl animal gene data, I have to specify strandedness=no within
> the
> python dexseq_count.py code and I am wondering if this will
> lead to errors down the line.. The dexseq_count code won't work without
> -s no, despite that the bam file is probably from the ensembl site.

Somehow you got confused here. The bam files contain the aligned reads 
from your samples, not the annotation. And the "stranded" argument is 
about whether the wet-lab protocol you used for sample preparation 
preserves strandedness (i.e., whether the strand to which a read gets 
aligned can tell you about which strand the original mRNA was 
transcribed from.)

> Issue #4
>
> The two vignettes use different ways to create an ExonCountSet object.I am
> confused about which to use...

Well, they both work. We prefer to use the Python scripts. However, many 
users said they preferred an solution that works completely in R, 
without resorting to another language such as Python, and the 
GenomicRanges workflow is an attempt to accomodate these wishes.

> Also, if I have a homo sapiens gene transcript data from illumina, put
> through tophat,  what type of functions should I use to prep the data for
> the ExonCountSet
> creation? Same question for ensembl animal gene data...

I am still confused by what you mean by "data from ensemble". Do you 
have your own RNA-Seq data, or are you using published ones?

> Should I prep the data by using this workflow:
> http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf

This is the vignette for the GenomicFeatures package. You can use this 
package _instead_ of our Python scripts, but then, there is still no 
need to work through this vignette. Rather, just follow the instructions 
in the DEXSeq vignette, Section "Creating ExonCountSet objects From 
GRanges, BamListFiles and transcriptDb objects".

I hope this helps you to get started.

   Simon