[BioC] error in function reduce from girafe

Fri Jun 25 15:02:34 CEST 2010

Hi Andreia,

please keep the discussion on the list, in order to have the answers stored in
the mailing list archives. 
For reading in GFF (version 3) files, I am using the function "readGff3" from
package genomeIntervals. However, if you look at the code of that function
(simply type in 'readGff3' without brackets), you see that the process is
straightforward. The file is first read into R (using funcion 'scan') and then
a "Genome_intervals_stranded" object is constructed from the result. Since the
format GTF is very similar to GFF, you can likely use the same source code
line-by-line with only minor changes. For example, in the attributes field, I
think that GFF requires a "=" between attribute and value and GTF a
whitespace. So after reading in the file with "scan", you might want to
replace those whitespaces at those positions by equal signs, using for example
the function 'gsub'. 

HTH,
Joern

On Fri, 25 Jun 2010 13:16:55 +0100, Andreia Fonseca wrote 
> Dear Joern, 
> 
> Thank you for your quick reply. Now its working well!!!! I have one more
question. Concerning the annotation file, I have a gtf file with all the
annotation of the human genome. I saw in your annotation script that you show
how to read gff files. Can you give me a hint on how to read the gtf file. 
> Thanks 
> with kind regards, 
> Andreia 
> 
> On Thu, Jun 24, 2010 at 4:26 PM, Joern Toedling <Joern.Toedling at curie.fr>
wrote: 
>  Hi Andreia, 
> 
> when "reducing" overlapping aligned reads, also the consensus sequence of the 
> merged reads is determined, using the function consensusString from package 
> Biostrings. And for some reason this step does not work for you. 
> 
> I cannot be completely sure but I think there may be an issue with the way 
> that you generate the AlignedGenomeIntervals object. The recommended way is 
> to use package ShortRead first to generate an AlignedRead object and then 
> coercing that object into an AlignedGenomeIntervals object. However, creating 
> it manually is possible but more prone to such problems. 
> However, thank you very much for trying it and reporting the results, maybe I 
> need to implement an additional check of the object to provide users in this 
> case with a more meaningful error message. 
> 
> I think the problem here is that your specified read sequence always seems to 
> be one shorter than the specified interval lengths. 
> For example, in the first row the sequence 
> TTTCTAATGAGCCCAGGGAGGGCTAGA 
> has 27 letters but the interval goes from position 43076512 to 43076539, which 
> is 28 bases. So maybe your alignment coordinates are in half-open interval 
> notation, in which case the interval actually ends at position 43076538. Then 
> the argument 'end' should be: 
> end=data$V6-1L 
> Also note that the coordinates in girafe are 1-based, so you may need to 
> adjust the alignment coordinates accordingly. 
> 
> It could also be an issue with factors in your data.frame 'data' (which btw. 
> is not a very good name since there also is a basic R function of that name), 
> especially the strand could be a candidate here. When you read in data using 
> read.table it's generally a good idea to use the argument 
> stringsAsFactors=FALSE to make sure there are no factor-related 
> errors later on unless you specifically want to make use of factors in your 
> data.frame. 
> 
> Please let me know if these suggested changes resolve the problem. Otherwise, 
> I would be grateful if you can send me a small excerpt of the file that you 
> read in with read.table off-list and I can have a closer look at it. 
> 
> Cheers, 
> Joern