[BioC] EasyRNAseq - error in building count table - Error in unlist

Nicolas Delhomme delhomme at embl.de
Thu Sep 26 12:30:15 CEST 2013


Hej Felix!

I figured out what the error is:

Here are the end of the exons located at the extremity of the 10 chromosomes:

Browse[1]> sapply(lapply(ranges(geneModel(obj))[gm.sel],range),end)
        1        10         2         3         4         5         6         7         8         9 
301409969 149550619 237864118 232184249 242011676 217905991 169336130 176810119 175341538 157013685

Here are the chromosome sizes as retrieved from your BAM file:

Browse[1]> chrSize(obj)
        1        10         2         3         4         5         6         7         8         9        Mt 
301354135 150189435 237068873 232140174 241473504 217872852 169174353 176764762 175793759 156750706    569630 
       Pt   Unknown 
   140384   7140151

As you can see below 8 of the chromosomal annotation - from biomaRt - have exons outside of the chromosomes - sizes derived from the BAM header. 

Browse[1]> sapply(lapply(ranges(geneModel(obj))[gm.sel],range),end) > chrSize(obj)[1:10]
    1    10     2     3     4     5     6     7     8     9 
 TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

In short, you are using a different version of the genome for the alignment as the one you are retrieving the annotation from.

easyRNASeq is obviously not behaving gracefully here, so I'll get a proper check implemented. That won't solve your issue though. What you need to figure out is which version of the genome/annotation to use and consistently doing so for your analyses. At the moment, since it would appear that the annotation have been lifted over a new genome, you would probably end up mis-assigning reads to intronic or extra-genic regions (hence not counting them) or worse assigning them to other genes. The easiest in your case is probably to get the gff3 or gtf file corresponding to the version of the genome you used for the alignments (e.g. from phytozome.org, retrieving the gene_exon.gff3 file) and use that for the easyRNASeq function with the parameter annotationMethod="gff" and annotationFile=[the path to the gff3] file.

Btw, I've come to realize that the geneModel generation is painfully slow, luckily the new version due in October should fix that. It will actually introduces a large number of improvements both computationaly and memory wise. That new version will as well provide an easier biomaRt setup, so that you don't have to fetch the annotation yourself.

I've actually looked into this a bit more and got it to work using the ggf3 file Zmays_181_gene_exons.gff3 retrieved from phytozome. The code is below with some inline comments. Actually, what I think is of highest interest for you is at the bottom of that code as I've written a work around to avoid the geneModel generation bottleneck I mentioned above.

## libs
library(easyRNASeq)
files<- c("excerpt_12a_sorted.bam",
          "excerpt_13a_sorted.bam")

## easyRNASeq v1.6 and lower
## is not very flexible in its gff support
## so we need to convert the gff file
## read the gff3 file and convert it into FlyBase like
gff3 <- readGff3("Zmays_181_gene_exons.gff3")

## get the transcript to gene mapping from the mRNA feature
transcriptGeneMapping <- data.frame(getGffAttribute(gff3[type(gff3)=="mRNA",],"ID"),
                                    getGffAttribute(gff3[type(gff3)=="mRNA",],"Parent"))

## get the exon feature
exons <- gff3[type(gff3)=="exon",]

## change their gffAttributes into FlyBase like
## where ID is geneID:exonNumber
## here, we create the new ID by retrieving the 
## geneID from the mapping and appending the 
## exon number extracted from the original ID
## and we paste that in front of the original
## gffAttribute which ID section has been
## removed
exons$gffAttributes <- paste("ID=",transcriptGeneMapping[match(getGffAttribute(exons,"Parent"),
                                                               transcriptGeneMapping$ID),"Parent"],
                             ":",sub(".*\\.","",getGffAttribute(exons,"ID")),
                             ";",sub("^ID=.*;P","P",exons$gffAttributes),
                             sep="")

## write down the modified file
## all these functionalities we used to
## play with the gff file are from the
## genomeIntervals package
writeGff3(exons,file="Zmays_181_exons.gff3")

## the following now works
## I have removed the validity check
## because your BAM files and
## the above gff file use the same naming 
## conventions for the chromosomes
rnaSeq_excerpt<- easyRNASeq(filesDirectory=".",
                            annotationMethod="gff",
                            annotationFile="Zmays_181_exons.gff3",
                            count="genes",
                            filenames=files,
                            summarization="geneModels",
                            outputFormat="RNAseq")

## AND HERE is what you could be most interested in
## if you are only interested in gene models
## as the geneModel generation is at the moment
## painfully slow, here is a work around
## were we will flatten the exon structure to have
## a single "synthetic" super-transcript per gene 

## get the exon ranges by gene
rngList <- split(IRanges(start=exons[,1],end=exons[,2]),
                    transcriptGeneMapping[match(getGffAttribute(exons,"Parent"),
                                                transcriptGeneMapping$ID),"Parent"])

## flatten the gene exons
## this will create synthetic exons if two exons
## of the gene overlap
rngList <- reduce(rngList)

## get the gff information
## here we simply duplicate the 
## first exon of every gene
## by the number of synthetic exons
## per gene. The content will be updated
## afterwards.
syntheticGeneModel <- exons[rep(match(transcriptGeneMapping$ID[match(names(rngList),
                                                                     transcriptGeneMapping$Parent)],
                                      getGffAttribute(exons,"Parent")),elementLengths(rngList)),]

## update the coordinates
syntheticGeneModel[,1] <- unlist(start(rngList))
syntheticGeneModel[,2] <- unlist(end(rngList))

## change the source
levels(syntheticGeneModel$source) <- "inhouse"

## modify the gffAttributes
## get the exon number for the minus strand
exonNumber <- sapply(elementLengths(rngList),":",1)
## reverse them on the plus strand
sel <- strand(syntheticGeneModel)[cumsum(elementLengths(rngList))] == "+"
exonNumber[sel] <- sapply(exonNumber[sel],rev)
## update the attributes
syntheticGeneModel$gffAttributes <- paste("ID=",rep(names(rngList),elementLengths(rngList)),
                                          ":",unlist(exonNumber),";Parent=",
                                          rep(names(rngList),elementLengths(rngList)),".0",sep="")

## write the gff file containing only one transcript per gene
## this transcript consists of all existing exon loci for that gene
writeGff3(syntheticGeneModel,file="Zmays_181_synthetic-gene-models.gff3")

## Now the trick is to count using transcripts
## as we have only one transcrit per gene model
## and circumvent the gene model generation
## altogether
geneModel_excerpt<- easyRNASeq(filesDirectory=".",
                            annotationMethod="gff",
                            annotationFile="Zmays_181_synthetic-gene-models.gff3",
                            count="transcripts",
                            filenames=files)

## Note that this is still not the best
## but that's due to the annotation
## if you look at it you'll see that
## the original annotation contains 
## exons that are 1bp in size...
## So you might want extra filtering :-)

Let me know if can I help you any further,

Cheers,

Nico

sessionInfo()

R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] easyRNASeq_1.6.0       ShortRead_1.18.0       latticeExtra_0.6-26    RColorBrewer_1.0-5    
 [5] Rsamtools_1.12.4       DESeq_1.12.1           lattice_0.20-23        locfit_1.5-9.1        
 [9] BSgenome_1.28.0        GenomicRanges_1.12.5   Biostrings_2.28.0      IRanges_1.18.4        
[13] edgeR_3.2.4            limma_3.16.8           biomaRt_2.16.0         Biobase_2.20.1        
[17] genomeIntervals_1.16.0 BiocGenerics_0.6.0     intervals_0.14.0      

loaded via a namespace (and not attached):
 [1] annotate_1.38.0      AnnotationDbi_1.22.6 bitops_1.0-6         DBI_0.2-7            genefilter_1.42.0   
 [6] geneplotter_1.38.0   grid_3.0.1           hwriter_1.3          RCurl_1.95-4.1       RSQLite_0.11.4      
[11] splines_3.0.1        stats4_3.0.1         survival_2.37-4      tools_3.0.1          XML_3.98-1.1        
[16] xtable_1.7-1         zlibbioc_1.6.0   

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 13 Sep 2013, at 10:03, Felix Frey wrote:

> Thank you for your answer, Nico!
> 
> I can upload the data. I have 48 .bam files with each about 15GB. I don't know if this is possible in dropbox. If so, I could just upload 2 files, which are also mentionned below in the code.
> 
> Best,
> 
> Felix
> 
> On 12.09.2013 17:38, Nicolas Delhomme wrote:
>> Hej Felix!
>> 
>> This is strange. Would you be able to share the data offline with me so that I can try to reproduce the issue? If it's possible I would create a drop box folder.
>> 
>> Cheers,
>> 
>> Nico
>> 
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>> 
>> Genome Biology Computational Support
>> 
>> European Molecular Biology Laboratory
>> 
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>> 
>> 
>> 
>> 
>> 
>> On Sep 12, 2013, at 4:36 PM, Felix Frey [guest] wrote:
>> 
>>> Hello,
>>> I have sequencing data of maize which I aligned to the reference sequence (The maize sequence ZmB73_RefGen_v2.tar.gz, downloaded from:
>>> # http://ftp.maizesequence.org/current/assembly/) using tophat (v2.0.3), bowtie2 (2.0.0.6) and samtools (0.1.18.0).
>>> I want to make a count table with easyRNAseq (1.4.2) to use for EdgeR.
>>> Some months ago, I did following analysis with easyRNASeq:
>>> 
>>> chr.map<- data.frame(  from=c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10"),
>>>  to=c("1","2","3","4","5","6","7","8","9","10"))
>>> ensembl<- useMart("ENSEMBL_MART_PLANT",dataset="zmays_eg_gene")
>>> exon.annotation<- getBM(c("ensembl_gene_id",
>>>                            "ensembl_transcript_id",
>>>                            "chromosome_name",
>>>                            "ensembl_gene_id",
>>>                            "ensembl_exon_id",
>>>                            "exon_chrom_start",
>>>                            "exon_chrom_end"),
>>>                            mart=ensembl,
>>>                            filters="chromosome_name",
>>>                            values=chrm)
>>> exon.annotation$chromosome<- paste("chr",exon.annotation$chromosome_name,sep="")
>>> exon.range<- RangedData(IRanges(
>>>    start=exon.annotation$exon_chrom_start,
>>>    end=exon.annotation$exon_chrom_end),
>>>     space=exon.annotation$chromosome,
>>>     strand=exon.annotation$strand,
>>>     transcript=exon.annotation$ensembl_transcript_id,
>>>     gene=exon.annotation$ensembl_gene_id,
>>>     exon=exon.annotation$ensembl_exon_id,
>>>     universe = "NULL")
>>> 
>>> files<- c(	"accepted_hits_12a_sorted.bam",
>>> 		"accepted_hits_13a_sorted.bam",...)
>>> 
>>> rnaSeq<- easyRNASeq(filesDirectory="/EASYRNASEQ/countdata",
>>>                          gapped=F,
>>>                          validity.check=TRUE,
>>>                          chr.map=chr.map,
>>>                          organism="custom",
>>>                          annotationMethod="env",
>>>                          annotationObject=exon.range,
>>>                          count="genes",
>>> 			  filenames=files,
>>> 			  summarization="geneModels",
>>> 			  outputFormat="RNAseq")
>>> 
>>> I got an output gene count table.
>>> Now, I wanted to repeat the analysis, but leaving out the non-protein-coding genes, which I wanted to achieve by introducing the "biotype" and filtering by "protein-coding" in the exon.annotation object.
>>> To check the analysis i performed it again like some months before and I got an error message, which I do not understand completely:
>>> 
>>> 
>>> OUTPUT:
>>> 
>>>> head( exon.annotation)
>>>  ensembl_gene_id strand ensembl_transcript_id chromosome_name
>>> 1   GRMZM2G439951      1     GRMZM2G439951_T01               5
>>> 2   GRMZM2G094632      1     GRMZM2G094632_T02               5
>>> 3   GRMZM2G094632      1     GRMZM2G094632_T01               5
>>> 4   GRMZM2G094632      1     GRMZM2G094632_T01               5
>>> 5   GRMZM2G095090      1     GRMZM2G095090_T01               5
>>> 6   GRMZM2G095094      1     GRMZM2G095094_T01               5
>>>  ensembl_gene_id.1   ensembl_exon_id exon_chrom_start exon_chrom_end
>>> 1     GRMZM2G439951 GRMZM2G439951_E01         39542856       39544325
>>> 2     GRMZM2G094632 GRMZM2G094632_E01         39435878       39436448
>>> 3     GRMZM2G094632 GRMZM2G094632_E03         39435878       39436317
>>> 4     GRMZM2G094632 GRMZM2G094632_E02         39436730       39437134
>>> 5     GRMZM2G095090 GRMZM2G095090_E01         39427195       39427706
>>> 6     GRMZM2G095094 GRMZM2G095094_E01         39424253       39427105
>>>    gene_biotype chromosome
>>> 1 protein_coding       chr5
>>> 2 protein_coding       chr5
>>> 3 protein_coding       chr5
>>> 4 protein_coding       chr5
>>> 5 protein_coding       chr5
>>> 6 protein_coding       chr5
>>> 
>>>> head( exon.range)
>>> RangedData with 6 rows and 5 value columns across 10 spaces
>>>     space                 ranges |    strand        transcript
>>>  <factor>               <IRanges>  |<integer>        <character>
>>> 1     chr1 [301409811, 301409969] |        -1 AC185612.3_FGT006
>>> 2     chr1 [301362751, 301363304] |        -1 GRMZM5G884466_T03
>>> 3     chr1 [301353831, 301354022] |        -1 GRMZM5G884466_T03
>>> 4     chr1 [301353664, 301353745] |        -1 GRMZM5G884466_T03
>>> 5     chr1 [301353366, 301353557] |        -1 GRMZM5G884466_T03
>>> 6     chr1 [301353147, 301353260] |        -1 GRMZM5G884466_T03
>>>              gene                   exon        biotype
>>>       <character>             <character>     <character>
>>> 1 AC185612.3_FG006 AC185612.3_FG006.exon1 protein_coding
>>> 2    GRMZM5G884466      GRMZM5G884466_E12 protein_coding
>>> 3    GRMZM5G884466      GRMZM5G884466_E03 protein_coding
>>> 4    GRMZM5G884466      GRMZM5G884466_E04 protein_coding
>>> 5    GRMZM5G884466      GRMZM5G884466_E08 protein_coding
>>> 6    GRMZM5G884466      GRMZM5G884466_E10 protein_coding
>>>> rnaSeq_new<- easyRNASeq(filesDirectory="/projects/irg/grp_stich/personal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRNASEQ/countdata",
>>> +                           gapped=F,
>>> +                           validity.check=TRUE,
>>> +                           chr.map=chr.map,
>>> +                           organism="custom",
>>> +                           annotationMethod="env",
>>> +                           annotationObject=exon.range,
>>> +                           count="genes",
>>> +                           filenames=files,
>>> +                           summarization="geneModels",
>>> +                           outputFormat="RNAseq")
>>> Checking arguments...
>>> Fetching annotations...
>>> Computing gene models...
>>> Summarizing counts...
>>> Processing accepted_hits_12a_sorted.bam
>>> Updating the read length information.
>>> The reads are of 95 bp.
>>> Warning message:
>>> In easyRNASeq(filesDirectory = "/projects/irg/grp_stich/personal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRNASEQ/countdata",  :
>>>  There are 8770 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want?
>>> Error in unlist(aggregate(readCoverage(obj)[names(geneModel(obj))[gm.sel]],  :
>>>  error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") :
>>>  'x' values larger than vector length 'sum(width)'
>>> 
>>> 
>>> 
>>> 
>>> Could you help me with this problem?
>>> 
>>> Best,
>>> Felix
>>> 
>>> --
>>> Felix Frey
>>> 
>>> Max Planck Institut für Pflanzenzüchtungsforschung
>>> Carl-von-Linné-Weg 10
>>> D-50829 Köln
>>> +49 (0) 221-5062 405
>>> 
>>> -- output of sessionInfo():
>>> 
>>>> sessionInfo()
>>> R version 3.0.1 (2013-05-16)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>> 
>>> locale:
>>> [1] C
>>> 
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>> 
>>> other attached packages:
>>> [1] easyRNASeq_1.6.0       ShortRead_1.18.0       latticeExtra_0.6-26
>>> [4] RColorBrewer_1.0-5     Rsamtools_1.12.4       DESeq_1.12.1
>>> [7] lattice_0.20-15        locfit_1.5-9.1         BSgenome_1.28.0
>>> [10] GenomicRanges_1.12.5   Biostrings_2.28.0      IRanges_1.18.3
>>> [13] edgeR_3.2.4            limma_3.16.7           biomaRt_2.16.0
>>> [16] Biobase_2.20.1         genomeIntervals_1.16.0 BiocGenerics_0.6.0
>>> [19] intervals_0.14.0
>>> 
>>> loaded via a namespace (and not attached):
>>> [1] AnnotationDbi_1.22.6 DBI_0.2-7            RCurl_1.95-4.1
>>> [4] RSQLite_0.11.4       XML_3.98-1.1         annotate_1.38.0
>>> [7] bitops_1.0-6         genefilter_1.42.0    geneplotter_1.38.0
>>> [10] grid_3.0.1           hwriter_1.3          splines_3.0.1
>>> [13] stats4_3.0.1         survival_2.37-4      tools_3.0.1
>>> [16] xtable_1.7-1         zlibbioc_1.6.0
>>> 
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
> 
> 
> -- 
> Felix Frey
> 
> AG Stich
> Max Planck Institut für Pflanzenzüchtungsforschung
> Carl-von-Linné-Weg 10
> D-50829 Köln
> +49 (0) 221-5062 405
> 



More information about the Bioconductor mailing list