[BioC] EasyRNAseq - error in building count table - Error in unlist

Felix Frey frey at mpipz.mpg.de
Thu Sep 26 15:12:48 CEST 2013


Hello Nico,
thank you very much for your effort! Probably the problem is that I used 
a different version in my alignment. I'm looking forward for the new 
version.
I executed the code you wrote and it seems to work although I do not 
understand each step completely!

Furthermore I have another question regarding my actual aim for this 
analysis:
The actual reason, why I wanted to repeat this analysis, was that I 
wanted to include a filtering for protein-coding genes. For this reason 
I included "gene_biotype" and a filtering for protein_coding in my 
exon.annotation object:

  exon.annotation <- getBM(c("ensembl_gene_id",
                             "strand",
                             "ensembl_transcript_id",
                             "chromosome_name",
                             "ensembl_gene_id",
                             "ensembl_exon_id",
                             "exon_chrom_start",
                             "exon_chrom_end",
                             "gene_biotype"),
                             mart=ensembl,
                             filters=c("chromosome_name", "biotype"),
                             values=list(chrm, "protein_coding")) # 
filter only protein coding and "chrm" in chromosome!
#
  exon.annotation$chromosome <- 
paste("chr",exon.annotation$chromosome_name,sep="")
  exon.range <- RangedData(IRanges(
     start=exon.annotation$exon_chrom_start,
     end=exon.annotation$exon_chrom_end),
      space=exon.annotation$chromosome,
      strand=exon.annotation$strand,
      transcript=exon.annotation$ensembl_transcript_id,
      gene=exon.annotation$ensembl_gene_id,
      exon=exon.annotation$ensembl_exon_id,
      biotype=exon.annotation$gene_biotype,
      universe = "NULL")

Do you have any idea how to filter out non-protein-coding genes in your 
work around?

Thanks again!

Felix




On 26.09.2013 12:30, Nicolas Delhomme wrote:
> Hej Felix!
>
> I figured out what the error is:
>
> Here are the end of the exons located at the extremity of the 10 chromosomes:
>
> Browse[1]>  sapply(lapply(ranges(geneModel(obj))[gm.sel],range),end)
>          1        10         2         3         4         5         6         7         8         9
> 301409969 149550619 237864118 232184249 242011676 217905991 169336130 176810119 175341538 157013685
>
> Here are the chromosome sizes as retrieved from your BAM file:
>
> Browse[1]>  chrSize(obj)
>          1        10         2         3         4         5         6         7         8         9        Mt
> 301354135 150189435 237068873 232140174 241473504 217872852 169174353 176764762 175793759 156750706    569630
>         Pt   Unknown
>     140384   7140151
>
> As you can see below 8 of the chromosomal annotation - from biomaRt - have exons outside of the chromosomes - sizes derived from the BAM header.
>
> Browse[1]>  sapply(lapply(ranges(geneModel(obj))[gm.sel],range),end)>  chrSize(obj)[1:10]
>      1    10     2     3     4     5     6     7     8     9
>   TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
>
> In short, you are using a different version of the genome for the alignment as the one you are retrieving the annotation from.
>
> easyRNASeq is obviously not behaving gracefully here, so I'll get a proper check implemented. That won't solve your issue though. What you need to figure out is which version of the genome/annotation to use and consistently doing so for your analyses. At the moment, since it would appear that the annotation have been lifted over a new genome, you would probably end up mis-assigning reads to intronic or extra-genic regions (hence not counting them) or worse assigning them to other genes. The easiest in your case is probably to get the gff3 or gtf file corresponding to the version of the genome you used for the alignments (e.g. from phytozome.org, retrieving the gene_exon.gff3 file) and use that for the easyRNASeq function with the parameter annotationMethod="gff" and annotationFile=[the path to the gff3] file.
>
> Btw, I've come to realize that the geneModel generation is painfully slow, luckily the new version due in October should fix that. It will actually introduces a large number of improvements both computationaly and memory wise. That new version will as well provide an easier biomaRt setup, so that you don't have to fetch the annotation yourself.
>
> I've actually looked into this a bit more and got it to work using the ggf3 file Zmays_181_gene_exons.gff3 retrieved from phytozome. The code is below with some inline comments. Actually, what I think is of highest interest for you is at the bottom of that code as I've written a work around to avoid the geneModel generation bottleneck I mentioned above.
>
> ## libs
> library(easyRNASeq)
> files<- c("excerpt_12a_sorted.bam",
>            "excerpt_13a_sorted.bam")
>
> ## easyRNASeq v1.6 and lower
> ## is not very flexible in its gff support
> ## so we need to convert the gff file
> ## read the gff3 file and convert it into FlyBase like
> gff3<- readGff3("Zmays_181_gene_exons.gff3")
>
> ## get the transcript to gene mapping from the mRNA feature
> transcriptGeneMapping<- data.frame(getGffAttribute(gff3[type(gff3)=="mRNA",],"ID"),
>                                      getGffAttribute(gff3[type(gff3)=="mRNA",],"Parent"))
>
> ## get the exon feature
> exons<- gff3[type(gff3)=="exon",]
>
> ## change their gffAttributes into FlyBase like
> ## where ID is geneID:exonNumber
> ## here, we create the new ID by retrieving the
> ## geneID from the mapping and appending the
> ## exon number extracted from the original ID
> ## and we paste that in front of the original
> ## gffAttribute which ID section has been
> ## removed
> exons$gffAttributes<- paste("ID=",transcriptGeneMapping[match(getGffAttribute(exons,"Parent"),
>                                                                 transcriptGeneMapping$ID),"Parent"],
>                               ":",sub(".*\\.","",getGffAttribute(exons,"ID")),
>                               ";",sub("^ID=.*;P","P",exons$gffAttributes),
>                               sep="")
>
> ## write down the modified file
> ## all these functionalities we used to
> ## play with the gff file are from the
> ## genomeIntervals package
> writeGff3(exons,file="Zmays_181_exons.gff3")
>
> ## the following now works
> ## I have removed the validity check
> ## because your BAM files and
> ## the above gff file use the same naming
> ## conventions for the chromosomes
> rnaSeq_excerpt<- easyRNASeq(filesDirectory=".",
>                              annotationMethod="gff",
>                              annotationFile="Zmays_181_exons.gff3",
>                              count="genes",
>                              filenames=files,
>                              summarization="geneModels",
>                              outputFormat="RNAseq")
>
> ## AND HERE is what you could be most interested in
> ## if you are only interested in gene models
> ## as the geneModel generation is at the moment
> ## painfully slow, here is a work around
> ## were we will flatten the exon structure to have
> ## a single "synthetic" super-transcript per gene
>
> ## get the exon ranges by gene
> rngList<- split(IRanges(start=exons[,1],end=exons[,2]),
>                      transcriptGeneMapping[match(getGffAttribute(exons,"Parent"),
>                                                  transcriptGeneMapping$ID),"Parent"])
>
> ## flatten the gene exons
> ## this will create synthetic exons if two exons
> ## of the gene overlap
> rngList<- reduce(rngList)
>
> ## get the gff information
> ## here we simply duplicate the
> ## first exon of every gene
> ## by the number of synthetic exons
> ## per gene. The content will be updated
> ## afterwards.
> syntheticGeneModel<- exons[rep(match(transcriptGeneMapping$ID[match(names(rngList),
>                                                                       transcriptGeneMapping$Parent)],
>                                        getGffAttribute(exons,"Parent")),elementLengths(rngList)),]
>
> ## update the coordinates
> syntheticGeneModel[,1]<- unlist(start(rngList))
> syntheticGeneModel[,2]<- unlist(end(rngList))
>
> ## change the source
> levels(syntheticGeneModel$source)<- "inhouse"
>
> ## modify the gffAttributes
> ## get the exon number for the minus strand
> exonNumber<- sapply(elementLengths(rngList),":",1)
> ## reverse them on the plus strand
> sel<- strand(syntheticGeneModel)[cumsum(elementLengths(rngList))] == "+"
> exonNumber[sel]<- sapply(exonNumber[sel],rev)
> ## update the attributes
> syntheticGeneModel$gffAttributes<- paste("ID=",rep(names(rngList),elementLengths(rngList)),
>                                            ":",unlist(exonNumber),";Parent=",
>                                            rep(names(rngList),elementLengths(rngList)),".0",sep="")
>
> ## write the gff file containing only one transcript per gene
> ## this transcript consists of all existing exon loci for that gene
> writeGff3(syntheticGeneModel,file="Zmays_181_synthetic-gene-models.gff3")
>
> ## Now the trick is to count using transcripts
> ## as we have only one transcrit per gene model
> ## and circumvent the gene model generation
> ## altogether
> geneModel_excerpt<- easyRNASeq(filesDirectory=".",
>                              annotationMethod="gff",
>                              annotationFile="Zmays_181_synthetic-gene-models.gff3",
>                              count="transcripts",
>                              filenames=files)
>
> ## Note that this is still not the best
> ## but that's due to the annotation
> ## if you look at it you'll see that
> ## the original annotation contains
> ## exons that are 1bp in size...
> ## So you might want extra filtering :-)
>
> Let me know if can I help you any further,
>
> Cheers,
>
> Nico
>
> sessionInfo()
>
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] easyRNASeq_1.6.0       ShortRead_1.18.0       latticeExtra_0.6-26    RColorBrewer_1.0-5
>   [5] Rsamtools_1.12.4       DESeq_1.12.1           lattice_0.20-23        locfit_1.5-9.1
>   [9] BSgenome_1.28.0        GenomicRanges_1.12.5   Biostrings_2.28.0      IRanges_1.18.4
> [13] edgeR_3.2.4            limma_3.16.8           biomaRt_2.16.0         Biobase_2.20.1
> [17] genomeIntervals_1.16.0 BiocGenerics_0.6.0     intervals_0.14.0
>
> loaded via a namespace (and not attached):
>   [1] annotate_1.38.0      AnnotationDbi_1.22.6 bitops_1.0-6         DBI_0.2-7            genefilter_1.42.0
>   [6] geneplotter_1.38.0   grid_3.0.1           hwriter_1.3          RCurl_1.95-4.1       RSQLite_0.11.4
> [11] splines_3.0.1        stats4_3.0.1         survival_2.37-4      tools_3.0.1          XML_3.98-1.1
> [16] xtable_1.7-1         zlibbioc_1.6.0
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On 13 Sep 2013, at 10:03, Felix Frey wrote:
>
>> Thank you for your answer, Nico!
>>
>> I can upload the data. I have 48 .bam files with each about 15GB. I don't know if this is possible in dropbox. If so, I could just upload 2 files, which are also mentionned below in the code.
>>
>> Best,
>>
>> Felix
>>
>> On 12.09.2013 17:38, Nicolas Delhomme wrote:
>>> Hej Felix!
>>>
>>> This is strange. Would you be able to share the data offline with me so that I can try to reproduce the issue? If it's possible I would create a drop box folder.
>>>
>>> Cheers,
>>>
>>> Nico
>>>
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>>
>>> Genome Biology Computational Support
>>>
>>> European Molecular Biology Laboratory
>>>
>>> Tel: +49 6221 387 8310
>>> Email: nicolas.delhomme at embl.de
>>> Meyerhofstrasse 1 - Postfach 10.2209
>>> 69102 Heidelberg, Germany
>>> ---------------------------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>> On Sep 12, 2013, at 4:36 PM, Felix Frey [guest] wrote:
>>>
>>>> Hello,
>>>> I have sequencing data of maize which I aligned to the reference sequence (The maize sequence ZmB73_RefGen_v2.tar.gz, downloaded from:
>>>> # http://ftp.maizesequence.org/current/assembly/) using tophat (v2.0.3), bowtie2 (2.0.0.6) and samtools (0.1.18.0).
>>>> I want to make a count table with easyRNAseq (1.4.2) to use for EdgeR.
>>>> Some months ago, I did following analysis with easyRNASeq:
>>>>
>>>> chr.map<- data.frame(  from=c("chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10"),
>>>>   to=c("1","2","3","4","5","6","7","8","9","10"))
>>>> ensembl<- useMart("ENSEMBL_MART_PLANT",dataset="zmays_eg_gene")
>>>> exon.annotation<- getBM(c("ensembl_gene_id",
>>>>                             "ensembl_transcript_id",
>>>>                             "chromosome_name",
>>>>                             "ensembl_gene_id",
>>>>                             "ensembl_exon_id",
>>>>                             "exon_chrom_start",
>>>>                             "exon_chrom_end"),
>>>>                             mart=ensembl,
>>>>                             filters="chromosome_name",
>>>>                             values=chrm)
>>>> exon.annotation$chromosome<- paste("chr",exon.annotation$chromosome_name,sep="")
>>>> exon.range<- RangedData(IRanges(
>>>>     start=exon.annotation$exon_chrom_start,
>>>>     end=exon.annotation$exon_chrom_end),
>>>>      space=exon.annotation$chromosome,
>>>>      strand=exon.annotation$strand,
>>>>      transcript=exon.annotation$ensembl_transcript_id,
>>>>      gene=exon.annotation$ensembl_gene_id,
>>>>      exon=exon.annotation$ensembl_exon_id,
>>>>      universe = "NULL")
>>>>
>>>> files<- c(	"accepted_hits_12a_sorted.bam",
>>>> 		"accepted_hits_13a_sorted.bam",...)
>>>>
>>>> rnaSeq<- easyRNASeq(filesDirectory="/EASYRNASEQ/countdata",
>>>>                           gapped=F,
>>>>                           validity.check=TRUE,
>>>>                           chr.map=chr.map,
>>>>                           organism="custom",
>>>>                           annotationMethod="env",
>>>>                           annotationObject=exon.range,
>>>>                           count="genes",
>>>> 			  filenames=files,
>>>> 			  summarization="geneModels",
>>>> 			  outputFormat="RNAseq")
>>>>
>>>> I got an output gene count table.
>>>> Now, I wanted to repeat the analysis, but leaving out the non-protein-coding genes, which I wanted to achieve by introducing the "biotype" and filtering by "protein-coding" in the exon.annotation object.
>>>> To check the analysis i performed it again like some months before and I got an error message, which I do not understand completely:
>>>>
>>>>
>>>> OUTPUT:
>>>>
>>>>> head( exon.annotation)
>>>>   ensembl_gene_id strand ensembl_transcript_id chromosome_name
>>>> 1   GRMZM2G439951      1     GRMZM2G439951_T01               5
>>>> 2   GRMZM2G094632      1     GRMZM2G094632_T02               5
>>>> 3   GRMZM2G094632      1     GRMZM2G094632_T01               5
>>>> 4   GRMZM2G094632      1     GRMZM2G094632_T01               5
>>>> 5   GRMZM2G095090      1     GRMZM2G095090_T01               5
>>>> 6   GRMZM2G095094      1     GRMZM2G095094_T01               5
>>>>   ensembl_gene_id.1   ensembl_exon_id exon_chrom_start exon_chrom_end
>>>> 1     GRMZM2G439951 GRMZM2G439951_E01         39542856       39544325
>>>> 2     GRMZM2G094632 GRMZM2G094632_E01         39435878       39436448
>>>> 3     GRMZM2G094632 GRMZM2G094632_E03         39435878       39436317
>>>> 4     GRMZM2G094632 GRMZM2G094632_E02         39436730       39437134
>>>> 5     GRMZM2G095090 GRMZM2G095090_E01         39427195       39427706
>>>> 6     GRMZM2G095094 GRMZM2G095094_E01         39424253       39427105
>>>>     gene_biotype chromosome
>>>> 1 protein_coding       chr5
>>>> 2 protein_coding       chr5
>>>> 3 protein_coding       chr5
>>>> 4 protein_coding       chr5
>>>> 5 protein_coding       chr5
>>>> 6 protein_coding       chr5
>>>>
>>>>> head( exon.range)
>>>> RangedData with 6 rows and 5 value columns across 10 spaces
>>>>      space                 ranges |    strand        transcript
>>>>   <factor>                <IRanges>   |<integer>         <character>
>>>> 1     chr1 [301409811, 301409969] |        -1 AC185612.3_FGT006
>>>> 2     chr1 [301362751, 301363304] |        -1 GRMZM5G884466_T03
>>>> 3     chr1 [301353831, 301354022] |        -1 GRMZM5G884466_T03
>>>> 4     chr1 [301353664, 301353745] |        -1 GRMZM5G884466_T03
>>>> 5     chr1 [301353366, 301353557] |        -1 GRMZM5G884466_T03
>>>> 6     chr1 [301353147, 301353260] |        -1 GRMZM5G884466_T03
>>>>               gene                   exon        biotype
>>>>        <character>              <character>      <character>
>>>> 1 AC185612.3_FG006 AC185612.3_FG006.exon1 protein_coding
>>>> 2    GRMZM5G884466      GRMZM5G884466_E12 protein_coding
>>>> 3    GRMZM5G884466      GRMZM5G884466_E03 protein_coding
>>>> 4    GRMZM5G884466      GRMZM5G884466_E04 protein_coding
>>>> 5    GRMZM5G884466      GRMZM5G884466_E08 protein_coding
>>>> 6    GRMZM5G884466      GRMZM5G884466_E10 protein_coding
>>>>> rnaSeq_new<- easyRNASeq(filesDirectory="/projects/irg/grp_stich/personal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRNASEQ/countdata",
>>>> +                           gapped=F,
>>>> +                           validity.check=TRUE,
>>>> +                           chr.map=chr.map,
>>>> +                           organism="custom",
>>>> +                           annotationMethod="env",
>>>> +                           annotationObject=exon.range,
>>>> +                           count="genes",
>>>> +                           filenames=files,
>>>> +                           summarization="geneModels",
>>>> +                           outputFormat="RNAseq")
>>>> Checking arguments...
>>>> Fetching annotations...
>>>> Computing gene models...
>>>> Summarizing counts...
>>>> Processing accepted_hits_12a_sorted.bam
>>>> Updating the read length information.
>>>> The reads are of 95 bp.
>>>> Warning message:
>>>> In easyRNASeq(filesDirectory = "/projects/irg/grp_stich/personal_folders/Felix/TranscriptomeProfiling/RNASeq/RNASeq/tuxedo/tophat/EASYRNASEQ/countdata",  :
>>>>   There are 8770 synthetic exons as determined from your annotation that overlap! This implies that some reads will be counted more than once! Is that really what you want?
>>>> Error in unlist(aggregate(readCoverage(obj)[names(geneModel(obj))[gm.sel]],  :
>>>>   error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .Call2("Rle_getStartEndRunAndOffset", x, start, end, PACKAGE = "IRanges") :
>>>>   'x' values larger than vector length 'sum(width)'
>>>>
>>>>
>>>>
>>>>
>>>> Could you help me with this problem?
>>>>
>>>> Best,
>>>> Felix
>>>>
>>>> --
>>>> Felix Frey
>>>>
>>>> Max Planck Institut für Pflanzenzüchtungsforschung
>>>> Carl-von-Linné-Weg 10
>>>> D-50829 Köln
>>>> +49 (0) 221-5062 405
>>>>
>>>> -- output of sessionInfo():
>>>>
>>>>> sessionInfo()
>>>> R version 3.0.1 (2013-05-16)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] C
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>> [1] easyRNASeq_1.6.0       ShortRead_1.18.0       latticeExtra_0.6-26
>>>> [4] RColorBrewer_1.0-5     Rsamtools_1.12.4       DESeq_1.12.1
>>>> [7] lattice_0.20-15        locfit_1.5-9.1         BSgenome_1.28.0
>>>> [10] GenomicRanges_1.12.5   Biostrings_2.28.0      IRanges_1.18.3
>>>> [13] edgeR_3.2.4            limma_3.16.7           biomaRt_2.16.0
>>>> [16] Biobase_2.20.1         genomeIntervals_1.16.0 BiocGenerics_0.6.0
>>>> [19] intervals_0.14.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] AnnotationDbi_1.22.6 DBI_0.2-7            RCurl_1.95-4.1
>>>> [4] RSQLite_0.11.4       XML_3.98-1.1         annotate_1.38.0
>>>> [7] bitops_1.0-6         genefilter_1.42.0    geneplotter_1.38.0
>>>> [10] grid_3.0.1           hwriter_1.3          splines_3.0.1
>>>> [13] stats4_3.0.1         survival_2.37-4      tools_3.0.1
>>>> [16] xtable_1.7-1         zlibbioc_1.6.0
>>>>
>>>> --
>>>> Sent via the guest posting facility at bioconductor.org.
>>



More information about the Bioconductor mailing list