[BioC] MakeTranscriptDbFromGFF

Tue May 7 11:58:51 CEST 2013

Thank you very very much, Marc.
Ugo

> From: Marc Carlson <mcarlson at fhcrc.org>
> Date: Mon, 06 May 2013 15:00:09 -0700
> To: <bioconductor at r-project.org>
> Subject: Re: [BioC] MakeTranscriptDbFromGFF
> 
> Hi Ugo,
> 
> So I have good news and bad news regarding that data file you wanted to
> use from:
> 
> http://tophat.cbcb.umd.edu/igenomes.shtml
> 
> The good news is that I have modified makeTranscriptDbFromGFF() to
> notice whenever a file only has exon rank information for CDS and to do
> the right thing (and make use of that information) while issuing a
> warning that it is doing so.  The bad news is that after looking more
> carefully at the data from the site listed above, there is no way to
> make that data into a transcriptome without inferring the exon ranks.
> At least not in it's current state. The files on offer simply don't
> contain exon rank information. What I initially thought might be exon
> ranks is (after more careful inspection) something else, and so can't be
> used for that.
> 
> So you probably should go with the advice from my last message and make
> a custom mm9 package that makes use of the UCSC transcriptome (or maybe
> one from BiomaRt depending on what you have aligned your data to).  But
> I would not recommend using the data at that web site to make
> transcriptomes unless they can give you a file that has exon rank
> information.
> 
> 
>   Marc
> 
> 
> 
> On 05/06/2013 01:32 PM, Marc Carlson wrote:
>> Hi Ugo,
>> 
>> This worked for me:
>> 
>> library(OrganismDbi)
>> 
>> gd <- list(join1 = c(GO.db="GOID", org.Mm.eg.db="GO"),
>>            join2 = c(org.Mm.eg.db="ENTREZID",
>>                      TxDb.Mmusculus.UCSC.mm10.knownGene="GENEID"))
>> 
>> makeOrganismPackage(pkgname = "Mus.musculus.mm9",
>>                     graphData = gd,
>>                     organism = "Mus musculus",
>>                     version = "1.0",
>>                     maintainer ="You <maintainer at someplace.org>",
>>                     author = "You",
>>                     destDir = ".",
>>                     license = "Artistic-2.0")
>> 
>> 
>> Then I can do stuff like this (after I run R CMD INSTALL
>> Mus.musculus.mm9) :
>> 
>> library(Mus.musculus.mm9)
>> 
>> select(Mus.musculus.mm9,
>> head(keys(Mus.musculus.mm9,keytype="ENTREZID")),c("SYMBOL","TXSTRAND"),keytyp
>> e="ENTREZID")
>> 
>> txs <- transcripts(Mus.musculus.mm9, columns="SYMBOL")
>> 
>> 
>> Etc.
>> 
>> 
>> As for your other question, that error means that the chromosome names
>> in your input file are not all present in the chrominfo that you have
>> used.  Usually this means that there is a row that says something like
>> chrom8_random (or something like that) as it's chromosome name in the
>> input file.  And you don't have that in your chrominfo table, so the
>> database is stopping, because it doesn't have any way to know what
>> chrom8_random is?
>> 
>> To fix it, you can either remove unwanted rows from your input file or
>> you can add new rows to your chrominfo object.
>> 
>> 
>>   Marc
>> 
>> 
>> 
>> On 05/04/2013 07:18 AM, Ugo Borello wrote:
>>> Dear Marc,
>>> I apologize for bothering you again, but I was intrigued by the
>>> custom Mus.musculus.mm9 object. It sounds very convenient indeed.
>>> so I tried, as you suggested:
>>> 
>>> gd<-c(org.Mm.eg.db='SYMBOL',
>>>             TxDb.Mmusculus.UCSC.mm9.knownGene="GENEID")
>>> ## I also tried to set org.Mm.eg.db='ENTREZID'
>>> 
>>> destination <- tempfile()
>>> dir.create(destination)
>>> makeOrganismPackage(pkgname = "Mus.musculus.mm9",
>>>                     graphData = gd,
>>>                     organism = "Mus musculus",
>>>                     version = "1.0.0",
>>>                     maintainer = "Package
>>> Maintainer<maintainer at somewhere.com>",
>>>                     author = "SomeBody",
>>>                     destDir = destination,
>>>                     license = "Artistic-2.0")
>>> 
>>> and I got:
>>> Error in cbind(pkgs, keys) :
>>>   number of rows of matrices must match (see arg 2)
>>> 
>>> I am surely missing something here. Any suggestions?
>>> 
>>> 
>>> And, coming back to makeTranscriptDbFromGFF.
>>> I obtained my gtf annotation file from UCSC using genePredToGtf, as
>>> the UCSC people suggest
>>> (http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format)
>>> 
>>> txdb <-makeTranscriptDbFromGFF(file = 'mm9.gtf', format = "gtf",
>>> + exonRankAttributeName = "exon_number",
>>> + chrominfo = chrominfo,
>>> + dataSource = "UCSC",
>>> + species = "Mus musculus")
>>> extracting transcript information
>>> Estimating transcript ranges.
>>> Extracting gene IDs
>>> Processing splicing information for gtf file.
>>> Prepare the 'metadata' data frame ... metadata: OK
>>> Error in .checkForeignKey(transcripts_tx_chrom, NA,
>>> "transcripts$tx_chrom",  :
>>>   all the values in 'transcripts$tx_chrom' must be present in
>>> 'chrominfo$chrom'
>>> 
>>> Why I get this error now?
>>> 
>>>> chrominfo
>>>    chrom    length is_circular
>>> 1  chr10 129993255       FALSE
>>> 2  chr11 121843856       FALSE
>>> 3  chr12 121257530       FALSE
>>> 4  chr13 120284312       FALSE
>>> 5  chr14 125194864       FALSE
>>> 6  chr15 103494974       FALSE
>>> 7  chr16  98319150       FALSE
>>> 8  chr17  95272651       FALSE
>>> 9  chr18  90772031       FALSE
>>> 10 chr19  61342430       FALSE
>>> 11  chr1 197195432       FALSE
>>> 12  chr2 181748087       FALSE
>>> 13  chr3 159599783       FALSE
>>> 14  chr4 155630120       FALSE
>>> 15  chr5 152537259       FALSE
>>> 16  chr6 149517037       FALSE
>>> 17  chr7 152524553       FALSE
>>> 18  chr8 131738871       FALSE
>>> 19  chr9 124076172       FALSE
>>> 20  chrM     16299       FALSE
>>> 21  chrX 166650296       FALSE
>>> 22  chrY  15902555       FALSE
>>> 
>>> Thank you very much for your time and patience.
>>> 
>>> Ugo
>>> 
>>> 
>>> 
>>> 
>>> P.S.
>>> This is the annotation
>>>> annFile3<- import.gff('mm9.gtf', format='gtf', asRangedData=FALSE)
>>>> annFile3
>>> GRanges with 962651 ranges and 9 metadata columns:
>>>               seqnames               ranges strand   | source
>>> type     score     phase     gene_id transcript_id
>>>                  <Rle>            <IRanges> <Rle>   |  <factor>
>>> <factor> <numeric> <integer> <character> <character>
>>>        [1]        chr1   [3195985, 3197398]      -   | knownGene
>>>   exon      <NA>      <NA> uc007aet.1    uc007aet.1
>>>        [2]        chr1   [3203520, 3205713]      -   | knownGene
>>>   exon      <NA>      <NA> uc007aet.1    uc007aet.1
>>>        [3]        chr1   [3204563, 3207049]      -   | knownGene
>>>   exon      <NA>      <NA> Q5GH67 uc007aeu.1
>>>        [4]        chr1   [3206106, 3207049]      -   | knownGene
>>>    CDS      <NA>         2      Q5GH67 uc007aeu.1
>>>        [5]        chr1   [3411783, 3411982]      -   | knownGene
>>>   exon      <NA>      <NA> Q5GH67 uc007aeu.1
>>>        ...         ...                  ...    ... ... ...
>>> ...       ...       ...         ...           ...
>>>   [962647] chrY_random [58502132, 58502365]      +   | knownGene
>>>    CDS      <NA>         0      Q62458 uc012htl.1
>>>   [962648] chrY_random [58502369, 58502946]      +   | knownGene
>>>   exon      <NA>      <NA> Q62458 uc012htl.1
>>>   [962649] chrY_random [58502369, 58502812]      +   | knownGene
>>>    CDS      <NA>         0      Q62458 uc012htl.1
>>>   [962650] chrY_random [58502132, 58502134]      +   | knownGene
>>> start_codon      <NA>         0      Q62458    uc012htl.1
>>>   [962651] chrY_random [58502813, 58502815]      +   | knownGene
>>> stop_codon      <NA>         0      Q62458    uc012htl.1
>>>            exon_number      exon_id   gene_name
>>>              <numeric>  <character> <character>
>>>        [1]           1 uc007aet.1.1        <NA>
>>>        [2]           2 uc007aet.1.2        <NA>
>>>        [3]           1 uc007aeu.1.1      Q5GH67
>>>        [4]           1 uc007aeu.1.1      Q5GH67
>>>        [5]           2 uc007aeu.1.2      Q5GH67
>>>        ...         ...          ...         ...
>>>   [962647]           1 uc012htl.1.1      Q62458
>>>   [962648]           2 uc012htl.1.2      Q62458
>>>   [962649]           2 uc012htl.1.2      Q62458
>>>   [962650]           1 uc012htl.1.1      Q62458
>>>   [962651]           1 uc012htl.1.1      Q62458
>>>   ---
>>>   seqlengths:
>>>            chr1        chr10        chr11        chr12 chr13
>>> ...         chrX  chrX_random         chrY  chrY_random
>>>              NA           NA           NA           NA NA
>>> ...           NA
>>> 
>>> 
>>> Quoting Marc Carlson <mcarlson at fhcrc.org>:
>>> 
>>>> Hi Ugo,
>>>> 
>>>> 
>>>> On 05/03/2013 04:00 AM, Ugo Borello wrote:
>>>>> Dear Carl,
>>>>> Thank you very much; it makes sense now.
>>>>> 
>>>>> To quantify gene expression of my RNASeq samples I  use your
>>>>> TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the
>>>>> alignments to
>>>>> the BSgenome.Mmusculus.UCSC.mm9 genome.
>>>>> Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I
>>>>> wanted to use their annotation.
>>>>> 
>>>>> 
>>>>> Anyway, I have now a general question.
>>>>> For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes
>>>>>> genome<- scanFaIndex('genome.fa')
>>>>>> seqlengths(genome)
>>>>>     chr10     chr11     chr12     chr13     chr14 chr15 chr16
>>>>> chr17     chr18     chr19
>>>>> 129993255 121843856 121257530 120284312 125194864 103494974 98319150
>>>>> 95272651  90772031  61342430
>>>>>      chr1      chr2      chr3      chr4      chr5 chr6 chr7
>>>>> chr8      chr9      chrM
>>>>> 197195432 181748087 159599783 155630120 152537259 149517037 152524553
>>>>> 131738871 124076172     16299
>>>>>      chrX      chrY
>>>>> 166650296  15902555
>>>>> 
>>>>> For your TxDb.Mmusculus.UCSC.mm9.knownGene
>>>>>> seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene)
>>>>>         chr1         chr2         chr3         chr4 chr5
>>>>> chr6         chr7         chr8         chr9
>>>>>    197195432    181748087    159599783    155630120 152537259
>>>>> 149517037    152524553    131738871    124076172
>>>>>        chr10        chr11        chr12        chr13 chr14
>>>>> chr15        chr16        chr17        chr18
>>>>>    129993255    121843856    121257530    120284312 125194864
>>>>> 103494974     98319150     95272651     90772031
>>>>>        chr19         chrX         chrY         chrM chr1_random
>>>>> chr3_random  chr4_random  chr5_random  chr7_random
>>>>>     61342430    166650296     15902555        16299 1231697
>>>>> 41899       160594       357350       362490
>>>>>  chr8_random  chr9_random chr13_random chr16_random chr17_random
>>>>> chrX_random  chrY_random chrUn_random
>>>>>       849593       449403       400311         3994 628739
>>>>> 1785075     58682461      5900358
>>>>> 
>>>>> 
>>>>> So. I want to filter out the '_random' stuff; is this the only and
>>>>> right way
>>>>> to do it?
>>>> 
>>>> It depends on whether you want to use the _random stuff.  My impression
>>>> of how people use this is that most people don't.  So they would use
>>>> isActiveSeq() to toggle those off.
>>>> 
>>>>>> ann<- TxDb.Mmusculus.UCSC.mm9.knownGene
>>>>>> isActiveSeq(ann)[seqlevels(ann)] <- FALSE
>>>>>> isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE,
>>>>> +                       "chr12"=TRUE,"chr13"=TRUE,
>>>>> +                       "chr14"=TRUE,"chr15"=TRUE,
>>>>> +                       "chr16"=TRUE, "chr17"=TRUE,
>>>>> +                       "chr18"=TRUE, "chr19"=TRUE,
>>>>> +                       "chr1"=TRUE, "chr2"=TRUE,
>>>>> +                       "chr3"=TRUE, "chr4"=TRUE,
>>>>> +                       "chr5"=TRUE, "chr6"=TRUE,
>>>>> +                       "chr7"=TRUE, "chr8"=TRUE,
>>>>> +                       "chr9"=TRUE, "chrM"=TRUE,
>>>>> +                       "chrX"=TRUE, "chrY"=TRUE)
>>>>> Then get my gene info this way?
>>>>>> genesInfo<- exons(ann, columns='gene_id')
>>>>> How can I add also gene names or symbols?
>>>> Again, it depends on what you want to do.  If you are dealing with just
>>>> a TranscriptDb like this, then there are not gene symbols attached
>>>> already.  So for that object yes, you would need to do it like that.
>>>> 
>>>> 
>>>> Now if you wanted gene symbols they are pretty easy to get. You can go
>>>> and get gene symbols by loading an org package like this:
>>>> 
>>>> library(org.Mm.eg.db)
>>>> cols(org.Mm.eg.db)
>>>> 
>>>> And then you could use select() to retrieve those (by using the gene
>>>> IDs as keys from your TranscriptDb object).  But some people find this
>>>> extra step to be inconvenient, so I have created another route. And
>>>> that is to use a OrganismDb object.  There is a package for this
>>>> already that I will use to demo here:
>>>> 
>>>> library(Mus.musculus)
>>>> cols(Mus.musculus)
>>>> 
>>>> You will notice that this kind of object has all the stuff you want in
>>>> one place.  I suspect that you will find that to be a lot more
>>>> convenient:
>>>> 
>>>> This basically means that you can do the thing you were interested in
>>>> like this:
>>>> 
>>>> genesInfo <- exons(Mus.musculus, columns=c("GENEID","SYMBOL"))
>>>> 
>>>> *But* in your case, you are using mm9, so you will want to make a
>>>> custom object (the Mus.musculus object is based on mm10). This is not
>>>> hard to do, but it is an extra step.  You can read how to do it in
>>>> section 2 of the following vignette:
>>>> 
>>>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/OrganismDbi/inst/d
>>>> oc/OrganismDbi.pdf
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Thank you again for your help.
>>>>> 
>>>>> Ugo
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> From: Marc Carlson <mcarlson at fhcrc.org>
>>>>>> Date: Thu, 02 May 2013 15:52:38 -0700
>>>>>> To: Ugo Borello <ugo.borello at inserm.fr>
>>>>>> Cc: <bioconductor at r-project.org>
>>>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>>>>>> 
>>>>>> Hi Ugo,
>>>>>> 
>>>>>> The 15 GB tarball you sent me to contains several different GTF files
>>>>>> for genes.  I grabbed this one as it seemed to be the most recent:
>>>>>> 
>>>>>> Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-01-24/Gen
>>>>>> es/ge 
>>>>>> 
>>>>>> nes.gtf
>>>>>> 
>>>>>> So looking at this file I can reproduce the problem you
>>>>>> mentioned.  And
>>>>>> it shows me two problems.  The 1st problem is that the only field
>>>>>> that
>>>>>> seems to contain any information about exon positions is called phase
>>>>>> (and not "exon_number" as was in the code I see from before).
>>>>>> There is
>>>>>> not actually any field called "exon_number" in this file. Either way,
>>>>>> one thing you can check is to make sure that the string you give
>>>>>> here is
>>>>>> the same as the appropriate field name that is used by the file.
>>>>>> There
>>>>>> is no way to know this information in advance since GTF does not
>>>>>> specify
>>>>>> how to encode this information (and in fact the information is
>>>>>> entirely
>>>>>> optional).
>>>>>> 
>>>>>> The second problem is that even "phase" can't work right now since
>>>>>> the
>>>>>> authors of this gtf file have decided to only associate the exon rank
>>>>>> information only with CDS and never with exons features. So there is
>>>>>> not any actual 'exon' position information in this file, only
>>>>>> information for CDS positions.  Now that I see people doing these
>>>>>> files
>>>>>> in this way, I plan to enhance the parser so that it can process
>>>>>> files
>>>>>> of this kind.
>>>>>> 
>>>>>> 
>>>>>> Is there a reason why you wanted to use this file and not the data
>>>>>> contained in this package here?
>>>>>> 
>>>>>> http://www.bioconductor.org/packages/2.12/data/annotation/html/TxDb.Mmusc
>>>>>> ulus. 
>>>>>> 
>>>>>> UCSC.mm9.knownGene.html
>>>>>> 
>>>>>> 
>>>>>>    Marc
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 05/02/2013 02:59 AM, Ugo Borello wrote:
>>>>>>> Dear Marc
>>>>>>> Sorry I was not precise on the origin of the gtf annotation file;
>>>>>>>  I got the
>>>>>>> gtf file from here:
>>>>>>> http://tophat.cbcb.umd.edu/igenomes.shtml
>>>>>>> 
>>>>>>> And more precisely from the Mus musculus/UCSC/mm9 folder
>>>>>>> Here the description of the content of the folder:
>>>>>>> ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/README.txt
>>>>>>> 
>>>>>>> I realized reading the README.txt file that actually the
>>>>>>> genes.gtf file I
>>>>>>> used is the Ensembl annotation of the mm9 release.
>>>>>>> So, I changed dataSource = "Ensembl" in the function call and I
>>>>>>> got the same
>>>>>>> error message:
>>>>>>> Error in data.frame(..., check.names = FALSE) :
>>>>>>>    arguments imply differing number of rows: 541775, 0
>>>>>>> 
>>>>>>> At the end of my previous email you have the result of calling:
>>>>>>> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE)
>>>>>>> 
>>>>>>> 
>>>>>>> Thank you
>>>>>>> 
>>>>>>> Ugo
>>>>>>> 
>>>>>>>> From: Marc Carlson <mcarlson at fhcrc.org>
>>>>>>>> Date: Wed, 01 May 2013 15:33:02 -0700
>>>>>>>> To: <bioconductor at r-project.org>
>>>>>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>>>>>>>> 
>>>>>>>> Hi Ugo,
>>>>>>>> 
>>>>>>>> Which UCSC file was it that you were trying to process?
>>>>>>>> 
>>>>>>>> 
>>>>>>>>     Marc
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 05/01/2013 02:21 AM, Ugo Borello wrote:
>>>>>>>>> Good morning,
>>>>>>>>> 
>>>>>>>>> I have a little problem creating a TranscriptDb object using
>>>>>>>>> the function
>>>>>>>>> makeTranscriptDbFromGFF. I want to use this annotation to count
>>>>>>>>> the
>>>>>>>>> overlaps
>>>>>>>>> of my genomic alignments with genes.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I ran:
>>>>>>>>> 
>>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>>>>> + exonRankAttributeName = "exon_number",
>>>>>>>>> + chrominfo = chrominfo,
>>>>>>>>> + dataSource = "UCSC",
>>>>>>>>> + species = "Mus musculus")
>>>>>>>>> 
>>>>>>>>> And I got this error message:
>>>>>>>>> Error in data.frame(..., check.names = FALSE) :
>>>>>>>>>     arguments imply differing number of rows: 541775, 0
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> "chrominfo" was (info retrieved from the fasta genome file):
>>>>>>>>> 
>>>>>>>>>> chrominfo
>>>>>>>>>      chrom    length is_circular
>>>>>>>>> 1  chr10 129993255       FALSE
>>>>>>>>> 2  chr11 121843856       FALSE
>>>>>>>>> 3  chr12 121257530       FALSE
>>>>>>>>> 4  chr13 120284312       FALSE
>>>>>>>>> 5  chr14 125194864       FALSE
>>>>>>>>> 6  chr15 103494974       FALSE
>>>>>>>>> 7  chr16  98319150       FALSE
>>>>>>>>> 8  chr17  95272651       FALSE
>>>>>>>>> 9  chr18  90772031       FALSE
>>>>>>>>> 10 chr19  61342430       FALSE
>>>>>>>>> 11  chr1 197195432       FALSE
>>>>>>>>> 12  chr2 181748087       FALSE
>>>>>>>>> 13  chr3 159599783       FALSE
>>>>>>>>> 14  chr4 155630120       FALSE
>>>>>>>>> 15  chr5 152537259       FALSE
>>>>>>>>> 16  chr6 149517037       FALSE
>>>>>>>>> 17  chr7 152524553       FALSE
>>>>>>>>> 18  chr8 131738871       FALSE
>>>>>>>>> 19  chr9 124076172       FALSE
>>>>>>>>> 20  chrM     16299        TRUE
>>>>>>>>> 21  chrX 166650296       FALSE
>>>>>>>>> 22  chrY  15902555       FALSE
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I ran it again without the "exonRankAttributeName" argument and
>>>>>>>>> I got:
>>>>>>>>> 
>>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>>>>> + chrominfo = chrominfo,
>>>>>>>>> + dataSource = "UCSC",
>>>>>>>>> + species = "Mus musculus")
>>>>>>>>> extracting transcript information
>>>>>>>>> Estimating transcript ranges.
>>>>>>>>> Extracting gene IDs
>>>>>>>>> Processing splicing information for gtf file.
>>>>>>>>> Deducing exon rank from relative coordinates provided
>>>>>>>>> Prepare the 'metadata' data frame ... metadata: OK
>>>>>>>>> Error in .checkForeignKey(transcripts_tx_chrom, NA,
>>>>>>>>> "transcripts$tx_chrom",
>>>>>>>>> :
>>>>>>>>>     all the values in 'transcripts$tx_chrom' must be present in
>>>>>>>>> 'chrominfo$chrom'
>>>>>>>>> In addition: Warning message:
>>>>>>>>> In .deduceExonRankings(exs, format = "gtf") :
>>>>>>>>>     Infering Exon Rankings.  If this is not what you expected,
>>>>>>>>> then please
>>>>>>>>> be
>>>>>>>>> sure that you have provided a valid attribute for
>>>>>>>>> exonRankAttributeName
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Without the "chrominfo" argument I got the same error message
>>>>>>>>> as the first
>>>>>>>>> time:
>>>>>>>>> 
>>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>>>>> + exonRankAttributeName = "exon_number",
>>>>>>>>> + dataSource = "UCSC",
>>>>>>>>> + species = "Mus musculus")
>>>>>>>>> 
>>>>>>>>> Error in data.frame(..., check.names = FALSE) :
>>>>>>>>>     arguments imply differing number of rows: 541775, 0
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Finally when I eliminated both the "exonRankAttributeName" and the
>>>>>>>>> "chrominfo" arguments it worked but the warning reminded me of the
>>>>>>>>> "exonRankAttributeName" argument and the chromosome names are
>>>>>>>>> now different
>>>>>>>>> from the ones in the genome file and there is no info on their
>>>>>>>>> length
>>>>>>>>> 
>>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>>>>> + dataSource = "UCSC",
>>>>>>>>> + species = "Mus musculus")
>>>>>>>>> extracting transcript information
>>>>>>>>> Estimating transcript ranges.
>>>>>>>>> Extracting gene IDs
>>>>>>>>> Processing splicing information for gtf file.
>>>>>>>>> Deducing exon rank from relative coordinates provided
>>>>>>>>> Prepare the 'metadata' data frame ... metadata: OK
>>>>>>>>> Now generating chrominfo from available sequence names. No
>>>>>>>>> chromosome
>>>>>>>>> length
>>>>>>>>> information is available.
>>>>>>>>> Warning messages:
>>>>>>>>> 1: In .deduceExonRankings(exs, format = "gtf") :
>>>>>>>>>     Infering Exon Rankings.  If this is not what you expected,
>>>>>>>>> then please
>>>>>>>>> be
>>>>>>>>> sure that you have provided a valid attribute for
>>>>>>>>> exonRankAttributeName
>>>>>>>>> 2: In matchCircularity(chroms, circ_seqs) :
>>>>>>>>>     None of the strings in your circ_seqs argument match your
>>>>>>>>> seqnames.
>>>>>>>>> 
>>>>>>>>>> seqinfo(txdb)
>>>>>>>>> Seqinfo of length 32
>>>>>>>>> seqnames     seqlengths isCircular genome
>>>>>>>>> chr13              <NA>      FALSE <NA>
>>>>>>>>> chr9               <NA>      FALSE <NA>
>>>>>>>>> chr6               <NA>      FALSE <NA>
>>>>>>>>> chrX               <NA>      FALSE <NA>
>>>>>>>>> chr17              <NA>      FALSE <NA>
>>>>>>>>> chr2               <NA>      FALSE <NA>
>>>>>>>>> chr7               <NA>      FALSE <NA>
>>>>>>>>> chr18              <NA>      FALSE <NA>
>>>>>>>>> chr8               <NA>      FALSE <NA>
>>>>>>>>> ...                 ...        ...    ...
>>>>>>>>> chrY_random        <NA>      FALSE <NA>
>>>>>>>>> chrX_random        <NA>      FALSE <NA>
>>>>>>>>> chr5_random        <NA>      FALSE <NA>
>>>>>>>>> chr4_random        <NA>      FALSE <NA>
>>>>>>>>> chrY               <NA>      FALSE <NA>
>>>>>>>>> chr7_random        <NA>      FALSE <NA>
>>>>>>>>> chr17_random       <NA>      FALSE <NA>
>>>>>>>>> chr13_random       <NA>      FALSE <NA>
>>>>>>>>> chr1_random        <NA>      FALSE <NA>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> What am I doing wrong in my original call to
>>>>>>>>> makeTranscriptDbFromGFF?
>>>>>>>>> 
>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>>>>> exonRankAttributeName = "exon_number",
>>>>>>>>>                                  chrominfo = chrominfo,
>>>>>>>>>                                  dataSource = "UCSC",
>>>>>>>>>                                  species = "Mus musculus")
>>>>>>>>> 
>>>>>>>>> Why am I getting this unfair error message?
>>>>>>>>> Thank you for your help
>>>>>>>>> Ugo
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> FYI, this is my annFile (My gtf  annotation file was downloaded
>>>>>>>>> together
>>>>>>>>> with a fasta file containing the mouse genome from UCSC):
>>>>>>>>> 
>>>>>>>>>> annFile
>>>>>>>>> GRanges with 595632 ranges and 9 metadata columns:
>>>>>>>>>                 seqnames               ranges strand |   source
>>>>>>>>> type
>>>>>>>>> score     phase
>>>>>>>>>                    <Rle> <IRanges> <Rle>   | <factor>
>>>>>>>>> <factor>
>>>>>>>>> <numeric> <integer>
>>>>>>>>>          [1]        chr1   [3204563, 3207049]      - |  unknown
>>>>>>>>> exon
>>>>>>>>> <NA>      <NA>
>>>>>>>>>          [2]        chr1   [3206103, 3206105]      - |  unknown
>>>>>>>>> stop_codon
>>>>>>>>> <NA>      <NA>
>>>>>>>>>          [3]        chr1   [3206106, 3207049]      - |  unknown
>>>>>>>>> CDS
>>>>>>>>> <NA>         2
>>>>>>>>>          [4]        chr1   [3411783, 3411982]      - |  unknown
>>>>>>>>> CDS
>>>>>>>>> <NA>         1
>>>>>>>>>          [5]        chr1   [3411783, 3411982]      - |  unknown
>>>>>>>>> exon
>>>>>>>>> <NA>      <NA>
>>>>>>>>>          ...         ...                  ...    ... ...      ...
>>>>>>>>> ...
>>>>>>>>> ...       ...
>>>>>>>>>     [595628] chrY_random [54422360, 54422362]      + |  unknown
>>>>>>>>> stop_codon
>>>>>>>>> <NA>      <NA>
>>>>>>>>>     [595629] chrY_random [58501955, 58502946]      + |  unknown
>>>>>>>>> exon
>>>>>>>>> <NA>      <NA>
>>>>>>>>>     [595630] chrY_random [58502132, 58502812]      + |  unknown
>>>>>>>>> CDS
>>>>>>>>> <NA>         0
>>>>>>>>>     [595631] chrY_random [58502132, 58502134]      + |  unknown
>>>>>>>>> start_codon
>>>>>>>>> <NA>      <NA>
>>>>>>>>>     [595632] chrY_random [58502813, 58502815]      + |  unknown
>>>>>>>>> stop_codon
>>>>>>>>> <NA>      <NA>
>>>>>>>>>                   gene_id  transcript_id gene_name        p_id
>>>>>>>>> tss_id
>>>>>>>>>               <character>    <character> <character> <character>
>>>>>>>>> <character>
>>>>>>>>>          [1]         Xkr4   NM_001011874 Xkr4 P2739
>>>>>>>>> TSS1881
>>>>>>>>>          [2]         Xkr4   NM_001011874 Xkr4 P2739
>>>>>>>>> TSS1881
>>>>>>>>>          [3]         Xkr4   NM_001011874 Xkr4 P2739
>>>>>>>>> TSS1881
>>>>>>>>>          [4]         Xkr4   NM_001011874 Xkr4 P2739
>>>>>>>>> TSS1881
>>>>>>>>>          [5]         Xkr4   NM_001011874 Xkr4 P2739
>>>>>>>>> TSS1881
>>>>>>>>>          ...          ...            ... ... ...
>>>>>>>>> ...
>>>>>>>>>     [595628] LOC100039753   NM_001017394 LOC100039753      P10196
>>>>>>>>> TSS19491
>>>>>>>>>     [595629] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>>>>>>> TSS4342
>>>>>>>>>     [595630] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>>>>>>> TSS4342
>>>>>>>>>     [595631] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>>>>>>> TSS4342
>>>>>>>>>     [595632] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>>>>>>> TSS4342
>>>>>>>>>     ---
>>>>>>>>>     seqlengths:
>>>>>>>>>              chr1        chr10        chr11 chr12 ...
>>>>>>>>> chrX_random
>>>>>>>>> chrY  chrY_random
>>>>>>>>>                NA           NA           NA NA ...          NA
>>>>>>>>> NA           NA
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> sessionInfo()
>>>>>>>>> R version 3.0.0 (2013-04-03)
>>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>>>>> 
>>>>>>>>> locale:
>>>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>>>>> 
>>>>>>>>> attached base packages:
>>>>>>>>> [1] parallel  stats     graphics  grDevices utils datasets
>>>>>>>>> methods
>>>>>>>>> base
>>>>>>>>> 
>>>>>>>>> other attached packages:
>>>>>>>>>    [1] rtracklayer_1.20.1     Rbowtie_1.0.2 Rsamtools_1.12.2
>>>>>>>>> Biostrings_2.28.0
>>>>>>>>>    [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0
>>>>>>>>> GenomicRanges_1.12.2
>>>>>>>>>    [9] IRanges_1.18.0         BiocGenerics_0.6.0
>>>>>>>>> 
>>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>>>    [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5
>>>>>>>>> BSgenome_1.28.0
>>>>>>>>>    [5] DBI_0.2-5            grid_3.0.0 hwriter_1.3
>>>>>>>>> lattice_0.20-15
>>>>>>>>>    [9] RCurl_1.95-4.1       RSQLite_0.11.3 ShortRead_1.18.0
>>>>>>>>> stats4_3.0.0
>>>>>>>>> [13] tools_3.0.0          XML_3.95-0.2 zlibbioc_1.6.0
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> 
>>> 
>>> 
>>> ----------------------------------------------------------------
>>> This message was sent using IMP, the Internet Messaging Program.
>>> 
>>> 
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor