[BioC] MakeTranscriptDbFromGFF

Fri May 3 13:00:24 CEST 2013

Dear Carl,
Thank you very much; it makes sense now.

To quantify gene expression of my RNASeq samples I  use your
TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the alignments to
the BSgenome.Mmusculus.UCSC.mm9 genome.
Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I
wanted to use their annotation.

Anyway, I have now a general question.
For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes
> genome<- scanFaIndex('genome.fa')
> seqlengths(genome)
    chr10     chr11     chr12     chr13     chr14     chr15     chr16
chr17     chr18     chr19
129993255 121843856 121257530 120284312 125194864 103494974  98319150
95272651  90772031  61342430
     chr1      chr2      chr3      chr4      chr5      chr6      chr7
chr8      chr9      chrM
197195432 181748087 159599783 155630120 152537259 149517037 152524553
131738871 124076172     16299
     chrX      chrY
166650296  15902555

For your TxDb.Mmusculus.UCSC.mm9.knownGene
> seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene)
        chr1         chr2         chr3         chr4         chr5
chr6         chr7         chr8         chr9
   197195432    181748087    159599783    155630120    152537259
149517037    152524553    131738871    124076172
       chr10        chr11        chr12        chr13        chr14
chr15        chr16        chr17        chr18
   129993255    121843856    121257530    120284312    125194864
103494974     98319150     95272651     90772031
       chr19         chrX         chrY         chrM  chr1_random
chr3_random  chr4_random  chr5_random  chr7_random
    61342430    166650296     15902555        16299      1231697
41899       160594       357350       362490
 chr8_random  chr9_random chr13_random chr16_random chr17_random
chrX_random  chrY_random chrUn_random
      849593       449403       400311         3994       628739
1785075     58682461      5900358

So. I want to filter out the '_random' stuff; is this the only and right way
to do it?
> ann<- TxDb.Mmusculus.UCSC.mm9.knownGene
> isActiveSeq(ann)[seqlevels(ann)] <- FALSE
> isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE,
+                       "chr12"=TRUE,"chr13"=TRUE,
+                       "chr14"=TRUE,"chr15"=TRUE,
+                       "chr16"=TRUE, "chr17"=TRUE,
+                       "chr18"=TRUE, "chr19"=TRUE,
+                       "chr1"=TRUE, "chr2"=TRUE,
+                       "chr3"=TRUE, "chr4"=TRUE,
+                       "chr5"=TRUE, "chr6"=TRUE,
+                       "chr7"=TRUE, "chr8"=TRUE,
+                       "chr9"=TRUE, "chrM"=TRUE,
+                       "chrX"=TRUE, "chrY"=TRUE)
> 
Then get my gene info this way?
> genesInfo<- exons(ann, columns='gene_id')

How can I add also gene names or symbols?

Thank you again for your help.

Ugo

> From: Marc Carlson <mcarlson at fhcrc.org>
> Date: Thu, 02 May 2013 15:52:38 -0700
> To: Ugo Borello <ugo.borello at inserm.fr>
> Cc: <bioconductor at r-project.org>
> Subject: Re: [BioC] MakeTranscriptDbFromGFF
> 
> Hi Ugo,
> 
> The 15 GB tarball you sent me to contains several different GTF files
> for genes.  I grabbed this one as it seemed to be the most recent:
> 
> Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-01-24/Genes/ge
> nes.gtf
> 
> So looking at this file I can reproduce the problem you mentioned.  And
> it shows me two problems.  The 1st problem is that the only field that
> seems to contain any information about exon positions is called phase
> (and not "exon_number" as was in the code I see from before).  There is
> not actually any field called "exon_number" in this file.  Either way,
> one thing you can check is to make sure that the string you give here is
> the same as the appropriate field name that is used by the file.  There
> is no way to know this information in advance since GTF does not specify
> how to encode this information (and in fact the information is entirely
> optional).
> 
> The second problem is that even "phase" can't work right now since the
> authors of this gtf file have decided to only associate the exon rank
> information only with CDS and never with exons features.  So there is
> not any actual 'exon' position information in this file, only
> information for CDS positions.  Now that I see people doing these files
> in this way, I plan to enhance the parser so that it can process files
> of this kind.
> 
> 
> Is there a reason why you wanted to use this file and not the data
> contained in this package here?
> 
> http://www.bioconductor.org/packages/2.12/data/annotation/html/TxDb.Mmusculus.
> UCSC.mm9.knownGene.html
> 
> 
>    Marc
> 
> 
> 
> On 05/02/2013 02:59 AM, Ugo Borello wrote:
>> Dear Marc
>> Sorry I was not precise on the origin of the gtf annotation file; I got the
>> gtf file from here:
>> http://tophat.cbcb.umd.edu/igenomes.shtml
>> 
>> And more precisely from the Mus musculus/UCSC/mm9 folder
>> Here the description of the content of the folder:
>> ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/README.txt
>> 
>> I realized reading the README.txt file that actually the genes.gtf file I
>> used is the Ensembl annotation of the mm9 release.
>> So, I changed dataSource = "Ensembl" in the function call and I got the same
>> error message:
>> Error in data.frame(..., check.names = FALSE) :
>>    arguments imply differing number of rows: 541775, 0
>> 
>> At the end of my previous email you have the result of calling:
>> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE)
>> 
>> 
>> Thank you
>> 
>> Ugo
>> 
>>> From: Marc Carlson <mcarlson at fhcrc.org>
>>> Date: Wed, 01 May 2013 15:33:02 -0700
>>> To: <bioconductor at r-project.org>
>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>>> 
>>> Hi Ugo,
>>> 
>>> Which UCSC file was it that you were trying to process?
>>> 
>>> 
>>>     Marc
>>> 
>>> 
>>> 
>>> On 05/01/2013 02:21 AM, Ugo Borello wrote:
>>>> Good morning,
>>>> 
>>>> I have a little problem creating a TranscriptDb object using the function
>>>> makeTranscriptDbFromGFF. I want to use this annotation to count the
>>>> overlaps
>>>> of my genomic alignments with genes.
>>>> 
>>>> 
>>>> I ran:
>>>> 
>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>> + exonRankAttributeName = "exon_number",
>>>> + chrominfo = chrominfo,
>>>> + dataSource = "UCSC",
>>>> + species = "Mus musculus")
>>>> 
>>>> And I got this error message:
>>>> Error in data.frame(..., check.names = FALSE) :
>>>>     arguments imply differing number of rows: 541775, 0
>>>> 
>>>> 
>>>> "chrominfo" was (info retrieved from the fasta genome file):
>>>> 
>>>>> chrominfo
>>>>      chrom    length is_circular
>>>> 1  chr10 129993255       FALSE
>>>> 2  chr11 121843856       FALSE
>>>> 3  chr12 121257530       FALSE
>>>> 4  chr13 120284312       FALSE
>>>> 5  chr14 125194864       FALSE
>>>> 6  chr15 103494974       FALSE
>>>> 7  chr16  98319150       FALSE
>>>> 8  chr17  95272651       FALSE
>>>> 9  chr18  90772031       FALSE
>>>> 10 chr19  61342430       FALSE
>>>> 11  chr1 197195432       FALSE
>>>> 12  chr2 181748087       FALSE
>>>> 13  chr3 159599783       FALSE
>>>> 14  chr4 155630120       FALSE
>>>> 15  chr5 152537259       FALSE
>>>> 16  chr6 149517037       FALSE
>>>> 17  chr7 152524553       FALSE
>>>> 18  chr8 131738871       FALSE
>>>> 19  chr9 124076172       FALSE
>>>> 20  chrM     16299        TRUE
>>>> 21  chrX 166650296       FALSE
>>>> 22  chrY  15902555       FALSE
>>>> 
>>>> 
>>>> I ran it again without the "exonRankAttributeName" argument and I got:
>>>> 
>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>> + chrominfo = chrominfo,
>>>> + dataSource = "UCSC",
>>>> + species = "Mus musculus")
>>>> extracting transcript information
>>>> Estimating transcript ranges.
>>>> Extracting gene IDs
>>>> Processing splicing information for gtf file.
>>>> Deducing exon rank from relative coordinates provided
>>>> Prepare the 'metadata' data frame ... metadata: OK
>>>> Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom",
>>>> :
>>>>     all the values in 'transcripts$tx_chrom' must be present in
>>>> 'chrominfo$chrom'
>>>> In addition: Warning message:
>>>> In .deduceExonRankings(exs, format = "gtf") :
>>>>     Infering Exon Rankings.  If this is not what you expected, then please
>>>> be
>>>> sure that you have provided a valid attribute for exonRankAttributeName
>>>> 
>>>> 
>>>> Without the "chrominfo" argument I got the same error message as the first
>>>> time:
>>>> 
>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>> + exonRankAttributeName = "exon_number",
>>>> + dataSource = "UCSC",
>>>> + species = "Mus musculus")
>>>> 
>>>> Error in data.frame(..., check.names = FALSE) :
>>>>     arguments imply differing number of rows: 541775, 0
>>>> 
>>>> 
>>>> Finally when I eliminated both the "exonRankAttributeName" and the
>>>> "chrominfo" arguments it worked but the warning reminded me of the
>>>> "exonRankAttributeName" argument and the chromosome names are now different
>>>> from the ones in the genome file and there is no info on their length
>>>> 
>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>> + dataSource = "UCSC",
>>>> + species = "Mus musculus")
>>>> extracting transcript information
>>>> Estimating transcript ranges.
>>>> Extracting gene IDs
>>>> Processing splicing information for gtf file.
>>>> Deducing exon rank from relative coordinates provided
>>>> Prepare the 'metadata' data frame ... metadata: OK
>>>> Now generating chrominfo from available sequence names. No chromosome
>>>> length
>>>> information is available.
>>>> Warning messages:
>>>> 1: In .deduceExonRankings(exs, format = "gtf") :
>>>>     Infering Exon Rankings.  If this is not what you expected, then please
>>>> be
>>>> sure that you have provided a valid attribute for exonRankAttributeName
>>>> 2: In matchCircularity(chroms, circ_seqs) :
>>>>     None of the strings in your circ_seqs argument match your seqnames.
>>>> 
>>>>> seqinfo(txdb)
>>>> Seqinfo of length 32
>>>> seqnames     seqlengths isCircular genome
>>>> chr13              <NA>      FALSE   <NA>
>>>> chr9               <NA>      FALSE   <NA>
>>>> chr6               <NA>      FALSE   <NA>
>>>> chrX               <NA>      FALSE   <NA>
>>>> chr17              <NA>      FALSE   <NA>
>>>> chr2               <NA>      FALSE   <NA>
>>>> chr7               <NA>      FALSE   <NA>
>>>> chr18              <NA>      FALSE   <NA>
>>>> chr8               <NA>      FALSE   <NA>
>>>> ...                 ...        ...    ...
>>>> chrY_random        <NA>      FALSE   <NA>
>>>> chrX_random        <NA>      FALSE   <NA>
>>>> chr5_random        <NA>      FALSE   <NA>
>>>> chr4_random        <NA>      FALSE   <NA>
>>>> chrY               <NA>      FALSE   <NA>
>>>> chr7_random        <NA>      FALSE   <NA>
>>>> chr17_random       <NA>      FALSE   <NA>
>>>> chr13_random       <NA>      FALSE   <NA>
>>>> chr1_random        <NA>      FALSE   <NA>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> What am I doing wrong in my original call to makeTranscriptDbFromGFF?
>>>> 
>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>                                  exonRankAttributeName = "exon_number",
>>>>                                  chrominfo = chrominfo,
>>>>                                  dataSource = "UCSC",
>>>>                                  species = "Mus musculus")
>>>> 
>>>> Why am I getting this unfair error message?
>>>> Thank you for your help
>>>> Ugo
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> FYI, this is my annFile (My gtf  annotation file was downloaded together
>>>> with a fasta file containing the mouse genome from UCSC):
>>>> 
>>>>> annFile
>>>> GRanges with 595632 ranges and 9 metadata columns:
>>>>                 seqnames               ranges strand   |   source
>>>> type
>>>> score     phase
>>>>                    <Rle>            <IRanges>  <Rle>   | <factor>
>>>> <factor>
>>>> <numeric> <integer>
>>>>          [1]        chr1   [3204563, 3207049]      -   |  unknown
>>>> exon
>>>> <NA>      <NA>
>>>>          [2]        chr1   [3206103, 3206105]      -   |  unknown
>>>> stop_codon
>>>> <NA>      <NA>
>>>>          [3]        chr1   [3206106, 3207049]      -   |  unknown
>>>> CDS
>>>> <NA>         2
>>>>          [4]        chr1   [3411783, 3411982]      -   |  unknown
>>>> CDS
>>>> <NA>         1
>>>>          [5]        chr1   [3411783, 3411982]      -   |  unknown
>>>> exon
>>>> <NA>      <NA>
>>>>          ...         ...                  ...    ... ...      ...
>>>> ...
>>>> ...       ...
>>>>     [595628] chrY_random [54422360, 54422362]      +   |  unknown
>>>> stop_codon
>>>> <NA>      <NA>
>>>>     [595629] chrY_random [58501955, 58502946]      +   |  unknown
>>>> exon
>>>> <NA>      <NA>
>>>>     [595630] chrY_random [58502132, 58502812]      +   |  unknown
>>>> CDS
>>>> <NA>         0
>>>>     [595631] chrY_random [58502132, 58502134]      +   |  unknown
>>>> start_codon
>>>> <NA>      <NA>
>>>>     [595632] chrY_random [58502813, 58502815]      +   |  unknown
>>>> stop_codon
>>>> <NA>      <NA>
>>>>                   gene_id  transcript_id    gene_name        p_id
>>>> tss_id
>>>>               <character>    <character>  <character> <character>
>>>> <character>
>>>>          [1]         Xkr4   NM_001011874         Xkr4       P2739
>>>> TSS1881
>>>>          [2]         Xkr4   NM_001011874         Xkr4       P2739
>>>> TSS1881
>>>>          [3]         Xkr4   NM_001011874         Xkr4       P2739
>>>> TSS1881
>>>>          [4]         Xkr4   NM_001011874         Xkr4       P2739
>>>> TSS1881
>>>>          [5]         Xkr4   NM_001011874         Xkr4       P2739
>>>> TSS1881
>>>>          ...          ...            ...          ...         ...
>>>> ...
>>>>     [595628] LOC100039753   NM_001017394 LOC100039753      P10196
>>>> TSS19491
>>>>     [595629] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>> TSS4342
>>>>     [595630] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>> TSS4342
>>>>     [595631] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>> TSS4342
>>>>     [595632] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>> TSS4342
>>>>     ---
>>>>     seqlengths:
>>>>              chr1        chr10        chr11        chr12 ...  chrX_random
>>>> chrY  chrY_random
>>>>                NA           NA           NA           NA ...           NA
>>>> NA           NA
>>>> 
>>>> 
>>>>> sessionInfo()
>>>> R version 3.0.0 (2013-04-03)
>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>> 
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>> 
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>> base
>>>> 
>>>> other attached packages:
>>>>    [1] rtracklayer_1.20.1     Rbowtie_1.0.2          Rsamtools_1.12.2
>>>> Biostrings_2.28.0
>>>>    [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1   Biobase_2.20.0
>>>> GenomicRanges_1.12.2
>>>>    [9] IRanges_1.18.0         BiocGenerics_0.6.0
>>>> 
>>>> loaded via a namespace (and not attached):
>>>>    [1] BiocInstaller_1.10.0 biomaRt_2.16.0       bitops_1.0-5
>>>> BSgenome_1.28.0
>>>>    [5] DBI_0.2-5            grid_3.0.0           hwriter_1.3
>>>> lattice_0.20-15
>>>>    [9] RCurl_1.95-4.1       RSQLite_0.11.3       ShortRead_1.18.0
>>>> stats4_3.0.0
>>>> [13] tools_3.0.0          XML_3.95-0.2         zlibbioc_1.6.0
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>