[BioC] Finding coding SNPs with predictCoding

Valerie Obenchain vobencha at fhcrc.org
Sat Mar 3 00:31:57 CET 2012


Slight change to this -

I'm now returning the following new columns,

   \item{\code{seqTxLoc}}{
       Location in transcript-based coordinates of the first nucleotide in
       the codon sequence to be translated. This position corresponds to the
       first nucleotide in both the \code{refSeq} and \code{varSeq} columns.
     }
     \item{\code{varTxLoc}}{
       Location in transcript-based coordinates of the first nucleotide in
       the variant. This value will be the same as \code{seqTxLoc} when the
       variant starts exactly at the beginning of a codon.
     }
     \item{\code{varCdsLoc}}{
       Location in cds-based coordinates of the first nucleotide in
       the variant. This position is relative to the start of the cds region
       defined in the \code{subject} annotation.
     }
     \item{\code{subjStrand}}{
       The strand of the \code{subject} the variant matched. 
\code{predictCoding}
       determines which variants fall in a coding region by finding the 
overlaps
       between the \code{query} and \code{subject}. The \code{query} may be
       un-stranded \sQuote{*} but the \code{subject} annotation will 
have a strand.
     }


You are interested in 'protein coordinates'. Does 'varCdsLoc' described 
above meet the need or are you looking for the actual codon number in 
the coding sequence? I am interested in hearing more about what you are 
doing with the protein coordinates, how you are using them. It would 
help us better design future functions.

Thanks,
Valerie

On 03/02/2012 01:11 AM, Alex Gutteridge wrote:
> Thanks Valerie - much appreciated!
>
> On 01.03.2012 21:30, Valerie Obenchain wrote:
>> A 'txLoc' column has been added to the output of predictCoding.
>> Available in devel version 1.1.57.
>>
>> Valerie
>>
>>
>> On 02/28/2012 08:20 AM, Valerie Obenchain wrote:
>>> Good suggestion. Yes, predictCoding is does this internally. I'll 
>>> post back here when this has been added.
>>>
>>> Valerie
>>>
>>>
>>>
>>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote:
>>>> Hi Valerie,
>>>>
>>>> Thanks everything works great now. One small feature request - 
>>>> would it be hard to output the protein sequence position of the 
>>>> coding SNPs? At the moment once I've run predictCoding I'm 
>>>> re-extracting the cds and working out the position of each coding 
>>>> SNP so I can see where in the protein sequence it is, but it seems 
>>>> like this is probably just replicating what predictCoding must be 
>>>> doing internally anyway?
>>>>
>>>> Alex Gutteridge
>>>
>>>
>>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote:
>>>> Hi Alex,
>>>>
>>>> Thanks for the bug report. The cdsID was taken from an overlap 
>>>> between the query and GRangesList of cds by transcripts. This gave 
>>>> the correct transcript number but (incorrectly) took the first cds 
>>>> number in the list by default. Now fixed in devel 1.1.55.
>>>>
>>>> I've also updated the man page.
>>>>
>>>> Valerie
>>>>
>>>>
>>>>
>>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
>>>>> On 22.02.2012 18:58, Hervé Pagès wrote:
>>>>>> Hi Alex,
>>>>>>
>>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>> But the predictCoding call gives this error:
>>>>>>>
>>>>>>> Error in .setSeqNames(x, value) :
>>>>>>> The replacement value for isActiveSeq must be a logical vector, 
>>>>>>> with
>>>>>>> names that match the seqlevels of the object
>>>>>>
>>>>>> The error message doesn't help much but I think the pb is that you
>>>>>> didn't rename chMT properly. Try to do this:
>>>>>>
>>>>>>   seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>>>>>
>>>>>> before you start the for(eg in entrez.ids){..} loop again.
>>>>>>
>>>>>> Cheers,
>>>>>> H.
>>>>>
>>>>> Thanks Hervé that nailed it. I'm having some difficulty joining up 
>>>>> the output of predictCoding() with the query SNPs though. If 
>>>>> someone could point out where the disconnect in my thinking is I 
>>>>> would appreciate it!
>>>>>
>>>>> Here's my (now edited down) script:
>>>>>
>>>>> library(BSgenome.Hsapiens.UCSC.hg19)
>>>>> library(VariantAnnotation)
>>>>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>>
>>>>> entrez.ids = c('6335')
>>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
>>>>>
>>>>> snps   = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
>>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
>>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>>>>
>>>>> gene.list = cdsBy(txdb19, by="gene")
>>>>> vsd.list = gene.list[entrez.ids]
>>>>> cds.list = cdsBy(txdb19,by="tx")
>>>>>
>>>>> eg = entrez.ids[1]
>>>>>
>>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]])))
>>>>> eg.snps = snps[snp.idx]
>>>>> iupac   = values(eg.snps)[,"alleles_as_ambig"]
>>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
>>>>> variant.alleles = 
>>>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]]) 
>>>>>
>>>>>
>>>>>
>>>>> aa = 
>>>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
>>>>>
>>>>> #####
>>>>>
>>>>> Then if I query the predictCoding results in aa in an interactive 
>>>>> session I get the following (see inline comments for what I think 
>>>>> should be happening, but I must be misinterpreting what queryID 
>>>>> means)
>>>>>
>>>>> The docs for predictCoding() contain a small typo 
>>>>> (s/queryHits/queryID), but otherwise seem clear?
>>>>>
>>>>> Columns include ‘queryID’, ‘consequence’, ‘refSeq’, ‘varSeq’,
>>>>>      ‘refAA’, ‘varAA’, ‘txID’, ‘geneID’, and ‘cdsID’.
>>>>>
>>>>>      ‘queryHits’ The ‘queryHits’ column provides a map back to the
>>>>>           variants in the original ‘query’. If the ‘query’ was a 
>>>>> ‘VCF’
>>>>>           object this index corresponds to the row in the 
>>>>> ‘GRanges’ in
>>>>>           the ‘rowData’ slot. If ‘query’ was an expanded ‘GRanges’,
>>>>>           ‘RangedData’ or ‘RangesList’ the index corresponds to 
>>>>> the row
>>>>>           in the expanded object.
>>>>>
>>>>> #####
>>>>>
>>>>>> aa[1,]
>>>>> DataFrame with 1 row and 9 columns
>>>>>     queryID   consequence         refSeq         varSeq         refAA
>>>>> <integer> <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
>>>>> 1         1 nonsynonymous            CTC            ATC            L
>>>>>           varAA        txID   geneID     cdsID
>>>>> <AAStringSet> <character> <factor> <integer>
>>>>> 1             I       10921     6335     33668
>>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx 
>>>>>> '10921' and cds '33668'.
>>>>>> #If I look at the first query SNP I get this:
>>>>>> eg.snps.exp[aa[1,'queryID'],]
>>>>> GRanges with 1 range and 2 elementMetadata values:
>>>>>       seqnames                 ranges strand |   RefSNP_id 
>>>>> alleles_as_ambig
>>>>> <Rle> <IRanges> <Rle> | <character> <character>
>>>>>   [1]     chr2 [167055370, 167055370]      * |   111558968         
>>>>>      R
>>>>>   ---
>>>>>   seqlengths:
>>>>>     chr1  chr2  chr3  chr4  chr5  chr6 ... chr20 chr21 chr22  chrX 
>>>>> chrY  chrM
>>>>>       NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA 
>>>>>   NA    NA
>>>>>> #So SNP 1 is at 167055370 on chr2
>>>>>> #But if I check tx '10921' I see that the cds overlapping 
>>>>>> 167055370 is actually '33651'
>>>>>> #And cds '33668' is at the other end of the tx:
>>>>>> cds.list[[aa[1,'txID']]]
>>>>> GRanges with 26 ranges and 3 elementMetadata values:
>>>>>        seqnames                 ranges strand   |    cds_id    
>>>>> cds_name
>>>>> <Rle> <IRanges> <Rle>   | <integer> <character>
>>>>>    [1]     chr2 [167168009, 167168266]      -   |     33668 <NA>
>>>>>    [2]     chr2 [167163466, 167163584]      -   |     33667 <NA>
>>>>>    [3]     chr2 [167163020, 167163109]      -   |     33666 <NA>
>>>>>    [4]     chr2 [167162302, 167162430]      -   |     33647 <NA>
>>>>>    [5]     chr2 [167160748, 167160839]      -   |     33646 <NA>
>>>>>    [6]     chr2 [167159600, 167159812]      -   |     33645 <NA>
>>>>>    [7]     chr2 [167151109, 167151172]      -   |     33644 <NA>
>>>>>    [8]     chr2 [167149741, 167149882]      -   |     33643 <NA>
>>>>>    [9]     chr2 [167144947, 167145153]      -   |     33642 <NA>
>>>>>    ...      ...                    ...    ... ...       ...        
>>>>> ...
>>>>>   [18]     chr2 [167099012, 167099166]      -   |     33659 <NA>
>>>>>   [19]     chr2 [167094604, 167094777]      -   |     33658 <NA>
>>>>>   [20]     chr2 [167089850, 167089972]      -   |     33657 <NA>
>>>>>   [21]     chr2 [167085201, 167085482]      -   |     33656 <NA>
>>>>>   [22]     chr2 [167084180, 167084233]      -   |     33655 <NA>
>>>>>   [23]     chr2 [167083077, 167083214]      -   |     33654 <NA>
>>>>>   [24]     chr2 [167060870, 167060974]      -   |     33653 <NA>
>>>>>   [25]     chr2 [167060465, 167060735]      -   |     33652 <NA>
>>>>>   [26]     chr2 [167055182, 167056374]      -   |     33651 <NA>
>>>>>        exon_rank
>>>>> <integer>
>>>>>    [1]         2
>>>>>    [2]         3
>>>>>    [3]         4
>>>>>    [4]         5
>>>>>    [5]         6
>>>>>    [6]         7
>>>>>    [7]         8
>>>>>    [8]         9
>>>>>    [9]        10
>>>>>    ...       ...
>>>>>   [18]        19
>>>>>   [19]        20
>>>>>   [20]        21
>>>>>   [21]        22
>>>>>   [22]        23
>>>>>   [23]        24
>>>>>   [24]        25
>>>>>   [25]        26
>>>>>   [26]        27
>>>>>   ---
>>>>>   seqlengths:
>>>>>                     chr1                  chr2 ... 
>>>>> chr18_gl000207_random
>>>>>                249250621             243199373 ...                 
>>>>> 4262
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: 
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list