[BioC] Finding coding SNPs with predictCoding

Mon Mar 5 10:25:37 CET 2012

Hi Valerie,

I'd exactly echo what Thomas wrote (thanks Thomas!). My specific use 
case is mapping coding SNPs to PDB protein structures, so something that 
encodes 'His88 (CAT) -> Gln88 (CAA)' is all I need.

Alex Gutteridge.

On 03.03.2012 00:58, Thomas Girke wrote:
> Hi Valerie,
>
> Based on my experience the position in the complete protein (rather 
> than
> CDS) sequence would be the most important piece of information to 
> record
> here. For instance, if a SNP changes the 88th triplet CAT (His) in 
> the
> ORF of a transcript to CAA (Gln) then you want to record it like 
> this:
> His88 (CAT) -> Gln88 (CAA). This way the user can map this change to 
> a
> protein structure or inject it into the corresponding protein 
> sequence
> without any additional remapping or coding.
>
> Another feature to consider are SNPs affecting splice sites (commonly
> first and last two nucleotides of an intron).
>
> If possible it would be also very useful to support users who want to
> work with their custom genomes and annotations files provided as 
> FASTA
> and GFF/GTF files, respectively.
>
> Best,
>
> Thomas
>
>
> On Fri, Mar 02, 2012 at 11:31:57PM +0000, Valerie Obenchain wrote:
>> Slight change to this -
>>
>> I'm now returning the following new columns,
>>
>>    \item{\code{seqTxLoc}}{
>>        Location in transcript-based coordinates of the first 
>> nucleotide in
>>        the codon sequence to be translated. This position 
>> corresponds to the
>>        first nucleotide in both the \code{refSeq} and \code{varSeq} 
>> columns.
>>      }
>>      \item{\code{varTxLoc}}{
>>        Location in transcript-based coordinates of the first 
>> nucleotide in
>>        the variant. This value will be the same as \code{seqTxLoc} 
>> when the
>>        variant starts exactly at the beginning of a codon.
>>      }
>>      \item{\code{varCdsLoc}}{
>>        Location in cds-based coordinates of the first nucleotide in
>>        the variant. This position is relative to the start of the 
>> cds region
>>        defined in the \code{subject} annotation.
>>      }
>>      \item{\code{subjStrand}}{
>>        The strand of the \code{subject} the variant matched.
>> \code{predictCoding}
>>        determines which variants fall in a coding region by finding 
>> the
>> overlaps
>>        between the \code{query} and \code{subject}. The \code{query} 
>> may be
>>        un-stranded \sQuote{*} but the \code{subject} annotation will
>> have a strand.
>>      }
>>
>>
>> You are interested in 'protein coordinates'. Does 'varCdsLoc' 
>> described
>> above meet the need or are you looking for the actual codon number 
>> in
>> the coding sequence? I am interested in hearing more about what you 
>> are
>> doing with the protein coordinates, how you are using them. It would
>> help us better design future functions.
>>
>> Thanks,
>> Valerie
>>
>> On 03/02/2012 01:11 AM, Alex Gutteridge wrote:
>> > Thanks Valerie - much appreciated!
>> >
>> > On 01.03.2012 21:30, Valerie Obenchain wrote:
>> >> A 'txLoc' column has been added to the output of predictCoding.
>> >> Available in devel version 1.1.57.
>> >>
>> >> Valerie
>> >>
>> >>
>> >> On 02/28/2012 08:20 AM, Valerie Obenchain wrote:
>> >>> Good suggestion. Yes, predictCoding is does this internally. 
>> I'll
>> >>> post back here when this has been added.
>> >>>
>> >>> Valerie
>> >>>
>> >>>
>> >>>
>> >>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote:
>> >>>> Hi Valerie,
>> >>>>
>> >>>> Thanks everything works great now. One small feature request -
>> >>>> would it be hard to output the protein sequence position of the
>> >>>> coding SNPs? At the moment once I've run predictCoding I'm
>> >>>> re-extracting the cds and working out the position of each 
>> coding
>> >>>> SNP so I can see where in the protein sequence it is, but it 
>> seems
>> >>>> like this is probably just replicating what predictCoding must 
>> be
>> >>>> doing internally anyway?
>> >>>>
>> >>>> Alex Gutteridge
>> >>>
>> >>>
>> >>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote:
>> >>>> Hi Alex,
>> >>>>
>> >>>> Thanks for the bug report. The cdsID was taken from an overlap
>> >>>> between the query and GRangesList of cds by transcripts. This 
>> gave
>> >>>> the correct transcript number but (incorrectly) took the first 
>> cds
>> >>>> number in the list by default. Now fixed in devel 1.1.55.
>> >>>>
>> >>>> I've also updated the man page.
>> >>>>
>> >>>> Valerie
>> >>>>
>> >>>>
>> >>>>
>> >>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
>> >>>>> On 22.02.2012 18:58, Herv? Pag?s wrote:
>> >>>>>> Hi Alex,
>> >>>>>>
>> >>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
>> >>>>>
>> >>>>> [...]
>> >>>>>
>> >>>>>>> But the predictCoding call gives this error:
>> >>>>>>>
>> >>>>>>> Error in .setSeqNames(x, value) :
>> >>>>>>> The replacement value for isActiveSeq must be a logical 
>> vector,
>> >>>>>>> with
>> >>>>>>> names that match the seqlevels of the object
>> >>>>>>
>> >>>>>> The error message doesn't help much but I think the pb is 
>> that you
>> >>>>>> didn't rename chMT properly. Try to do this:
>> >>>>>>
>> >>>>>>   seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>> >>>>>>
>> >>>>>> before you start the for(eg in entrez.ids){..} loop again.
>> >>>>>>
>> >>>>>> Cheers,
>> >>>>>> H.
>> >>>>>
>> >>>>> Thanks Herv? that nailed it. I'm having some difficulty 
>> joining up
>> >>>>> the output of predictCoding() with the query SNPs though. If
>> >>>>> someone could point out where the disconnect in my thinking is 
>> I
>> >>>>> would appreciate it!
>> >>>>>
>> >>>>> Here's my (now edited down) script:
>> >>>>>
>> >>>>> library(BSgenome.Hsapiens.UCSC.hg19)
>> >>>>> library(VariantAnnotation)
>> >>>>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>> >>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>> >>>>>
>> >>>>> entrez.ids = c('6335')
>> >>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
>> >>>>>
>> >>>>> snps   = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
>> >>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
>> >>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>> >>>>>
>> >>>>> gene.list = cdsBy(txdb19, by="gene")
>> >>>>> vsd.list = gene.list[entrez.ids]
>> >>>>> cds.list = cdsBy(txdb19,by="tx")
>> >>>>>
>> >>>>> eg = entrez.ids[1]
>> >>>>>
>> >>>>> snp.idx = unique(queryHits(findOverlaps(snps, 
>> vsd.list[[eg]])))
>> >>>>> eg.snps = snps[snp.idx]
>> >>>>> iupac   = values(eg.snps)[,"alleles_as_ambig"]
>> >>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
>> >>>>> variant.alleles =
>> >>>>> 
>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]])
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> aa =
>> >>>>> 
>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
>> >>>>>
>> >>>>> #####
>> >>>>>
>> >>>>> Then if I query the predictCoding results in aa in an 
>> interactive
>> >>>>> session I get the following (see inline comments for what I 
>> think
>> >>>>> should be happening, but I must be misinterpreting what 
>> queryID
>> >>>>> means)
>> >>>>>
>> >>>>> The docs for predictCoding() contain a small typo
>> >>>>> (s/queryHits/queryID), but otherwise seem clear?
>> >>>>>
>> >>>>> Columns include ?queryID?, ?consequence?, ?refSeq?, ?varSeq?,
>> >>>>>      ?refAA?, ?varAA?, ?txID?, ?geneID?, and ?cdsID?.
>> >>>>>
>> >>>>>      ?queryHits? The ?queryHits? column provides a map back to 
>> the
>> >>>>>           variants in the original ?query?. If the ?query? was 
>> a
>> >>>>> ?VCF?
>> >>>>>           object this index corresponds to the row in the
>> >>>>> ?GRanges? in
>> >>>>>           the ?rowData? slot. If ?query? was an expanded 
>> ?GRanges?,
>> >>>>>           ?RangedData? or ?RangesList? the index corresponds 
>> to
>> >>>>> the row
>> >>>>>           in the expanded object.
>> >>>>>
>> >>>>> #####
>> >>>>>
>> >>>>>> aa[1,]
>> >>>>> DataFrame with 1 row and 9 columns
>> >>>>>     queryID   consequence         refSeq         varSeq        
>> refAA
>> >>>>> <integer> <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
>> >>>>> 1         1 nonsynonymous            CTC            ATC        
>>    L
>> >>>>>           varAA        txID   geneID     cdsID
>> >>>>> <AAStringSet> <character> <factor> <integer>
>> >>>>> 1             I       10921     6335     33668
>> >>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to 
>> tx
>> >>>>>> '10921' and cds '33668'.
>> >>>>>> #If I look at the first query SNP I get this:
>> >>>>>> eg.snps.exp[aa[1,'queryID'],]
>> >>>>> GRanges with 1 range and 2 elementMetadata values:
>> >>>>>       seqnames                 ranges strand |   RefSNP_id
>> >>>>> alleles_as_ambig
>> >>>>> <Rle> <IRanges> <Rle> | <character> <character>
>> >>>>>   [1]     chr2 [167055370, 167055370]      * |   111558968
>> >>>>>      R
>> >>>>>   ---
>> >>>>>   seqlengths:
>> >>>>>     chr1  chr2  chr3  chr4  chr5  chr6 ... chr20 chr21 chr22  
>> chrX
>> >>>>> chrY  chrM
>> >>>>>       NA    NA    NA    NA    NA    NA ...    NA    NA    NA   
>> NA
>> >>>>>   NA    NA
>> >>>>>> #So SNP 1 is at 167055370 on chr2
>> >>>>>> #But if I check tx '10921' I see that the cds overlapping
>> >>>>>> 167055370 is actually '33651'
>> >>>>>> #And cds '33668' is at the other end of the tx:
>> >>>>>> cds.list[[aa[1,'txID']]]
>> >>>>> GRanges with 26 ranges and 3 elementMetadata values:
>> >>>>>        seqnames                 ranges strand   |    cds_id
>> >>>>> cds_name
>> >>>>> <Rle> <IRanges> <Rle>   | <integer> <character>
>> >>>>>    [1]     chr2 [167168009, 167168266]      -   |     33668 
>> <NA>
>> >>>>>    [2]     chr2 [167163466, 167163584]      -   |     33667 
>> <NA>
>> >>>>>    [3]     chr2 [167163020, 167163109]      -   |     33666 
>> <NA>
>> >>>>>    [4]     chr2 [167162302, 167162430]      -   |     33647 
>> <NA>
>> >>>>>    [5]     chr2 [167160748, 167160839]      -   |     33646 
>> <NA>
>> >>>>>    [6]     chr2 [167159600, 167159812]      -   |     33645 
>> <NA>
>> >>>>>    [7]     chr2 [167151109, 167151172]      -   |     33644 
>> <NA>
>> >>>>>    [8]     chr2 [167149741, 167149882]      -   |     33643 
>> <NA>
>> >>>>>    [9]     chr2 [167144947, 167145153]      -   |     33642 
>> <NA>
>> >>>>>    ...      ...                    ...    ... ...       ...
>> >>>>> ...
>> >>>>>   [18]     chr2 [167099012, 167099166]      -   |     33659 
>> <NA>
>> >>>>>   [19]     chr2 [167094604, 167094777]      -   |     33658 
>> <NA>
>> >>>>>   [20]     chr2 [167089850, 167089972]      -   |     33657 
>> <NA>
>> >>>>>   [21]     chr2 [167085201, 167085482]      -   |     33656 
>> <NA>
>> >>>>>   [22]     chr2 [167084180, 167084233]      -   |     33655 
>> <NA>
>> >>>>>   [23]     chr2 [167083077, 167083214]      -   |     33654 
>> <NA>
>> >>>>>   [24]     chr2 [167060870, 167060974]      -   |     33653 
>> <NA>
>> >>>>>   [25]     chr2 [167060465, 167060735]      -   |     33652 
>> <NA>
>> >>>>>   [26]     chr2 [167055182, 167056374]      -   |     33651 
>> <NA>
>> >>>>>        exon_rank
>> >>>>> <integer>
>> >>>>>    [1]         2
>> >>>>>    [2]         3
>> >>>>>    [3]         4
>> >>>>>    [4]         5
>> >>>>>    [5]         6
>> >>>>>    [6]         7
>> >>>>>    [7]         8
>> >>>>>    [8]         9
>> >>>>>    [9]        10
>> >>>>>    ...       ...
>> >>>>>   [18]        19
>> >>>>>   [19]        20
>> >>>>>   [20]        21
>> >>>>>   [21]        22
>> >>>>>   [22]        23
>> >>>>>   [23]        24
>> >>>>>   [24]        25
>> >>>>>   [25]        26
>> >>>>>   [26]        27
>> >>>>>   ---
>> >>>>>   seqlengths:
>> >>>>>                     chr1                  chr2 ...
>> >>>>> chr18_gl000207_random
>> >>>>>                249250621             243199373 ...
>> >>>>> 4262
>> >>>>>
>> >>>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> Bioconductor mailing list
>> >>>> Bioconductor at r-project.org
>> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>>> Search the archives:
>> >>>> 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>>
>> >>> _______________________________________________
>> >>> Bioconductor mailing list
>> >>> Bioconductor at r-project.org
>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>> Search the archives:
>> >>> 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Alex Gutteridge