[BioC] DNAStringSet_translate error in predictCoding()

Valerie Obenchain vobencha at fhcrc.org
Thu Jun 19 19:06:43 CEST 2014


Hi,

Please remember to hit 'reply all' when responding so we keep 
communication on the list.

If you're interested in mapping the ambiguity codes in alt to their 
base-pair equivalents see ?IUPAC_CODE_MAP in the Biostrings package.

To identify rows with ambiguous codes you can use the 'other' column 
output from alphabetFrequency():

   alt <- DNAStringSetList("WA", "G", "NA", c("AG", "M"), "A")
   af <- alphabetFrequency(unlist(alt), baseOnly=TRUE)
   ambiguous <- any(relist(af[,"other"] > 0L, alt))
 > ambiguous
[1]  TRUE FALSE  TRUE  TRUE FALSE


A VCF can be subset by rows (variants) or columns (samples) using '['. 
Remove ambiguous rows and keep all samples:

   vcf[!ambiguous, ]


Valerie



On 06/19/2014 03:28 AM, "Dr. Jörg Linde" wrote:
> Dear Valerie,
> thank you sooo much. Helped a lot. Version is VariantAnnotation_1.8.13.
>
>  >hasOnlyBaseLetters(unlist(alt(vcf)))
> FALSE
>
>
>  > unlist(alt(vcf))[rowSums(alphabetFrequency(unlist(alt(vcf)))[,5:17])>0]
>    A DNAStringSet instance of length 11
>       width seq
>   [1]     2 GN
>   [2]     2 WA
>   [3]    12 GTATGTGTNTAT
>   [4]     2 NA
>   [5]     2 YC
>   ...   ... ...
>   [7]     6 AGANGA
>   [8]     2 GN
>   [9]     6 MCAATA
> [10]    11 GTAGTANTAGT
> [11]     2 TN
>
>
> I am just looking for an elegant way to remove these lines from my vcf
>
> best
> Jörg
>
>
>
>
> On 06/18/2014 10:17 PM, Valerie Obenchain wrote:
>> Hi Jörg,
>>
>> It looks like your sessionInfo() output was cut off and I can't tell
>> what version of VariantAnnotation you have.
>>
>> Versions >= 1.10.0 detect structrural variants and create either a
>> CharacterList or DNAStringSetList. Since you have a DNAStringSetList,
>> all values should be valid bases.
>>
>> Does this return TRUE?
>>
>>     hasOnlyBaseLetters(unlist(alt(vcf)))
>>
>> Are there any non-base characters in the matrix?
>>
>>     alphabetFrequency(unlist(alt(vcf)))
>>
>>
>> To help further I'll need the version of VariantAnnotation and a
>> reproducible example.
>>
>> Valerie
>>
>>
>>
>> On 06/17/2014 05:45 AM, "Dr. Jörg Linde" wrote:
>>> Dear bioconductor team,
>>>
>>> I have a problem with predictCoding() of the VariantAnnotation library
>>> posing an error which is the same as described here:
>>> https://stat.ethz.ch/pipermail/bioconductor/2012-November/048940.html
>>>
>>> Howerver, after reading my vcf it clearly has  a DNAStringSetList in
>>> it's ALT variable.
>>> The problem remains when using vcftools to remove indels from the vcf.
>>> As far as I see there are some ALTs with two possibilities.
>>> Is there anything else which could cause the problem?
>>>
>>> I am also aware of this thread
>>> https://stat.ethz.ch/pipermail/bioconductor/2012-October/048370.html
>>> but I can't figure out how to remove those lines causing the problem.
>>>
>>> Thank you very much
>>> Jörg
>>>
>>>   vcf=readVcf("file.vcf","hg")
>>>   coding <- predictCoding(vcf, txdb, seqSource=fa)
>>> Error in .Call2("DNAStringSet_translate", x, DNA_BASE_CODES, lkup,
>>> skipcode,  :
>>>    in 'x[[6655]]': not a base at pos 3
>>>  > alt(vcf)
>>> DNAStringSetList of length 142721
>>> [[1]] C
>>> [[2]] T
>>> [[3]] G
>>> [[4]] G
>>> [[5]] G
>>> [[6]] C
>>> [[7]] C
>>> [[8]] A
>>> [[9]] G
>>> [[10]] C
>>> ..
>>> <142711 more elements>
>>>  > sessionInfo()
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>



More information about the Bioconductor mailing list