[BioC] how to convert genotype snp matrix to nucleotide genotypes?

Stephanie M. Gogarten sdmorris at u.washington.edu
Mon May 13 17:58:22 CEST 2013


You might have better luck using the methods in VariantAnnotation to 
access the genotype matrix, rather than converting to SnpMatrix.  For 
example:

fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
geno <- geno(vcf)$GT
geno
            NA00001 NA00002 NA00003
rs6054257  "0|0"   "1|0"   "1/1"
20:17330   "0|0"   "0|1"   "0/0"
rs6040355  "1|2"   "2|1"   "2/2"
20:1230237 "0|0"   "0|0"   "0/0"
microsat1  "0/1"   "0/2"   "1/1"

ref <- ref(vcf)
alt <- alt(vcf)
geno2 <- geno
for (i in 1:nrow(geno)) {
   geno2[i,] <- gsub("0", as.character(ref[i]), geno[i,])
   for (j in 1:elementLengths(alt[i])) {
     geno2[i,] <- gsub(as.character(j),
                       as.character(alt[[i]][j]),
                       geno2[i,])
   }
}
geno2
            NA00001 NA00002    NA00003
rs6054257  "G|G"   "A|G"      "A/A"
20:17330   "T|T"   "T|A"      "T/T"
rs6040355  "G|T"   "T|G"      "T/T"
20:1230237 "T|T"   "T|T"      "T/T"
microsat1  "GTC/G" "GTC/GTCT" "G/G"

There is probably a much more efficient way to do the string 
substitution than my nested for loop, but this gives you the idea.

Stephanie

On 5/12/13 6:47 PM, Vincent Carey wrote:
> On Sat, May 11, 2013 at 9:18 PM, Tereza Jezkova UNLV <
> jezkovat at unlv.nevada.edu> wrote:
>
>>    Hi Vincent,
>>
>> Thanks so much for your help. I did what you suggested but the resulting
>> matrix has only two values: NA (majority of cells) or A/B. There is no A/A
>> or B/B. SO I know I did something wrong. This is my entire code:
>>
>>> fl <- system.file("extdata", "Lizard_std.vcf",
>> package="VariantAnnotation")
>>> vcf <- readVcf(fl, "1342gen_fasta")
>>> mat <- genotypeToSnpMatrix(vcf)
>> Warning messages:
>> 1: In .local(x, ...) : variants with >1 ALT allele are set to NA
>> 2: In .local(x, ...) : non-single nucleotide variations are set to NA
>> 3: In .local(x, ...) : non-diploid variants are set to NA
>>
>
> could it be that your genome does not have many diallelic loci?  you may
> need to work directly with
> the genotype data and not the SnpMatrix representation.
>
>
>>> MAT <- as(mat$genotypes, "character")
>>> MAT_TRAN <- t(MAT)
>>> write.csv(MAT_TRAN, "mat.csv")
>>
>>
>> The resulting matrix (when opened in excel looks like this. I assume it
>> has something to do with the warning messages I received but I am not sure
>> what to do.
>>     1 2 3 4 5 6 7 8 9 10 RADid_0000001_depth_39:0000000058 NA NA NA NA NA
>> NA NA NA A/B NA RADid_0000003_depth_152:0000000007 NA NA NA NA NA NA NA NA
>> NA NA RADid_0000003_depth_152:0000000034 NA NA NA NA NA NA NA NA NA NA
>> RADid_0000003_depth_152:0000000046 NA NA NA A/B NA A/B NA NA NA NA
>> RADid_0000010_depth_57:0000000010 NA NA NA NA NA NA NA A/B NA NA
>> RADid_0000010_depth_57:0000000019 A/B A/B NA NA NA NA NA NA NA NA
>> RADid_0000010_depth_57:0000000020 A/B A/B A/B A/B NA NA NA A/B NA NA
>> RADid_0000010_depth_57:0000000059 A/B A/B NA A/B NA NA A/B NA NA NA
>> RADid_0000010_depth_57:0000000062 NA NA A/B NA NA NA NA NA NA NA
>>
>>
>> Will be very grateful for any advice.
>>
>> Thanks, Tereza
>>
>>
>>
>>
>>
>>   *From:* Vincent Carey <stvjc at channing.harvard.edu>
>> *Sent:* Friday, May 10, 2013 7:51 PM
>> *To:* Tereza Jezkova UNLV <jezkovat at unlv.nevada.edu>
>> *Cc:* bioconductor at r-project.org
>> *Subject:* Re: [BioC] how to convert genotype snp matrix to nucleotide
>> genotypes?
>>
>>
>>
>> On Fri, May 10, 2013 at 10:02 PM, Tereza Jezkova UNLV <
>> jezkovat at unlv.nevada.edu> wrote:
>>
>>> I created a Snp matrix using a genotypeToSnpMatrix command
>>>
>>> The matrix looks like this:
>>>
>>>> mat
>>> $genotypes
>>> A SnpMatrix with  10 rows and  50581 columns
>>> Row names:  Rodriguez_Lizard_1240_sequence_1_pileup.txt ...
>>> Rodriguez_Lizard_623_sequence_1_pileup.txt
>>> Col names:  RADid_0000001_depth_39:0000000058 ...
>>> RADid_0078132_depth_33:0000000081
>>>
>>> $map
>>> DataFrame with 50581 rows and 4 columns
>>>                                 snp.names       allele.1
>>> allele.2    ignore
>>>                               <character> <DNAStringSet>
>>> <DNAStringSetList> <logical>
>>> 1      RADid_0000001_depth_39:0000000058              C
>>> A     FALSE
>>> 2     RADid_0000003_depth_152:0000000007              G
>>> A,T      TRUE
>>> 3     RADid_0000003_depth_152:0000000034              G
>>> T,C      TRUE
>>> 4     RADid_0000003_depth_152:0000000046              T
>>> C     FALSE
>>> 5      RADid_0000010_depth_57:0000000010              T
>>> C     FALSE
>>> ...                                  ...            ...
>>> ...       ...
>>> 50577  RADid_0078129_depth_31:0000000062              C
>>> T     FALSE
>>> 50578  RADid_0078132_depth_33:0000000025              T
>>> C     FALSE
>>> 50579  RADid_0078132_depth_33:0000000033              C
>>> T     FALSE
>>> 50580  RADid_0078132_depth_33:0000000044              C
>>> A     FALSE
>>> 50581  RADid_0078132_depth_33:0000000081              C
>>> T     FALSE
>>>
>>>
>>> How do I convert my matrix to a a nucleotide genotype matrix?
>>> I would like my data to look something like:
>>>
>>> Sample 1 Snp 1 T/T
>>> Sample 2 Snp 1 T/A
>>> Sample 3 Snp 1 A/A etc.
>>>
>>>
>>
>> as(mat, "character")  will yield a conforming matrix with "A/B" notation
>> in each cell.  the
>> representation is only useful for diallelic SNP
>>
>> from ?read.snps.long
>>
>>    For nucleotide coding, nucleotides are assigned to the nominal alleles
>>    in alphabetic order. Thus, for a SNP with either  "T" and "A"
>>    nucleotides in the variant position,
>>    the nominal genotypes AA, AB and BB will refer to A/A,
>>    A/T and T/T.
>>
>> provided this convention is observed in the VCF translation, you could use
>> string substitutions
>> to transform the A/B notation to nucleotide notation.  oftentimes this is
>> not really needed.
>>
>>
>>> I noticed that people have been creating their Snp matrix from a txt
>>> file. I am importing from a vcf file and I can’t figure out how to get the
>>> desired format.
>>>
>>> Thanks a lot for your kind help,
>>>
>>> Tereza
>>>          [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list