[BioC] Problems Reading SNPs into snpMatrix

Chris K. Fuller chris at genome.ucsf.edu
Wed May 11 20:38:37 CEST 2011


Hello List,

I'm at a loss to explain why the read.snps.long function in snpStats
refuses to properly read my SNPs.  I've attempted this several
different ways without success.  I have 195 samples (individuals) and
440,000 SNPs, arranged in the "one call per line" format.  I'm running
this on Linux Ubuntu, and all R packages are up to date.  I've
reproduced these cases below.

Example 1:  Reading SNPs with Nucleotide Genotype

With a file like:
1110043	rs2847443	TT
1110059	rs2847443	TT
1110060	rs2847443	TT
1110063	rs2847443	TA
1110066	rs2847443	TT
1110067	rs2847443	TT
1110070	rs2847443	TT

I use the command:
foo = read.snps.long('one_call_per_line.txt',
sample.id=as.character(sample_id[[1]]),
snp.id=as.character(snp_id_mod[[1]]), fields = c(sample = 1, snp = 2,
genotype = 3), codes = "nucleotide", sep='\t', verbose=TRUE,
in.order=TRUE)

And receive the result:
Error in read.snps.long("one_call_per_line.txt", sample.id =
as.character(sample_id[[1]]),  :
Nucleotide coded genotype should be 2 character string: line 1


Example 2:  Perform my own SNP genotype coding and try again

With a file like:
110043	rs2847443	3
1110059	rs2847443	3
1110060	rs2847443	3
1110063	rs2847443	2
1110066	rs2847443	3
1110067	rs2847443	3
1110070	rs2847443	3

I use this command:
foo = read.snps.long('one_call_per_line_coded.txt',
sample.id=as.character(sample_id[[1]]),
snp.id=as.character(snp_id_mod[[1]]), fields = c(sample = 1, snp = 2,
genotype = 3), codes = c("1", "2", "3"), sep='\t', verbose=TRUE,
in.order=TRUE)

And receive this result:
Reading SnpMatrix with 195 rows and 443136 columns
                             Cumulative totals
                    -----------------------------------
    File      Line     Accepted   Rejected     No call     Skipped    File name
     1    9142000          0            0         9141999         0 0
...er_line_coded.txt


In the first example, it apparently doesn't treat the third column of
my tab-delimited text file as two characters.  Why?  I also generated
a file with quotes around each nucleotide (e.g. 'AA'), but this did
not change the behavior.  In the second example, it seems to treat
everything as no call.  Why?

I've been at this for far longer than I care to admit, so any insight
would be greatly appreciated.

Thank you,

Chris Fuller
chris at genome.ucsf.edu



More information about the Bioconductor mailing list