[R] Reshaping genetic data from long to wide

Wed Apr 12 01:56:50 CEST 2006

Farrel Buchinsky <fbuchins at wpahs.org> wrote:
>
> 2)Storing all the SNP data as a string seems quite clever and a space-saving
> way of doing it. However, if you were to analyze a whole chromosome at a
> time you would still be creating one almighty big table albeit only
> temporarily. Do you use R to run TDT analyses? If so, how are you setting up
> your data frames and then what commands do you issue to analyze what is in
> your dataframes?
>
> I used David Clayton's Stata add on a few months ago and was able to get it
> to run through the analysis. I ran just one locus. Technically, the analysis
> seemed to run OK but I want to run all the loci one after the other.
>
> Currently I have my data such that I can access it from R through an ODBC
> connection to Microsoft Access which in turn has an ODBC connection to the
> Sybase database. Whether I go through strings or not, I still need to find a
> way that I can assemble it so that a program can systematically run a TDT
> analysis on all the loci. I can see how strings help me in my storing of the
> data but that is already a fait acomplis. Can you explain to me how it would
> help me with sequential analysis of each locus? Do you have any history
> files so that I can see what you were doing?
>
At some stage, you will need to have a "wide" format pedigree file, at least
as wide as the haplotypes you are interested in.

Using the idea of storage as strings (which would be especially good for
SNPs), you would just do something like:
  cbind(phenotypes, unlist(strplit(snp, " ")))
or
  ex <- function(x) matrix(unlist(strsplit(x," ")), nc=length(x))
  cbind(phenotypes, ex(snps[chosen.snps])

Here is a fragment using reshape from long to wide

extract.snp <- function(snplist, snpdata) {
# names of snps from one file
  snps <- read.delim(snplist, sep="\t", head=FALSE,
                     colClasses=c("character", "numeric"))
  names(snps) <- c("Names", "Pos")
  snps <- snps[order(snps[,2]),]
# snp data in long format
  x <- read.delim(snpdata, sep="\t", colClasses="character", head=FALSE)
  names(x) <- c("id","marker", "genotype")
  ped<-reshape(x, v.names="genotype", timevar="marker",
             idvar="id",direction="wide")
  names(ped) <- gsub("genotype.","", names(ped))
  rpos <- match(snps$Name, names(ped))
  ped <- ped[,c(1, rpos[!is.na(rpos)])]
  ped
}

| David Duffy (MBBS PhD)                                         ,-_|\
| email: davidD at qimr.edu.au  ph: INT+61+7+3362-0217 fax: -0101  /     *
| Epidemiology Unit, Queensland Institute of Medical Research   \_,-._/
| 300 Herston Rd, Brisbane, Queensland 4029, Australia  GPG 4D0B994A v