[R] Reshaping genetic data from long to wide

dhinds at sonic.net dhinds at sonic.net
Sat Apr 8 09:59:16 CEST 2006


Farrel Buchinsky <fbuchins at wpahs.org> wrote:

> Bottom Line Up Front: How does one reshape genetic data from long to wide?

I avoid both your "long" and "wide" formats because they are awkward
and inefficient for large data sets.  The "long" format wastes a huge
amount of space with redundant column values, and manipulating a data
frame with 12000 columns is not much fun either.

Instead, I pack genotypes into strings: usually one string per SNP.
So in your example, I'd have a small 180-row table with a few columns
of data about the samples, and a small 6000-row SNP table with one
column of packed genotypes.  The sample table is organized so the row
order matches the positions of the sample genotypes in the packed
genotype strings.  If I want genotypes for a particular SNP, I unpack
them with strsplit(), tack them onto the sample table as a new column,
and discard when I'm done.

I store genotype data in our database this way as well.  We also have
a "long" format table, but I avoid it whenever possible because the
packed format is so much more convenient.  The time saved pulling data
out of the database in this format dwarfs the time spent parsing out
the genotype strings.

-- David Hinds




More information about the R-help mailing list