[BioC] example of working from a database dump

Vincent Carey 525-2265 stvjc at channing.harvard.edu
Wed Sep 24 13:52:19 MEST 2003

On Wed, 24 Sep 2003, Kaushik, Narendra K wrote:

> I have file in this format:
> Sequence ID  Gene_info  Avag_diff1 Avg_diff2..............Avg_diif95
> fffffffff    Gene1          45        60                    56
> --------     ------      ------     -------  ...............-----
> 9000         Gene9000       34        45                    56
> Avg_diff1--- Avg_diff95 etc values of gene expression data from N=95 chips.
> This is not Affy hgu95av2 Chip. we have custom made chip oligo synthesize
> in situ and are processed by PM-MM method.
> Narendra

there are certain conventions that need to be obeyed.  suppose
i edit your data snippet as follows

SequenceID  Gene_info  Avag_diff1 Avg_diff2           Avg_diif95
fffffffff    Gene1          45        60                    56
9000         Gene9000       34        45                    56

and now those three lines are the contents of file "tab".

in R, the command

> ERtab <-  read.table("tab", h=TRUE)

assigns the data to the object ERtab, which is a data.frame.
now you can do some statistics:

> apply(ERtab[,3:5],2,mean)
Avag.diff1  Avg.diff2 Avg.diif95
      39.5       52.5       56.0

so with a little bit of massaging of a non-standard data snippet
and two R commands i have learned something about the data.

you will need to learn some R.  1) no embedded blanks in variable
(column) names 2) embedded underscores are translated to ".",
3) read.table is powerful and will distinguish
between numeric and  character data.

you may have trouble reading in all 95 columns unless you have
lots of RAM.  once you have the matrix of numbers you can
consider structuring the data in the exprSet class.  clearly there
are some interesting a priori distinctions among the 95 chips.  these
should be encoded in the phenoData component of the exprSet.
read the Biobase and affy vignettes to learn more about this.

there may also be facilities in limma and the marray* tools
that can help you with your custom chips.

suppose you simply don't have enough RAM to read in the data.
you will have to divide and conquer in an appropriate way.
it may be that you can cut the data up into chunks of genes,
with 95 instances of all genes in each chunk.  or you can
cut it up into chunks of chips, with 9500 genes on all chips
in the chunk.  you need to think about filtering
to make this manageable if your computing resources are
insufficient to deal with the whole dataset.  R has all the
tools you need to do this -- you can work with scan, e.g.,
to get subsets of records in the file.  or you may have operating
system facilities that help with file decomposition.

More information about the Bioconductor mailing list