[BioC] Reading ArrayExpress data [was Re: questions on the ImaGene data using limma package]

Thu Oct 30 02:25:57 CET 2008

Dear Ming,

Thank you for mailing me example data sets and the annotation spreadsheet 
from ArrayExpress.

You are assuming that the data from ArrayExpress are in ImaGene format. 
This is incorrect.  The reason that limma gives a special treatment to 
ImaGene files is that, unlike other image analysis software, ImaGene 
writes the Cy3 and Cy5 channels into separate files.  However ArrayExpress 
has collated the original data into data matrices with the Cy3 and Cy5 
intensities in the same file for each sample.  Therefore you should ignore 
all references to ImaGene in the limma manual, and instead use the 
instructions for generic two-color platforms.

The data sets you sent me can easily be read into limma using the 
instructions in the limma User's Guide starting page 14 "What should you 
do if your image analysis program is not in the above list?"  I 
demonstrate this below.

Your emails suggest that you have not yet read any two-color data into 
limma.  It is essential that you try some simple examples before trying a 
large dataset from ArrayExpress, which will have a complex structure you 
might not fully understand.

I don't fully understand the sample annotation file from ArrayExpress that 
you sent me, but I doubt that you are interpretting it correctly.  It is 
not in the format you need for a limma targets file.  My guess is that 
each row of the file corresponds to one array, and that each array has 
been hybridized with a common reference that is not mentioned in the 
annotation file.  This means that the repeated sample names you have noted 
do not represent matched Cy3 and Cy5 channels, but rather represent 
dye-swap technical replicates.  That is, they are separate arrays.

If my guess is correct, then a targets file would be something like below.

Let me emphasize that I do not offer a plug-in service to read 
experimental data posted to ArrayExpress.  It is your responsibility to 
figure out the experimental design and the ArrayExpression data formats. 
I am just guessing.

Best wishes
Gordon

READING YOUR DATA FILES

> f
[1] "E-NCMF-8-raw-data-1363346838.txt" "E-NCMF-8-raw-data-1363346856.txt"

> ann <- c("Database NCMF:DB:omadhuman","Database
ebi.ac.uk:Database:ens_trscrpt_id","Feature coordinates: 
metaColumn","metaRow","column","row","Reporter identifier","Reporter 
sequence type")

> columns <- list(Rf="ImaGene:Signal Mean_Cy5",Rb="ImaGene:Background
Median_Cy5",Gf="ImaGene:Signal Mean_Cy3",Gb="ImaGene:Background 
Median_Cy3")

> RG <- read.maimages(files=f,annotation=ann,columns=columns)
Read E-NCMF-8-raw-data-1363346838.txt
Read E-NCMF-8-raw-data-1363346856.txt

> dim(RG)
[1] 37632     2

A POSSIBLE TARGETS FILE

> targets <- readTargets()
> targets
                       Source            DiseaseState 
ArrayDataMatrixFile       Cy3       Cy5
1                       3560 Squamous Cell Carcinoma 
E-NCMF-8-raw-data-1363346838.txt Reference   SCC3560
2 reference pool of 61 HNSCC Squamous Cell Carcinoma 
E-NCMF-8-raw-data-1363346856.txt Reference PoolHNSCC

On Wed, 29 Oct 2008, Ming YI [Contr] wrote:

> Hi, Dear Gordon:
>
> I tried to use limma to deal with ImaGene dataset I downloaded from 
> ArrayExpress. I never deal with ImaGene data before and not familiar with 
> ImaGene data format except knowing that the Cy5 and Cy3 signals are stored in 
> two separate files for the same sample. I tried to read the data into limma 
> and normalize them in the context of limma. and I keep running into issues 
> and errors. and I wish you can help me with this regard:
>
> I did attach a file (E-NCMF-8_sdrf.txt) that was download from ArrayExpress 
> can be potentially used for making the target file, and also I attached two 
> raw data files of the ImaGene dataset as examples. The thing bothering me is 
> as followed:
>
> Extract 3538  and Extract 3526 (see column "Extract Name" of 
> E-NCMF-8_sdrf.txt file) , they do have one Cy5 and one matched Cy3 files, so 
> that's fine with me. but in particular, for "Extract reference pool of 61 
> HNSCC" (see E-NCMF-8_sdrf.txt file), there are multiple Cy3 and Cy5 for such 
> samples, how should we incorporate that into the target file?
>
> I intended to use the following code to deal with this ImaGene data
>
> targets<-readTargets()
> files<-targets[,c("FileNameCy3", "FileNameCy5")'
> RG<-read.maimages(files, source="imagene")
>
> but I need the right target file to start with particularly with the issue I 
> mentioned above.
>
> Also for normalization, the
> RG<-backgroundCorrect(RG, method="normexp", offset=50) still appropiate for 
> ImaGene data?
>
> Thanks so much for your help!
>
> Ming Yi
> ABCC
> P.O.Box B, Bldg 430
> National Cancer Institute/SAIC-Frederick, Inc
> Frederick,Maryland
> USA