[BioC] Reading ArrayExpress data [was Re: questions on the ImaGene data using limma package]

Thu Oct 30 22:40:37 CET 2008

See http://www.bioconductor.org/docs/postingGuide.html.
Note that attachments are not permitted.

On Thu, 30 Oct 2008, Ming YI [Contr] wrote:

> Dear Gordon:
>
> Thanks a lot for your comments and suggestions. I already successfully read 
> all the data into limma objects based on your suggestion using the generic 
> method by using the attached target file I edited from their annotation file 
> as I sent to you earlier. I did assume that the Cy3 channel is the common 
> reference as you guessed.
>
> But the issue remained as you mentioned how actually they did the experiment. 
> Based on their E-NCMF-8.idf.txt file from arrayExpress,  it appears to be 
> dye_swap_design, which is exactly what you guessed. So the data appears to be 
> collated by ArrayExpress into data matrices with the Cy3 and Cy5 intensities 
> in the same file for each sample. But the concern is in the column of "Label" 
> in the file E-NCMF-8_sdrf.txt  I sent to you in last email, what does those 
> Cy3 and Cy5 mean for each sample, it looks like this column may tell for each 
> sample (and corresponding raw data file), what is dye for the sample and the 
> other dye would be used for the common reference, which was not mentioned in 
> their annotation file. What do you think? if this is true, I may need to 
> change my target file coordinately to accommodate this information. This 
> assumption makes more sense at least to explain the repeated samples in the 
> dataset, which should be the dye-swapping data.
>
> I tried to contact with them for details of the experiment design, that 
> should help to sort this out.
>
> By the way, I am not sure why my post not go to the mailing list. I changed a 
> bit the address this time, hope it works.
>
> Thanks again for your help. Any additional suggestion would be appreciated as 
> well.
>
> Best regards,
>
> Ming
>
>
> At 09:25 PM 10/29/2008, Gordon K Smyth wrote:
>> Dear Ming,
>> 
>> Thank you for mailing me example data sets and the annotation spreadsheet 
>> from ArrayExpress.
>> 
>> You are assuming that the data from ArrayExpress are in ImaGene format. 
>> This is incorrect.  The reason that limma gives a special treatment to 
>> ImaGene files is that, unlike other image analysis software, ImaGene writes 
>> the Cy3 and Cy5 channels into separate files.  However ArrayExpress has 
>> collated the original data into data matrices with the Cy3 and Cy5 
>> intensities in the same file for each sample.  Therefore you should ignore 
>> all references to ImaGene in the limma manual, and instead use the 
>> instructions for generic two-color platforms.
>> 
>> The data sets you sent me can easily be read into limma using the 
>> instructions in the limma User's Guide starting page 14 "What should you do 
>> if your image analysis program is not in the above list?"  I demonstrate 
>> this below.
>> 
>> Your emails suggest that you have not yet read any two-color data into 
>> limma.  It is essential that you try some simple examples before trying a 
>> large dataset from ArrayExpress, which will have a complex structure you 
>> might not fully understand.
>> 
>> I don't fully understand the sample annotation file from ArrayExpress that 
>> you sent me, but I doubt that you are interpretting it correctly.  It is 
>> not in the format you need for a limma targets file.  My guess is that each 
>> row of the file corresponds to one array, and that each array has been 
>> hybridized with a common reference that is not mentioned in the annotation 
>> file.  This means that the repeated sample names you have noted do not 
>> represent matched Cy3 and Cy5 channels, but rather represent dye-swap 
>> technical replicates.  That is, they are separate arrays.
>> 
>> If my guess is correct, then a targets file would be something like below.
>> 
>> Let me emphasize that I do not offer a plug-in service to read experimental 
>> data posted to ArrayExpress.  It is your responsibility to figure out the 
>> experimental design and the ArrayExpression data formats. I am just 
>> guessing.
>> 
>> Best wishes
>> Gordon
>> 
>> 
>> READING YOUR DATA FILES
>> 
>>> f
>> [1] "E-NCMF-8-raw-data-1363346838.txt" "E-NCMF-8-raw-data-1363346856.txt"
>> 
>>> ann <- c("Database NCMF:DB:omadhuman","Database
>> ebi.ac.uk:Database:ens_trscrpt_id","Feature coordinates: 
>> metaColumn","metaRow","column","row","Reporter identifier","Reporter 
>> sequence type")
>> 
>>> columns <- list(Rf="ImaGene:Signal Mean_Cy5",Rb="ImaGene:Background
>> Median_Cy5",Gf="ImaGene:Signal Mean_Cy3",Gb="ImaGene:Background 
>> Median_Cy3")
>> 
>>> RG <- read.maimages(files=f,annotation=ann,columns=columns)
>> Read E-NCMF-8-raw-data-1363346838.txt
>> Read E-NCMF-8-raw-data-1363346856.txt
>> 
>>> dim(RG)
>> [1] 37632     2
>> 
>> 
>> A POSSIBLE TARGETS FILE
>> 
>>> targets <- readTargets()
>>> targets
>>                       Source            DiseaseState ArrayDataMatrixFile 
>> Cy3       Cy5
>> 1                       3560 Squamous Cell Carcinoma 
>> E-NCMF-8-raw-data-1363346838.txt Reference   SCC3560
>> 2 reference pool of 61 HNSCC Squamous Cell Carcinoma 
>> E-NCMF-8-raw-data-1363346856.txt Reference PoolHNSCC
>> 
>> 
>> On Wed, 29 Oct 2008, Ming YI [Contr] wrote:
>> 
>>> Hi, Dear Gordon:
>>> 
>>> I tried to use limma to deal with ImaGene dataset I downloaded from 
>>> ArrayExpress. I never deal with ImaGene data before and not familiar with 
>>> ImaGene data format except knowing that the Cy5 and Cy3 signals are stored 
>>> in two separate files for the same sample. I tried to read the data into 
>>> limma and normalize them in the context of limma. and I keep running into 
>>> issues and errors. and I wish you can help me with this regard:
>>> 
>>> I did attach a file (E-NCMF-8_sdrf.txt) that was download from 
>>> ArrayExpress can be potentially used for making the target file, and also 
>>> I attached two raw data files of the ImaGene dataset as examples. The 
>>> thing bothering me is as followed:
>>> 
>>> Extract 3538  and Extract 3526 (see column "Extract Name" of 
>>> E-NCMF-8_sdrf.txt file) , they do have one Cy5 and one matched Cy3 files, 
>>> so that's fine with me. but in particular, for "Extract reference pool of 
>>> 61 HNSCC" (see E-NCMF-8_sdrf.txt file), there are multiple Cy3 and Cy5 for 
>>> such samples, how should we incorporate that into the target file?
>>> 
>>> I intended to use the following code to deal with this ImaGene data
>>> 
>>> targets<-readTargets()
>>> files<-targets[,c("FileNameCy3", "FileNameCy5")'
>>> RG<-read.maimages(files, source="imagene")
>>> 
>>> but I need the right target file to start with particularly with the issue 
>>> I mentioned above.
>>> 
>>> Also for normalization, the
>>> RG<-backgroundCorrect(RG, method="normexp", offset=50) still appropiate 
>>> for ImaGene data?
>>> 
>>> Thanks so much for your help!
>>> 
>>> Ming Yi
>>> ABCC
>>> P.O.Box B, Bldg 430
>>> National Cancer Institute/SAIC-Frederick, Inc
>>> Frederick,Maryland
>>> USA
>