[BioC] Problem reading Cel files - Oligo Package

Wed Aug 28 17:13:58 CEST 2013

On Wed, Aug 28, 2013 at 7:44 AM, James W. MacDonald <jmacdon at uw.edu> wrote:
> Hi Atul,
>
>
> On 8/27/2013 11:18 PM, Atul wrote:
>>
>> Hi All,
>>
>> I am trying to read four *.Cel files into oligo and getting this error:
>>
>> > celFiles <- list.celfiles()
>> > celFiles
>> [1] "Iris.CEL" "Liv1.CEL" "Liv2.CEL" "Liv3.CEL"
>> > AF_data = read.celfiles(celFiles)
>> All the CEL files must be of the same type.
>> Error: checkChipTypes(filenames, verbose, "affymetrix", TRUE) is not TRUE
>>
>> Then I tried reading files separately (one by one) and found that one
>> sample (Iris.CEL) shows annotation package as 'pd.huex.1.0.st.v1' while rest
>> (Liv1,Liv2,Liv3) are 'pd.huex.1.0.st.v2'. I checked on GEO and found that
>> though all the samples are from different studies but were generated using
>> same chip - Human Exon 1.0 ST Arrays and the one which is giving error
>> (Iris.cel )have 'HuEx-1_0-st-v2.r2.dt1.hg18.core.ps' mentioned under data
>> processing description, that means it is also version2 of HuEx 1.0ST.
>>
>> So I explicitly mentioned annotation package 'pd.huex.1.0.st.v2' instead
>> of the one recognized by oligo ('pd.huex.1.0.st.v1') and file is read
>> without any problem:
>>
>> > celFiles <- list.celfiles()
>> > celFiles
>> [1] "Iris.CEL"
>> > AF_data = read.celfiles(celFiles,pkgname='pd.huex.1.0.st.v2')
>> Platform design info loaded.
>> Reading in : Iris.CEL
>>
>> But if I add other files and try same thing, than the error is back:
>> > celFiles <- list.celfiles()
>> > celFiles
>> [1] "Iris.CEL" "Liv1.CEL" "Liv2.CEL" "Liv3.CEL"
>> > AF_data = read.celfiles(celFiles,pkgname='pd.huex.1.0.st.v2')
>> All the CEL files must be of the same type.
>> Error: checkChipTypes(filenames, verbose, "affymetrix", TRUE) is not TRUE
>>
>>
>> Can anybody please tell me why annotation package for Iris.cel which is
>> from HuEx 1.0ST v2 (from NCBI GEO description) is recognized as
>> 'pd.huex.1.0.st.v1'? If explicitly mention package name pd.huex.1.0.st.v2
>> and try to read Iris.cel alone, it works. But if read with other cel files
>> with same annotation (pd.huex.1.0.st.v2) it gives error??
>
>
> The Iris.cel file is a HuEx-1_0-st-v1, according to the header in that file:
>
>> sapply(fls, oligo:::getCelChipType, useAffyio=T)
> GSM1008547_02_V-2_Pool-Normal-Iris_11-18-09_S1.CEL.gz
>                                      "HuEx-1_0-st-v1"
>                                      GSM486433.CEL.gz
>                                      "HuEx-1_0-st-v2"
>
> And the others you are trying to read are version 2. It doesn't really
> matter what GEO says, as the information on GEO come from the submitter, and
> they evidently made a mistake.
>
> I don't know what, if any, differences there are between the two versions.
> In addition, there isn't anything I can see on the Affy website that says
> what differences there may be. Certainly they have the same number of probes
> and the probe IDs are all the same.

I have some old notes on this at
http://aroma-project.org/chipTypes/HuEx-1_0-st-v2;

"Note II: Older CEL files for this chip type, may be reported to have
chip type 'HuEx-1_0-st-v1'.  This chip is slightly different from the
'HuEx-1_0-st-v2' chip.  According to Affymetrix support, the
difference is only in the control probes; "There is only a minor
difference between the v1 and the v2 library files and it has to do
with the manufacturing controls on the array. There is no difference
with the probes interrogating the exons between v1 and v2.", cf.
Thread 'Discussion on affymetrix-defined-transcript-clusters' (Nov
25-Dec 2, 2008).  We don't have details on the exact differences and
we don't have access to the HuEx-1_0-st.v1.CDF (please fwd if you have
it), but from Affymetrix' feedback it sounds like one could use the
new HuEx-1_0-st-v2.CDF. "

I guess one could compare the probe sequences for the two to
ultimately find out how they differ.

/Henrik

> So you can combine:
>
>> fls <- dir(pattern = "CEL.gz")
>> dat1 <- read.celfiles(fls[1], pkgname="pd.huex.1.0.st.v2")
> Loading required package: pd.huex.1.0.st.v2
> Loading required package: RSQLite
> Loading required package: DBI
> Platform design info loaded.
> Reading in : GSM1008547_02_V-2_Pool-Normal-Iris_11-18-09_S1.CEL.gz
>> dat2 <- read.celfiles(fls[2]) ## note that you would use all three of the
>> other celfiles for this step
> Platform design info loaded.
> Reading in : GSM486433.CEL.gz
>> dat <- combine(dat1, dat2)
> Warning messages:
> 1: In alleq(levels(x[[nm]]), levels(y[[nm]])) : 1 string mismatch
> 2: data frame column 'exprs' levels not all.equal
> 3: In alleq(levels(x[[nm]]), levels(y[[nm]])) : 1 string mismatch
> 4: data frame column 'dates' levels not all.equal
>> all.equal(featureNames(dat1), featureNames(dat2))
> [1] TRUE
>> dat
> ExonFeatureSet (storageMode: lockedEnvironment)
> assayData: 6553600 features, 2 samples
>   element names: exprs
> protocolData
>   rowNames: GSM1008547_02_V-2_Pool-Normal-Iris_11-18-09_S1.CEL.gz
>     GSM486433.CEL.gz
>   varLabels: exprs dates
>   varMetadata: labelDescription channel
> phenoData
>   rowNames: GSM1008547_02_V-2_Pool-Normal-Iris_11-18-09_S1.CEL.gz
>     GSM486433.CEL.gz
>   varLabels: index
>   varMetadata: labelDescription channel
> featureData: none
> experimentData: use 'experimentData(object)'
> Annotation: pd.huex.1.0.st.v2
>
> You should note however that this isn't a recommendation on my part that you
> should do this. I don't know what these data are, nor what you are planning
> to do with them. In general combining data from two or more completely
> different experiments is a very tricky endeavor. Using something like fRMA
> (if there are frozen estimates for this chip type) might be a better way to
> go.
>
> Best,
>
> Jim
>
>
>
>>
>> NCBI GEO ID:
>> Iris.cel - GSM1008547
>> Liv1/2/3 - GSM486433/GSM486434/GSM486435
>>
>> Awaiting help.
>>
>> AK
>>
>>
>> Session Info:
>>
>> > sessionInfo()
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C LC_TIME=en_US.UTF-8
>> LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
>>  [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=C LC_NAME=C
>> LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets methods
>> base
>>
>> other attached packages:
>> [1] pd.huex.1.0.st.v2_3.8.0 RSQLite_0.11.4 DBI_0.2-7 oligo_1.24.2
>> Biobase_2.20.1          oligoClasses_1.22.0
>> [7] BiocGenerics_0.6.0
>>
>> loaded via a namespace (and not attached):
>>  [1] affxparser_1.32.3     affyio_1.28.0 BiocInstaller_1.10.1
>> Biostrings_2.28.0     bit_1.1-10 codetools_0.2-8
>>  [7] ff_2.2-11             foreach_1.4.0 GenomicRanges_1.12.4
>> IRanges_1.18.1        iterators_1.0.6 preprocessCore_1.22.0
>> [13] splines_3.0.1         stats4_3.0.1 tools_3.0.1 zlibbioc_1.6.0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor