[BioC] Hu Gene 1.0 ST v1 microarray processing and analysis

James W. MacDonald jmacdon at med.umich.edu
Tue Jan 10 00:54:25 CET 2012


Hi Asha,

On 1/9/2012 5:17 PM, Azby Cdex wrote:
> Dear Friends,
>
> First of all let me tell that I am not an expert bioinformatician.  I would
> like to do some basic microarray analysis using R&  Bioconductor with CEL
> files obtained using Affymetrix HuGene-1.0-ST-v1 platform. I have so many
> questions and I tried to search and read several threads in the
> Bioconductor Help List and other webpages. My questions are related to or
> the same as many of the previous threads but after reading several of those
> answers, questions remain almost the same.
>
> The main question is regarding the number of genes probed in this platform.
> According to Affymetrix Data sheet on this platform there are 764,885
> distinct probes and 28,869 estimated genes. When I use ‘affy’ and use the
> function ‘ReadAffy()’ and ‘rma’ I get an expression set with 32321
> features. Very different from 28,869!

There is a difference between the number of genes interrogated and the 
number of probesets because there can be more than one probeset that 
interrogates a particular gene. Remember that this chip is supposed to 
interrogate transcripts, and that may include different splice variants.

>
> I read in most of the replies to previous threads that ‘affy’ should not be
> used for the analysis of this platform.  (It will be great if somebody can
> explain or point to relevant literature on the reasons for these
> differences). However, with ‘affy’ it automatically identifies the correct
> annotation file (at least the name ‘*hugene10stv1’*) and processes the CEL
> file without giving any error message or warning.

The reason to use oligo instead of affy is that the affy package 
(actually the makecdfenv package, which makes the cdf packages) was 
designed for an older chip style that never re-used probes for different 
probesets. In both the Gene and Exon chips, there are some probes that 
are part of more than one probeset. If you use the affy package, these 
probes will only be assigned to a single probeset.

>
> As suggested in many threads and in Bioconductor website I used the package
> ‘oligo’ for processing my HuGene10STv1 based CEL file. After summarizing at
> the core level using ‘rma’ function, I obtained an expression set object
> with 33297 features, and of course it is neither 28,869 nor 32321. Here the
> annotation used is ‘pd.hugene.1.0.st.v1’ instead of the ‘hugene10stv1’ in
> the previous case.
>
> I am fine with using ‘oligo’. [See, I am ‘blindly’ using a software, like
> most of the people! I found papers, even in prestigious journals, using
> ‘affy’ to process CEL files obtained using ‘hugene10stv1’ chip. Please help
> me to open my eyes or enlighten me (and many others)!] However, when I want
> to get gene Symbols corresponding to the transcripts, again there is a
> ‘number mismatch’. For example when I used the package
> 'hugene10sttranscriptcluster.db' , I found that there are 21995 keys out of
> 33295 (not 33297) can be mapped to gene symbols. What happened to two of
> them? Or, with ‘oligo’ I have to use something else to convert ‘transcript
> ids’ to SYMBOLS or ENTREZIDs, than 'hugene10sttranscriptcluster.db'?

No, if you use oligo and the 'core' transcripts, then you want to use 
the hugene10sttranscriptcluster.db annotation package. Note that the 
annotation packages are made by taking the manufacturer's mapping of 
probesets to (usually) Entrez Gene IDs, and then using that mapping to 
get all the other annotation data. So any lack of probeset -> gene 
mapping is usually due to a lack of annotation by the manufacturer.


>
> I read that ‘affy’  can be used with * "hugene10stv1.r3cdf" *but there is
> no such thing available in bioconductor website among the annotation
> packages. May be that was applicable to an older Bioconductor release as
> those threads were 2-3 years old. Doesn’t it imply that the currently
> available ‘*hugene10stv1’ *is the correct one to use with ‘affy’? On the
> other hand, if it cannot be used why is it there in Bioconductor? Where do
> we use the annotation ‘*hugene10stv1’*?

I am not sure what version of the unsupported cdf we used to create the 
cdf package. I see that there is in fact an unsupported cdf on the Affy 
website with an 'r3' in the file name. You could hypothetically download 
that cdf file and use the makecdfenv package to create a cdf package 
yourself. However, this will suffer from the same shortcomings as the 
cdf that we supply. Note also that this isn't an annotation package. 
Instead, it is a package that tells the affy package which probe belongs 
in which probeset, used during the summarization step.

Best,

Jim



>
> I read there are other packages such as ‘aroma-affymetrix’, xps, etc, but I
> am trying to do some simple things with standard, official, ‘bioconductor’
> packages. Any suggestions and helpful hints are highly appreciated.
>
> Here are the commands that used in Bioconductor version 2.8 (with R 2.13)
> [Yes, I will update to most recent version soon!].
>
> As an example, I used the CEL file, 'GSM857535.CEL.gz', down loaded  from
> GEO.
>
>> library(‘affy’)
>> as<- ReadAffy('GSM857535.CEL.gz')
>> as
>> aset<- rma(as)
>> aset
>> library('hugene10sttranscriptcluster.db')
> x<- hugene10sttranscriptclusterSYMBOL
>
> xx<- x[mappedkeys(x)]
>
>> length(x)
> [1] 33295
>
>> length(xx)
> [1] 21995
>
> library(‘oligo’)
>
> bs<- read.celfiles('GSM857535.CEL.gz')
>
>> bs
>> bset<- rma(bs,target='core')
>> bset
> Thanks,
> Asha
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues 



More information about the Bioconductor mailing list