[BioC] ath1121501probe_1.0 error (was GCRMA missing value error on ATH1 chip)

Fri Feb 13 15:31:21 MET 2004

Thanks,

The main problem is that I'm a complete beginner with R, trying to learn as
I go along. I've investigated abit more and because the ath1121501probe package
already corresponds to the ATH1-121501_probe_tab theres no point in reading in
that file.
What I did with assistance from help was-

>x <- ath1121501probe$Probe.Set.Name
>y <-unique(x, incomparables = FALSE)
>y

This then returned a vector of the probeset names from the ath1121501probe
package which contained 22814 probesets, which is 4 more than there are on the
chip! So this could account for the 43 extra sequences. The problem now is that
I don't know a simple way of finding and eliminating them. 
I can get a vector of the actual probesets on the chip by-
>data <- ReadAffy()
>gn <- geneNames(data)

So basically I want to compare gn with y to find the 4 unique values in y. I
combined gn and y with-
>z <-c(gn, y)
But the function I want to use 'uniquecombs' is in the mgcv package and I don't
seem to be able to get that from CRAN for my windows R1.9devel.

Is there an easier way?

Thanks
Matt

-----Original Message-----
From: Robert Gentleman [mailto:rgentlem at jimmy.harvard.edu]
Sent: Freitag, 13. Februar 2004 14:04
To: Matthew Hannah
Subject: Re: [BioC] ath1121501probe_1.0 error (was GCRMA missing value
error on ATH1 chip)

On Fri, Feb 13, 2004 at 12:59:48PM +0100, Matthew  Hannah wrote:
> Thanks,
> 
> I've investigated this some more and found that the ATH1-121501_probe_tab.zip
> file from the affy website contains 251,121 sequences whilst the CEL files and
> the ATH1-121501_probe_fasta.zip only contain 251,078 probes. It therefore seems
> that the errors were there in the tab file before the BioC ath1121501probe 
> package was made. I've emailed  affymetrix about it but don't expect a quick 
> response judging from past queries.
> 
> So does anyone know how to find the extra values in the tab file? It doesn't 
> look like there are simply extra values added at the start or finish. Does anyone
> familiar with R know how to obtain a list of Affy ID vs. # of probes from the
> ath1121501probe package or by reading in the ATH1-121501_probe_tab file. This 
> would be easy to cross-reference with the Affy ID vs. probe number that you get
> from the CEL file during MAS5 analysis.

 Basically you can look at the file format, then use scan to suck in
 pretty much anything. Stick the input into a data.frame and then
 compare the affy ids from the two files. I can't imagine that it is
 more than a 1/2 hour of work. And if you did it generally we could
 add it to one of the packages (matchprobes?) so that others could do
 the same if this problem resurfaces.

 Because of the costs involved in changing the layout of a chip I
 expect that Affy often has to drop some probes from the analysis. 
 However, since they do not seem to version anything (or at least not
 the last time I checked) there is very little point to checking until
 a problem is found - like now.

> 
> Has this been an issue for any other chips, are we just trusting affymetrix to
> provide the correct sequence data? I've seen some data showing that ~700 ATH1
> probesets don't match their intended target when an independent BLAST was done.
> 

  It would be nice to have a tool here too. We have played a bit with
  notions of sensitivity and specificity (do the probes go to the gene
  they are annotated at and do they only go there). That would not be
  too hard to do (although some substantial computing resource would
  be needed and again a lack of version numbers on Affy's part makes
  like a little hard). However, a somewhat larger problem looms and
  that is determining just what to blast against (and probably with
  short sequences I would not blast but rather use some sort of
  perfect matching - with 1 error algorithm; the Biostrings package
  has some stuff that could be used for this purpose).

  Robert

> Thanks
> Matt 
> 
> 
> 
> >HI, 
> > there seems to be a disagreement on how many pm probes there are on the
> >chip. This is causing problem in matching the pm intensities with
> >sequences. I am not sure if this is true for all ATH1 chip...
> >
> >  After reading in your Cel file into "object",
> >###########
> >  pmIndex <-  unlist(indexProbes(object,"pm"))
> >  length(pmIndex)
> >  #[1]251078 
> >  #however the probe package gives 251121 pm probe sequences.
> >  length(get("ath1121501probe")$sequence)
> >  [1] 251121
>   
> >  right now I am not sure which should be fixed-- whether the probe
> >package has some redundent sequences that are not PM probes or the
> >indexProbes missed some pm probes?
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

-- 
+---------------------------------------------------------------------------+
| Robert Gentleman                 phone : (617) 632-5250                   |
| Associate Professor              fax:   (617)  632-2444                   |
| Department of Biostatistics      office: M1B20                            |
| Harvard School of Public Health  email: rgentlem at jimmy.harvard.edu        |
+---------------------------------------------------------------------------+