[BioC] annotating microarray data with mogene10stv1

Mon Jul 21 16:43:48 CEST 2014

Hi Jakub,

On 7/19/2014 8:30 AM, Jakub Stanislaw Nowak wrote:
> Hello everyone,
>
> This my first attempt so it may not be a perfect email. I am not very advanced in bioinformatics so I tried to be very detailed.
> Basically I am trying to annotate microarray dataset from Affymetrix using bioconductor.
>
> I used following steps:
>
> 1. loading libraries
>
>> library(annotate)
>> library(limma)
>> library(mogene10sttranscriptcluster.db)
>> library(affy)
>
> 2. I read my target files into annotated data frame using
>> adf <- read.AnnotatedDataFrame("target.txt",header=TRUE,row.names=1 ,as.is=TRUE)
>
> 3. I read my expression .CEL files into target files  using
>> mydata <- ReadAffy(filenames=pData(adf)$FileName,phenoData=adf)
>
>>> mydata
>> AffyBatch object
>> size of arrays=1050x1050 features (19 kb)
>> cdf=MoGene-1_0-st-v1 (34760 affyids)
>> number of samples=6
>> number of genes=34760
>> annotation=mogene10stv1
>> notes=
>
>
> 4. I normalised my data with ram
>> eset <- rma(mydata)
>
> when you check eset it looks ok
>>> eset
>> ExpressionSet (storageMode: lockedEnvironment)
>> assayData: 34760 features, 6 samples
>>    element names: exprs
>> protocolData
>>    sampleNames: GSM910962.CEL GSM910963.CEL ... GSM910967.CEL (6 total)
>>    varLabels: ScanDate
>>    varMetadata: labelDescription
>> phenoData
>>    sampleNames: GSM910962.CEL GSM910963.CEL ... GSM910967.CEL (6 total)
>>    varLabels: FileName Description
>>    varMetadata: labelDescription
>> featureData
>>    featureNames: 10338001 10338003 ... 10608724 (34760 total)
>>    fvarLabels: ID Symbol Name
>>    fvarMetadata: labelDescription
>> experimentData: use 'experimentData(object)'
>> Annotation: mogene10stv1
>
> 5. I was able to generate fold change vector for interesting samples but I have problem annotating eset.
>
> 6. I tried annotate the eset file using annotation from above using mogene10sttranscriptcluster.db or mogene10stprobes.db.
>
> #I build an annotation table
>
> ID <- featureNames(eset)
>
> Symbol <- getSYMBOL(ID, "mogene10sttranscriptcluster.db")
>
> Name <- as.character(lookUp(ID, "mogene10sttranscriptcluster.db", "GENENAME"))
>
> tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name,
> stringsAsFactors=F)
>
> tmp[tmp=="NA"] <- NA #fix padding with NA characters

There are two issues here. First, the all of the Gene ST arrays have a 
lot of control probes that tend to pollute your results. There is a 
function in my affycoretools package that you can use to remove them 
(getMainProbes).

Second, you don't want to annotate like that. The lookUp() function will 
by default ignore all probesets that have one-to-many mappings, and will 
return NA for them. Instead, use more current methods:

tmp <- select(mogene10sttranscriptcluster.db, ID, 
c("SYMBOL","GENENAME","ENTREZID"))

You will then get a message stating that you have multiple-mapping 
probesets. How you deal with that is up to you. Alternatives include 
concatenating:

tmp2 <- do.call("rbind", lapply(split(tmp, tmp$PROBEID), function(x) 
apply(x,2,function(y) paste(y[!duplicated(y)], collapse = " | "))))

I usually don't do that because I push results out via ReportingTools, 
and I like to have links to the Gene IDs. Plus some of the probesets map 
to a huge number of genes, so you can get really messy results. An 
alternative is to select genes at random, for those with multiple mappings:

tmp3 <- t(sapply(split(tmp, tmp$PROBEID),function(x) 
x[sample(seq_len(nrow(x)), 1),]))

Or my personal favorite, the most naive thing possible:

tmp4 <- tmp[!duplicated(tmp$PROBEID),]

Best,

Jim

>
>
>
> The problem I have is that for large number of IDs - all initial 6500 - I am getting Symbol and Name annotated as NA
> Here is the output for some of them
>
>>> Name[6500:6550]
>>   [1] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
>> [22] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
>> [43] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" “NA"
>>
>>> Symbol[6500:6550]
>> 10344545 10344546 10344547 10344548 10344549 10344550 10344551 10344552 10344553 10344554 10344555 10344556
>>        NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA
>> 10344557 10344558 10344559 10344560 10344561 10344562 10344563 10344564 10344565 10344566 10344567 10344568
>>        NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA
>> 10344569 10344570 10344571 10344572 10344573 10344574 10344575 10344576 10344577 10344578 10344579 10344580
>>        NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA
>> 10344581 10344582 10344583 10344584 10344585 10344586 10344587 10344588 10344589 10344590 10344591 10344592
>>        NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA
>> 10344593 10344594 10344595
>>        NA       NA       NA
>
> #So if I assign it as feature data of the current Eset
>
> fData(eset) <- tmp
>
> #and perform stat with limma
>
> #Build the design matrix
> design <- model.matrix(~-1+factor(c(1,1,2,2,3,3)))
> colnames(design) <- c(“mock”,"siGFP”,"siLin28a",”")
>
>>> design
>>    mock siGFP siLin28a
>> 1    1     0        0
>> 2    1     0        0
>> 3    0     1        0
>> 4    0     1        0
>> 5    0     0        1
>> 6    0     0        1
>> attr(,"assign")
>> [1] 1 1 1
>> attr(,"contrasts")
>> attr(,"contrasts")$`factor(c(1, 1, 2, 2, 3, 3))`
>> [1] "contr.treatment"
>
>
> # instructs Limma which comparisons to make
> contrastmatrix <- makeContrasts(mock-siGFP,mock-siLin28a,siGFP-siLin28a,levels=design)
>
>>> contrastmatrix
>>            Contrasts
>> Levels     mock - siGFP mock - siLin28a siGFP - siLin28a
>>    mock                1               1                0
>>    siGFP              -1               0                1
>>    siLin28a            0              -1               -1
>
>   # make the contrasts
> fit <- lmFit(eset, design)
>
> fit2 <- contrasts.fit(fit, contrastmatrix)
> fit2 <- eBayes(fit2)
>
> #listed the top differentially expressed genes
>
>>> topTable(fit2,coef=1,adjust="fdr")
>>                 ID Symbol Name     logFC  AveExpr          t      P.Value adj.P.Val         B
>> 10342604 10342604   <NA> <NA>  1.831809 2.845863  14.037480 1.248257e-05 0.4338942 -3.694032
>> 10343224 10343224   <NA> <NA> -2.751868 2.658551 -10.992289 4.802754e-05 0.8347186 -3.715792
>> 10339733 10339733   <NA> <NA>  1.703917 3.426402   9.860457 8.674159e-05 1.0000000 -3.728996
>> 10341175 10341175   <NA> <NA> -2.861665 2.167471  -8.877976 1.526481e-04 1.0000000 -3.744350
>> 10340405 10340405   <NA> <NA> -1.368074 2.944242  -8.245297 2.263752e-04 1.0000000 -3.756919
>> 10343199 10343199   <NA> <NA> -1.238289 2.077839  -8.130892 2.437752e-04 1.0000000 -3.759472
>> 10339048 10339048   <NA> <NA>  1.101959 2.129288   7.949683 2.746238e-04 1.0000000 -3.763713
>> 10344413 10344413   <NA> <NA> -1.407850 2.259821  -7.810608 3.014011e-04 1.0000000 -3.767144
>> 10343919 10343919   <NA> <NA>  1.134134 2.650166   7.644642 3.374231e-04 1.0000000 -3.771452
>> 10338867 10338867   <NA> <NA>  1.162114 5.792621   7.454279 3.850651e-04 1.0000000 -3.776701
>
> As you can see just this small portion is already missing information about Symbol and Name.
>
> So my question is do I use a correct .db library for annotation? As it looks like I missing a lot of ID cannot be annotated
>
> How can I fix that problem?
>
> Many Thanks,
>
> Jakub
>
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099