[BioC] Summarisation in oligo package after exclusion of probes containing SNPs

Sun Nov 3 17:00:12 CET 2013

Dear Benilton/BioC list,

Query: Is it possible to perform summarisation using the oligo package after exclusion of probes containing SNPs?

Background: I've performed an eQTL analysis where the expression data has been obtained on Affymetrix HuGene ST 1.1 microarrays.
After reading in the CEL files, I pre-processed the GeneFeatureSet doing background correction, quantile normalisation, and summarisation to 
gene-level using the rma function from the oligo package.

# What my real data looks like:
( cels <- read.celfiles(pathToFiles) ) # generates a GeneFeatureSet

GeneFeatureSet (storageMode: lockedEnvironment)
assayData: 1178100 features, 90 samples 
element names: exprs 
protocolData
rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ... triad0078_H7_498_CD16.CEL (90 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ... triad0078_H7_498_CD16.CEL (90 total)
varLabels: cell_type info.batch.name ... age (29 total)
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hugene.1.1.st.v1 

#Example workfow using oligoData for reproducibilty:

library(oligo)
library(oligoData)

data(affyGeneFS)
affyGeneFS        # a GeneFeatureSet

GeneFeatureSet (storageMode: lockedEnvironment)
assayData: 1102500 features, 33 samples 
element names: exprs 
protocolData
rowNames: TisMap_Brain_01_v1_WTGene1.CEL TisMap_Brain_02_v1_WTGene1.CEL ... TisMap_Thyroid_03_v1_WTGene1.CEL (33 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: TisMap_Brain_01_v1_WTGene1.CEL TisMap_Brain_02_v1_WTGene1.CEL ... TisMap_Thyroid_03_v1_WTGene1.CEL (33 total)
varLabels: index
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hugene.1.0.st.v1

# generate gene-level expression values:
eset <- rma(affyGeneFS)

# equivalent to:
bgCorrected <- backgroundCorrect(affyGeneFS)                          # Background correct
normalized <- normalize(bgCorrected,  method="quantile")              # Quantile normalise
eset2 <- rma(normalized, background=F, normalize=F, subset=NULL)      # Summarize with median polish

I would like to re-run the eQTL analysis, this time excluding probes from the expression data that contain
SNPs in >1% of CEU individuals to see whether this has any substantial effect on my findings. My concern is
the possibility of apparent eQTLs which are in fact artefacts due to less efficient binding between probes
and transcripts containing the minor allele.

My questions are- 

1) Having obtained a list of the probes to exclude, is it valid to simply perform BG correction and quantile normalisation on the whole GeneFeatureSet,
but to then remove probes containing SNPs prior to summarisation? Would the validity of mapping the resulting probesets to Entrez ids etc.
be questionable if some probes had been excluded?

2) If such an approach is reasonable, how can I map Affymetrix probe-level identifiers to the GeneFeatureSet?

# I know how to get feature ids for PM probes and their groupings into gene-level probesets:
featureInfo <- oligo:::stArrayPmInfo(normalized, target = 'core')

head(featureInfo,10)

       fid  fsetid
1   116371 7892501
2   943979 7892501
3   493089 7892501
4   907039 7892501
5  1033309 7892502
6   653512 7892502
7   690769 7892502
8   997409 7892502
9   379979 7892503
10  469846 7892503

The fsetid s correspond to gene-level probesets. Are the fid s row indices in the GeneFeatureSet,
and how can they be linked to Affy probe level ids?

Assuming I can map Affy probe ids to the fid column, here is my proposed solution for
summarisation after removal of selected probes, but I wonder if there is a more straightforward way?

# for demo purposes generate a random list of fids to remove
# some fid s are duplicated in the data.frame as they map to multiple probesets, so use 'unique'
set.seed(1234)
fidToCut <- sample(unique(featureInfo$fid), 10000, replace=F)

# remove rows from GeneFeatureSet corressponding to fid s we want to exclude
normalized2 <- normalized[-fidToCut, ]

# create mapping for the new GeneFeatureSet of how old row indices map to new indices
map <- cbind(new_ind= 1:nrow(normalized2), old_ind= featureNames(normalized2) )

# pmi gives row indices of pm probes in 'normalized', not 'normalized2'
pmi <-  featureInfo[["fid"]] 

# re-order dataframe so 'old_ind' column is the same as pmi
map1 <- map[match(pmi, map[,"old_ind"]),]
map1 <- cbind(map1, featureInfo)

tail(map1)
       new_ind old_ind    fid  fsetid
861488  961136  969962 969962 8180418
861489  208831  210719 210719 8180418
861490  870109  878113 878113 8180418
861491  598190  603636 603636 8180418
861492  858752  866642 866642 8180418
861493  713834  720353 720353 8180418

# we need to get rid of rows with NA... e.g. where 'fid' but no 'new_ind' exists:
# effectively recreating a new 'featureInfo' dataframe
map1 <- na.omit(map1)

# now we can get the new row indices, and the probesets they belong to
pmi <-  as.numeric( as.character( map1[["new_ind"]] )) # 'new_ind' got coerced to a factor
pnVec <- as.character(map1[["fsetid"]])

# subset the normalized probe-level values to keep probes by indexes in pmi
pms <- exprs(normalized2)[pmi, , drop = FALSE]
dimnames(pms) <- NULL
colnames(pms) <- sampleNames(normalized2)
# get a matrix of summarized values, which can be used to create an ExpressionSet
theExprs <- basicRMA(pms, pnVec, normalize=F, background=F)  

I'd greatly appreciate any advice. I'm a biologist by background so apologies if this is rather basic.
Thanks very much
Jimmy Peters, Smith Lab, Cambridge Institute for Medical Research.

 -- output of sessionInfo(): 

R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pd.hugene.1.0.st.v1_3.8.0 oligoData_1.8.0           biomaRt_2.18.0            pd.hugene.1.1.st.v1_3.8.0 RSQLite_0.11.4            DBI_0.2-7                
 [7] oligo_1.26.0              Biobase_2.22.0            oligoClasses_1.24.0       BiocGenerics_0.8.0       

loaded via a namespace (and not attached):
 [1] affxparser_1.34.0     affyio_1.30.0         BiocInstaller_1.12.0  Biostrings_2.30.0     bit_1.1-10            codetools_0.2-8       ff_2.2-12            
 [8] foreach_1.4.1         GenomicRanges_1.14.1  IRanges_1.20.0        iterators_1.0.6       preprocessCore_1.24.0 RCurl_1.95-4.1        splines_3.0.2        
[15] stats4_3.0.2          tools_3.0.2           XML_3.95-0.2          XVector_0.2.0         zlibbioc_1.8.0    

--
Sent via the guest posting facility at bioconductor.org.