[BioC] Drosophila tiling array background correction

Tobias Straub tstraub at med.uni-muenchen.de
Wed Jul 16 18:22:34 CEST 2008

you can extract the signals manually (watch out that takes some time..).



#here the location of your bpmap
bpmap_file <- "Dm35b_MR_v02-3_BDGPv4h.bpmap"
bpmap <- readBpmap(bpmap_file)

# you get a list of chromosome mappings only a few of them are  
Drosophila ones
# these you specify in here:
allchrs <- c("Dm:BDGPv4h;chr2L",

cel_files <- c("file1.cel","file2.cel",
	"file3.cel", "file4.cel",
	"file5.cel", "file6.cel")

cel_header <- readCelHeader(cel_files[1])
cel_indices <- lapply(bpmap, function(x) xy2indices(x[["pmx"]],  
x[["pmy"]], nr=cel_header[["cols"]]))

data <- readCelIntensities(cel_files, cel_indices[[chr]])
signals <- data
colnames(signals) <- cel_files
layout <- data.frame(chr=
	factor(bpmap[[chr]][["seqInfo"]][["name"]]), seq=
	bpmap[[chr]][["probeseq"]], pos=bpmap[[chr]][["startpos"]],
	stringsAsFactors = FALSE)
layout$probe.id <- paste(layout[,1], layout[,3], sep="_")
lm <- layout

for (chr in allchrs[2:length(allchrs)]) {
	data <- readCelIntensities(cel_files, cel_indices[[chr]])
	signals <-rbind(signals, data)
	layout <- data.frame(chr=
		factor(bpmap[[chr]][["seqInfo"]][["name"]]), seq=
		bpmap[[chr]][["probeseq"]], pos=bpmap[[chr]][["startpos"]],
		stringsAsFactors = FALSE)
	layout$probe.id <- paste(layout[,1], layout[,3], sep="_")
	lm <- rbind(lm, layout)

layout<-data.frame("probe.id"=lm$probe.id, "chr"=sub("chr","",lm$chr),  
"pos"=lm$pos, "sequence"=lm$seq)


you end up with a 'signal' matrix and a 'layout' dataframe
the signals you can easily normalize without having them in an  
AffyBatch object.

sth. like

norm <- exprs(vsn2(signals, subsample=200000))

or whatever else


On Jul 16, 2008, at 4:01 PM, Steve Lianoglou wrote:

> Hi Ben,
> I don't really have any answers for you, but I'm interested in  
> hearing them myself as well :-)
> I'm making a few comments below ... I'm sorry, they turned out to be  
> rather long and I'm not making any claim that I know exactly what  
> I'm doing, I'm just trying what makes most sense to me at the moment.
> If anybody has any positive/negative feedback, I'd be happy to hear.
> On Jul 15, 2008, at 8:53 PM, Benjamin Lansdell wrote:
>> Hi,
>> I have been trying to analyse drosophila development tiling arrays  
>> using Bioconductor. In particular I would like to perform some form  
>> of background correction for probe affinities (such as GCRMA) and  
>> then quantile normalisation but have been unable to find any  
>> packages that can do so without a .cdf file. I have a collection  
>> of .cel files and a .bpmap file.
> I've been wrestling with 1.0R Drosophila tiling array myself and  
> haven't really been able to use the AffyBatch objects I get (from  
> reading their .CEL files with ReadAffy) with any success. By that I  
> mean I haven't been able to then pass the batch object into any  
> other processing step for further analysis/QA assessment, even  
> though I think I made the appropriate CDFs correctly (more on that  
> below)
> So, really the only thing I use the AffyBatch object is to just get  
> the expression info from it:
> > exps <- exprs(myAffyBatch)
> You can then quantile normalize the data by just passing the exps  
> expression matrix into the affy::normalize.quantiles function since  
> it only expects a matrix.
> I've also done my own MvA plots for rudimentary QA.
>> I know the package oligo can perform RMA using a package that you  
>> build from a .bpmap file but this isn't quite what I want.
> Can it? My impression was that RMA relies on the idea of "probe  
> sets," which I don't think really applies to tiling arrays as much.  
> I see that it does SNPRMA ... I don't really know what the details  
> of that are, though.
> Certainly you can annotate your probes to construct your own probe  
> sets, though, but you also have a majority of probes on the array  
> that don't really belong to any probe set.
>> Is it possible to create the necessary structures myself  
>> (AffyBatch, something that contains the probe sequences, etc) for  
>> use with the many packages that seem to rely on a .cdf file?
> I've built the appropriate CDF-like structure (I think) by using the  
> makePlatformDesign::makePDpackage, but still never really got it act  
> like a CDF for my affybatch objects like Bioconductor expects. If  
> you haven't done so, I've put mine up previously for someone else to  
> download, and you can still grab them. They were built for both  
> chips of the 1.0R version of the tiling array (forward and backward  
> strand (aka MF and MR)) on a 64-bit Linux machine:
> http://cbio.mskcc.org/~lianos/files/bioconductor/
> I've been manually attacking this data, probably doing more work  
> than necessary since I can't figure out how to leverage most of  
> bioconductor tools just yet.
> Ideally I'd like to take a similar approach to what was done here:
> http://genomebiology.com/2008/9/7/R112
> The methods appeared earlier in the proceedings of this year's PSB  
> conference. They've also provided MATLAB source code implementing  
> their analysis techniques:
> http://www.fml.tuebingen.mpg.de/raetsch/projects/PSBTiling
> In reality, as a first approach, I'm probably just going to  
> implement (in R) their slightly modified version of M. Gerstein's  
> Sequence Quantile Normalization:
> http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/8/988
> and explore the signals I'm getting off different loci of the  
> genome. I'm hoping I can pull this off within the time constraints  
> of my current rotation project. If others find it useful, and I'm  
> successful, I'd be happy to share it back w/ the R/BioC community.
> In my adventures, I've created some SQLite database that you might  
> find useful:
> (i) A DB with all of the annotation data for the Dmel (v5) genome  
> (protein coding genes, RNA's, exonic positions, etc) from the GTF  
> file taken from ensembl:
> ftp://ftp.ensembl.org/pub/current_gtf/
> (ii) A DB with meta data about the tiling chips, this includes the  
> following tables:
>  (a) probe information as extracted from the bpmap file (sequence, x/ 
> y chip coordinates, pm/mm, etc)
>  (b) alignment information for every probe to the latest release of  
> the genome. For this I used MUMmer and had it return all alignments  
> >= 23 NTs in length (taking Wolfgang Huber's choice from his  
> tilingArray package)
>  (c) a table used for quick lookup of how many times each probe  
> aligns perfectly, and "almost perfectly" to the reference genome.
> I'm now building RData probe_annotation objects for each chromosome  
> that tie this information together. This data.frame essentially has:
> (i) probe information (sequence, x/y, pm/mm)
> (ii) number of perfect/close hits for the probe
> (iii) where in the genome the probe (perfectly) lands: gene name,  
> type of gene (protein coding, miRNA, snoRNA, rRNA, etc) and whether  
> it's exonic/intronic.
> If you think any of these would be helpful to you, let me know and  
> I'll try to put them up when I think they're reasonably correct.
> Like I said before, I'm probably not leveraging the BioC tools that  
> are available as I've gotten lost in the myriad of options (which  
> are good to have) that are there.
> -steve
> --
> Steve Lianoglou
> Graduate Student: Physiology, Biophysics and Systems Biology
> Weill Cornell Medical College
> http://cbio.mskcc.org/~lianos
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Dr. Tobias Straub         Adolf-Butenandt-Institute, Molecular Biology
tel: +49-89-2180 75 439         Schillerstr. 44, 80336 Munich, Germany

More information about the Bioconductor mailing list