[BioC] edgeR question tags per million

David martin vilanew at gmail.com
Thu Jun 17 10:07:43 CEST 2010


Thanks Mark,
That would do the trick !!!!

I have also written some functions to do multiple comparisons in a loop 
as this is clear missing. Similar to what is in limma. Some functions 
are writtent at the bottom of this page. You might want to include this 
code snippet to do pairwise comparisons of all your conditions.
I'm not a R guru and probably people would do the same easily but it's 
quite useful to get some working code.


library(edgeR)
library(Biobase)
####
mydataFile contains the reads for each condition
#####
mydataFile=read.table(file="rnaseq_reads.txt",header=TRUE,row.names=1,sep="\t", 
check.names=FALSE)

I start reading a file that contains the link between tissus/cell names 
and description
###
pDataFile.txt
###
Experiment      Group
mouse pro B cells [09-002]      pro-bcells
mouse pre B cells [09-002]      pre-bcells
mouse immature B cells [09-002] Imm-bcells
mouse mature B cells (spleen) replicate 1 [09-002]      mBcells-1
mouse mature B cells (spleen) replicate 2 [09-002]      mBcells-2
mouse B cells activated in vitro with LPS/IL4 replicate 1 [09-002] 
LPS-IL4-1
mouse B cells activated in vitro with LPS/IL4 replicate 2 [09-002] 
LPS-IL4-2
mouse B cells activated in vitro with LPS/IL4 replicate 3 [09-002] 
LPS-IL4-3
mouse B cells activated in vitro with LPS/aIgD-Dextran [09-002] LPS-algD

pData=read.table(pDataFile,header=TRUE,row.names=1,sep="\t")

At this stage you need to make sure that the samples in your pData file 
are sorted the same way they are in your reads_file.

if (all(rownames(pData) == colnames(mydataFile)) == FALSE)
      {
        idx = which(rownames(pData) != colnames(mydataFile))
        for (i in 1:length(idx))
          {
            print(paste("pDATA", i))
            print(rownames(pData)[i])
            print(paste("mydataFile", i))
            print(colnames(mydataFile[i]))

          }
        stop ("Problems with pData file. check sorting")
      }

#Create your groups
phenotype <- new("AnnotatedDataFrame", data = pData) #All groups
group=c(as.character(unique(phenotype$Group)))
groups = factor(c(design.list(phenotype$Group,group)))
design <- model.matrix(~ 0+ groups)
colnames(design) <- c(group)


#Prepare contrast matrix
contrast.matrix <- design.pairs(c(group))
comparisons = names(contrast.matrix[1,])
l= length(comparisons)

#Run pairwise comparisons and plot smearplots
#
pdf("smearplots.pdf")
for (i in 1:l)
   {
     #print(paste("Running Comparison ",comparisons[i]))
     tmp = unlist(strsplit(comparisons[i],"-"))
     condA = tmp[1]
     condB = tmp[2]
     print(paste("Running Comparison ",condB, " vs ", condA) )

     #You submit condA vs condB but the comparison is done the other way 
around condB vs conD A. just need to remember, that's the way exacttest 
waorks !!!

     de.com <- exactTest( d, c( condA, condB ) )
     outputfile = paste(condB,"vs",condA,".csv",sep="")
     detags500.com <- rownames(topTags(de.com, n = 500)$table)
     title = paste("FC plot using common dispersion for ",condB," vs 
",condA, "\nhighlighted top 500 genes",sep="")
     plotSmear(d, de.tags = detags500.com, main = title)
     abline(h = c(-2, 2), col = "dodgerblue", lwd = 2)
     write.table(file=outputfile,de.com$table,sep="\t")
   }

and you're done. This creates a csv file for each comparison and a pdf 
file with all smearplots.
Would be useful to get volcanoplots as well but don't know how to do it.


#####
FUNCTIONS
#####

design.list <- function(phenotype,cond)
   {
n <- length(phenotype)
k = length(cond)
design <- matrix(1,1,(n))
for (i in 1:k)
   {
     #print(paste("Reading:",i))
     ind = (which(phenotype == cond[i]))
     design[,ind] = i
     #design[,ind] = cond[i]

   }
design
   }

design.pairs <- function(levels) {
n <- length(levels)
design <- matrix(0,n,choose(n,2))
rownames(design) <- levels
colnames(design) <- 1:choose(n,2)
k <- 0
for (i in 1:(n-1))
for (j in (i+1):n) {
k <- k+1
design[i,k] <- 1
design[j,k] <- -1
colnames(design)[k] <- paste(levels[i],"-",levels[j],sep="")
}
design
}

On 06/16/2010 02:17 PM, Mark Robinson wrote:
> Hi David.
>
> We generally recommend raw counts as input to edgeR.  Presumably you can
> back to this by reversing the TPM calculation --  multiply the TPM in a
> library by the total number of reads and divide by 1 million.  I presume
> this info is available.
>
> That is some seriously high dispersion.  You may also want to look into
> normalization -- see ?calcNormFactors or see the manual.
>
> Cheers,
> Mark
>
>> Hi,
>> I'm going through a sequencing dataset (GSE21630).
>> It's a 46 different libraries experiment with a large dispersion:
>>
>>   >  d$common.dispersion
>> [1] 3.101696
>>
>>   >  sqrt(d$common.dispersion)
>> [1] 1.761163
>>
>> I guess i need to cope with that as the sequencing libraries were
>> prepared with an old protocol not as efficient as current 1.5 solexa
>> protocol.
>>
>> With thar being said the librarires are given in tags per 1 million
>> (TPM). On average the experiment had 4 million reads per library. I was
>> wondering how edge analysis would be affected by using reads (sort of
>> normalized to 1million size library (TPM)) rather then using the total
>> number of reads.
>>
>> What do you think ?? Should i multiply 4 the current number of reads
>> given in TPM or can i stay with my libraries at 1million reads all.
>>
>> thanks,
>> david
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:10}}



More information about the Bioconductor mailing list