[BioC] Help With RNA-seq

Thomas J Hardcastle tjh48 at cam.ac.uk
Wed Feb 1 18:20:51 CET 2012


Well, this is embarrassing...

A bug in the code for getPriors.Pois meant that it tripped into an 
infinite loop. I've posted a fix to the release track of Bioconductor; 
version 1.8.2 should allow getPriors.Pois to work. I suppose that nobody 
finding this before means that most people are using the negative 
binomial approach - as they should be!

Tom

On 01/02/12 16:50, Valerie Obenchain wrote:
> Thinking about this a bit more ...
>
> Have you tried modifying the 'maxit' argument to getPriors.Pois() ? 
> It's possible this threshold could be reduced; it would also confirm 
> that the algorithm is converging (if you reduced it to a point where 
> you saw an error).
>
> If I understand correctly, you are not seeing any error messages but 
> getPirors.Pois() and getPrior.NB() are taking a long time to run. (fyi 
> per your comment below, CDP.Poi at priors is just accessing the data in 
> the slot of the object; it is not a function). Without having your 
> data to test on it is difficult to know what is going on. It would be 
> useful to know what kind of machine you are working on, how may cpus, 
> how much memory etc.
>
> I'm cc'ing Thomas (package author) in case he has ideas.
>
> Valerie
>
>
> On 01/31/2012 07:41 PM, Valerie Obenchain wrote:
>> Hi Tina,
>>
>> I'm pasting in your message below so we can keep all communication on 
>> the mailing list.
>>
>> The 'samplesize' argument looks wrong in your call to compute the 
>> priors.
>>
>>     CDP.Poi<- getPriors.Pois(CD, samplesize = 2, takemean = TRUE, cl 
>> = cl)
>>
>> Why have you set this to 2? It should be a much larger number. Try 
>> using the default (1e5),
>>
>>     CDP.Poi<- getPriors.Pois(CD, cl = cl)
>>
>> Does this speed it up?
>>
>> Valerie
>>
>> ## 
>> -----------------------------------------------------------------------
>>
>> Hi Valerie,
>>
>> Thank you once again for all your help.
>>
>> As you requested in the previous email for a clearer explanation to 
>> the problem I am encountering at present.
>>
>>  head(D)
>>    R1L1Kidney R1L2Liver R1L3Kidney R1L4Liver R1L6Liver R1L7Kidney 
>> R1L8Liver
>> 10          0         0          0         0         0          
>> 2         1
>> 15          4        35          7        32        31          
>> 3        29
>> 17          0         2          0         0         0          
>> 1         0
>> 18        110       177        131       135       141        
>> 149       148
>> 19      12685      9246      13204      9312      8746      
>> 12403      8496
>> 22          0         1          0         0         0          
>> 0         0
>>    R2L2Kidney R2L3Liver R2L6Kidney
>> 10          4         0          3
>> 15          6        34          7
>> 17          1         0          0
>> 18        112       145        118
>> 19      13031      9070      13268
>> 22          1         0          0
>>
>> names(D)
>>  [1] "R1L1Kidney" "R1L2Liver"  "R1L3Kidney" "R1L4Liver"  "R1L6Liver"
>>  [6] "R1L7Kidney" "R1L8Liver"  "R2L2Kidney" "R2L3Liver"  "R2L6Kidney"
>>
>> dim(D)
>> [1] 22490    10
>>
>>  g<- gsub("R[1-2]L[1-8]", "", colnames(D))
>>
>>> g
>>  [1] "Kidney" "Liver"  "Kidney" "Liver"  "Liver"  "Kidney" "Liver"  
>> "Kidney"
>>  [9] "Liver"  "Kidney"
>>
>>
>>> d<- DGEList(counts = D, group = substr(colnames(D), 5, 30))
>> Calculating library sizes from column totals.
>>
>>> d
>> An object of class "DGEList"
>> $samples
>>             group lib.size norm.factors
>> R1L1Kidney Kidney  1804977            1
>> R1L2Liver   Liver  1691734            1
>> R1L3Kidney Kidney  1855190            1
>> R1L4Liver   Liver  1696308            1
>> R1L6Liver   Liver  1630795            1
>> R1L7Kidney Kidney  1742426            1
>> R1L8Liver   Liver  1575425            1
>> R2L2Kidney Kidney  1927517            1
>> R2L3Liver   Liver  1767339            1
>> R2L6Kidney Kidney  1963420            1
>>
>> $counts
>>    R1L1Kidney R1L2Liver R1L3Kidney R1L4Liver R1L6Liver R1L7Kidney 
>> R1L8Liver
>> 10          0         0          0         0         0          
>> 2         1
>> 15          4        35          7        32        31          
>> 3        29
>> 17          0         2          0         0         0          
>> 1         0
>> 18        110       177        131       135       141        
>> 149       148
>> 19      12685      9246      13204      9312      8746      
>> 12403      8496
>>    R2L2Kidney R2L3Liver R2L6Kidney
>> 10          4         0          3
>> 15          6        34          7
>> 17          1         0          0
>> 18        112       145        118
>> 19      13031      9070      13268
>> 22485 more rows ...
>>
>> $all.zeros
>>    10    15    17    18    19
>> FALSE FALSE FALSE FALSE FALSE
>> 22485 more elements ...
>>
>> d$samples
>>             group lib.size norm.factors
>> R1L1Kidney Kidney  1804977            1
>> R1L2Liver   Liver  1691734            1
>> R1L3Kidney Kidney  1855190            1
>> R1L4Liver   Liver  1696308            1
>> R1L6Liver   Liver  1630795            1
>> R1L7Kidney Kidney  1742426            1
>> R1L8Liver   Liver  1575425            1
>> R2L2Kidney Kidney  1927517            1
>> R2L3Liver   Liver  1767339            1
>> R2L6Kidney Kidney  1963420            1
>>
>>> names(d)
>> [1] "samples"   "counts"    "all.zeros"
>>
>>
>>> dim(d)
>> [1] 22490    10
>>
>> Yes the plot does work .
>>
>> However, it is the call to CDP.Poi at priors that is taking that long.
>>
>> With regards to the rest of the code it will follow the same approach 
>> with as the one above with slight changes.
>>
>> Also the call to getPriors.NB() is also taking very long as well.
>>
>> These two calls are in essence the main contributions if the rest of 
>> the code is to work.
>>
>> Thank you so much for your help.
>>
>> Have a pleasant day.
>>
>>
>> ## 
>> -----------------------------------------------------------------------
>>
>>
>>
>>
>>
>> On 01/29/12 20:54, Valerie Obenchain wrote:
>>> On 01/29/12 07:49, Tina Asante Boahene wrote:
>>>> Hi all,
>>>>
>>>> I am still having problems with bayseq
>>>>
>>>> Having followed the pdf document associated with it and also 
>>>> tailoring it to the Marioni et al data I am using, it seems that 
>>>> the code has been running for over two days without any results.
>>>>
>>>> I am wondering this code be down to my code.
>>>>
>>>> I have therefore attached my code to this email hoping that someone 
>>>> can help me solve this problem, thank you.
>>>>
>>>>
>>>> library(baySeq)
>>>> library(edgeR)
>>>> library(limma)
>>>> library(snow)
>>>>
>>>> cl<- makeCluster(4, "SOCK")
>>>>
>>>>
>>>> ##Calculating normalization factors##
>>>> D=MA.subsetA$M
>>>> head(D)
>>>> names(D)
>>>> dim(D)
>>> Please provide the output of these (i.e., head, names, dim).
>>>>
>>>> g<- gsub("R[1-2]L[1-8]", "", colnames(D))
>>>> d<- DGEList(counts = D, group = substr(colnames(D), 5, 30))
>>>> d$samples
>>>> names(d)
>>>> dim(d)
>>> These values would be helpful too.
>>>>
>>>>
>>>> CD<- new("countData", data = as.matrix(MA.subsetA$M), libsizes =
>>>> as.integer(d$samples$lib.size), replicates = g)
>>>> groups(CD)<- list(rep(1, ncol(CD)), g)
>>>>
>>>> CD at libsizes<- getLibsizes(CD)
>>>>
>>>> plotMA.CD(CD, samplesA = c(1,3,6,8,10), samplesB = c(2,4,5,7,9), 
>>>> col = c(rep("red",
>>>> 100), rep("black", 900)))
>>>
>>> Did this work? Your original question was about plotMA.CD not 
>>> recognizing your groups. Does the plot work for you now?
>>>>
>>>> ## Optionally adding annotation details to the @annotation slot of 
>>>> the countData object. ##
>>>> CD at annotation<- data.frame(name = paste("gene", 1:1000, sep = "_"))
>>>>
>>>>
>>>>
>>>>
>>>> ### Poisson-Gamma Approach ###
>>>>
>>>> CDP.Poi<- getPriors.Pois(CD, samplesize = 2, takemean = TRUE, cl = cl)
>>>>
>>>> CDP.Poi at priors  ## This takes time ###
>>> Is the call to getPriors.Pois() that has been running for over 2 
>>> days? If not, please specify which function call is taking so long. 
>>> Did the rest of the code below work for you?
>>>
>>> Valerie
>>>>
>>>> CDPost.Poi<- getLikelihoods.Pois(CDP.Poi, pET = "BIC", cl = cl)
>>>> CDPost.Poi at estProps
>>>>
>>>> CDPost.Poi at posteriors[1:10, ]  ## A list of the posterior 
>>>> likelihoods each model for the first 10 genes ##
>>>> CDPost.Poi at posteriors[101:110, ]   ## A list of the posterior 
>>>> likelihoods each model for the genes from 101 to 110 ##
>>>>
>>>>
>>>> ### Negative-Binomial Approach ###
>>>>
>>>> CDP.NBML<- getPriors.NB(CD, samplesize = 1000, estimation = "QL", 
>>>> cl = cl)
>>>>
>>>> CDPost.NBML<- getLikelihoods.NB(CDP.NBML, pET = 'BIC', cl = cl)
>>>>
>>>> CDPost.NBML at estProps
>>>>
>>>> CDPost.NBML at posteriors[1:10,]
>>>>
>>>> CDPost.NBML at posteriors[101:110,]
>>>>
>>>>
>>>> Kind Regards
>>>>
>>>> Tina
>>>> ________________________________________
>>>> From: Valerie Obenchain [vobencha at fhcrc.org]
>>>> Sent: 25 January 2012 18:14
>>>> To: Tina Asante Boahene
>>>> Cc: bioconductor at stat.math.ethz.ch
>>>> Subject: Re: [BioC] Help With RNA-seq
>>>>
>>>> Hi Tina,
>>>>
>>>> It's difficult to help without knowing what your data look like or 
>>>> what
>>>> error message you are seeing. Both pieces of information would be 
>>>> helpful.
>>>>
>>>> For starters I think you need to provide 'replicate' and 'groups'
>>>> arguments when you create your new "countData" object. Depending on 
>>>> what
>>>> order your data are in you need something like,
>>>>
>>>>       groups<- list(NDE = c(1,1,1,1,1,1,1,1,1,1), DE =
>>>> c(1,2,1,2,2,1,2,1,2,1))
>>>>       replicates<- c("Kidney", "Liver", "Kidney", "Liver", "Liver',
>>>> "Kidney", "Liver", "Kidney", "Liver", "Kidney")
>>>>
>>>> Then create your "countData" with these variables,
>>>>
>>>>       CD<- new("countData", data = as.matrix(MA.subsetA$M), libsizes =
>>>> as.integer(d$samples$lib.size),
>>>>           replicates = replicates, groups = groups)
>>>>
>>>> Now look at the CD object and make sure the columns are labeled as 
>>>> they
>>>> should be and the other slot values make sense. The MA plot call would
>>>> look something like,
>>>>
>>>>       plotMA.CD(CD, samplesA = "Kidney", samplesB = "Liver")
>>>>
>>>> The author used the red and black colors for the vignette plot because
>>>> there was a known structure to the data; the first 100 counts showed
>>>> differential expression and the last 900 did not. You probably have a
>>>> different situation in your data so using the same color scheme may 
>>>> not
>>>> make sense.
>>>>
>>>> Valerie
>>>>
>>>>
>>>> On 01/23/2012 06:13 AM, Tina Asante Boahene wrote:
>>>>> Hi all,
>>>>>
>>>>> I am conducting some analysis using the Marioni et al data.
>>>>>
>>>>> However, I am having a bit of trouble using my data to conduct the 
>>>>> analysis based on the baySeq package.
>>>>>
>>>>>    And I was wondering if you could stir me in the right direction.
>>>>>
>>>>> I have already used edgeR to find the library sizes for the ten 
>>>>> libraries I have for my data as well as for the groups (Liver and 
>>>>> Kidney) as stated below.
>>>>>
>>>>>
>>>>> library(baySeq)
>>>>> library(edgeR)
>>>>> library(limma)
>>>>> library(snow)
>>>>>
>>>>> cl<- makeCluster(4, "SOCK")
>>>>>
>>>>>
>>>>> ##Calculating normalization factors##
>>>>> D=MA.subsetA$M
>>>>> head(D)
>>>>> names(D)
>>>>> dim(D)
>>>>>
>>>>> g<- gsub("R[1-2]L[1-8]", "", colnames(D))
>>>>> d<- DGEList(counts = D, group = substr(colnames(D), 5, 30))
>>>>> d$samples
>>>>> names(d)
>>>>> dim(d)
>>>>>
>>>>>
>>>>> I will like to know how to model my code in order to produce the 
>>>>> MA plot for count data
>>>>>
>>>>>
>>>>> This is what I have, however it runs with the wrong response.
>>>>>
>>>>> Can someone help me fix this please.
>>>>>
>>>>> CD<- new("countData", data = as.matrix(MA.subsetA$M), libsizes = 
>>>>> as.integer(d$samples$lib.size))
>>>>>
>>>>> plotMA.CD(CD, samplesA = 1:5, samplesB = 6:10, col = c(rep("red",
>>>>> 100), rep("black", 900)))
>>>>>
>>>>>
>>>>> How can I get it to recognise the "groups" as "g" (Library and 
>>>>> Kidney)
>>>>>
>>>>> This is the output for the groups   [1] "Kidney" "Liver"  "Kidney" 
>>>>> "Liver"  "Liver"  "Kidney" "Liver"  "Kidney"  "Liver"  "Kidney"
>>>>>
>>>>> thank you.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Kind Regards
>>>>>
>>>>> Tina
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: 
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Dr. Thomas J. Hardcastle

Department of Plant Sciences
University of Cambridge
Downing Street
Cambridge, CB2 3EA
United Kingdom



More information about the Bioconductor mailing list