[BioC] Query regarding SomatiSignature bioconductor package

Wed Jun 11 17:24:38 CEST 2014

Hi Anand,

> Thank you for the elaborative reply. This clears lot of things. Regarding R number you are right. Also, I guess you need more genoms to decipher more signatures.

The number of genomes will give you more power to detect signatures, 
whereas the number of potentially present signatures (and therefore 
mutational processes) will depend on the biology of the samples.

> One more question though, in the plot generated from plotSignatures() , the Y axis 'contribution' - is it the percentage contribution ?

The 'contribution' reflects the values of the matrix decomposition. 
They are proportional to each other, but do not reflect percentages. 
You can transform them to percentages by dividing the decomposed matrix 
of interest.  As an example:

   sigs_nmf$w = sigs_nmf$w / rowSums(sigs_nmf$w)

I may add a convenient function for this soon.

Best wishes
Julian

>
> Thanks again,
>
> Regards,
> -Anand
>
> -----Original Message-----
> From: Julian Gehring [mailto:julian.gehring at embl.de]
> Sent: Wednesday, 11 June, 2014 11:31 AM
> To: Anand [guest]; public-csiamt-6Bl98Hp8bEiLvajZxc+D7Q at plane.gmane.org
> Subject: Re: Query regarding SomatiSignature bioconductor package
>
>
>
> Hi Anand,
>
>> 1.I have data from a single study (AML) with mutations obtained from
>> 14 patients. In this case, how do I group the data ? If I group the
>> data by â€˜studyâ€™ as in vignette, I am getting an error while
>> running nmfSignatures function.(I guess itâ€™s because the dimension
>> of matrix
>> (sca_occurance) has only one column corresponding to the single study performed ) Can I group it based on patients (sampleNames) instead ?
>
> You can group your variants by any variable that is present in the 'VRanges' object that contain your calls.  The object behaves very similar to a data frame, so you could add a column with
>
>      x$sample = ... ## your 14 samples ##
>
> and than group it with
>
>      motifMatrix(x, group = "sample")
>
> If your samples are already stored in the column 'sampleNames', you can also refer to this (see '?mutationContext' for an example).
>
>
>> 2.How do I choose the number R (number of signatures to obtain) ? I guess it should be less than number of columns of sca_occurances ? In a recent publication (Nicocolo Bolli et al , 2013, nat. com)  involving single study (multiple myeloma with 52 patients), they mention - the have found two signatures, does it mean they have set the number of signatures (R argument in nmfSignatures()) to 2?
>
> For estimating the number of signatures, there are several approaches.
> If and how well they perform depends largely on the input data, none of
> them will work reliably in all cases.   For this reason, I haven't
> implemented an estimation for the number of signatures so far - I want
> to avoid giving a false sense of security/certainty.
>
> For the practical aspect, most information will the contained in the
> first few signatures - increasing the number of signatures further will
> add little information.  From a biological point of view, each signature
> should result from a different mutation generating process.  In your
> setting with 14 patients suffering from the same type of cancer, one
> would suspect a low number of such processes.
>
> I hope this made things a bit clearer.
>
> Best wishes
> Julian
>
>