[BioC] Support vector model?

Celine Carret ckc at sanger.ac.uk
Fri Dec 12 17:36:07 CET 2008


Dear Zeljko,

Thank you for answering! So 1st of all, I'm sorry I wasn't clear enough,
I have 57 different measurable quantities (columns) for set 1
(parasite); 18 for the negative control etc, so in fact there is a
variable number of columns for each set, but also variable number of
rows as each measurable event do not have the same length in time. The
longest being 815 time points, I filled the empties with NA.
The length of an event is counted with a 1, so basically each column
corresponds to an event (either test or control or else) measured on a
time scale by 0, nothing happens on that time point, or 1 stuff happens.
If you look at a column it will look like
000000000000000000011111111111111111111111111100000000000000000111111111
111111000000000000000000000000000000000000000000000000
But the second column could be
111111111111111111111111111111111111111111111111111100000000000000000000
0000
Etc etc. However the length of zeros after a measurement (1) is
irrelevant in this case, as most of the times, the measurement was
stopped after seeing an event.

Is this clearer?

I would like to draw conclusions about the relevant variables/length of
an event and eventually recurrence in the same measurement (column)

Unfortunately, this is biology :-( the measurements are not highly
reproducible.
If I split the measurements (columns) having more than one event to make
more columns, would it help? As the time points themselves are not
important, only the length is (purely by observing under a microscope,
often the control gives very short events while the parasite set gives
more sustained events). And that's what I would like to test for
significance.

I would gladly receive more guidance on how to proceed forward, sing the
Chi2 for instance.

Thank you so much for your help
Best wishes
Celine

 

-----Original Message-----
From: Zeljko Debeljak [mailto:zeljko.debeljak at gmail.com] 
Sent: 12 December 2008 16:13
To: Celine Carret
Subject: Re: [BioC] Support vector model?

Dear Celine,

You do need to provide us with some clarifications. How many input
variables do you have? 57? How many time points at which you measure
all your variables? 815? If I am correct you have 4 matrices with 57
columns corresponding to 57 different measurable quantities and 815
rows corresponding to 815 time points while each matrix corresponds to
the specific class (set). In short, you have 4 multivariate (57x815)
fingerprints in front of you (I believe). And based on such data you
want to draw some conclusions about the relevant variables/time points
i.e. variables/time points which make the difference between sets? If
so, you need to have highly reproducible measurements, especially when
it comes to the time coordinate. If this is not the case (and I
believe it is not) you have to make few repetitive measurements for
each matrix and even then you will have serious problems (from the
data analysis point of view). However in that case you will be in a
position to draw some conclusions. For such task I am not sure that
SVM could be of much help (at least due to the time domain variability
and the binary nature of input variables). I would expect better
results based on application of Random forests, but even in that case
I am not sure about the quality of results. The easiest, and the most
unreliable way to do that is to compare corresponding variables at
corresponding time points between different sets. You can even use
some chi2 or similar test statistic to find the answers in a
univariate fashion. If I have interpreted your problem correctly
please contact me. I have been dealing with this type of problems for
a while and, at the moment I have been benchmarking some statistical
tests for the similar problems. Hope this helps.

Zeljko Debeljak, PhD
Medical Biochemistry Specialist
Clinical Hospital Osijek,
CROATIA

2008/12/12 Celine Carret <ckc at sanger.ac.uk>:
>
> Dear All,
>
> Apologies for sending this email to both list, but at this point I'm
not
> sure which one could help me the most.
>
> I have 4 sets of data, 1 test and 3 different sets of controls.
> The measurements are binary, with a matrix of 0 and 1
> I'm measuring across time (rows, ~815) the behaviour of organelles in
> the cell by microscopy in response to different stimuli (several
> measurements for each set, 57 columns in total)
> Set 1: parasite test
> Set 2: no stimulus
> Set 3: inert stimulus (beads)
> Set 4: different pathogen
> Across time, a "zero" means nothing happens around my parasite
> introduced in the cell, a "1" means some cytoskeleton dynamics
occurring
> around my parasite
> I want to give some statistical value to my observations in saying
that
> the cytoskeleton dynamics are specific to my parasite at that
frequency
> across time.
>
> I thought of comparing profiles, like a smooth profile to summarise
all
> that is happening in each set and test for distances between 2
smoothed
> sets. But the timig when something is happening varies a lot,
sometimes
> it's few seconds, sometimes minutes, sometimes only once per
> measurements, sometimes more for the same parasite..
> I'm not sure how to proceed.
>
> I have been looking into e1071 package in R for support vector
machine,
> but I'm not sure this will give me the right model.
>
> I am very grateful for any help / advice anyone can think of
>
> Thank you very much
> Celine
>
>
>
>
> --
>  The Wellcome Trust Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
>

No virus found in this incoming message.
Checked by AVG - http://www.avg.com 

12/12/2008 09:02


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.



More information about the Bioconductor mailing list