[R] Variance of multiple non-contiguous time periods?

Wed Nov 5 01:29:15 CET 2014

On Nov 4, 2014, at 3:41 PM, CJ Davies wrote:

> On 04/11/14 17:42, David Winsemius wrote:
>> On Nov 4, 2014, at 9:16 AM, CJ Davies wrote:
>> 
>>> On 04/11/14 17:02, David Winsemius wrote:
>>>> On Nov 4, 2014, at 8:35 AM, CJ Davies wrote:
>>>> 
>>>>> On 04/11/14 16:13, PIKAL Petr wrote:
>>>>>> Hi
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>>>>>> project.org] On Behalf Of CJ Davies
>>>>>>> Sent: Tuesday, November 04, 2014 2:50 PM
>>>>>>> To: Jim Lemon; r-help at r-project.org
>>>>>>> Subject: Re: [R] Variance of multiple non-contiguous time periods?
>>>>>>> 
>>>>>>> On 04/11/14 09:11, Jim Lemon wrote:
>>>>>>>> On Mon, 3 Nov 2014 12:45:03 PM CJ Davies wrote:
>>>>>>>>> ...
>>>>>>>>> On 30/10/14 21:33, Jim Lemon wrote:
>>>>>>>>> If I understand, you mean to calculate deviations for each
>>>>>>> individual
>>>>>>>>> 'chunk' of each transition & then aggregate the results? This is
>>>>>>> what
>>>>>>>>> I'd been thinking about, but is there a sensible manner within R to
>>>>>>>>> achieve this, or is it something for which it would be easier to
>>>>>>>>> preprocess the data in an external tool? Is there some way to subset
>>>>>>>> the
>>>>>>>>> data such that I can work over just contiguous 'chunks'?
>>>>>>>>> 
>>>>>>>> Exactly. If there is some combination of existing variables that can
>>>>>>>> be combined to make a set of unique values for each "chunk", you can
>>>>>>>> calculate the deviations within each "chunk", then average the
>>>>>>> squared
>>>>>>>> deviations for each type of "chunk", weighting by the duration of the
>>>>>>>> "chunks" so that you don't bias the pooled variance toward the longer
>>>>>>>> "chunks".
>>>>>>>> 
>>>>>>>> Jim
>>>>>>>> 
>>>>>>> I am stumped for a way of automating this process though. Each line of
>>>>>>> log data looks like this;
>>>>>>> 
>>>>>>> 2406  55.4    (-11.2, 1.0, -0.9)      (-4.1, 1.0, 0.0)        7.077912
>>>>>>>      0.9203392       (0.0,
>>>>>>> 0.7, -0.1, 0.7)       8.129684        89.41537        -8.212769       (0.0, 0.7, -0.1,
>>>>>>> 0.7)
>>>>>>> 8.129684      89.41537        351.7872        1       0       0       False   0.15    3
>>>>>>>      37.76761        True    False   0
>>>>>>> transition 1
>>>>>> First you need to import it to R which could be tricky based on above line.
>>>>>> Some values will probably need to process through regular expression.
>>>>>> 
>>>>>> If I understand correctly number after transition is a signal which estimets continuous chunks. If it is true then
>>>>>> 
>>>>>> ?rle is a function which can estimate length of chunks.
>>>>>> 
>>>>>> Cheers
>>>>>> Petr
>>>>>> 
>>>>>>> Where the last variable defines which transition is currently active.
>>>>>>> However to separate these data into 'chunks' would involve making a
>>>>>>> comparison between each line of data & the preceding line of data to
>>>>>>> determine whether it is part of the same contiguous 'chunk'. Is this
>>>>>>> something that would be better achieved using external preprocessing
>>>>>>> written in a language I am more familiar with, as I haven't the
>>>>>>> foggiest how I would approach this within R?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> CJ Davies
>>>>>>> 
>>>>>>> ______________________________________________
>>>> snipped
>>>>> Importing into R wasn't an issue; some of the fields contain spaces & symbols, but all the fields are tab separated so I can simply use;
>>>>> 
>>>>> foo <- read.csv("bar",header=T,sep="\t")
>>>>> 
>>>>> I've just written a hacky bit of Java that gives me the lines of each 'chunk' as a separate list & I think I'll then calculate these particular values using Java's Math class rather than trying to come up with a sensible way to import these 'chunks' back into R. When it comes to string/list manipulation like this I think my knowledge in Java & lack of knowledge in R makes the former the better option!
>>>>> 
>>>> If you had offered the output of dput(head(foo, 20) ) and explained what defined a "chunk-defining transition", it would have been fairly easy to show you how to use cumsum in an ave() call to construct a grouping variable.
>>>> 
>>>> 
>>>>> Regards,
>>>>> CJ Davies
>>>>> 
>>>>> ______________________________
>>>> 
>>>> David Winsemius
>>>> Alameda, CA, USA
>>>> 
>>> Here is an example 100 lines of the input --> http://paste2.org/2LZVGP5K
>>> 
>>> The final value on each line, under the header "environment", is always one of ["real", "transition 1", "transition 2", "transition 3", "transition 4"]. A 'chunk-defining transition' is when this value changes.
>>> 
>>> If there is a way to do this in R in a more elegant fashion than my hacky Java, then I would be glad to learn.
>> That pasted material does not appear to preserve the tabs. Input with your suggested code "does not work" in the sense that it brings in an object like this. 
>> 
>>> download.file("http://paste2.org/2LZVGP5K", "bar.txt")
>> trying URL 'http://paste2.org/2LZVGP5K'
>> Content type 'text/html; charset=UTF-8' length unknown
>> opened URL
>> .......... .......... ........
>> downloaded 28 Kb
>> 
>>> foo <- read.csv("bar.txt",header=T,sep="\t")
>>> str(foo)
>> 'data.frame':	2829 obs. of  1 variable:
>> $ X..DOCTYPE.html.: Factor w/ 669 levels "","          ",..: 106 104 219 233 220 222 221 215 217 79 ...
>> 
>> I SAY AGAIN:
>> 
>> Need ; output of dput(head(foo, 100) )
>> 
>> 
>>> Regards,
>>> CJ Davies
>> David Winsemius
>> Alameda, CA, USA
>> 
> That was a pastebin URI, so what you downloaded was HTML instead of raw
> text. This is the raw text;

Well, it was text but it had no tabs. On this mailing list, HTML is considered evil.

> foo$chunk <- c(NA, foo$environment[-1] != head(foo$environment,-1) )
> table(foo$chunk)

FALSE  TRUE 
  503   106 
> foo$chunk <- cumsum(c(1, foo$chunk[-1]) )
> table(foo$chunk)

  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26 
 20   1   6   1   1   1   4   1   1   2  16   1   7   4  14   2   6   1   2   4   1   4   2   8   6   2 
 27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52 
  2   1   7   1   1   2   2   2   6  10   3   1  12   3   1  10  18   6   1   6  14   4   1  19  13  10 
 53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78 
  6   2  10  14   3   2   1   2   1   1   1  15   4   2   2   6  21   5   1  16   5   3   1   2  21   3 
 79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 
  1   2   3   4   4   3   5   1   9   1   3   3   7   2   5   6   6   5  13   1   1   8   1   2   2   3 
105 106 107 
  6   9  70 

So now you have a chunking index and can use `by` or `ave` or `for()`-loops

> 
> http://cjdavies.org/foo

That was displayed as it it had tabs and after correcting the error of using T for TRUE it did succeed.

> 
> Regards,
> CJ Davies

David Winsemius
Alameda, CA, USA