[R] How might I work with a data.frame where each physical row represents several logical rows?

David Wolfskill r at catwhisker.org
Thu Jan 5 04:40:10 CET 2017


I have (time series) data extracted from a repository that is stored
such that each record is for an hour, but each record contains an
ordered set of values throughout the hour.  In the following exmaple,
I'll show sets of 4, depicting 0, 15, 30, and 45 minutes after each
"start" point (respectively):

test_data <- structure(list(start = c(1482793200L, 1482793200L, 1482793200L, 
1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L, 
1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L, 
1482793200L, 1482793200L), hostname = c("c001.example.net", "c001.example.net", 
"c001.example.net", "c001.example.net", "c001.example.net", "c001.example.net", 
"c001.example.net", "c001.example.net", "c001.example.net", "c001.example.net", 
"c001.example.net", "c161.example.net", "c161.example.net", "c161.example.net", 
"c161.example.net"), mtype = c("health", "health", "net", "net", 
"net", "net", "net", "net", "net", "sys", "sys", "net", "sys", 
"sys", "sys"), limit_type = c("fill", "serve", "", "", "", "", 
"", "", "", "", "", "", "", "", ""), hw = c(1.16, 1.16, 1.16, 
1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.21, 1.21, 1.21, 
1.21), fw = c("2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", 
"2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", 
"2017Q1.1.1", "2017Q1.1.1", "2016Q4.2.13", "2016Q4.2.13", "2016Q4.2.13", 
"2016Q4.2.13"), tcp_state = c("", "", "", "", "closed", "closing", 
"fin_wait_2", "last_ack", "syn_rcvd", "", "", "", "", "", ""), 
    value_type = c("limit", "limit", "", "", "", "", "", "", 
    "", "", "", "", "", "", ""), nic = c("all", "all", "", "", 
    "", "", "", "", "", "", "", "mce0", "", "", ""), name = c("in_download_window", 
    "in_download_window", "tcpOutSegs", "tcpRetransSegs", "tcp_connection_count", 
    "tcp_connection_count", "tcp_connection_count", "tcp_connection_count", 
    "tcp_connection_count", "CpuSystem", "CpuUser", "HCOutOctets", 
    "CpuIdle", "CpuSystem", "CpuUser"), values = c("[0.0, 0.0, 0.0, 0.0]", 
    "[0.0, 0.0, 0.0, 0.0]", "[260410.94547698632, 258469.54433635762, 260579.2186617577, 258763.2815145043]", 
    "[18436.311524062934, 18248.952271420356, 18201.62259198662, 17818.39529178736]", 
    "[5.0, 3.0, 3.0, 3.0]", "[3.0, 3.0, 2.0, 2.0]", "[670.0, 677.0, 685.0, 729.0]", 
    "[1162.0, 1192.0, 1148.0, 1110.0]", "[25.0, 60.0, 71.0, 33.0]", 
    "[11.0, 10.0, 11.0, 10.0]", "[2.0, 2.0, 2.0, 2.0]", "[7.873191635959294E9, 7.7377184658927E9, 7.876630519328283E9, 7.714521544912713E9]", 
    "[70.0, 70.0, 70.0, 70.0]", "[27.0, 26.0, 27.0, 26.0]", "[4.0, 4.0, 4.0, 4.0]"
    )), .Names = c("start", "hostname", "mtype", "limit_type", 
"hw", "fw", "tcp_state", "value_type", "nic", "name", "values"
), class = "data.frame", row.names = c(NA, -15L))


So of the 15 rows in the above example, row 8 (which depicts the TCP
connection counts in the "last ACK" state) has the values:

* 1162.0
* 1192.0
* 1148.0
* 1110.0

It seems to me that what is wanted is for each of the existing rows to be
replaced by a set of 4 (in this case) rows, where the other columns are
the same (save for "start", which is a timestamp, and should be adjusted
for the respective times).

I'm fairly sure I can write code to do that, but it would end up being
something like Perl implemented in R, which seems fairly grotesque: I
can't help but think that there ought to be a ... more elegant approach
in R (which is why I am sking for help).

(I will also end up collecting all of the records for a given timestamp
and hostname, and creating one very wide record with all of the data
from the set of records thus found.  I already have (yes, Perl) code to
do this -- though if there's a reasonable way to avoid that, I'm
interested.)

Once that's all done, I'll be examining various columns, subsetting
by attributes of the systems being compared -- but I already have
code to do that (that makes use of a different -- and rather more
fragile -- approach for extracting the data from its repository).

Thanks!

Peace,
david
-- 
David H. Wolfskill				r at catwhisker.org
Epistemology for post-truthers: How do we select parts of reality to ignore?

See http://www.catwhisker.org/~david/publickey.gpg for my public key.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 603 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20170104/b83a0662/attachment.bin>


More information about the R-help mailing list