[R] Identyfing rows with specific conditions
David L Carlson
dcarlson at tamu.edu
Wed May 24 19:23:20 CEST 2017
You may run into memory issues if the table is that large in which case you may need to break your customers into subsets and process each subset separately. Then combine the results of the subsets into a single file.
There are a couple of relatively easy ways to speed things up, but I don't know if it will be enough. The simplest is to change from data.frames to matrices:
Meals <- structure(list(mealAcode = c(34L, 89L, 25L, 34L, 25L), mealBcode = c(66L,
39L, 77L, 39L, 34L)), .Names = c("mealAcode", "mealBcode"), row.names = c(NA,
-5L), class = "data.frame")
Customers <- structure(list(id = c(15L, 11L, 85L), M1 = c(77L, 25L, 89L),
M2 = c(34L, 34L, 25L), M3 = c(25L, 39L, 77L)), .Names = c("id",
"M1", "M2", "M3"), class = "data.frame", row.names = c(NA, -3L
))
Meals <- as.matrix(Meals)
Meals <- unname(Meals)
Customers <- as.matrix(Customers)
Customers <- unname(Customers)
Results <- matrix(nrow=0, ncol=3)
k <- 0
for (i in seq_len(nrow(Meals))) {
for (j in seq_len(nrow(Customers))) {
if (any(Customers[j, -1] %in% Meals[i, 1]) &
any(Customers[j, -1] %in% Meals[i, 2])) {
k <- k+1
Results <- rbind(Results, c(Meals[i, ], Customers[j, 1]))
}
}
}
colnames(Results) <- c("mealAcode", "mealBcode", "id")
Results
This should process quite a bit faster, but is still slowed down by the fact that we are allocating memory for each match. If we pre-allocate the memory, it is faster, but you have to know how much to allocate:
Customers <- unname(Customers)
Meals <- unname(Meals)
Results <- matrix(nrow=1000000, ncol=3)
k <- 0
for (i in seq_len(nrow(Meals))) {
for (j in seq_len(nrow(Customers))) {
if (any(Customers[j, -1] %in% Meals[i, 1]) &
any(Customers[j, -1] %in% Meals[i, 2])) {
k <- k+1
Results[k, ] <- c(Meals[i, ], Customers[j, 1])
}
}
}
colnames(Results) <- c("mealAcode", "mealBcode", "id")
Results
This pre-allocates space for a million rows so it should be even faster, but it will fail if there are more rows, so guess high.
There are some specialized packages such as data.table and dplyr in R that might be even faster. Also package parallel could use parallel processing to speed things up.
David C
From: Allaisone 1 [mailto:allaisone1 at hotmail.com]
Sent: Wednesday, May 24, 2017 7:54 AM
To: David L Carlson <dcarlson at tamu.edu>; Bert Gunter <bgunter.4567 at gmail.com>
Cc: r-help at r-project.org
Subject: Re: [R] Identyfing rows with specific conditions
Dear David ..,
Many thanks for spending the time and effort solving this problem.
I liked your suggestion to generate the results with the first format
as this will be very helpful for further analysis :
# mealAcode mealBcode id
# 1 25 77 15
# 2 25 77 85
# 3 34 39 11
# 4 25 34 15
# 5 25 34 11
I wonder if if there is any method to process the analysis faster. The end output will be a very large table as I already know from my previous analysis that for example I have around 20,000 customers taking one of these combinations and there are about 17,000 customers taking another combination of meals and so on. Considering that I have 2000 rows (2000 combinations) and that number of customers per combination is very high
(20,000 or lower), this will generate a huge table with millions of rows which is very complex. I have tested the code just to see and as expected it takes a long time(hours) before I stopped the analysis.
________________________________________
From: David L Carlson <dcarlson at tamu.edu>
Sent: 23 May 2017 18:20:02
To: Allaisone 1; Bert Gunter
Cc: r-help at r-project.org
Subject: RE: [R] Identyfing rows with specific conditions
You will generally get a better response if you create a data set that we can test our ideas on. You sketched out the data but your final results include meal combinations that are not in your Meals data frame. Also using dput() to provide the data makes it easier to recreate the data since just printing it loses important information such as if the numbers are integer or decimal, or if the character variables are characters or factors. Also send your message as plain text, not html which messes up columns and other details. Here is a modification of your data with the beginnings of a solution:
Meals <- structure(list(mealAcode = c(34L, 89L, 25L, 34L, 25L), mealBcode = c(66L,
39L, 77L, 39L, 34L)), .Names = c("mealAcode", "mealBcode"), row.names = c(NA,
-5L), class = "data.frame")
Meals
# mealAcode mealBcode
# 1 34 66
# 2 89 39
# 3 25 77
# 4 34 39
# 5 25 34
Customers <- structure(list(id = c(15L, 11L, 85L), M1 = c(77L, 25L, 89L),
M2 = c(34L, 34L, 25L), M3 = c(25L, 39L, 77L)), .Names = c("id",
"M1", "M2", "M3"), class = "data.frame", row.names = c(NA, -3L
))
Customers
# id M1 M2 M3
# 1 15 77 34 25
# 2 11 25 34 39
# 3 85 89 25 77
Results <- data.frame(mealAcode=NA, mealBcode=NA, id=NA)
k <- 0
for (i in seq_len(nrow(Meals))) {
for (j in seq_len(nrow(Customers))) {
if (any(Customers[j, -1] %in% Meals[i, 1]) &
any(Customers[j, -1] %in% Meals[i, 2])) {
k <- k+1
Results[k, ] <- cbind(Meals[i, ], id=Customers[j, 1])
}
}
rownames(Results) <- NULL
}
Results
# mealAcode mealBcode id
# 1 25 77 15
# 2 25 77 85
# 3 34 39 11
# 4 25 34 15
# 5 25 34 11
Results is a data frame that can be modified to get various forms of output. You mention a data frame with ids in separate columns. That might be unwieldy for many types of analysis. The following list shows the ids for each meal combination, the number of meals of each combination consumed, and a matrix similar to the one you provided.
Results.lst <- split(Results$id, paste(Results[ , 1], Results[ , 2], sep=" - "))
Results.lst
# $`25 - 34`
# [1] 15 11
#
# $`25 - 77`
# [1] 15 85
#
# $`34 - 39`
# [1] 11
table(Results[, 1:2])
# mealBcode
# mealAcode 34 39 77
# 25 2 0 2
# 34 0 1 0
maxid <- 5
t(sapply(Results.lst, function(x) c(sort(as.vector(x)),
rep(NA, maxid-length(as.vector(x))))))
# [,1] [,2] [,3] [,4] [,5]
# 25 - 34 11 15 NA NA NA
# 25 - 77 15 85 NA NA NA
# 34 - 39 11 NA NA NA NA
-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352
-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Allaisone 1
Sent: Monday, May 22, 2017 4:41 PM
To: Bert Gunter <bgunter.4567 at gmail.com>
Cc: r-help at r-project.org
Subject: Re: [R] Identyfing rows with specific conditions
Dear Bert
I have answered your questions in my last 2 messages. If you did not see them, I will answer again. The 2 tables are all data.frames and the order of the meals does not matter. The meal cannot be replicated for each person, they are all different. The missing values are to the right end of each row as each row starts with meal codes first. This task is just a small part of a long 2 years project.
Regards
________________________________
From: Bert Gunter <bgunter.4567 at gmail.com>
Sent: 22 May 2017 14:35:49
To: Allaisone 1
Cc: r-help at r-project.org
Subject: Re: [R] Identyfing rows with specific conditions
You haven't said whether your "table" is a matrix or data frame.
Presumably the latter.
Nor have you answered my question about whether order of your meal
code pairs matters.
Another question: can meals be replicated for an ID or are they all different?
Finally, is this a homework assignment or class project of some sort?
Or is it a real task -- i.e., what is the context?
Again, be sure to cc the list.
-- Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, May 22, 2017 at 1:56 AM, Allaisone 1 <allaisone1 at hotmail.com> wrote:
> Hi Bert ..,
>
>
> The number of meals differ from one customer to other customer. You may find
> one customer with only one meal and another one with 2,3 or even rarely 30
> meals. You may also
>
> find no meal at all for some customers so the entire row takes the missing
> value "\N" . Any
>
> row starts with the meals codes first, then all missing values are to the
> right end of the table.
>
> ________________________________
> From: Bert Gunter <bgunter.4567 at gmail.com>
> Sent: 22 May 2017 03:11:11
> To: Allaisone 1
> Cc: r-help at r-project.org
> Subject: Re: [R] Identyfing rows with specific conditions
>
> Clarification:
>
> Does each customer have the same number of meals or do they differ
> from customer to customer? If the latter, how are missing meals
> notated? Do they always occur at the (right) end or can they occur
> anywhere in the row?
>
> Presumably each customer ID can have many different meal code
> combinations, right ?(since they can have 30 different meals with
> potentially 30 choose 2 = 435 combinations apiece)
>
> Please make sure you reply to the list, not just to me, as I may not
> pursue this further but am just trying to clarify for anyone else who
> may wish to help.
>
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sun, May 21, 2017 at 5:10 PM, Allaisone 1 <allaisone1 at hotmail.com> wrote:
>>
>> Hi All..,
>>
>> I have 2 tables. The first one contains 2 columns with the headers say
>> "meal A code" & "meal B code " in a table called "Meals" with 2000 rows each
>> of which with a different combination of meals(unique combination per row).
>>
>>
>>>Meals
>>
>> meal A code meal B code
>>
>> 1 34 66
>>
>> 2 89 39
>>
>> 3 25 77
>>
>> The second table(customers) shows customers ids in the first column with
>> Meals codes(M) next to each customer. There are about 300,000 customers
>> (300,000 rows).
>>
>>> Customers
>> 1 2 3 4 ..30
>> id M1 M2 M3
>> 1 15 77 34 25
>> 2 11 25 34 39
>> 3 85 89 25 77
>> .
>> .
>> 300,000
>>
>> I would like to identify all customers ids who have had each meal
>> combination in the first table so the final output would be the first table
>> with ids attached next to each meal combination in each row like this:
>>
>>>IdsMeals
>>
>>
>> MAcode MBcode ids
>>
>> 1 34 39 11
>>
>> 2 25 34 15 11
>>
>> 3 25 77 15 85
>>
>> Would you please suggest any solutions to this problem?
>>
>> Regards
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list