[R] Loop to check for large dataset

Adrian Dușa dusa.adrian at unibuc.ro
Mon Oct 10 12:25:46 CEST 2016


This is an example of how a reproducible code looks like, assuming you have
three columns in your dataset named S (store), P (product) and W (week),
and also assuming they have integer values from 1 to 19, 1 to 22 and 1 to
157 respectively:

#########

mydata <- expand.grid(seq(19), seq(22), seq(157))
names(mydata) <- c("S", "P", "W")

# randomly delete 65626 - 63127 = 2499 rows
set.seed(12345) # make it replicable

mydata <- mydata[-sample(seq(nrow(mydata)), nrow(mydata) - 63127), ]

#########


Now the dataframe mydata contains exactly 63127 rows, just as in your case.
The task is to find which weeks are missing, from which store and for which
product.
Below is a possible code to do that. Given you have a small number of
stores and products, I'll keep it simple and stupid, by using for loops:


#########

result <- matrix(nrow = 0, ncol = 3)

for (i in seq(19)) {
    for (j in seq(22)) {
        miss <- setdiff(seq(157), mydata$W[mydata$S == i & mydata$P == j])
        if (length(miss) > 0) {
            result <- rbind(result, cbind(S = i, P = j, W = miss))
        }
    }
}

# The result matrix contains 2499 rows that are missing.

> head(result)
     S P   W
[1,] 1 1  10
[2,] 1 1  11
[3,] 1 1  82
[4,] 1 1 100
[5,] 1 1 117
[6,] 1 1 148

#########


In this example, for S(tore) number 1 and P(roduct) number 1, you are
missing W(eek) 10, 11, 82 and so on.

In hoping you can adapt this code to your particular example,
Adrian


On Sun, Oct 9, 2016 at 2:26 AM, Christoph Puschmann <
c.puschmann at student.unsw.edu.au> wrote:
>
> Dear Adrian,
>
> Yes it is a cyclical data set and theoretically it should repeat this
interval until 61327. The data set itself is divided into 2 Parts:
> 1. Product category (column 10)
> 2. Number of Stores Participating (column 01)
> Overall there are 22 different products and in each you have 19 different
stores participating. And theoretically each store over each product
category should have a 1 - 157 week interval.
>
> The part I am struggling with is how do I run a loop over the whole data
set, while checking if all stores participated 157 weeks over the different
products.
>
> So far I came up with this:
>
> n=61327                           # Generate Matrix to check for values
> Control = matrix(
>   0,
>   nrow = n,
>   ncol = 1)
>
> s <- seq(from =1 , to = 157, by = 1)
> CW = matrix(
>   s,
>   nrow = 157,
>   ncol = 1
> )
>
> colnames(CW)[1] <- ’s'
>
> CW = as.data.frame(CW)
>
> for (i in 1:nrow(FD)) {           # Let run trhough all the rows
>   for (j in 1:157) {
> if(FD$WEEk[j] == C$s[j]) {
>   Control[i] = 1                 # coresponding control row = 1
> } else {
>   Control[i] = 0                 # corresponding control row = 0
> }
> }
> }
>
> I coded a  MRE and attached an sample of my data set.
>
> MRE:
>
> #MRE
>
> dat <- data.frame(
>   Store = c(rep(8, times = 157), rep(12, times = 157)),  # Number of
stores
>   WEEK = rep(seq(from=1, to = 157, by = 1), times = 2)
> )
>
>
>
>



--
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr.90
050663 Bucharest sector 5
Romania

	[[alternative HTML version deleted]]



More information about the R-help mailing list