[R] How to read a file containing two types of rows - (for the Netflix challenge data format)

Fri Jan 31 22:36:10 CET 2020

Hi All,

Thanks so much for your inputs, it's so nice to have such a helpful
community -- I wrote some kind of mix between the different replies, I copy
my final code below.

All the best,

Emmanuel

mat = read.csv("~/format_test.csv", fill=TRUE, header=FALSE, as.is=TRUE)

first.col.idx = grep(":",mat[,1])

first.col.val = grep(":",mat[,1],value=TRUE)

first.col.val = gsub(pattern=":", replacement="", first.col.val)

mat.clean = mat[-first.col.idx,]

reps = diff(c(first.col.idx,length(mat1[,1])))

reps[1] = reps[1]+1

mat.final = cbind( rep(first.col.val, reps-1), mat.clean)

On Fri, 31 Jan 2020 at 20:31, Berry, Charles <ccberry using health.ucsd.edu>
wrote:

>
>
> > On Jan 31, 2020, at 1:04 AM, Emmanuel Levy <emmanuel.levy using gmail.com>
> wrote:
> >
> > Hi,
> >
> > I'd like to use the Netflix challenge data and just can't figure out how
> to
> > efficiently "scan" the files.
> > https://www.kaggle.com/netflix-inc/netflix-prize-data
> >
> > The files have two types of row, either an *ID* e.g., "1:" , "2:", etc.
> or
> > 3 values associated to each ID:
> >
> > The format is as follows:
> > *1:*
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > *2:*
> > value1,value2, value3
> > value1,value2, value3
> > *3:*
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > *4:*
> > etc ...
> >
> > And I want to create a matrix where each line is of the form:
> >
> > ID value1, value2, value3
> >
> > Si "ID" needs to be duplicated - I could write a Perl script to convert
> > this format to CSV, but I'm sure there's a simple R trick.
> >
>
> I'd be tempted to use pipe() to separately read the ID lines and the value
> lines, but in R you can do this:
>
> fc <- count.fields( "yourfile.txt", sep = ",")
> rlines <- split( readLines( "yourfile.txt" ), fc)
> mat <-
>   cbind( rlines[[1]],
>         do.call( rbind, strsplit( rlines[[2]], ",")))
>
>
> This assumes that there are exactly 1 or 3 fields in each row of
> "yourfile.txt", if not, some incantation of grepl() applied to the text of
> readLines() should suffice.
>
> HTH,
>
> Chuck
>
>
>
>

	[[alternative HTML version deleted]]