[R] How to read a file containing two types of rows - (for the Netflix challenge data format)

Fri Jan 31 19:30:50 CET 2020

> On Jan 31, 2020, at 1:04 AM, Emmanuel Levy <emmanuel.levy using gmail.com> wrote:
> 
> Hi,
> 
> I'd like to use the Netflix challenge data and just can't figure out how to
> efficiently "scan" the files.
> https://www.kaggle.com/netflix-inc/netflix-prize-data
> 
> The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
> 3 values associated to each ID:
> 
> The format is as follows:
> *1:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *2:*
> value1,value2, value3
> value1,value2, value3
> *3:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *4:*
> etc ...
> 
> And I want to create a matrix where each line is of the form:
> 
> ID value1, value2, value3
> 
> Si "ID" needs to be duplicated - I could write a Perl script to convert
> this format to CSV, but I'm sure there's a simple R trick.
> 

I'd be tempted to use pipe() to separately read the ID lines and the value lines, but in R you can do this:

fc <- count.fields( "yourfile.txt", sep = ",")
rlines <- split( readLines( "yourfile.txt" ), fc)
mat <-
  cbind( rlines[[1]],
        do.call( rbind, strsplit( rlines[[2]], ",")))

This assumes that there are exactly 1 or 3 fields in each row of "yourfile.txt", if not, some incantation of grepl() applied to the text of readLines() should suffice.

HTH,

Chuck