[R] How to read a file containing two types of rows - (for the Netflix challenge data format)

Chris Evans chr|@ho|d @end|ng |rom p@yctc@org
Fri Jan 31 11:39:48 CET 2020


I am sure Rainer's approach is good and I know my R programming is truly terrible but here's a crude script in base R that does what you want

# rawDat <- readLines(con = "netflix.dat")
fil <- tempfile(fileext = ".dat")
cat("*1:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
value1,value2, value3
*2:*
value1,value2, value3
value1,value2, value3
*3:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
*4:*", 
    file = fil,
    sep = "\n")
rawDat <- readLines(fil, n = -1)
unlink(fil) # tidy up data input

### create a data frame for output
### this first line will be overwritten by the actual data
outDF <- as.data.frame(list(id = 1,
                            value1 = "",
                            value2 = "",
                            value3 = ""),
                       stringsAsFactors = FALSE) # necessary to avoid mess with character to factor conversion

j <- 0 # counter for entries
for (i in 1:length(rawDat)) {
  rawDat[i] <- trimws(rawDat[i])
  if (nchar(rawDat[i]) == 0) next # skip empty lines
  if (grepl(":*", rawDat[i], fixed = TRUE)) {
    ### got an ID line
    id <- sub("\\*([0123456789]*):\\*", "\\1", rawDat[i])
  } else {
    ### not an ID line so one of the one or more following lines of data
    ### I have assumed these are all of the same form
    j <- j + 1
    rawDat[i] <- gsub(" ", "", rawDat[i], fixed = TRUE)
    tmpDat <- unlist(strsplit(rawDat[i], ","))
    outDF[j,1] <- id
    outDF[j,2:4] <- tmpDat
  }
}
outDF

I am slowly adapting to the tidyverse but this is something I still find easier to do in very crude for loop, base R.

Plea: my formal programming training is one week of "Introduction to FORTRAN" on teletypes in 1975, but I confess it's 
both lack of formal training _and_ lack of native ability that means my coding is so bad.

If any gurus have a moment, show us really elegant and tidyverse ways to do this!

Very best all,

Chris

----- Original Message -----
> From: "Rainer M Krug" <Rainer using krugs.de>
> To: "Emmanuel Levy" <emmanuel.levy using gmail.com>
> Cc: "R-help Mailing List" <r-help using r-project.org>
> Sent: Friday, 31 January, 2020 10:55:46
> Subject: Re: [R] How to read a file containing two types of rows - (for the Netflix challenge data format)

> I did something similar yesterday…
> 
> Use readLine() to read at in and identify the “*1:*, … with a regex. Than you
> have your dividers. In a second step, use read.csv(skip = …, Ncollumns = …)  to
> read the enclosed blocks, and last, combine them accordingly.
> 
> This is written without an R installation, so the argument names are likely
> wrong.
> 
> Rainer
> 
> 
>> On 31 Jan 2020, at 10:04, Emmanuel Levy <emmanuel.levy using gmail.com> wrote:
>> 
>> Hi,
>> 
>> I'd like to use the Netflix challenge data and just can't figure out how to
>> efficiently "scan" the files.
>> https://www.kaggle.com/netflix-inc/netflix-prize-data
>> 
>> The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
>> 3 values associated to each ID:
>> 
>> The format is as follows:
>> *1:*
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> *2:*
>> value1,value2, value3
>> value1,value2, value3
>> *3:*
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> *4:*
>> etc ...
>> 
>> And I want to create a matrix where each line is of the form:
>> 
>> ID value1, value2, value3
>> 
>> Si "ID" needs to be duplicated - I could write a Perl script to convert
>> this format to CSV, but I'm sure there's a simple R trick.
>> 
>> Thanks for suggestions!
>> 
>> Emmanuel
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> --
> Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology,
> UCT), Dipl. Phys. (Germany)
> 
> Orcid ID: 0000-0002-7490-0066
> 
> Department of Evolutionary Biology and Environmental Studies
> University of Zürich
> Office Y34-J-74
> Winterthurerstrasse 190
> 8075 Zürich
> Switzerland
> 
> Office:	+41 (0)44 635 47 64
> Cell:       	+41 (0)78 630 66 57
> email:      Rainer.Krug using uzh.ch
>		Rainer using krugs.de
> Skype:     RMkrug
> 
> PGP: 0x0F52F982
> 
> 
> 
> 
>	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Chris Evans <chris using psyctc.org> Visiting Professor, University of Sheffield <chris.evans using sheffield.ac.uk>
I do some consultation work for the University of Roehampton <chris.evans using roehampton.ac.uk> and other places
but <chris using psyctc.org> remains my main Email address.  I have a work web site at:
   https://www.psyctc.org/psyctc/
and a site I manage for CORE and CORE system trust at:
   http://www.coresystemtrust.org.uk/
I have "semigrated" to France, see: 
   https://www.psyctc.org/pelerinage2016/semigrating-to-france/ 
That page will also take you to my blog which started with earlier joys in France and Spain!

If you want to book to talk, I am trying to keep that to Thursdays and my diary is at:
   https://www.psyctc.org/pelerinage2016/ceworkdiary/
Beware: French time, generally an hour ahead of UK.



More information about the R-help mailing list