[R] How to read a file containing two types of rows - (for the Netflix challenge data format)

Jan Galkowski b@ye@|@n|og|c@1 @end|ng |rom gm@||@com
Fri Jan 31 17:05:45 CET 2020


With the *data.table* package, *R* can use *fread* as follows:

> grab<- function(file)
> {
>  fin<- fread(file=file,
>  sep=NULL,
>  dec=".",
>  quote="", nrows=Inf, header=FALSE,
>  stringsAsFactors=FALSE, verbose=FALSE,
>  col.names=c("record"),
>  check.names=FALSE, fill=FALSE, blank.lines.skip=FALSE,
>  showProgress=TRUE,
>  data.table=FALSE, skip=0,
>  nThread=2, logical01=FALSE, keepLeadingZeros=FALSE)
>  cat(sprintf("Read '%s'.\n", file))
>  #
>  substance<- apply(X=fin, MARGIN=1, FUN=function(r) chartr(",", "\t", r[1]))
>  cat(sprintf("Translated '%s'.\n", file))
>  D<- fread(text=substance,
>  sep="\t",
>  dec=".",
>  quote="", nrows=Inf, header=FALSE,
>  stringsAsFactors=FALSE, verbose=FALSE,
>  col.names=c("ip", "valid.hits", "err.hits", "megabytes"),
>  check.names=FALSE, fill=FALSE, blank.lines.skip=FALSE,
>  showProgress=TRUE,
>  data.table=FALSE, skip=0,
>  nThread=2, logical01=FALSE, keepLeadingZeros=FALSE)
>  cat(sprintf("Parsed '%s'.\n", file))
>  ip<- D$ip
>  withinBlock<- sapply(X=ip, FUN=function(s) as.integer((strsplit(x=s, split=".", fixed=TRUE)[[1]])[4]))
>  D$within.block<- withinBlock
>  return(D)
> }
> 

In short, one pass pulls in all the records into an internal structure, which can be edited or manipulated at will, and then a second call to *fread* parses it properly. 

*fread *is fast, even for big datasets.


--
Jan Galkowski 

https://www.linkedin.com/in/deepdevelopment

member,

... American Statistical Association
... International Society for Bayesian Analysis
... Ecological Society of America
... International Association of Survey Statisticians
... American Association for the Advancement of Science
... TeX Users Group

(pronouns: *he, him, his*)

*Keep your energy local*. --John Farrell, *ILSR <http://ilsr.org/>*


	[[alternative HTML version deleted]]



More information about the R-help mailing list