[R] incomplete reading of a large csv file

Uwe Ligges ||gge@ @end|ng |rom @t@t|@t|k@tu-dortmund@de
Fri Feb 21 20:16:45 CET 2020



On 21.02.2020 20:10, Christopher W. Ryan wrote:
> sessionInfo at end of message.
> 
> I have data that I was given as an Excel .xlsx file. It contains 96266
> lines and 24 columns. I opened it in OpenOffice.org and saved it in .csv
> format, using the pipe character as a field separator. This produced a
> file with 96266 lines.
> 
> When I read it into R thusly:
> 
>> skip0.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv",
> sep = "|", header = TRUE, comment.char = "", skip = 0)
> 
> the resulting skip0.dd dataframe has only 58208 lines:
> 
>> dim(skip0.dd)
> [1] 58208    24
> 
> 
> I've tried a variety of things to troubleshoot. Using head() and tail(),
> the expected first and last lines (comparing to the .csv file) do indeed
> exist in skip0.dd.  Several arbitrary lines from the "middle" of the csv
> file are also present in the skip0.dd dataframe.
> 
> I tried reading only the first column, which is integer, but still it
> appears that not all lines are read in:
> 
>> classes <- c(NA, rep("NULL", 23))
>> skip01.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv",
> sep = "|", header = TRUE, comment.char = "", skip = 0, colClasses = classes)
>> dim(skip01.dd)
> [1] 58208    1
> 
> Skipping the first 50000 lines nominally should give me a dataframe of
> 46266 lines, or at least one of 50000 fewer lines than skip0.dd (i.e.
> 8208 lines), but it does neither:
> 
>> skip50000.dd <-
> read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv", sep = "|",
> header = TRUE, comment.char = "", skip = 50000)
>> dim(skip50000.dd)
> [1] 22170    24
> 
> Any thoughts on what might be going wrong? Some funky characters from
> Excel or OpenOffice.org lurking in the .csv file?

quotes are a typical proiblem, what if you try with arg quote=""?




> 
> Perhaps I'd have more success with one of the packages that enables
> reading directly from an .xlsx file.
> 
> Thanks.
> 
> --Chris Ryan
> SUNY Upstate Medical University Binghamton Clinical Campus
> Broome County Health Department
> Binghamton University
> 
> 
> ####################################
>> sessionInfo()
> R version 3.5.3 (2019-03-11)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 17763)
> 
> Matrix products: default
> 
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] dplyr_0.8.3       stringr_1.4.0     Hmisc_4.2-0       ggplot2_3.2.1
> [5] Formula_1.2-3     survival_2.44-1.1 lattice_0.20-38
> 
> loaded via a namespace (and not attached):
>   [1] Rcpp_1.0.1          pillar_1.4.0        compiler_3.5.3
>   [4] RColorBrewer_1.1-2  tools_3.5.3         base64enc_0.1-3
>   [7] digest_0.6.18       zeallot_0.1.0       rpart_4.1-13
> [10] checkmate_1.9.3     tibble_2.1.1        gtable_0.3.0
> [13] htmlTable_1.13.1    pkgconfig_2.0.2     rlang_0.4.0
> [16] Matrix_1.2-15       rstudioapi_0.10     xfun_0.7
> [19] gridExtra_2.3       knitr_1.23          withr_2.1.2
> [22] cluster_2.0.7-1     htmlwidgets_1.3     vctrs_0.2.0
> [25] grid_3.5.3          nnet_7.3-12         tidyselect_0.2.5
> [28] data.table_1.12.2   glue_1.3.1          R6_2.4.0
> [31] foreign_0.8-71      latticeExtra_0.6-28 purrr_0.3.2
> [34] magrittr_1.5        htmltools_0.3.6     backports_1.1.4
> [37] scales_1.0.0        splines_3.5.3       assertthat_0.2.1
> [40] colorspace_1.4-1    stringi_1.4.3       acepack_1.4.1
> [43] lazyeval_0.2.2      munsell_0.5.0       crayon_1.3.4
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list