[R] Unexpected behaviour from read.table
pdalgd at gmail.com
Mon Feb 5 10:57:06 CET 2018
This looks like a bug. Specifically, inside read.table
lines <- .External(C_readtablehead, file, nlines, comment.char,
blank.lines.skip, quote, sep, skipNul)
returns "lines" as
 "ID\tValue" "=\"Total\"\t1000"
 "=\"CJ01 \"\t550\n=\"CF02\"\t450"
Notice the embedded \n in the 3rd line. I.e., there are really 4 lines there. This gets pushed back twice and the first 3 (not 4) lines get read again as part of the header logic. Then when it comes to reading the data proper, the 4th line has ended up duplicated as the top row...
As you suggest, it seems that something is up with the quote matching logic.
> On 4 Feb 2018, at 23:45 , Michael <michael77allen at gmail.com> wrote:
> I’ve been struggling with seemingly ‘corrupt’ data.frames for a few days, and believe I’ve narrowed the problem down to some odd behaviour from read.table
> I receive a tab delimited file from an external provider where strings are encoded as =“content”. Not sure why, perhaps as most users open it in Excel.
> My specific issue is that trailing spaces in any of the strings are causing strange results from read.table
> # No trailing spaces
> V1 V2
> 1 ID Value
> 2 =Total 1000
> 3 =CJ01 550
> 4 =CF02 450
> # Now with trailing spaces in line 3
> read.table(text="ID\tValue\n=\"Total\"\t1000\n=\"CJ01 \"\t550\n=\"CF02\"\t450",header=FALSE,sep='\t')
> V1 V2
> 1 =CF02 450
> 2 ID Value
> 3 =Total 1000
> 4 =CJ01 550
> 5 =CF02 450
> I solved my specific problem by setting quote=‘’, and extracting the string content after calling read.table. As my original code had header=TRUE, I was finding random rows were being used as column names!
> Flagging a potential issue with read.table, although I can easily accept I'm missing something obvious here.
> R version 3.4.3 (2017-11-30)
> Platform: x86_64-apple-darwin15.6.0 (64-bit) / x86_64-pc-linux-gnu (64-bit)
> Running under: macOS High Sierra 10.13.2 / Ubuntu 16.04.3 LTS
> [[alternative HTML version deleted]]
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Office: A 4.23
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-help