[R] read.xls counts more rows than I really have in my Excel

Tue Feb 8 16:38:44 CET 2011

On 2/8/2011 2:41 AM, (Ted Harding) wrote:
> On 08-Feb-11 10:05:58, agent dunham wrote:
>> Hi,
>> i'm using read.xls, and it counts more rows in my Excel
>> than I really have.
>>
>> i've used both:
>>
>>   especie26<- read.xls("especie26cargar.xls")
>>
>>   datos26<- read.xls("especie26cargar.xls", header = TRUE,
>>      as.is = FALSE, na.strings = "NA",
>>      skip = 0, check.names = TRUE, fill = FALSE,
>>      strip.white = FALSE, blank.lines.skip = TRUE)
>>
>> when i check dim(especie26) or dim(datos26) i got 858  17
>> and it should be 667  17
>>
>> thanks in advance !!!!
>> --
>
> You say "i'm using read.xls, and it counts more rows
> in my Excel than I really have." I would be inclined
> to the view that this should be "it counts more rows
> in my Excel than I think I really have."
>
> In other words, very probably there really are 858 rows
> in your spreadsheet, even though you think there are 667
> (and can only see 667).
>
> As a check, I would suggest that you export the entire
> worksheet into a CSV file, e.g. "especie26cargar.csv".
> For the purposes of this check DO NOT highlight the data
> area to be saved (since doing this would export your
> possibly false impression of the data as well).
>
> Then, using some suitable utility, count the number of
> lines in the resulting CSV file "especie26cargar.csv".
> I am prepared to bet that you will find 858 lines in
> this file (plus 1 for the header).

I, for one, would not be willing to take the other side of that bet.

> In that case the excess (858 - 667) = 191 rows would
> probably consist of rows containing only commas, like
>
>    ,,,,,,,,,,,,,,,,
>
> (i.e. 16 commas, for the 17 fields, though the nunber
> might vary from row to row).
>
> The phenomenon on which I base this guess is that very often,
> when people enter data into an Excel spreadsheet, they
> somehow allow the cursor to wander outside the area of
> the spreadsheet which is to be occupied by real data.
> Internally (and invisibly to the user) Excel will then
> attribute the extraneous cells into which the cursor had
> wandered as being data cells with empty content. They
> therefore continue to appear on screen as pure blanks,
> but they have been invisibly added to the structure of
> the spreadsheet.
>
> A possible alternative to this check (but less definitive)
> is to look at the dataframe which you get after you have
> read the spreadsheet into R.
>
> E.g., after
>
>    especie26<- read.xls("especie26cargar.xls")
>
> on the basis tghat you should only have 667 data rows,
> have a look at the result of, say,
>
>    especie26[(670:680),]
>
> or some other range of rows beyond (1:667). On the same
> basis as above, I am betting that these rows probably
> consist solely of NAs.
>
> Why am I saying this? Because more times than I have any
> wish to remember in detail I have encountered exactly this
> problem in spreadsheets sloppily prepared by people who
> believed what they saw on screen, totally unaware of
> what Excel had really done to their data. Excel is a con
> artist: it imparts an impression of success and confidence
> to the user, which may be quite false.

Let me add another way to detect this.  In Excel, on the sheet of 
interest, the key combination Ctrl-End will take you to the last (lower 
right-most) cell that Excel considers part of the worksheet.  I would 
guess that, in your case, this will be row 859, column 17 (Q859).

> By the way: By a similar mechanism, you can find that
> different rows in the spreadsheet have different numbers
> of fields.

It is possible to shrink the area Excel considers the data area. 
Highlight the rows (using the row indicators) after your data down to 
the supposed end.  Then delete them using the right-click menu option 
"delete".  A similar thing can be done with excess columns.  Save the 
file.  Then verify that Excel now notes the end of the data area the 
same place you do with Ctrl-End.

> Ted.
> feel
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding)<ted.harding at wlandres.net>
> Fax-to-email: +44 (0)870 094 0861
> Date: 08-Feb-11                                       Time: 10:41:27
> ------------------------------ XFMail ------------------------------
>

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University