[R] String comparison, trailing blanks make a difference.
john.archie.mckown at gmail.com
Sat Jul 19 20:46:09 CEST 2014
On Fri, Jul 18, 2014 at 11:17 AM, John McKown
<john.archie.mckown at gmail.com> wrote:
> Well, this was a shock to me. And I don't really see any documentation
> about it, but perhaps I just can't see it.
>>"abc" == "abc "
>  FALSE
> I guess that I thought of strings in R like I do is some other
> languages where the shorter value is padded with blanks to the length
> of the longer value, then compared. I.e. that trailing blanks didn't
> The best solution that I have found is to use the str_trim() function
> from the stringr to remove all the trailing blanks after I get the
> data from the SQL data base. I cannot change the SQL schema to make
> the column a varchar instead of a char column. It is a vendor DB. And
> I don't know an ANSI SQL standard way to remove trailing blanks in the
> SELECT command. PostgreSQL has a "trim(trailing ' ' from column)', but
> MS-SQL upchucks on that syntax.
Well, here I am - talking to myself ... again.
My "problem" was, of course, of my own making. I am getting my data
via RODBC from MS-SQL Server. I was basically doing a "SELECT * FROM
TABLE". I normally use PostgreSQL, not MS-SQL, and I tend to use the
"TEXT" data type instead of CHAR or VARCHAR. So when I do the SELECT,
I get back my data without trailing blanks. Well, the data I am
reading now is created by a software vendor. I guess in order to be
database independent, the vendor designed his tables to have only
fixed length CHAR, and INT values in it. The fixed length CHAR values
are, naturally, padded on the right with blanks. Of course, now that I
understand this (weird as it is to me), I know to use a SELECT which
specifically lists the columns that I want _and_ does a TRIM() on them
to remove trailing blanks. This will reduce the size, in bytes, in my
data.frame and make it easier to use the comparison operators. Given
how the vendor saves the data, I am quite surprised that they didn't
use SQLite. The tables are simple. There are no "stored procedures",
no VIEWs, no use of SCHEMAs to make subsets. Basically they just want
a simple data store, with the ability to do _simple_ joins. SQLite
seems, to me, to be a better fit than requiring the user to have a
full blown RDMS such as MS-SQL or Oracle.
Well, thanks for the whack on the head to wake me up and make me
really look at my data.
There is nothing more pleasant than traveling and meeting new people!
More information about the R-help