[R] why must a named colClasses in read.table be in correct order

Andreas Leha andreas.leha at med.uni-goettingen.de
Thu Jul 9 05:15:16 CEST 2015


Hi Henrik,

Thank you very much for looking into this.  And thanks for the patch!

Yes, let's hope this is a typo that gets fixed.

Regards,
Andreas

Henrik Bengtsson <henrik.bengtsson at ucsf.edu> writes:
> Thanks for insisting; I was wrong and I'm happy to see that there is
> indeed code intended for named 'colClasses', which even goes back to
> 2004.   But as you report, then names only work when
> length(colClasses) < cols (which also explains why I though it was not
> supported).  I'm not sure if that _strictly less than_  test is
> intentional or a mistake, but I would propose the following patch:
>
> [HB-X201]{hb}: svn diff src\library\utils\R\readtable.R
> Index: src/library/utils/R/readtable.R
> ===================================================================
> --- src/library/utils/R/readtable.R     (revision 68642)
> +++ src/library/utils/R/readtable.R     (working copy)
> @@ -139,7 +139,7 @@
>      if (rlabp) col.names <- c("row.names", col.names)
>
>      nmColClasses <- names(colClasses)
> -    if(length(colClasses) < cols)
> +    if(length(colClasses) <= cols)
>          if(is.null(nmColClasses)) {
>              colClasses <- rep_len(colClasses, cols)
>          } else {
>
>
> Your example works with this patch.  I've made it source():able so you
> can try it out (if you cannot source() https://, then download the
> file an source it locally):
>
> source("https://gist.githubusercontent.com/HenrikBengtsson/ed1eeb41a1b4d6c43b47/raw/ebe58f76e518dd014423bea466a5c93d2efd3c99/readtable-fix.R")
>
> kkk <- c("a\tb",
>          "3.14\tx")
>
> colClasses <- c(a="numeric", b="character")
> data <- read.table(textConnection(kkk),
>                    sep="\t",
>                    header = TRUE,
>                    colClasses = colClasses)
> str(data)
> ### 'data.frame':   1 obs. of  2 variables:
> ### $ a: num 3.14
> ### $ b: chr "x"
>
> ## Does not work with utils::read.table(), but with patch
> data <- read.table(textConnection(kkk),
>                    sep="\t",
>                    header = TRUE,
>                    colClasses = rev(colClasses))
> str(data)
> ### 'data.frame':   1 obs. of  2 variables:
> ### $ a: num 3.14
> ### $ b: chr "x"
>
> Let's hope that the above is a (10-year old) typo, and changing a < to
> a <= adds support for named 'colClasses', which is a really useful
> functionality.
>
> /Henrik
>
> On Wed, Jul 8, 2015 at 6:42 PM, Andreas Leha
> <andreas.leha at med.uni-goettingen.de> wrote:
>> Hi Henrik,
>>
>> Thanks for your reply.
>>
>> I am not (yet) convinced, though.  The help page for read.table
>> mentions named colClasses and if I specify colClasses for not all
>> columns, the names are taken into account:
>>
>> --8<---------------cut here---------------start------------->8---
>> kkk <- c("a\tb",
>>          "3.14\tx")
>> str(read.table(textConnection(kkk),
>>            sep="\t",
>>                header = TRUE))
>>
>> str(read.table(textConnection(kkk),
>>                sep="\t",
>>                header = TRUE,
>>                colClasses=c(b="character")))
>> --8<---------------cut here---------------end--------------->8---
>>
>> What am I missing?
>>
>> Best,
>> Andreas
>>
>>
>>
>> On 09/07/2015 02:21, Henrik Bengtsson wrote:
>>> read.table() does not make use of names(colClasses) - only its values.
>>> Because of this, ordering is critical, as you noted. It shouldn't be
>>> too hard to add support for a named `colClasses` argument of
>>> utils::read.table(), but someone needs to convince the R core team
>>> that this is a good idea.
>>>
>>> As an alternative, see R.filesets::readDataFrame() for a
>>> read.table()-like function that matches names(colClasses) to column
>>> names, if they exists.
>>>
>>> /Henrik
>>> (author of R.filesets)
>>>
>>> On Wed, Jul 8, 2015 at 5:41 PM, Andreas Leha
>>> <andreas.leha at med.uni-goettingen.de> wrote:
>>>> Hi all,
>>>>
>>>> Apparently, the colClasses argument to read.table needs to be in the
>>>> order of the columns *even when it is named*.  Why is that?  And where
>>>> would I find it in the documentation?
>>>>
>>>> Here is a MWE:
>>>>
>>>> --8<---------------cut here---------------start------------->8---
>>>> kkk <- c("a\tb",
>>>>          "3.14\tx")
>>>> read.table(textConnection(kkk),
>>>>            sep="\t",
>>>>            header = TRUE)
>>>>
>>>> cclasses=c(b="character",
>>>>            a="numeric")
>>>>
>>>> read.table(textConnection(kkk),
>>>>            sep="\t",
>>>>            header = TRUE,
>>>>            colClasses = cclasses)              ## <--- error
>>>>
>>>> read.table(textConnection(kkk),
>>>>            sep="\t",
>>>>            header = TRUE,
>>>>            colClasses = cclasses[order(names(cclasses))])
>>>> --8<---------------cut here---------------end--------------->8---
>>>>
>>>>
>>>> Thanks,
>>>> Andreas
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list