[BioC] modify colClasses in read.columns?

Gordon K Smyth smyth at wehi.EDU.AU
Mon Apr 28 02:50:29 CEST 2008


Dear Henrik,

Continuing on from Wolfgang's reply ...

The main reason for the read.columns() function in the limma package is to 
avoid having to go through the rigmarole of setting up the colClassses 
argument to read.table().  If you want to set up colClasses yourself, it 
is expected that you will use read.table() directly.  I will add some 
comments to the read.columns help page to make this clearer.

Best wishes
Gordon

On Sun, 27 Apr 2008, Wolfgang Huber wrote:

> Dear Henrik,
>
> with a file test.txt as follows:
>
> A	B	C
> 1	4711	34.50
> 2	ZAZA	01.40
>
> and the call
>
> z=read.table("test.txt", colClasses=c("integer", "NULL", "character"),
>           header=TRUE, sep="\t")
>
> I get
>
>> str(z)
> 'data.frame':   2 obs. of  2 variables:
> $ A: int  1 2
> $ C: chr  "34.50" "01.40"
>
>
> so maybe the functionality you wish is already provided by read.table?
>
> From looking at its code and man page, I don't think read.columns is designed 
> to accept user input for what it takes as colClasses. In fact, when I try to 
> supply colClasses to read.columns, I get:
>
> Errore in read.table(file = file, header = TRUE, col.names = allcnames:
>  l'argumento formale "colClasses" è associato a diversi argomenti passati
>
>  Best wishes
> 	Wolfgang
>
>
>
>> sessionInfo()
> R version 2.8.0 Under development (unstable) (2008-04-27 r45517)
> x86_64-unknown-linux-gnu
>
> locale:
> LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] fortunes_1.3-4
>
>
> ------------------------------------------------------------------
> Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber
>
>
> Henrik Parn a écrit 25/04/2008 21:21:
>> Dear Herve,
>> 
>> Thanks for your rapid answer!
>> 
>> Sorry, I forgot to paste the sessionInfo into my previous mail:
>>
>>  > sessionInfo()
>> R version 2.7.0 (2008-04-22)
>> i386-pc-mingw32
>> 
>> locale:
>> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United 
>> Kingdom.1252;LC_MONETARY=English_United 
>> Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base 
>> other attached packages:
>> [1] coda_0.13-1       limma_2.13.8      lme4_0.99875-9    Matrix_0.999375-9 
>> lattice_0.17-6 
>> loaded via a namespace (and not attached):
>> [1] grid_2.7.0  tools_2.7.0
>>  > sessionInfo()
>> 
>> 
>> The read.columns function is a part of the limma package in Bioconductor:
>> source("http://bioconductor.org/biocLite.R")
>> biocLite("limma")
>> 
>> I would like to use the read.columns function to read a subset of columns 
>> from several data files. Here is some example columns (out of many) and 
>> rows of the data:
>> 
>> ID i          ID j          Ni   Nj   S   A    R1        B     R2        C 
>> R3        D     R4   8414341.20    8414342.20    1    2    -1  1 
>> 0.425183  1     0.758413  1      0.551275  1     0.543045
>> 8414341.20    8414343.20    1    3    -1  1    0.128981  1     0.034859  1 
>> -0.001998  1     0.002093
>> 
>> In this example, there are 13 tab-delimited columns of which I want to use 
>> only ID i, ID i, R1, R2, R3 and R4. The problem with the data in its 
>> current form is the unfortunate format of the ID i and ID j columns: I need 
>> ID i and ID j to be treated as characters although they look like numeric 
>> (if they are read as numeric the .20 will become a .2). When I have used 
>> read.table(), I have first read all columns, and by using the argument 
>> colClasses = c("character", "character",...), I have preserved the format 
>> of ID i and ID j. In the next step I have selected only the relevant 
>> columns.
>> 
>> I thought read.columns could be a convenient alternative to select only the 
>> relevant columns when reading the data, by using e.g. required.col = c("ID 
>> i", "ID j"), text.to.search = "R". However, in read.columns I cannot 
>> specify colClasses. As it says in the help text "It uses |required.col| and 
>> |text.to.search| to set up the |colClasses| argument of |read.table|.". So, 
>> I wonder anyone could advice me on how to modify the read.columns code to 
>> be able to specify colClasses, if it is not to complicated.
>> 
>> Thanks in advance!
>> 
>> 
>> Henrik 
>> 
>> 
>> Herve Pages wrote:
>> 
>>> Hi Henrik,
>>> 
>>> I don't have read.columns() when I start a fresh R session so it looks 
>>> like it's
>>> not part of the default R installation. Which package does it belong to?
>>> Providing your sessionInfo() is always a good idea as it would at least 
>>> give
>>> us a clue of where to look for the read.columns() function. Also a small 
>>> example
>>> (with code) of what you are trying to do would be very useful.
>>> 
>>> Thanks!
>>> H.
>>> 
>>> 
>>> Henrik Parn wrote:
>>> 
>>>> Dear all,
>>>> 
>>>> I have received some data sets with some variables that certainly looks 
>>>> like numeric: they are individual IDs that are composed of some numbers 
>>>> separated by ".", e.g. 6534231.18, 8783234.20. Not surprisingly they are 
>>>> treated as numeric by read.columns, and 8783234.20 ends up like 8783234.2 
>>>> when read to R. When I used read.table I specified in colClasses that 
>>>> these variables should be read as |characters. However, in read.columns| 
>>>> |required.col| and |text.to.search| is used to set up the |colClasses| 
>>>> argument of |read.table|.| Does anyone have a suggestion of how I can 
>>>> modify the read.columns function so I can specify the colClasses myself?
>>>> 
>>>> Thanks in advance!   |
>>>> 
>> 
>


More information about the Bioconductor mailing list